From "Word Unit" to "Symbol Unit": The Debate Over AI's Foundational Cognition Behind the Chinese Name of Token

特邀专栏作者

2026-04-10 10:33

This article is about 6346 words, reading the full article takes about 10 minutes

After "Token" was officially named as "词元" (word unit), this article points out from perspectives such as computational ontology, multimodal evolution, and back-translation consistency that this naming suffers from path dependency and semantic anchoring issues. The essence of Token is a discrete symbolic unit across modalities, not a linguistic "word." In comparison, "符元" (symbol unit) better aligns with its computational nature, offering long-term stability and cognitive consistency.

AI Summary

Expand

Core Argument: The article argues that while translating "Token" in the field of artificial intelligence as "词元" (word unit) has its advantages in terms of dissemination, it poses long-term adaptability risks when examined from the dimensions of technical essence, multimodal development, and terminology system consistency. It proposes that "符元" (symbol unit) is a translation scheme with greater structural consistency and cross-contextual stability.
Key Elements:
1. The definition of "词元" (word unit) is based on Token's "initial application scenario" in NLP, but the essence of Token is a "discrete symbolic unit" that processes various signals such as text, images, and speech. Multimodal development has allowed it to break through the narrow context of "word."
2. "词元" (word unit) relies on the analogy of a "generalized word" to explain multimodal applications, but analogy should not replace definition, as it can easily lead to semantic drift and cognitive bias. In contrast, "符" (symbol) as a neutral concept possesses inherent cross-modal adaptability.
3. In linguistics and NLP, "词元" (word unit) has long corresponded to "Lemma" (the canonical base form of a word), which differs in meaning from Token. Mixing the two would violate the principle of univocality for terms and cause ambiguity in academic communication.
4. From the perspectives of information theory and computational theory, Token is a "symbol" index processed at the model's foundational level, not a "word" carrying semantics. "符元" (symbol unit) more accurately reflects its ontological attribute as a fundamental carrier of computation.
5. In cross-language back-translation, "词元" (word unit) lacks a clear English equivalent and is easily confused with several similar concepts. In contrast, "符元" (symbol unit) can more stably correspond to "symbolic unit," facilitating semantic consistency in international academic exchange.

Recently, the National Committee for the Examination and Approval of Scientific and Technological Terms issued an announcement recommending the translation of "Token" in the field of artificial intelligence as "词元" (word unit), and it is now open for public trial use. Subsequently, the People's Daily published an article titled "Experts Explain Why the Chinese Name for Token Was Set as '词元'", which systematically interpreted this naming from a professional perspective.

The article mentions that the word "token" originates from the Old English "tācen", meaning "symbol" or "mark". In language models, a token is the smallest discrete unit obtained after text segmentation or byte-level encoding, which can manifest in different forms such as words, subwords, affixes, or characters. It is precisely through modeling sequences of tokens that models exhibit certain intelligent capabilities.

This translation is considered within the expert evaluation system to comply with the principles of monosemy, scientificity, conciseness, and coordination, and it also possesses a certain foundation for use in the current Chinese context. However, after reading the related interpretations, I have formed a different understanding regarding this naming path.

From a standardization perspective, this naming proposal has advantages in comprehensibility and dissemination in the short term. But if examined from dimensions such as computational ontology, information structure, multimodal evolution, and back-translation consistency, its long-term adaptability still requires further testing. In this context, an alternative path equally worthy of attention—"符元" (symbol unit)—gradually reveals stronger structural consistency and cross-context stability.

1. Misalignment of Definition: "Origin" Cannot Replace "Essence"

Article Viewpoint (Chen Xilin, Researcher at the Institute of Computing Technology, Chinese Academy of Sciences): Token's initial role in artificial intelligence is the "basic semantic unit of language", therefore "词元" can better align with its essence.

This judgment is reasonable within the historical context, but in the current era of major technological paradigm shifts, this line of thinking is essentially a form of "academic rigidity".

At the logical level of term definition, a strict distinction must be made between "initial application scenario" and "structural essential attribute".

Token indeed originated in Natural Language Processing (NLP), but in the evolutionary path towards AGI, it has long since broken through the boundaries of language models, evolving into a fundamental unit for uniformly processing text, images, speech, and even physical signals. In modern computational systems, the true structural ontology of Token is the "discrete symbolic unit", not a single-modal linguistic unit.

If named according to its "initial role", the computer (Computer) should still be called an "electronic calculator" (derived from its initial function of replacing human calculators); the Internet (Internet) should be called a "Cold War military network". The fatal flaw in this naming logic is that it only sees technology's "temporary function" at a specific historical moment, overlooking its "physical ontology" that transcends eras.

Historical path cannot be equated with essential attributes. Similarly, we cannot permanently lock Token into the narrow context of "word" simply because it was initially used to process text.

Using the "initial application scenario" to define a foundational concept essentially substitutes historical path dependency for the ontological truth of the structure. This type of definition might provide convenience for understanding in the early stages of a technology, but during the paradigm expansion phase of multimodal explosion, it quickly becomes obsolete and turns into a shackle hindering cognition. In contrast, "符元" directly aligns with the symbolic ontology of cross-modal computation; it defines not Token's "past", but Token's "truth".

2. The Boundary of Analogy: When Explanation Becomes Definition, It Begins to Deviate

Article Viewpoint (Dong Yuxiao, Associate Professor, Department of Computer Science, Tsinghua University): Through analogies like "word cloud" and "bag of words", the discrete units in multimodal contexts can be understood as "generalized words".

Professor Dong Yuxiao's analogy aids understanding but should not replace definition. This line of thinking has some explanatory power at the interpretative level, but if elevated further to become the basis for naming, it may lead to categorical misalignment at the conceptual level.

Methodologically, the role of analogy is to lower the barrier to understanding, while the duty of definition is to delineate semantic boundaries. When "word" is extended to cover image patches, speech segments, vector representations (embeddings), and even broader perceptual signals, its original linguistic attributes are continuously diluted, and its semantic boundaries become blurred. This "analogy-driven" expansion path can maintain explanatory consistency in the short term but is prone to semantic drift in long-term evolution.

Regarding cross-modal expansion capability, we must be vigilant against the slippage from "analogy" to "definition". In the context of term standardization, the boundary between "explanatory metaphor" and "ontological definition" must be distinguished to prevent the former from substituting the latter.

A more intuitive comparison is: in popular science contexts, we can analogize a light bulb to an "artificial sun" to enhance intuitive understanding; but in scientific naming systems, it would be impossible to rename the unit of electric current "Ampere" as "light unit" based on this. The former belongs to descriptive expression, while the latter involves strict measurement systems and standardized definitions; the two cannot be conflated.

Similarly, terms like "word cloud" and "bag of words" are essentially descriptive or statistical metaphors; their function is to aid in understanding data structure or distribution patterns. However, Token, as a fundamental measurement unit in large models, is deeply embedded in systems for computing power billing, model training, and academic measurement. When its usage scale reaches daily volumes of tens of billions to trillions of calls, its naming carries not just explanatory function, but also a foundational concept with engineering and standard significance. At this level, terminology needs to align more with its ontological attributes rather than relying on analogical extensions.

If this analogical logic is further pushed to the naming level, it actually implies a dangerous premise: since people are already accustomed to understanding Token through "word", why not continue using this analogy. But this is actually a continuation of path dependency—substituting the convenience of existing cognition for the correction of conceptual ontology. In this sense, this naming is closer to a "linguistic romanticism" rather than a strict alignment with computational ontology.

We cannot demand discussing "electronic horses" in electric motors just because "horsepower" contains "horse". Analogy can inspire understanding, but it cannot define standards.

In contrast, "符" (symbol) as a more neutral concept naturally possesses cross-modal adaptability, capable of covering various information forms like text, images, and speech without requiring additional explanation. Therefore, the naming path centered on "symbolic unit" is closer to the structural essence of Token at the definitional level. Under this logic, "符元" as a corresponding translation exhibits higher conceptual consistency and long-term adaptability.

3. The Cost of Cognition: When Semantic Anchors Create Systemic Misunderstanding

Article Viewpoint (Synthesized Expert Opinions): The expression "词元" is concise, conforms to Chinese language habits, and is easy to disseminate.

This judgment has a certain rationality at the dissemination level, but its implicit premise is that the public can accept the cross-modal analogy of "word". However, analogy is essentially an expert thinking tool, not a natural cognitive method for the general public. For ordinary users, "word" has a strong semantic anchoring effect—once they hear "word", their intuitive direction is inevitably the language system, not other modalities like images, sounds, or actions. This cognitive path is not a technical issue but a stable structure at the level of cognitive psychology.

On this basis, when "word" is extended to the so-called "generalized word", it actually creates bias in user cognition. Users first form an intuitive understanding of "word = linguistic unit", not the abstract concept of "cross-modal symbolic unit". Once this misunderstanding is established, all subsequent explanations become corrections to existing cognition, rather than natural extensions of understanding.

For example, when media reports that "the model was trained using 10 trillion 词元", the public can easily interpret it as "read a vast amount of text", overlooking the large amounts of image, speech, and other modal data included. This misunderstanding is not an isolated case but is systematically induced by the semantic anchoring of the term itself.

In practical engineering contexts, this naming may also introduce friction in cross-disciplinary communication. When discrete units in visual or speech models are called "words", it not only easily causes semantic misunderstandings but also creates unnecessary linguistic conflicts between different fields. Multimodal systems require unification at the "symbolic layer", not the expansion of linguistic categories.

In comparison, "符" as a more abstract concept, although having a slightly higher initial barrier to understanding, has a more neutral semantic direction, not pre-locking cognition at the linguistic layer. In long-term use, it is more conducive to establishing a stable, unified cognitive framework, thereby reducing overall explanatory costs and providing a more stable cognitive foundation for multimodal unification.

The cost of naming does not occur at the moment of definition, but at the moment of correction; once early naming forms a semantic anchor, the cost of subsequent cognitive repair increases exponentially.

Experts can expand the boundary of "word" through analogy, but the public does not understand concepts through analogy. Naming does not serve experts alone; it is responsible for the entire era's cognitive system.

4. The Illusion of Monosemy: When One Word Attempts to Bear Two Systems

Article Viewpoint (Principle of Term Standardization): "词元" complies with the principle of monosemy, helping to resolve translation confusion.

Regarding the monosemy of terms, special attention must be paid to the systemic risks that "one word with two meanings" may trigger. In the standardization of scientific terms, "monosemy" is one of the fundamental principles. If a term requires context or additional explanation to distinguish its meaning, its value as a standard component is already lost.

However, from the perspective of the existing academic system, this judgment still has room for further discussion. The term "词元" has long been "taken" in the fields of linguistics and Natural Language Processing (NLP). In classical linguistics, its long-standing corresponding English concept is Lemma, i.e., the canonical base form of a word (e.g., the lemma for is/am/are is be). This usage has formed a stable consensus in foundational linguistics and NLP textbooks and academic papers.

Against this background, if Token is also translated as "词元", semantic conflicts can easily arise in specific expressions, leading to disastrous scenarios.

For example, when describing the "lemmatization operation in NLP (lemmatize a token)", the Chinese expression would present a structure like "perform '词元化' on a '词元'". This expression not only increases the cost of understanding but also introduces ambiguity in academic writing and information retrieval, making it difficult for readers to distinguish whether "词元" refers to the segmented discrete unit or the canonical base form of a word.

From a conceptual function perspective, the two also have clear distinctions: Lemma emphasizes "reduction" at the linguistic level, corresponding to the canonical expression after word inflection; whereas Token emphasizes "segmentation" in the computational process, corresponding to the smallest discrete unit when the model processes information. This difference between "reduction" and "segmentation" corresponds precisely to different dimensions of the semantic layer and the symbolic layer.

Therefore, when a term needs to be "generalized" to simultaneously cover multiple existing concepts, its monosemy has actually transformed into "unification at the explanatory level", not "stability at the semantic level".

When a term needs to rely on explanation to maintain unity, its stability as a standard term is often already beginning to waver.

In contrast, "符元" does not have semantic conflicts in the existing terminology system. On one hand, it preserves Token's ontological attribute as a discrete symbol; on the other hand, it avoids overlap with the existing translation of Lemma, thereby exhibiting higher stability in terms of semantic clarity and systemic consistency.

5. The Return to Ontology: Token is Essentially a "Symbol", Not a "Word"

Article Viewpoint (General Explanation): Token is the smallest unit used to process text in language models.

This statement is valid at the functional level but remains at the level of "how to use", without touching its ontological attributes in computational theory. From the perspectives of information theory and computational theory, the basic objects processed by computing systems are not "words", but "symbols".

This can be further understood from two levels:

On one hand, from an information theory perspective, the essence of information lies in eliminating uncertainty, its unit of measurement is the bit, and its carrier entity is the discrete symbol. Symbols do not care about semantic content; they are only related to probability distributions and encoding structures.

On the other hand, at the level of computational implementation, the underlying large models do not "read characters"; the objects they process are discrete index representations (IDs). Whether this ID corresponds to a Chinese character, an image patch, or an audio sample point, it participates in computation in a unified symbolic form during the computational process.

Within this framework, it is precisely because its essence lies at the "symbolic layer", not the "semantic layer". Symbols themselves do not carry semantics; they exist as the basic carriers for encoding and computation.

Naming Token as "词元" introduces, to some extent, an implicit direction towards the linguistic semantic layer, pulling this originally symbolic-layer concept back into a language-centric path of understanding. This naming method might provide intuitiveness at the explanatory level, but at the theoretical level, it easily blurs the boundary between "symbolic computation" and "semantic understanding".

In contrast, "符元" conceptually remains within the symbolic layer. On one hand, it accurately reflects Token's computational attribute as a discrete symbol; on the other hand, it avoids introducing semantic features into the ontological definition, thus aligning better with the basic framework of information theory and computational theory.

From a broader perspective, as artificial intelligence systems continue to evolve towards multimodality and general intelligence, if the naming of foundational concepts can directly align with their mathematical and computational ontology, it will be more conducive to constructing a stable, extensible cognitive system. In this sense, the naming path centered on "symbolic unit" is not merely a linguistic choice issue but also a consistent expression of the essence of computation, and "符元" is the natural correspondence within this framework.

Defining concepts from the symbolic layer is an alignment with the essence of computation; naming concepts from the semantic layer is closer to explanation than definition.

6. The Fracture of Language: Mapping Failure in Back-Translation Mechanisms

Article Viewpoint (Synthesized Interpretation): "词元" has gradually formed a foundation for use in the Chinese academic community, possessing certain dissemination advantages.

In cross-linguistic contexts, we must be vigilant against the systemic impact brought about by "back-translation fracture" of terms. Measuring whether a scientific term possesses long-term vitality depends not only on its ability to convey meaning within the Chinese context but also on whether it can achieve stable mapping within the international academic system. An ideal term should possess "reversibility", i.e., achieving semantically consistent round-trips between different languages.

The above judgment reflects the acceptability of "词元" in the local context, but from a cross-linguistic perspective, there is still room for further discussion. If a term is only valid within a single language system and cannot form a stable corresponding relationship in the international context, it may introduce additional understanding costs in academic exchanges.

Specifically, "词元" lacks a clear, unique corresponding path during back-translation. When restored to English, it often leads to divergence among multiple approximate concepts: for example, "word unit" lacks a strict academic definition; "morpheme" corresponds to the linguistic morpheme; "lexeme" points to the lexeme. None of these concepts can accurately cover the meaning of Token in the computational context; instead, they introduce categorical shifts.

In contrast, "符元" can more naturally correspond to "symbolic unit". This concept has a clear theoretical foundation and stable usage in fields such as information theory, discrete mathematics, and multimodal representation, enabling it to maintain consistent semantic direction across different contexts. Therefore, it is easier to form a one-to-one mapping relationship between Chinese and English.

From a practical perspective, once a term enters academic papers, technical documentation, and international communication scenarios, its back-translation capability will directly impact expression efficiency and understanding accuracy. If a term requires additional explanation to complete cross-linguistic conversion, its long-term usage cost will continuously accumulate.

Therefore, within cross-linguistic systems, the main issue faced by "词元" lies in the instability of its mapping path, whereas "符元" exhibits higher certainty in terms of semantic correspondence and conceptual consistency. Against the backdrop of increasingly globalized artificial intelligence, choosing terms with good back-translation characteristics will be more conducive to constructing an open, interoperable academic and technological system.

The international reversibility of a term is essentially a key indicator of whether it possesses long-term academic vitality.

7. The Misconception of Unification: Formal Consistency Does Not Equal Structural Consistency

Article Viewpoint (Synthesized Expert Opinions): The expression style of "词元" is consistent with terms like "embedding" and "attention"—concise, abstract, and conforming to the Chinese technical context.

Conclusion first: The unification of a terminology system should be built upon "conceptual isomorphism", not "linguistic homomorphism".

In the supporting arguments for "词元", a common rationale is: its expression style is consistent with terms like "embedding" and "attention"—concise, abstract, and conforming to the Chinese technical context. This rationale captures the real need for terminology systems to have unity, but the problem is—if unification only stays at the linguistic level and not the structural level, it slides from "order" into "illusion".

The reason why "embedding" and "attention"

AIGC

Welcome to Join Odaily Official Community