The new tokenizer has 200,000 tokens in total and about 25% are in non-English languages, says Deedy Das, AI investor at Menlo Ventures. It used language filters to count the number of tokens in different languages, and the top languages, besides English, are Russian, Arabic, and Vietnamese.
“So the main impact of the tokenizer, in my opinion, is that you lower the cost in those languages, not that the quality in those languages increases dramatically,” Das says. When an LLM has better and larger tokens in non-English languages, it can parse messages faster and charge users less for the same response. With the new tokenizer, “you’re looking at almost a fourfold cost reduction,” he says.
Das, who also speaks Hindi and Bengali, took a look at the biggest tokens in those languages. Badges reflect conversations in these languages, so they include words like “Narendra” or “Pakistan” but common English terms like “Prime Ministry”, “University” and “International” also occur frequently. They also do not present the issues regarding Chinese tokens.
This probably reflects training data in those languages, says Das: “My working theory is that Hindi and Bengali websites are very rudimentary. It’s like [mostly] news articles. So I would expect that to be the case. There are not many spam bots and porn sites trying to happen in these languages. It will be mostly in English.”
Contaminated data and lack of cleanup
However, things are drastically different in Chinese. According to several researchers who examined the new library of tokens used for GPT-4o, the largest tokens in Chinese are almost exclusively spam words used in pornography, gambling and fraud contexts. Even shorter tokens, such as three-character Chinese words, reflect these themes to a significant extent.
“The problem is clear: the body was training [the tokenizer] it is not clean. English characters look good, but Chinese characters don’t,” says Princeton University’s Cai. It is not uncommon for a language model to detect spam when collecting training data, but usually a significant effort will be made to clean the data before it is used. “It’s possible they didn’t do proper data cleansing when it comes to Chinese,” he says.
The content of these Chinese tokens could indicate that they are infected by a specific phenomenon: websites hack irrelevant content in Chinese or other languages to boost spam.
These messages are often advertisements for pornographic videos and gambling sites. They could be real businesses or just scams. And the language is inserted into content exploitation sites or sometimes legitimate sites so that they can be indexed by search engines, bypass spam filters, and appear in random searches. For example, Google indexed a search results page a website of the US National Institutes of Health, which lists a porn site in Chinese. The same site name also appeared on at least five Chinese tokens in GPT-4o.