GPT-4o’s Chinese token-training data is polluted by spam and porn websites

The brand new tokenizer has 200,000 tokens in complete, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to rely the variety of tokens in numerous languages, and the highest languages, moreover English, are Russian, Arabic, and Vietnamese.

“So the tokenizer’s primary impression, in my view, is you get the price down in these languages, not that the standard in these languages goes dramatically up,” Das says. When an LLM has higher and longer tokens in non-English languages, it could analyze the prompts quicker and cost customers much less for a similar reply. With the brand new tokenizer, “you’re nearly 4 occasions price discount,” he says.

Das, who additionally speaks Hindi and Bengali, took a have a look at the longest tokens in these languages. The tokens mirror discussions taking place in these languages, in order that they embrace phrases like “Narendra” or “Pakistan,” however frequent English phrases like “Prime Minister,” “college,” and “worldwideadditionally come up often. In addition they don’t exhibit the problems surrounding the Chinese language tokens.

That possible displays the coaching knowledge in these languages, Das says: “My working idea is the web sites in Hindi and Bengali are very rudimentary. It’s like [mostly] information articles. So I might count on this to be the case. There aren’t many spam bots and porn web sites attempting to occur in these languages. It’s principally going to be in English.”

Polluted knowledge and a scarcity of cleansing

Nevertheless, issues are drastically totally different in Chinese language. In response to a number of researchers who’ve regarded into the brand new library of tokens used for GPT-4o, the longest tokens in Chinese language are nearly completely spam phrases utilized in pornography, playing, and scamming contexts. Even shorter tokens, like three-character-long Chinese language phrases, mirror these matters to a big diploma.

“The issue is obvious: the corpus used to coach [the tokenizer] will not be clear. The English tokens appear superb, however the Chinese language ones aren’t,” says Cai from Princeton College. It’s not uncommon for a language mannequin to crawl spam when accumulating coaching knowledge, however normally there can be important effort taken to wash up the info earlier than it’s used. “It’s attainable that they didn’t do correct knowledge clearing in the case of Chinese language,” he says.

The content material of those Chinese language tokens may recommend that they’ve been polluted by a particular phenomenon: web sites hijacking unrelated content material in Chinese language or different languages to spice up spam messages. 

These messages are sometimes commercials for pornography movies and playing web sites. They might be actual companies or merely scams. And the language is inserted into content material farm web sites or generally official web sites to allow them to be listed by serps, circumvent the spam filters, and are available up in random searches. For instance, Google listed one search consequence web page on a US Nationwide Institutes of Well being web site, which lists a porn web site in Chinese language. The identical web site identify additionally appeared in at the least 5 Chinese language tokens in GPT-4o. 

Leave a Comment