Large language models (LLMs) such as GPT-4 (OpenAI) and LLaMA-2 (Facebook/Meta) and Gemini (Google) use a huge amount of data for training. It’s well-known that Wikipedia is an important source of data for training, but news articles are used too.
The advantage of using news articles for LLM training is obvious — news articles represent reality and how people perceive and converse about the world.
One ethical danger is that the Overlords of AI will attempt to make LLMs politically correct in some sense by filtering out news articles that don’t support their agendas. In my opinion, AI systems should represent actual reality, not the desired reality of any group that wants to impose their opinions on others.
Here’s a screenshot of the Yahoo website news feed for a single day. It illustrates the point that reality isn’t always pretty.
Training text, including news articles, is converted into integer tokens. The tiktoken tokenizer is used by the OpenAI GPT-4 large language model. I slapped together a quick demo.
My demo source text is “The future of AI is impredictable.” where I deliberately used a word, impredictable, that doesn’t exist. The tokenizer breaks the source text into “The”, “future”, “of”, “AI”, “is”, “imp”, “redict”, “able”, “.” and then into integers [791, 3938, 315, 15592, 374, 3242, 9037, 481, 13]. More common words and punctuations have smaller integer representations.
I’m optimistic that as AI systems evolve, they will use all available accurate data, not just data that promotes a particular point of view. Data is just data and it’s not inherently good or evil. What matters is how data is used, or not used.
Demo code:
# tiktoken_demo.py # OpenAI tokenizer # pip install tiktoken import numpy as np import tiktoken print("\nBegin tiktoken tokenizer demo ") enc = tiktoken.get_encoding("cl100k_base") # GPT-4 txt = "The future of AI is impredictable." print("\nsource text: ") print(txt) encoded = enc.encode(txt) print("\nencoded integer tokens: ") print(encoded) print("\nsplit text: ") for i in range(len(encoded)): t = enc.decode([encoded[i]]) # make it a list print(t) # print(t.strip()) # remove leading space print("\nEnd demo ")