How close is AI to human-level intelligence?

OpenAI’s latest artificial intelligence (AI) system dropped in September with a bold promise. The company behind the chatbot ChatGPT showcased o1 — its latest suite of large language models (LLMs) — as having a “new level of AI capability”. OpenAI, which is based in San Francisco, California, claims that o1 works in a way that is closer to how a person thinks than do previous LLMs.

The release poured fresh fuel on a debate that’s been simmering for decades: just how long will it be until a machine is capable of the whole range of cognitive tasks that human brains can handle, including generalizing from one task to another, abstract reasoning, planning and choosing which aspects of the world to investigate and learn from?

Bigger AI chatbots more inclined to spew nonsense — and people don’t always realize

Such an ‘artificial general intelligence’, or AGI, could tackle thorny problems, including climate change, pandemics and cures for cancer, Alzheimer’s and other diseases. But such huge power would also bring uncertainty — and pose risks to humanity. “Bad things could happen because of either the misuse of AI or because we lose control of it,” says Yoshua Bengio, a deep-learning researcher at the University of Montreal, Canada.

The revolution in LLMs over the past few years has prompted speculation that AGI might be tantalizingly close. But given how LLMs are built and trained, they will not be sufficient to get to AGI on their own, some researchers say. “There are still some pieces missing,” says Bengio.

What’s clear is that questions about AGI are now more relevant than ever. “Most of my life, I thought people talking about AGI are crackpots,” says Subbarao Kambhampati, a computer scientist at Arizona State University in Tempe. “Now, of course, everybody is talking about it. You can’t say everybody’s a crackpot.”

Why the AGI debate changed

The phrase artificial general intelligence entered the zeitgeist around 2007 after its mention in an eponymously named book edited by AI researchers Ben Goertzel and Cassio Pennachin. Its precise meaning remains elusive, but it broadly refers to an AI system with human-like reasoning and generalization abilities. Fuzzy definitions aside, for most of the history of AI, it’s been clear that we haven’t yet reached AGI. Take AlphaGo, the AI program created by Google DeepMind to play the board game Go. It beats the world’s best human players at the game — but its superhuman qualities are narrow, because that’s all it can do.

The new capabilities of LLMs have radically changed the landscape. Like human brains, LLMs have a breadth of abilities that have caused some researchers to seriously consider the idea that some form of AGI might be imminent¹, or even already here.

This breadth of capabilities is particularly startling when you consider that researchers only partially understand how LLMs achieve it. An LLM is a neural network, a machine-learning model loosely inspired by the brain; the network consists of artificial neurons, or computing units, arranged in layers, with adjustable parameters that denote the strength of connections between the neurons. During training, the most powerful LLMs — such as o1, Claude (built by Anthropic in San Francisco) and Google’s Gemini — rely on a method called next token prediction, in which a model is repeatedly fed samples of text that has been chopped up into chunks known as tokens. These tokens could be entire words or simply a set of characters. The last token in a sequence is hidden or ‘masked’ and the model is asked to predict it. The training algorithm then compares the prediction with the masked token and adjusts the model’s parameters to enable it to make a better prediction next time.

How AI is reshaping science and society

The process continues — typically using billions of fragments of language, scientific text and programming code — until the model can reliably predict the masked tokens. By this stage, the model parameters have captured the statistical structure of the training data, and the knowledge contained therein. The parameters are then fixed and the model uses them to predict new tokens when given fresh queries or ‘prompts’ that were not necessarily present in its training data, a process known as inference.

The use of a type of neural network architecture known as a transformer has taken LLMs significantly beyond previous achievements. The transformer allows a model to learn that some tokens have a particularly strong influence on others, even if they are widely separated in a sample of text. This permits LLMs to parse language in ways that seem to mimic how humans do it — for example, differentiating between the two meanings of the word ‘bank’ in this sentence: “When the river’s bank flooded, the water damaged the bank’s ATM, making it impossible to withdraw money.”

This approach has turned out to be highly successful in a wide array of contexts, including generating computer programs to solve problems that are described in natural language, summarizing academic articles and answering mathematics questions.

And other new capabilities have emerged along the way, especially as LLMs have increased in size, raising the possibility that AGI, too, could simply emerge if LLMs get big enough. One example is chain-of-thought (CoT) prompting. This involves showing an LLM an example of how to break down a problem into smaller steps to solve it, or simply asking the LLM to solve a problem step-by-step. CoT prompting can lead LLMs to correctly answer questions that previously flummoxed them. But the process doesn’t work very well with small LLMs.

The limits of LLMs

CoT prompting has been integrated into the workings of o1, according to OpenAI, and underlies the model’s prowess. Francois Chollet, who was an AI researcher at Google in Mountain View, California, and left in November to start a new company, thinks that the model incorporates a CoT generator that creates numerous CoT prompts for a user query and a mechanism to select a good prompt from the choices. During training, o1 is taught not only to predict the next token, but also to select the best CoT prompt for a given query. The addition of CoT reasoning explains why, for example, o1-preview — the advanced version of o1 — correctly solved 83% of problems in a qualifying exam for the International Mathematical Olympiad, a prestigious mathematics competition for high-school students, according to OpenAI. That compares with a score of just 13% for the company’s previous most powerful LLM, GPT-4o.