To hear companies such as ChatGPT’s OpenAI tell it, artificial general intelligence, or AGI, is the ultimate goal of machine learning and AI research. But what is the measure of a generally intelligent machine? In 1970 computer scientist Marvin Minsky predicted that soon-to-be-developed machines would “read Shakespeare, grease a car, play office politics, tell a joke, have a fight.” Years later the “coffee test,” often attributed to Apple co-founder Steve Wozniak, proposed that AGI will be achieved when a machine can enter a stranger’s home and make a pot of coffee.
Few people agree on what AGI is to begin with—never mind achieving it. Experts in computer and cognitive science, and others in policy and ethics, often have their own distinct understanding of the concept (and different opinions about its implications or plausibility). Without a consensus it can be difficult to interpret announcements about AGI or claims about its risks and benefits. Meanwhile, though, the term is popping up with increasing frequency in press releases, interviews and computer science papers. Microsoft researchers declared last year that GPT-4 shows “sparks of AGI”; at the end of May OpenAI confirmed it is training its next-generation machine-learning model, which would boast the “next level of capabilities” on the “path to AGI.” And some prominent computer scientists have argued that with text-generating large language models, it has already been achieved.
To know how to talk about AGI, test for AGI and manage the possibility of AGI, we’ll have to get a better grip on what it actually describes.
On supporting science journalism
If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
General Intelligence
AGI became a popular term among computer scientists who were frustrated by what they saw as a narrowing of their field in the late 1990s and early 2000s, says Melanie Mitchell, a professor and computer scientist at the Sante Fe Institute. This was a reaction to projects such as Deep Blue, the chess-playing system that bested grandmaster Garry Kasparov and other human champions. Some AI researchers felt their colleagues were focusing too much on training computers to master single tasks such as games and losing sight of the prize: broadly capable, humanlike machines. “AGI was [used] to try to get back to that original goal,” Mitchell says—it was coinage as recalibration.
But viewed in another light, AGI was “a pejorative,” according to Joanna Bryson, an ethics and technology professor at the Hertie School in Germany who was working in AI research at the time. She thinks that the term arbitrarily divided the study of AI into two groups of computer scientists: those deemed to be doing meaningful work toward AGI, who were explicitly in pursuit of a system that could do everything humans could do, and everyone else, who was assumed to be spinning their wheels on more limited—and therefore frivolous—aims. (Many of these “narrow” goals, such as teaching a computer to play games, later helped advance machine intelligence, Bryson points out.)
Other definitions of AGI can seem equally wide-ranging and slippery. At its simplest, it is shorthand for a machine that equals or surpasses human intelligence. But “intelligence” itself is a concept that’s hard to define or quantify. “General intelligence” is even trickier, says Gary Lupyan, a cognitive neuroscientist and psychology professor at the University of Wisconsin–Madison. In his view, AI researchers are often “overconfident” when they talk about intelligence and how to measure it in machines.
Cognitive scientists have been trying to home in on the fundamental components of human intelligence for more than a century. It’s generally established that people who do well on one set of cognitive questions tend to also do well on others, and many have attributed this to some yet-unidentified, measurable aspect of the human mind, often called the “g factor.” But Lupyan and many others dispute this idea, arguing that IQ tests and other assessments used to quantify general intelligence are merely snapshots of current cultural values and environmental conditions. Elementary school students who learn computer programming basics and high schoolers who pass calculus classes have achieved what was “completely outside the realm of possibility for people even a few hundred years ago,” Lupyan says. Yet none of this means that today’s kids are necessarily more intelligent than adults of the past; rather, humans have amassed more knowledge as a species and shifted our learning priorities away from, say, tasks directly related to growing and acquiring food—and toward computational ability instead.
“There’s no such thing as general intelligence, artificial or natural,” agrees Alison Gopnik, a professor of psychology at the University of California, Berkeley. Different kinds of problems require different kinds of cognitive abilities, she notes; no single type of intelligence can do everything. In fact, Gopnik adds, different cognitive abilities can be in tension with each other. For instance, young children are primed to be flexible and fast learners, allowing them to make many new connections quickly. But because of their rapidly growing and changing mind, they don’t make great long-term planners. Similar principles and limitations apply to machines as well, Gopnik says. In her view, AGI is little more than “a very good marketing slogan.”
General Performance
Moravec’s paradox, first described in 1988, states that what’s easy for humans is hard for machines, and what humans find challenging is often easier for computers. Many computer systems can perform complex mathematical operations, for instance, but good luck asking most robots to fold laundry or twist doorknobs. When it became obvious that machines would continue to struggle to effectively manipulate objects, common definitions of AGI lost their connections with the physical world, Mitchell notes. AGI came to represent mastery of cognitive tasks and then what a human could do sitting at a computer connected to the Internet.
In its charter, OpenAI defines AGI as “highly autonomous systems that outperform humans at most economically valuable work.” In some public statements, however, the company’s founder, Sam Altman, has espoused a more open-ended vision. “I no longer think [AGI is] like a moment in time,” he said in a recent interview. “You and I will probably not agree on the month or even the year that we’re like, ‘Okay, now that’s AGI.’”
Other arbiters of AI progress have drilled down into specifics instead of embracing ambiguity. In a 2023 preprint paper, Google DeepMind researchers proposed six levels of intelligence by which various computer systems can be graded: systems with “No AI” capability at all, followed by “Emerging,” “Competent,” “Expert,” “Virtuoso” and “Superhuman” AGI. The researchers further separate machines into “narrow” (task-specific) or “general” types. “AGI is often a very controversial concept,” lead author Meredith Ringel Morris says. “I think people really appreciate that this is a very practical, empirical definition.”
To come up with their characterizations, Morris and her colleagues explicitly focused on demonstrations of what an AI can do instead of how it can do tasks. There are “important scientific questions” to be asked about how large language models and other AI systems achieve their outputs and whether they’re truly replicating anything humanlike, Morris says, but she and her co-authors wanted to “acknowledge the practicality of what’s happening.”
According to the DeepMind proposal, a handful of large language models, including ChatGPT and Gemini, qualify as “emerging AGI,” because they are “equal to or somewhat better than an unskilled human” at a “wide range of nonphysical tasks, including metacognitive tasks like learning new skills.” Yet even this carefully structured qualification leaves room for unresolved questions. The paper doesn’t specify what tasks should be used to evaluate an AI system’s abilities nor the number of tasks that distinguishes a “narrow” from a “general” system, nor the way to establish comparison benchmarks of human skill level. Determining the correct tasks to compare machine and human skills, Morris says, remains “an active area of research.”
Yet some scientists say answering these questions and identifying proper tests is the only way to assess if a machine is intelligent. Here, too, current methods may be lacking. AI benchmarks that have become popular, such as the SAT, the bar exam or other standardized tests for humans, fail to distinguish between an AI that regurgitates training data and one that demonstrates flexible learning and ability, Mitchell says. “Giving a machine a test like that doesn’t necessarily mean it’s going to be able to go out and do the kinds of things that humans could do if a human got a similar score,” she explains.
General Consequences
As governments attempt to regulate artificial intelligence, some of their official strategies and policies reference AGI. Variable definitions could change how those policies are applied, Mitchell points out. Temple University computer scientist Pei Wang agrees: “If you try to build a regulation that fits all of [AGI’s definitions], that’s simply impossible.” Real-world outcomes, from what sorts of systems are covered under emerging laws to who holds responsibility for those systems’ actions (is it the developers, the training data compilers, the prompter or the machine itself?) might be altered by how the terminology is understood, Wang says. All of this has critical implications for AI safety and risk management.
If there’s an overarching lesson to take away from the rise of LLMs, it might be that language is powerful. With enough text, it’s possible to train computer models that appear, at least to some, like the first glimpse of a machine whose intelligence rivals that of humans. And the words we choose to describe that advance matter.
“These terms that we use do influence how we think about these systems,” Mitchell says. At a pivotal 1956 Dartmouth College workshop at the start of AI research, scientists debated what to call their work. Some advocated for “artificial intelligence” while others lobbied for “complex information processing,” she points out. Perhaps if AGI were instead named something like “advanced complex information processing,” we’d be slower to anthropomorphize machines or fear the AI apocalypse—and maybe we’d agree on what it is.