AGI and jumping to the New Inference Market S-Curve

The Gist

Evolving rapidly. AI continues to evolve at a rapid pace toward Artificial General Intelligence (AGI) but we may need a new way to determine how to measure we arrived.
Not there yet. Achieving AGI will require strong capabilities in reasoning and logic that go beyond the current capabilities of LLMs.
Expect a slowdown. The exponential growth in large language model parameters and GPU compute power will start to slow, as training data sizes and hardware improvements hit practical limits.
Highly efficient predictions. New entrants in the inference space will enable highly efficient predictions using commodity devices at the edge, enabling a host of new, low-latency use cases.
New focus. Inference will become the new wave of focus as AI applications move from experimentation to production.

Artificial general intelligence (AGI) has been the Holy Grail of AI for many decades. AGI is an application of “strong AI” that is defined as AI that can perform as well or better than humans on a wide range of cognitive tasks. There is much debate over when artificial general intelligence may be fully realized, especially with the current evolution of large language models (LLMs). For many people, AGI is something out of a science fiction movie that remains mostly theoretical. Others believe we have already reached AGI with the latest releases of Chat-GPT4o and Gemini Advanced.

In Search of Artificial General Intelligence

Historically, we have used the Turing test as the measurement to determine if a system has reached artificial general intelligence. Created by Alan Turing in 1950 and originally called the “Imitation Game,” the test is largely based on three participants, an interrogator whose asks questions to the machine and human, the machine or system and the human who answers the question alongside the machine for comparison.

The criticism of the test is that it doesn’t measure intelligence or any other human qualities. The foundational assumption that an interrogator can determine if a machine is “thinking” by comparing its behavior with human behavior has a lot of subjectivity and is not necessarily deterministic.

There is also lack of consensus on whether the modern LLMs have actually achieved AGI. In June 2022, Google claimed LaMDA had passed the test, but critics quickly dismissed this as an advancement in “fooling” people you have intelligence rather than advancing toward AGI. The reality is that the test has outlived its usefulness.

Ray Kurzweil, a technology futurist, has spent much of his career making predictions on when we will reach AGI. In his recent talk at SXSW, he said he is sticking to his original prediction in 1999 that AI will match/surpass human intelligence by 2029.

But how will we know?

A New Way to Think About AGI

Horizontal AI products like ChatGPT, Gemini, Midjourney, Dall-E have given millions of users exposure to the power of AI. To many, these AI platforms seem very smart as they can generate answers, compose songs and write code in seconds.

However, there is a big difference between AI and AGI. These current AI platforms are essentially highly efficient prediction machines because they have been trained on a large corpus of data. However, that does not enable creativity, logical reasoning and sensory perception.

As we move closer to artificial general intelligence, we need an accepted definition of AGI and a framework that truly measures these critical aspects of intelligence such as reasoning, creativity and sentience.

One approach is to consider artificial general intelligence as an end-to end “intelligence supply chain” encompassing all the capabilities needed to achieve AGI.

We can group the critical components needed for AGI into four major categories as follows:

Observations and Learning – the ability to observe an environment and understand actions to gain context. The training or imparting of knowledge into the system to understand past behavior.
Pattern Matching and Predictions – the ability to match what is happening currently with what has happened in the past and use that to make predictions.
Abstractions and Reasoning – the ability to provide parameters and constraints in the decision-making process. Human decision-making frequently considers interactions across multiple domains to understand the interactions and connections.
Creativity and Emotions – the ability to generate new ideas that are far removed from what has happened before.

Today’s AI systems are mostly excelling at 1 and 2. For artificial general intelligence to be attained, we will need systems that can accomplish 3 and 4.

Achieving AGI will require further advances in algorithms, computing and data than what powers the models of today. Mimicking complex human behavior such as creativity, perceptions, learning and memory will require embodied cognition or learning from a multitude of senses or inputs. We also need systems and infrastructure that go beyond training.

The image shows a set of glass chess pieces arranged on a chessboard, reflecting on the board's surface. In the center stands a tall king, flanked by a queen, bishops, knights, and rooks, all made of clear glass. The pieces are illuminated, casting soft reflections and refractions of light on the glossy surface, highlighting the theme of transparency and strategy. The background is a soft, light gradient, emphasizing the clarity and simplicity of the composition in piece about artificial general intelligence and AI inference. — Achieving AGI will require further advances in algorithms, computing and data than what powers the models of today. Mimicking complex human behavior such as creativity, perceptions, learning and memory will require embodied cognition or learning from a multitude of senses or inputs.shahrilkhmd on Adobe Stock Photos

Human intelligence is heavily based on logical reasoning. We understand cause and effect, deduce information from existing knowledge and make inferences. Reasoning algorithms let a system traverse knowledge representations, drawing conclusions and finding solutions. This goes beyond basic pattern matching, enabling a more humanlike problem-solving ability. Replicating similar processes is fundamental for an AI to achieve AGI.

The timing of artificial general intelligence remains uncertain, but when it does, it’s going to impact our lives, businesses and society significantly.

The real power of AI technology is still ahead of us.

Enabling AGI by Shifting Focus to Inference

One of the prerequisites for achieving artificial general intelligence is the capability for AI inference, which is when an AI model produces accurate predictions or conclusions. Much of the computing power today is focused on model training. Model training is the stage when data is fed into a learning algorithm to produce a model. Training enables AI models to make accurate predictions when prompted.

AI can be divided into two major market segments — training and inference. Today, many companies are focused on creating high-performance hardware for data center providers to conduct massive AI model training. For instance, Nvidia, controls more than 95% of the specialized AI chip market. They sell to major tech companies like Amazon, Meta, and Microsoft, which are believed to make up roughly 40% of its revenue.

However, the market will soon shift its focus to building inferencing infrastructure for generative AI applications. The inferencing market will quickly grow as Fortune 500 companies that are currently testing generative AI applications move into production deployment. New applications will also emerge that will require scale to support workloads across centralized cloud, edge computing and IoT (Internet of Things) devices.

Model training is a very computationally intensive process that takes a lot of time to complete. Inference is usually faster and much less resource-intensive. Inferencing boils down to running AI applications or workloads after models have been trained.

Inference is going to be 100 times bigger than training. Nvidia is really good at training but is not ideal for inference.

A pivot from training to inference may not be easy.

Dominance in Training Doesn’t Mean Dominance in Inference

Nvidia was founded in 1993 long before the AI craze we see today. They were not initially focused on supplying AI hardware and software solutions and instead focused on creating graphics cards. As the PC market expanded and new applications such as Windows and gaming became prevalent, it became necessary to have dedicated hardware to handle the complicated tasks of 3D graphics processing. The opportunity to create high-performance processing units to support intensive computational operations in the PC and gaming market was not something that happens very often.

It turns out Nvidia struck gold with its GPU architectures. GPUs are well suited for AI for three primary reasons. They employ parallel processing; the systems scale up through high-performance interconnections creating supercomputing capabilities and the software for managing and tuning the stack for AI is broad and deep.

The idea of having separate hardware existed before Nvidia came onto the scene. For instance, the first Atari video game consoles, shipped in the 1970s, had graphics chips inside. And IBM had released the Professional Graphics Controller (PGA) which used an onboard Intel 8088 microprocessor to do video tasks. Silicon Graphics Inc or SGI also emerged as a dominant graphics player in the market in the late 1980s.

Things changed rapidly in 1993 with the release of a 3D game called Doom by game developer Id Software. Doom was the first mature, action-packed first-person shooter game on the market. Quake quickly followed and offered brand-new technical breakthroughs such as full real-time 3D rendering and online multiplayer. This paved the way for the dedicated graphics card market.

Nvidia didn’t immediately rise to fame. The first product came in May 1995, called the NV1, which was a multimedia PCI card with graphics, sound, and gamepad support. However, the product flopped as the NV1 was not compatible with the leading graphics APIs at the time — OpenGL, 3Dfx’s Glide, etc. It wasn’t until the Riva 128, launched in 1997 that the company saw success. At the time of launch, Nvidia had less than six weeks of cash left in the bank!

By the early 2000s, the graphics card market had drastically consolidated from over 30 to just three: Nvidia, ATI, and Intel taking up the low end. Nvidia coined the phrase General Processing Unit, or GPU, and set its sights on the broader compute market.

Finding an Adjacency Was a Big Win

The opportunity to create new businesses in adjacent markets, outside your core business, is not something you see frequently. A shining example was Amazon, an online commerce company, that created a cloud computing platform, Amazon Web Services (AWS) from the technology components they created to run a massively scalable commerce platform. Uber, a ride-sharing company leveraged its backend infrastructure to launch a food delivery service called UberEATS.

In a similar fashion, Nvidia realized that its graphic processing units (GPUs) that powered many of the graphics hardware boards in PCs and gaming consoles had another use in accelerating mathematical operations. By investing in making GPUs programmable, they opened up their parallel processing capabilities to a wider variety of applications. This enabled high-performance computing to be more readily accessible and run on commodity hardware.

Their first venture into the high-performance computing (HPC) space with its CUDA parallel computing architecture, enabling GPUs to be used for general-purpose computing tasks. This capability helped sparked early breakthroughs in modern AI. Initial AI applications like Alexnet, a convolutional neural network (CNN) used to classify images, was unveiled in 2012. It was trained using just two of Nvidia’s programmable GPUs.

The big discovery was that GPUs could massively accelerate neural network processing, or model training. As this began to spread among computer and data scientists, demand for Nvidia’s GPUs soared. In some ways, the AI revolution found Nvidia.

But that was just the beginning. Nvidia’s relentless pursuit of innovation led to a series of breakthrough architectures starting with the Turing architecture in 2018, which fused real-time ray tracing, AI, simulation, and rasterization to fundamentally change the way graphics processing worked. Turing featured new tensor cores, processors that accelerate deep learning training and inference, providing up to 500 trillion tensor operations per second. Tensor cores are essential building blocks of the NVIDIA solution that incorporates hardware, networking, software, libraries and optimized AI models. Tensor cores deliver significantly faster AI training times compared to traditional CUDA cores alone, which are primarily designed for general-purpose processing tasks and excel in parallel computing.

Nvidia’s rapid rate of innovation continued with subsequent architectural advancements with Ampere, Volta, Lovelace, Hopper and now Blackwell architectures. The H100 Tensor Core GPU was the first based on the Hopper architecture with over 80 billion transistors, built-in transformer engine, advanced NVLink inter-GPU communications and a second-generation multi-instance GPU (MIG).

The growth of computational power used to be governed by Moore’s Law, which predicted a doubling roughly every two years. Nvidia’s new Blackwell GPU has shattered expectations, increasing computational speed by over a thousand times in just eight years.

It’s a Difficult Shift to Inference

What’s good for training may not be good for inference.

There are still a limited number of AI applications in production today. Outside of a few large tech companies, very few corporations have advanced to running large-scale AI models in production. So most of the hardware focus has been on optimizing the hardware platform for training.

As the number of AI applications increases, the amount of compute a company uses for running models to respond to end-user requests will increase significantly. This will exceed the cost they’re spending on training today. The focus will then shift to optimizing hardware to reduce inference costs.

GPUs are well suited for the computational complexity of training. The workloads make it possible to split work across a few GPUs that are tightly interconnected. That makes reducing latency by distributing across low-end CPUs unrealistic.

However, this is not true for inference. The model weights are fixed and can easily be duplicated across many machines, so no communication is needed. This makes an army of commodity PCs and CPUs very appealing for applications relying on inference.

The Battle for the Future of AI Computing

New companies like Groq are emerging that have the potential to be serious competitors in the AI chip market. This could pose a threat to Nvidia’s dominance in the AI world.

Today, all the AI giants heavily rely on Nvidia to supply them with computing cards for mostly AI training with smaller demands on inference. The latest product, the H100 is still in high demand, remains costly (about $35,000 each) and only achieves inference speeds of 30-40 tokens per second. Compared to inference, training requires more stringent computing card specifications, especially in terms of memory size, which is growing close to 300 GB per card.

Groq’s approach to neural network acceleration is radically different from Nvidia’s. The architecture opts for a single large processor with hundreds of functional units, which significantly reduces instruction decoding overhead. This architecture allows superior performance and reduced latencies, ideal for cloud services requiring real-time inferences.

Groq’s secret sauce is its Logic Processing Unit (LPU) inference engines that are specifically engineered to address the two major bottlenecks faced by Large Language Models (LLMs) — compute capacity and memory bandwidth. The LPU systems boast comparable, if not superior, compute power to GPUs and have eliminated external memory bandwidth bottlenecks, enabling faster generation of text sequences.

The realization that computational power was a bottleneck for AI’s potential led to the inception of Groq and the creation of the LPU. Jonathan Ross who initially began what became the TPU project at Google started Groq in 2016.

Nvidia remains well entrenched and will likely not be easy to dethrone. However, Groq has demonstrated that its vision of an innovative processor architecture can compete with industry giants.

Enabling High-Performance Inference on Commodity Devices

There are tools emerging for machine learning that enable more efficient inferencing. Developed by Georgi Gerganov (the “GG” in GGML), GGML has emerged as a powerful and versatile tensor library, empowering developers to build and deploy high-performance machine learning applications across a wide spectrum of devices. It is designed to bring large-scale machine-learning models to commodity devices.

GGML is a lightweight engine that runs neural networks on C++. This is significant because it’s fast, has no dependencies (pure C++) it’s multi-platform, and can be easily ported to devices such as mobile phones. It defines a binary format for distributing large language models (LLMs) using quantization, a technique that allows LLMs to run on consumer hardware with effective CPU inferencing. It enables these big models to run on the CPU as fast as possible.

The benefit of GGML is it requires fewer resources to run, typically 4x less RAM requirements, and 4x less RAM bandwidth requirements, and thus faster inference on the CPU.

Traditionally, inference is done on centralized servers in the cloud. However, tools like GGML are making it possible to do model inference on commodity devices at the network’s edge. That is critical for low latency use cases like in self-driving cars.

GGML is empowering AI developers to harness the full potential of machine learning on everyday hardware. It provides an impressive array of features, is an open standard and has been optimized for Apple Silicon. GGML is poised to play a pivotal role in shaping the future of edge computing.

Preparing for Inference-Centric Workloads

The future of AI is undoubtedly headed toward inference-centric workloads. While the training of LLMs and other complex AI models gets a lot of current attention, inference makes up the vast majority of actual AI workloads.

Enterprises should begin to understand how inference works and how it will help enable better use of AI to improve their products and services.

fa-solid fa-hand-paper Learn how you can join our contributor community.