The biggest bottleneck in large language models

Large language models (LLMs) like OpenAI’s GPT-4 and Anthropic’s Claude 2 have captured the public’s imagination with their ability to generate human-like text. Enterprises are just as enthusiastic, with many exploring how to leverage LLMs to improve products and services. However, a major bottleneck is severely constraining the adoption of the most advanced LLMs in production environments: rate limits. There are ways to get past these rate limit toll booths, but real progress may not come without improvements in compute resources.

Paying the piper

Public LLM APIs that give access to models from companies like OpenAI and Anthropic impose strict limits on the number of tokens (units of text) that can be processed per minute, the number of requests per minute, and the number of requests per day. This sentence, for example, would consume nine tokens.

API calls to OpenAI’s GPT-4 are currently limited to three requests per minute (RPM), 200 requests per day, and a maximum of 10,000 tokens per minute (TPM). The highest tier allows for limits of 10,000 RPM and 300,000 TPM.

For larger production applications that need to process millions of tokens per minute, these rate limits make using the most advanced LLMs essentially infeasible. Requests stack up, taking minutes or hours, precluding any real-time processing.

Most enterprises are still struggling to adopt LLMs safely and effectively at scale. But even when they work through challenges around data sensitivity and internal processes, the rate limits pose a stubborn block. Startups building products around LLMs hit the ceiling quickly when product usage and data accumulate, but larger enterprises with big user bases are the most constrained. Without special access, their applications won’t work at all.

What to do?

Routing around rate limits

One path is to skip the rate-limiting technologies altogether. For example, there are use-specific generative AI models that don’t come with LLM bottlenecks. Diffblue, an Oxford, UK-based startup, relies on reinforcement learning technologies that impose no rate limits. It does one thing very well and very efficiently and can cover millions of lines of code. It autonomously creates Java unit tests at 250 times the speed of a developer and that compile 10 times faster.

Unit tests written by Diffblue Cover enable rapid understanding of complex applications allowing enterprises and startups alike to innovate with confidence, which is ideal for moving legacy applications to the cloud, for example. It can also autonomously write new code, improve existing code, accelerate CI/CD pipelines, and provide deep insight into risks associated with change without requiring manual review. Not bad.

Of course, some companies have to rely on LLMs. What options do they have?

More compute, please

One option is simply to request an increase in a company’s rate limits. This is fine so far as it goes, but the underlying problem is that many LLM providers don’t actually have additional capacity to offer. This is the crux of the problem. GPU availability is fixed by the total silicon wafer starts from foundries like TSMC. Nvidia, the dominant GPU maker, cannot procure enough chips to meet the explosive demand driven by AI workloads, where inference at scale requires thousands of GPUs clustered together.

The most direct way of increasing GPU supplies is to build new semiconductor fabrication plants, known as fabs. But a new fab costs as much as $20 billion and takes years to build. Major chipmakers such as Intel, Samsung Foundry, TSMC, and Texas Instruments are building new semiconductor production facilities in the United States. Someday, that will be awesome. For now, everyone must wait.

As a result, very few real production deployments leveraging GPT-4 exist. Those that do are modest in scope, using the LLM for ancillary features rather than as a core product component. Most companies are still evaluating pilots and proofs of concept. The lift required to integrate LLMs into enterprise workflows is substantial on its own, before even considering rate limits.

Looking for answers

The GPU constraints limiting throughput on GPT-4 are driving many companies to use other generative AI models. AWS, for example, has its own specialized chips for training and inference (running the model once trained), allowing its customers greater flexibility. Importantly, not every problem requires the most powerful and expensive computational resources. AWS offers a range of models that are cheaper and easier to fine-tune, such as Titan Light. Some companies are exploring alternatives like fine-tuning open source models such as Meta’s Llama 2. For simple use cases involving retrieval-augmented generation (RAG) that require appending context to a prompt and generating a response, less powerful models are sufficient.

Techniques such as parallelizing requests across multiple older LLMs with higher limits, chunking up data, and model distillation can also help. There are several techniques used to make inference cheaper and faster. Quantization reduces the precision of the weights in the model, which are typically 32-bit floating point numbers. This isn’t a new approach. For example, Google’s inference hardware, Tensor Processing Units (TPUs), only works with models where the weights have been quantized to eight-bit integers. The model loses some accuracy but becomes much smaller and faster to run.

A newly popular technique called “sparse models” can reduce the costs of training and inference, and it is less labor-intensive than distillation. You can think of an LLM as an aggregation of many smaller language models. For example, when you ask GPT-4 a question in French, only the French-processing part of the model needs to be used, and this is what sparse models exploit.

You can do sparse training, where you only need to train a subset of the model on French, and sparse inference, where you run just the French-speaking part of the model. When used with quantization, this can be a way of extracting smaller special-purpose models from LLMs that can run on CPUs rather than GPUs (albeit with a small accuracy penalty). The problem? GPT-4 is famous because it’s a general-purpose text generator, not a narrower, more specific model.

On the hardware side, new processor architectures specialized for AI workloads promise gains in efficiency. Cerebras has built a gigantic Wafer-Scale Engine optimized for machine learning, and Manticore is repurposing “rejected” GPU silicon discarded by manufacturers to deliver usable chips.

Ultimately, the greatest gains will come from next-generation LLMs that require less compute. Combined with optimized hardware, future LLMs could break through today’s rate limit barriers. For now, the ecosystem strains under the load of eager companies lined up to tap into the power of LLMs. Those hoping to blaze new trails with AI may need to wait until GPU supplies open further down the long road ahead. Ironically, these constraints may help temper some of the frothy hype around generative AI, giving the industry time to settle into positive patterns for using it productively and cost-effectively.

Source link