IEEE "Spectrum explains large language models, the transformer architecture, and how it all works"

From the electrical/electronic brainiacs at IEEE Spectrum, February 14:

Generative AI is today’s buzziest form of artificial intelligence, and it’s what powers chatbots like ChatGPT, Ernie, LLaMA, Claude, and Command—as well as image generators like DALL-E 2, Stable Diffusion, Adobe Firefly, and Midjourney.
Generative AI is the branch of AI that enables machines to learn
patterns from vast datasets and then to autonomously produce new content
based on those patterns. Although generative AI is fairly new, there
are already many examples of models that can produce text, images,
videos, and audio.

Many “foundation models”
have been trained on enough data to be competent in a wide variety of
tasks. For example, a large language model can generate essays, computer code, recipes, protein structures, jokes, medical diagnostic advice, and much more.
It can also theoretically generate instructions for building a bomb or
creating a bioweapon, though safeguards are supposed to prevent such
types of misuse.

What’s the difference between AI, machine learning, and generative AI?
Artificial intelligence (AI) refers to a wide variety of computational approaches to mimicking human intelligence. Machine learning (ML) is a subset of AI; it focuses on algorithms that enable systems to learn from data and improve their performance. Before generative AI came along, most ML models learned from datasets to perform tasks such as classification or prediction. Generative AI is a specialized type of ML involving models that perform the task of generating new content, venturing into the realm of creativity.

What architectures do generative AI models use?
Generative models are built using a variety of neural network architectures—essentially the design and structure that defines how the model is organized and how information flows through it. Some of the most well-known architectures are variational autoencoders (VAEs), generative adversarial networks (GANs), and transformers. It’s the transformer architecture, first shown in this seminal 2017 paper from Google, that powers today’s large language models. However, the transformer architecture is less suited for other types of generative AI, such as image and audio generation.

Autoencoders learn efficient representations of data through an
encoder-decoder framework. The encoder compresses input data into a lower-dimensional space, known as the latent (or embedding) space,
that preserves the most essential aspects of the data. A decoder can
then use this compressed representation to reconstruct the original
data. Once an autoencoder has been trained in this way, it can use novel
inputs to generate what it considers the appropriate outputs. These
models are often deployed in image-generation tools and have also found use in drug discovery, where they can be used to generate new molecules with desired properties.

With generative adversarial networks (GANs), the training involves a
generator and a discriminator
that can be considered adversaries. The generator strives to create
realistic data, while the discriminator aims to distinguish between
those generated outputs and real “ground truth” outputs. Every time the
discriminator catches a generated output, the generator uses that
feedback to try to improve the quality of its outputs. But the
discriminator also receives feedback on its performance. This
adversarial interplay results in the refinement of both components,
leading to the generation of increasingly authentic-seeming content.
GANs are best known for creating deepfakes but can also be used for more benign forms of image generation and many other applications.

The transformer is arguably the reigning champion of generative AI
architectures for its ubiquity in today’s powerful large language models
(LLMs). Its strength lies in its attention mechanism, which enables the
model to focus on different parts of an input sequence while making
predictions. In the case of language models, the input consists of
strings of words that make up sentences, and the transformer predicts
what words will come next (we’ll get into the details below). In
addition, transformers can process all the elements of a sequence in
parallel rather than marching through it from beginning to end, as
earlier types of models did; this
parallelization
makes training faster and more efficient. When developers added vast
datasets of text for transformer models to learn from, today’s
remarkable chatbots emerged.

How do large language models work?….