Four and a half years ago I wrote a quite popular article about why current Deep Learning technology is a dead end for Artificial General Intelligence.
It turned out that I was not alone with this view:
“Scale will only get you so far. (…) The biggest breakthroughs in AI are yet to come — and will take more than just chips.” — Demis Hassabis, Google DeepMind CEO
“On the path towards human-level intelligence, an LLM is basically an off-ramp, a distraction, a dead end.” — Yann LeCun, Meta AI Chief Scientist
Completed with an older, but still relevant quote:
“My view is throw it all away and start again.” — Geoff Hinton, the man behind backpropagation being used in most DL models today, about his most important contribution
So how does that compare to the claims of popular AI figures (especially those connected somehow to a single company) that we will soon have superintelligence?
And what additional years of hard practical work between those articles taught me from first-hand experience?
I can’t wait to share it with you — and believe me this is not something you can read about anywhere else… So stay with me till the end.
Do you know why training large Deep Learning models is so expensive and slow? Why can people learn very fast and those models still require so many examples and have the electricity requirements of a city?
Because they are involved in lossy data compression, by using the same neurons for multiple tasks — therefore adjustments have to be delicate and slow to not disrupt the whole system, when trying to learn a new concept.
Imagine adjusting millions of tiny little control knobs every second, just by a fraction of a millimeter to find out how much better or worse the outcome is — this is how it works.
Compression happens when instead of memorizing every single value, you combine them as a single representation.
For illustration purposes look at the photos below and imagine that instead of storing a value for each pixel, you can store a single value for a group of similar pixels and reduce the needed storage space.
Computer scientists learned long ago that storing images as bitmaps is not the most effective way, and that is how we got JPEG and PNG — the most popular and compressed image formats. The same happened for sound and video with MP3, MP4 and AVI files.
When you train a foundational model you use terabytes/petabytes of data and try to fit it into a model with much less memory capacity. Because of that forced bottleneck, the model (automatically as a byproduct of compression) learns what has to remain and what can be skipped to still perform its task reasonably well.
So you take thousands of examples of something, show them to the model thousands of times to teach it how to compress its representation to keep it useful. This is the essence of backpropagation.
What is wrong with it?
Nothing, as long as your goal is to build a tool extracting knowledge and wisdom from human reasoning — I believe there is currently no better tool for that task.
But don’t confuse extraction of pre-reasoned pieces of data (textual Q&A, labeled images, etc.) with intelligence…
There is none. There is only a blending of pre-reasoned patterns of information, extracted from people.
Deep Learning mimics our unconscious mind (a.k.a. System 1), which stores and retrieves the information we encounter in our lives. When you know what has to be memorized (you get it from human reasoning in the training dataset) — it is really good at the job.
But trying to use a Deep Learning model to define what to memorize (provide answer) works only as far as this model’s level of knowledge extracted from people. When you ask it for something very different from the training dataset — it fails miserably.
To reason and learn from that reasoning the AI system has to have access to the (hypothesis) testing environment. Create an idea, test if it works in simulated or real conditions, and update the knowledge base (simplified description of System 2). With the current setup repeat the same message many times, making sure that other precious pieces of information will not be disrupted…
An example of such an evaluation procedure can be a software development environment — where an AI agent can write different variants of software and test it in terms of accuracy, performance, or memory consumption — where the results are numerical and can be easily evaluated as better or worse. For much more complex scenarios, when the ‘good answer’ is subjective — the testing environment has to contain multiple modalities, and we will discuss it later in the article.
How AI can learn faster and without human participation (continual learning)? We need a way to encode new information in a useful way (we can use the same model, other Deep Learning models, or alternative techniques) and store that encoding in a memory external to the base model. To prevent memorization of every data sample, we can use unsupervised and self-supervised learning techniques (while storing only new samples, when they differ significantly from what was already learned) that allow us to benefit from both short-term (external memory) and long-term (base model parameters) memory.
As a result: we can learn things both really fast and slowly integrate them with the base model, using various available methods (e.g. replay methods).
So as humanity, we already know how to make AI learn fast with already existing methods, but do we know how to optimally define what to learn in an automated way?
Raise your hand if you have used an LLM already for some time and have not received pure nonsense as an output even once.
In the art of blending pre-reasoned information patterns, LLMs do their best even if they are significantly undertrained in some areas and have no clue what they are talking about — introducing hallucinations.
In my previous article, I wrote that brains are not initialized randomly. And we humans share semantic maps, meaning that our brains have neurons processing similar types of data in similar brain regions.
In our heads, during childhood, neural progenitor cells create scaffolds for neurons and glial cells to arrange formations in the brain, that enable 3D information flow between senses and various components of the brain that shape the semantic map of our knowledge.
Narrow AI systems are great at mapping individual “routes” of cognitive information and combining them (input-to-output mappings if you prefer).
They are much worse at building a “map” to define completely new and previously unseen (or not experienced yet) routes. To understand what they do not yet understand.
Or in simpler terms: they would benefit if what they are planning to do makes any sense in the first place.
Algorithms like Q* (which according to some Chief Hype Officers on the Internet will bring the AGI soon…) aim to combine a set of internally generated rationales with algorithms that teach AI to move through the space of potential possibilities optimally to generate better responses.
I see some problems here:
a) imagine that your system has 90% accuracy when measured on single tasks, now chain them together in a sequence 5–10 times and this accuracy will quickly deteriorate:
- 0.9 x 0.9 = 0.81
- 0.9 x 0.9 x 0.9 = 0.729
- 0.9 x 0.9 x 0.9 x 0.9 = 0.6561
- 0.9 x 0.9 x 0.9 x 0.9 x 0.9 = 0.59
- 0.9 x 0.9 x 0.9 x 0.9 x 0.9 x 0.9 = 0.5314
You know where I am going. When you rely only on your incomplete knowledge, you quickly become a victim of the Dunning-Kruger Effect, where instead of trying to fill the gaps in your knowledge, you try to apply the wrong information immediately.
On the contrary, the best people in specific areas try to understand the shortcomings of their knowledge and fill in the blank spots in the cognitive map to be better prepared for the challenges ahead.
b) now, being at the level of 90% accuracy, how hard is it to get to 99% or better 99.99% accuracy? It is just about 1/10 of the effort, right?
But when you look at it from a different perspective, not how many times you are right, but how often are you wrong, things look a bit differently.
- for 90% accuracy, you are wrong in 10% situations — let’s define it as 0.1
- for 99% accuracy, you make mistakes in 1% of cases — it will be 0.01
- for 99.99% accuracy, it is 0.01% — that is 0.0001
You need to reduce the error rate by a factor of 10x (0.1/0.01 = 10 = 1000%) to go from 0.1 to 0.01 error rate, and 1000x (0.1/0.0001) to reach 0.0001 — the level of 99.99% accuracy. And if we want to make AI part of our society, we should not stop there to minimize the risk of errors, when technology will drive our cars or make critical decisions. 99.9999% accuracy would require 100.000x improvement (0.1 / 0.000001)…
In my opinion, it is not reasonable to shuffle already available data endlessly for months, burning billions of dollars for training models that will just mix pre-reasoned knowledge patterns to store them effectively, hoping that AGI will emerge on the other end…
Instead, we should build reliable cognitive maps for AI — hand in hand with AI, creating a form of hybrid intelligence.
AI in its highly condensed description is a technology for extracting human skills, knowledge, and wisdom from data.
Why not organize it in a controlled way?
The relationships between concepts can be mapped with knowledge graphs, and the space of potential solutions can be mapped in 2D or 3D from high-dimensional vectors (embeddings and pre-computed cached neural representations) that encode the meaning of specific concepts.
With such a map, you can organize knowledge hierarchically, grouping similar concepts, and making sure that your AI understands the concept at all levels of abstraction.
You can now recognize that this is the opposite process of compressing data in a Deep Learning model. You start with a structure that has only signal for learning and reasoning, with no noise to filter out (that is how models compress information — by filtering out what is irrelevant).
This knowledge representation is fully interpretable, able to be evaluated by people, and modified in real time. Explainable and secure.
Having this knowledge representation can help AI to:
- quickly discard what is wrong, and forget what is irrelevant during training
- make sure (similar to conscious effort) that what it outputs makes sense according to its knowledge base held in a cognitive map
Current Large Models suffer from being good only when focused on narrow tasks, they lack the big picture. These problems are fixed with multiple iterations of generating, refining, summarizing, and evaluating the outputs (how economically inefficient it is and how bad for our planet it will be at a larger scale of deployment?).
While it could generate what is right (at least from our understanding and expectations) from the very beginning.
Now Knowledge Graphs are not equal to Semantic Maps, and especially Cognitive Maps, but that is the topic for another article. We will touch on it briefly in the next section.
With a Cognitive Map, the AI (and its users) understands the empty spots in the knowledge and the ways how to validate the plans and improve before taking action.
So let’s not leave the accuracy of AI to chance, but work actively to enhance it.
A cognitive map requires a world model. A structure that holds information about the world — it can be a knowledge graph, a virtual 3D environment, or any other model that holds information that allows for creating a cognitive map through interactions with that model.
The world model represents the environment, various locations, objects, and their properties, agents (understood as people, animals, or machines able to act autonomously), rules or laws, potential actions to be made, etc.
The world model also contains a model of an agent using it to interact with the world. It helps to understand what is possible and what is not, and what to expect from other autonomous entities and the surrounding environment.
So in its simplest form, it can be a graph with properties, it would help to solve some problems, but it would be too simple for others — where a richer and more interactive environment would be more beneficial.
It is the thing an autonomous entity can use to think before acting. It is what can be used to reason about complex plans and predict the outcomes.
And of course, you can count on the rich, complex, and accurate world model automagically emerging from data scraped from the Web and Reddit communities, inside billions of parameters of a Large Model, which no one ever will be able to grasp in their mind, influence or control…
But why to do that?
I took the harder path in my life many times, but only if it was reasonable.
This is not. The impact of AI technology, the associated cost paid directly or indirectly by all people — billions and then trillions spent on training and inference of AI — just begs for something safe, effective, interpretable, and reliable.
Deep Learning does not have these properties (yet).
External world models and cognitive maps can bring those features much faster and in a much better way than current approaches.
Another aspect is that cognition and processing information is only one part of being intelligent — as it deals with internal operations. The second is to truly understand the physical world — an environment external to the agent.
We, people, use language, books, photos, and videos only as representations of some ideas and concepts.
What happens in our minds is a recreation of richer information patterns from those representations. This is why people are affected differently by various pieces of content, food, music, or events. They can bring different feelings and thoughts to each of us.
You can describe something, you can show it on video — but only those who experienced something similar before will be able to fully recreate what you mean.
This is because of our embodied multi-sensory experience of life.
Now, it will take some time for robots to be able to share the world with us. I was involved in this space for a couple of years.
But what we are becoming really good at as a species, is simulating the world(s) in virtual environments.
Within these environments, AI agents can experience and learn similar things to what we do in our world.
Like AlphaGo models they can ‘imagine’ various scenarios and learn from them, even when no one told them anything about them.
And these knowledge pieces can have completely transparent definitions.
Even if something has to be opaque to be effective, it is possible to train a parallel surrogate model that reflects what is inside its counterpart.
In these virtual environments, agents have the chance to have experiences and learn from them.
Many people think about consciousness (or awareness) as something mystical or powered by unexplainable ‘quantum’ phenomena. As I can’t be sure that my explanation of how it emerges can be correct or near the truth, I will limit myself to defining what it does instead:
Awareness or consciousness is a construct that simplifies (generalizes) the state of the internal and external worlds for an agent.
The atoms we experience every day do not have color, taste, or smell — it is our brain that defines these and other properties. At the same time, it filters out irrelevant information allowing us to focus on what needs our attention — to help us navigate the complexities of life.
If there is a problem in our body, we need to take care of — it sends a pain signal. If there is a need to be satisfied — it sends pleasure signal to reinforce it in the future. If all needs are satisfied — curiosity and exploration are triggered.
It is a construct that allows us to perceive and make sense of the world, imagine things, and plan, similar to a virtual environment.
Animals see and hear things differently. Maybe machines will also do the same. But they will strongly benefit from having a virtual environment to interact with, before making any decisions.
Reading this section you may think, that it is necessary to equip AI with rendering power stations, but this is not the case. Reasoning and simulating can happen inside the neural network, without rendering the view (watch the presentation here, if you are interested to learn more).
All concepts mentioned in this article make AI systems more complex, not necessarily in terms of confusing technologies, but in the number of integrated modules.
Models trained with backpropagation are self-contained and optimized to perform tasks similar to what has been experienced during training procedures.
How to manage even more complex, modular architectures?
In recent times, especially this year (2024) we could read about and use different approaches to more flexible usage of resources available to AI solutions, with a common name Mixture of X (something).
We have a Mixture of Experts (models), Depth (depth of computation in the model), Memory Experts (model parts that can be used), or Agents (multiple cooperating AI agents).
But once again we use good, old Backpropagation to teach models how to manage these resources. Which expert to load, how deep to go in computation, etc. And it works, as long as we deal with something that the models have been trained to deal with, on many examples.
You can be surprised with this information, but neurons may not be the ones running the show in the brain.
In section #2 I already mentioned glial cells, which basically create the environment for neurons to operate. For many years considered just a filler, maintenance workers keeping neurons fed and healthy — can have a much more important role.
You see at the level where neurons connect and exchange information (tripartite synapse), there are actually three players: two neurons and an astrocyte (a type of glial cell).
Astrocytes exchange neurotransmitters with other members of the tripartite synapse — as a result, they can potentially lower the activation threshold of the connection and probe the interactions to propagate that information as context.
Different types of glial cells (or glia in short) take care of neurons, heal them, and feed them like caring parents. Neurons are stimulated electrically and can send various signals fast over longer distances, while astrocytes are stimulated chemically — slower and for a longer time.
We can view them as big-picture context holders that deliver blood, and nutrients to the right places and measure information along the way about what will happen next.
In real life with continuous experience, we don’t switch context every three seconds like some AI tools (or at least most of us…) — so astrocytes can detect signals of specific neural resources being needed and facilitate the enhanced retrieval of the right data, at the right time.
It can be viewed as astrocytes using neurons to make the information being sent faster, through electric impulses — while focusing on the high-level context and state.
Of course, they are not alone — a working memory in the brain emerges from interactions between organs called the thalamus, hippocampus, pre-frontal cortex, and other areas of the cerebral cortex.
We also have supportive components that allow us and animals to manage cognitive or instinctive processes.
The key information is that there is more to the brain, than just a single unified mechanism, processing information from the input to the output on the same path.
In the brain, the only area that produces new neurons in the adulthood is hippocampus. It is responsible for establishing episodic memory — our ability to remember sets of information occurring together (“events”), forming our memories.
The semantic memory, giving meaning to things, objects, and concepts is distributed over the whole cerebral cortex — the most characteristic part of the human brain. It exchanges information with the thalamus (being a ‘router’ of information between senses and various parts of the brain — also helping us to form more complex representations).
We have organs that define what is relevant (usually new, surprising, or bringing intense emotional reactions), manage our emotional responses, allow us to orientate in physical environments, and so on.
The cerebellum, a relatively small part of the brain, located in the back of our heads where actually most of our neurons are placed. It helps us to perform activities with precision, without the risk of any disruption caused by further adaptations of connections between the neurons. You can think of it as a place where we keep highly accurate and trained enough neural circuits to be stored and not overwritten.
People often think that the seat of the intelligence is cerebral cortex, it is what makes us different from most animals. But it is not what makes us autonomous. It makes us able to remember a larger number of the more complex ideas.
Deep Learning aims to replicate the features of exactly this part of our brain. One of its flavors, Reinforcement Learning tries to mimic the reward mechanism that motivates our actions. But more signals govern our autonomous behavior.
My point in this chapter was to make a clear picture — it is not about the scale, because we already have huge models.
It is about the modular composition, that allows animals to do their jobs effectively, and us to do some additional intellectual work too.
This modular composition requires multi-signal learning. In the end, it is not that we respond to the same events always in the same way.
When we are hungry — we react to food differently, than when our needs are satisfied and we have something important to do.
When we have time to think about the optimal solution, we tend to think differently than in situations when the decision just has to be made in the next 1–2 seconds.
In section #5 I mentioned how context is potentially managed in the brain, where different areas are ‘activated’ more, depending on the predictions about the next events.
In our heads we have neuromodulators, multiple substances that influence the behavior of neurons (and probably glial cells) at specific areas, not only at single synapses like neurotransmitters.
Supportive modules of our brains often have specific neuromodulators that are especially active there— e.g. acetylcholine in the hippocampus, or dopamine in the striatum where rewards and motivations are being managed.
Our lives are not only about remembering things, performing tasks, or maximizing the reward (satisfaction or pleasure).
They are about knowing how to respond to an unpredictable world, socializing with others, knowing when to focus on internal thoughts and when on the inputs from the external world.
It is not about using our mouth to speak first things that come from our unconscious mind like current LLMs seem to do.
Multi-signal learning, by using more signals to learn can teach AI agents how to respond in specific situations and even to understand better human emotions(!).
Imagine a situation, where we teach an AI agent that in specific situations it should spend more time seeking an optimal answer — not by prompts or any tricks, but just internally.
In other cases, it should ask other agents (people included) or information sources, to ‘feel’ the need to interact with them.
Or when there is a low battery level (or computational limit close to being reached) — act with the least steps necessary.
All those elements are not possible, when the only thing you teach AI is to map the association between input and output, through backpropagation.
You don’t run to the nearest food or yell “burger!” every time you see it. You adjust your mind’s and body’s response to the current situation — your level of hunger, the people nearby, and your current plans.
We just need this complexity of possible responses. And it is done with various timescales of our memory and multiple learning signals.
More about it in the next section.
So we know that specific signals (neuromodulators) can drive the activity of the brain’s modules. But it is not the only function of these elements.
Our consciousness, our self needs a generalized state of the external world, and internal environment to respond to anything.
We don’t need to know which cells in our stomach have a problem to learn that what we just ate was not good for us — we feel pain and reduced happiness.
In challenging moments, we feel stress, increased heart rate, and adrenaline flow to ensure the focus and action necessary to move past this obstacle.
In times when everything is ok, and we can just enjoy life we feel the surge of satisfying substances — that allow us to focus on sensory experiences in the nearby environments, without overthinking.
When the challenge is intellectual — our brain motivates a larger set of its components to help us solve it, and we feel like information that we have not remembered a moment before — magically appears in front of our imagination.
Those elements do not only affect the activities at the level of specific modules — but also our highest-level executive center: pre-frontal cortex and consciousness.
Contrary to what people often think, we don’t make decisions based on logical reasoning but through our feelings.
And those feelings are state generalizations — how do we feel about the current situations and what will probably come next — are we excited or rather like to not continue what we do?
It does not mean that thoughts are not involved, but in the end, we just do what ‘feels right’.
All those components that our intelligence is composed of — are constantly interacting with themselves. From the low-level details to the high-level executive center, we call ‘self’.
To make that happen it is not only necessary to know when to load which information, but also how to make decisions — including which variant of the plan to choose.
What we see, hear, or think feeds the decision-making process that is based on the state generalization, or in simple words: feelings.
These kinds of elements can be quantified (expressed in numbers) in AI systems, as long as they are simulated in the virtual environment.
My point in this article was to share that building AGI and later ASI with the evolution of LLMs is highly improbable, or even impossible.
We should also not aim to replicate exactly how the brain works — but instead what it does (the functions).
And if you read the text until this point — would you agree that it does something different than ChatGPT or any other language model?
It has many more modules, interacting with themselves all the time, more signals to learn, more context management techniques — and us in the center of it all.
A conscious agent, aware of its surroundings — using a data representation created in the brain, to make sense of colorless, soundless, and tasteless atoms around.
At some time in the past, people believed to be the center of creation.
We can be proud, putting trillions of dollars into training LLMs or we can be humble, knowing what is lacking in the pursuit of AI. And that we have an example of general intelligence to be inspired by.
Nature can be understood, the brain can be understood at the level that will allow us to recreate intelligence, and not just a model of pre-reasoned information patterns.
We will also create agents highly aware of the situation in their environment. Will they be like us? I don’t know. But I know that we can create some kind of environment allowing them to explore their ideas and learn without us — while keeping their artificial minds more transparent than any human being before.
We can allow them to understand us and the world much better, than just from a text description.
Now, at the end of this article, I would like to conclude with something that can be surprising to you…
The creation of Artificial General Intelligence (human-level intelligence) and Artificial Super Intelligence (above human level) will be much harder than some people think and will happen sooner than others think.
If we define (correctly) the current AI stage as the skill and knowledge extraction systems from humanity — then why not tighten the cooperation to form Hybrid Intelligence?
As a person deeply passionate about both biological and artificial intelligence — I know that creating AGI and ASI can take decades or centuries if we take the wrong direction and only years if we choose the right one.
Imagine a Hybrid Intelligence system based on:
- the same foundational models on all devices, allowing for collective machine learning, what is learned by one agent — AI or human can be shared with the rest of the community almost instantly in the form of an external memory component
- full interpretability of the system (either through sparsely activated components or surrogate models)
- tools allowing people to think better with AI support, and AI to learn from how people think
- privacy: local data processing, where you decide what is shared with the community and what you can get in exchange
- participatory economy: where your data as a carrier of value brings money to you, by helping AI and robots to create value
This kind of system combines the intelligence of humanity, automates its growth with AI features, and equips AI with the highest signal/noise ratio information possible.
With all the thinking about how clever are we, the truth is that to reach this 99.99% level of accuracy (that required 100.000x improvement in error rate mentioned before, in section #2) we will need to work together.
This is the only possible path for the exponential growth of AI — where people want to participate in the system and be economically incentivized to do so — instead of giving their data to corporations for free.
Big Tech companies live in this illusion of having free access to petabytes of unfiltered data bringing so much noise to the models, while I really believe that the most powerful AI company of this decade is not yet known.
But companies creating Hybrid Intelligence systems first, will have much more higher quality resources to build AGI, and then ASI systems. Especially those who will share the value with their individual contributors.
Who knows, maybe they are being built right now.
— — — — — — — — — — — — — — — — — — — —
If you enjoyed this article and would like to read more, feel free to download my FREE e-book “Intelligent Machines of the Future”