Uncategorized

AIs trained on AI content are in danger from model collapse.



Can the work we are doing with Large Language Models (LLMs) to develop current artificial intelligence (AI) prove to be the Achilles Heel of the technology? Two days ago, Joe Castaldo wrote an article published in The Globe and Mail entitled “AI models ‘collapse’ and spout gibberish over time, research finds. But there could be a fix.” He described research from the University of Toronto that when text from the Internet is the training material chatbots like ChatGPT and others use, future AIs trained using it could produce rubbish. Two researchers at the University of Toronto in May 2023 published an article called, “The Curse of Recursion: Training on Generated Data Makes Models Forget.” It concluded that training AI on AI-generated data can make them useless. They described how AI output becomes gibberish and garbage when trained using AI-produced content. They labelled the phenomenon “model collapse.” Another name for it is “data poisoning.”

The underlying concern comes from the explosion of AI-generated content appearing on the Internet today. This means that AIs trained using Internet-based datasets produce feedback loops that amplify errors passing them along to succeeding generations of trained AIs. This produces:

  • Data Quality Degradation: Even small quantities of AI-generated content can be dangerous to an AI being trained.
  • Loss of Human-Created Content: AI-produced content loses its humanity.
  • Data Inbreeding Effect: Results produced increasingly turn to gibberish.
  • Reduced Performance: Output from AI-synthesized data becomes less useful through each training cycle.
  • Content Homogenization: AI feeding off other AI output leads to a loss in diversity and creativity.
  • Differentiation Difficulties: People using the Internet will find it increasingly difficult to distinguish between human and AI-generated content. AIs will equally find it difficult to distinguish the differences.

The study shows that when LLMs are exposed to their text outputs repeatedly, within a few cycles, they start “vomiting nonsense.” The reason, the researchers hypothesize, is mistakes in the original training become compounded over time because the data isn’t being properly curated by humans.

The Globe article uses the metaphor “a snake eating its own tail.” In other words, inbreeding becomes amplified in “just a few training cycles.” In one test using a visual AI tool and a human image, after 20 cycles the research shows that the output fused the person to the background.

Why Is The AI We See Now Subject To Model Collapse?

It is the rush to market by AI competitors that has led to the potential for model collapse. When OpenAI launched ChatGPT back on November 30, 2022, with the backing of billions of dollars from Microsoft, every other big tech company with stakes in the game rushed out their AIs. The carefully curated experiments developed in universities like Princeton and UC-Berkeley fell by the wayside as Google, Meta, Amazon and others began one-upping each other in releases that appeared to come out almost monthly. The result is we are awash with first-generation commercial LLMs producing synthetic data starting to fill up the Internet.

At what level of Internet content, would synthetic online data lead to AI model collapse and are we there yet? The AI used to master the game of Go, Google’s DeepMind AlphaGo Zero, trained on a combination of human and AI-played sessions. In this case, model collapse didn’t happen. Likely this was because Go like chequers and chess follows mathematically-defined rules and parameters not subject to the nuances of language and interpretation.

Saving AI From Itself And For The Future

Are there solutions to save AI? The answer is yes state those doing development in the field. What it comes down to is better training and data curation.

On LinkedIn today opportunities for employment to train AIs abound. The Globe article illustrates the training problem noting how AI companies are using “legions of typically underpaid workers to annotate, rank and label data for models to learn from.” The skillset of those employed to do this work clearly needs upgrading.

Then there is data curation vigilance to ensure no AI Internet-generated content is used to train new AIs. That means better detection methods are required to differentiate human versus AI-produced content.

The LLMs themselves need improvement so that they train on smaller datasets and thus avoid corrupted AI content. Those datasets used in training should include offline and pre-Internet data sources. Finally, specifically designed content produced by AIs for training purposes can be developed and used to help train future LLMs.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *