There was a lot of press ~2 years ago about this paper, and the term “model collapse”:
Training on generated data makes models forget

There was concern that the AI models had slurped up the Whole Internet and needed more data to get any smarter. Generated “synthetic data” was mooted as a possible solution. And there’s the fact that the Internet increasingly contains AI-generated content.

As so often happens (and happens fast in AI), research and industry move on, but the flashy news item remains in peoples’ minds. To this day I see posts from people who misguidedly think this is still a problem (and a such one more reason the whole AI house of cards is about to fall)

In fact, the big frontier models today (GPT, Gemini, Llama, Phi, etc) are all trained on synthetic data

As it turns out, quality of data is what really matters, not whether it’s synthetic or not; see " Textbooks Are All You Need "

And then some folks figured out how to use an AI Verifier to automatically curate that quality data: " Escaping Model Collapse via Synthetic Data Verification "

And people used clever math to make the synthetic data really high quality: " How to Synthesize Text Data without Model Collapse? "

Summary:
“Model collapse” from AI-generated content is largely a Solved Problem.

There may be reasons the whole AI thing will collapse, but this is not one.

  • BootLoop@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    1
    ·
    edit-2
    2 months ago

    Nuh uh, the smart people of Lemmy keep telling me that any day now, ChatGPT will stop working because of Ouroborus. Any day now.

    • panda_abyss@lemmy.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 months ago

      Technically openAI haven’t succeeded in pertaining a new model in like 18 months, so maybe it has.

      But I agree with you

      • nymnympseudonym@piefed.socialOP
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 months ago

        I think there’s general consensus that we need another breakthrough like Reasoning or Attention.

        What / where / when / whether that breakthrough is… does not yet have general consensus.