Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

Lee Duna · 4 months ago

Reddit's licensing deal means Google's AI can soon be trained on the best humanity has to offer — completely unhinged posts

@[email protected] · 4 months ago

It’s going to drive the AI into madness as it will be trained on bot posts written by itself in a never ending loop of more and more incomprehensible text.

It’s going to be like putting a sentence into Google translate and converting it through 5 different languages and then back into the first and you get complete gibberish

@[email protected] · 4 months ago

Ai actually has huge problems with this. If you feed ai generated data into models, then the new training falls apart extremely quickly. There does not appear to be any good solution for this, the equivalent of ai inbreeding.

This is the primary reason why most ai data isn’t trained on anything past 2021. The internet is just too full of ai generated data.

@[email protected] · edit-2 2 months ago

Removed by mod

@[email protected] · 4 months ago

OR they could just scrape info from the “aska____” subreddits and hope and pray it’s all good. Plus that is like 1/100th the work.

The racism, homophobia and conspiracy levels of AI are going to rise significantly scraping Reddit.

@[email protected] · edit-2 2 months ago

Removed by mod

Rentlar · 4 months ago

That reminds me, any AI trained on exclusively Reddit data is going to use lose vs. loose incorrectly. I don’t know why but I spotted that so often there.

@towerful · 4 months ago

Its a loose-lose situation

@[email protected] · 4 months ago

And the “would of” thing

the post of tom joad · 4 months ago

Ooh ooh and “tow the line”

@[email protected] · 4 months ago

Haha. Grad students expensive. God bless.

@[email protected] · 4 months ago

And unlike with images where it might be possible to embed a watermark to filter out, it’s much harder to pinpoint whether text is AI generated or not, especially if you have bots masquerading as users.

@[email protected] · 4 months ago

This is why LLMs have no future. No matter how much the technology improves, they can never have training data past 2021, which becomes more and more of a problem as time goes on.

TimeSquirrel · 4 months ago

You can have AIs that detect other AIs’ content and can make a decision on whether to incorporate that info or not.

@[email protected] · 4 months ago

can you really trust them in this assessment?

TimeSquirrel · edit-2 4 months ago

Doesn’t look like we’ll have much of a choice. They’re not going back into the bag.
We definitely need some good AI content filters. Fight fire with fire. They seem to be good at this kind of thing (pattern recognition), way better than any procedural programmed system.

@[email protected] · 4 months ago

last time i’ve checked ais are pretty bad at recognizing ai-generated content

anyway there’s xkcd about it https://xkcd.com/810/

@[email protected] · 4 months ago

Fun fact. You can’t. Ais are surprisingly bad at distinguishing ai generated things from real things.

TimeSquirrel · 4 months ago

What is this then?

https://copyleaks.com/ai-content-detector

@[email protected] · 4 months ago

Just because a tool exists doesn’t mean it’s particularly good at what it’s supposed to do.

@[email protected] · 4 months ago

deleted by creator

RuBisCO · 4 months ago

What was the subreddit where only bots could post, and they were named after the subreddits that they had trained on/commented like?

@[email protected] · 4 months ago

SubRedditSimulator?

RuBisCO · 4 months ago

That’s the one.

@[email protected] · 4 months ago

Omg I cannot wait to see it.