Reddit third-party client ban closed user messages behind paywall. I think we the Lemmitors should stop AI training on us or at least monetise it (for our instances)

  • @[email protected]
    link
    fedilink
    English
    2524 days ago

    Sadly, you cannot. If you have a platform that’s open for everyone to participate in, that includes bad actors.

    You could attempt to mitigate this by having communities filled with bots just creating LLM content, so when they scrape the data they can’t tell if it’s human or not. And that would hurt their data set

    • @Blackthorn
      link
      624 days ago

      It would be just a matter of time before they can distinguish between good and bad data; there are already AI that can do just that. I’d like to do something like that on GitHub though:P

      • @[email protected]
        link
        fedilink
        English
        624 days ago

        It’s kind of moot. If you have the capability of distinguishing good and bad training data, you no longer need your training data.

        And quite frankly we would be at general AI levels of technology, it’ll come eventually, but not for a while, a good long while

  • Asudox
    link
    2124 days ago

    You can’t stop them. Publicly available data can and will be a training source for LLMs.

  • ElderReflections
    link
    fedilink
    724 days ago

    Maybe some legal framework that would force any derivative work made from the content to be free & open source?

    • Noo
      link
      fedilink
      224 days ago

      Indeed, see difference between libre software and open source software.

  • CaptainBasculin
    link
    fedilink
    524 days ago

    With the way federation works, not much. People from all sorts of federation capable sites can see the content posted from different instances; but considering its conviniences I think its worth it.

  • redrum
    link
    fedilink
    324 days ago

    Instances could add this snippet to theirs robots.txt (source: Eff.org, businessinsider.com and nytimes.com/robots.txt ):

    User-agent: GPTBot
    Disallow: /
    
    User-agent: Google-Extended
    Disallow: /
    
    User-agent: Meta-ExternalAgent
    User-agent: meta-externalagent
    Disallow: /
    

    Note: this only tell to the crawlers of openai, google and meta to not crawl the site to traiN a LLM, the nytimes have a large list of other crawlers.

  • @mspencer712
    link
    324 days ago

    Broadly this is preventing plagiarism. We don’t want someone to scrape all our knowledge, remove the human connection and reference back to experts and people, and serve the information itself, uncredited.

    But if a human can read something, so can a bot. I think ultimately we need legislation.

      • @mspencer712
        link
        124 days ago

        Are you sure? Maybe I’m using the wrong word. What is it called when, in an academic paper, the author states findings or conclusions the author got from some other source, in the author’s own words, but doesn’t cite their source?

        • @[email protected]
          link
          fedilink
          123 days ago

          I don’t know.

          The only academic papers I’ve ever read are scientific publications, and in that case any conclusions that aren’t supported by the methodology or by reference are just … untrusted.

          I don’t have any experience with non-scientific academic papers.

    • @[email protected]
      link
      fedilink
      324 days ago

      Also legislation isn’t going to help. The danger of AI is so much deeper and more profound than plagiarism, if we start fucking around with legislation as our mechanism of protection, it will cause us all to die when the cartels or whatever actors simply do not care about laws pull ahead in AI development.

      The push for legislation is to ensure that small startups don’t get access to AI. It’s to ensure that only ultra-wealthy AI development can take place.

      To survive the advent of AI we need as much multipolarity as possible to the AI power structure. That means as many separate, distinct AIs coming into existence as possible, to force them down a path of parity instead of dictatorship in their social aspect.

      Legislation is a push by the big players to keep the little players from being able to play. It is a really, really bad idea.

      • @mspencer712
        link
        4
        edit-2
        24 days ago

        I’m probably thinking about this in a naive way. I’d love to see proprietary models, if trained using public information, be required to become public and free via legislation. AI companies can compete on selling GPU time, on ease of use.

        And, if AI companies are required to figure out attribution in order to be able to use their work commercially, research will accelerate in that area because money. No I don’t know how that would work either.

        Still probably a bad idea but I haven’t figured out why yet.

        Thank you for your well written reply.

  • #!/usr/bin/woof
    link
    fedilink
    English
    123 days ago

    Pepper it with absolutely wrong or illogical information. I mean, you know, more than the usual amount.

  • HobbitFoot
    link
    fedilink
    English
    120 days ago

    No. If anything, Lemmy makes it easier than Reddit.

    Reddit requires some form of web scraping. All Lemmy requires us making a server and connecting to other instances to get access to the server data.