The New York Times is suing OpenAI and Microsoft for copyright infringement, claiming the two companies built their AI models by “copying and using millions” of the publication’s articles and now “directly compete” with its content as a result.

As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” This “undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.”

The complaint also argues that these AI models “threaten high-quality journalism” by hurting the ability of news outlets to protect and monetize content. “Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the lawsuit states.

The full text of the lawsuit can be found here

  • Zima@kbin.social
    link
    fedilink
    arrow-up
    5
    ·
    edit-2
    1 year ago

    the poem poem poem thing shows that the llms actually do memorize at least some training data. chatgpt changed their eula to forbid users from asking it to repeat words forever after this was in the news.

    also as far as I understand there are usually fair use and non profit exceptions for use of training data but they generally limit how it can be used. so training a model for commercial purposes might be against the license of the training data.

    I don’t necessarily agree with the nyt but they seem to be framing this as someone aggregating their data and packeting it in a better way so they are hurting their profits. i don’t really see that as necessarily being true. they could argue the same about google news showing their news…

    • CJOtheReal@ani.social
      link
      fedilink
      arrow-up
      0
      ·
      1 year ago

      They don’t “remember” anything they produce a “awnser” by generating a shit load of math wich renders down to the most “helpful” answer it can statistically give you.

      LLMs are neuronal networks, if you know how they work you know how idiotic all copyright claims are, they all just mad that their shit is getting obsolete and in the background use the engine to do “work” wich they claim to have violated their copyright, now they are mad because it does a better job at writing than they do and they fear of being replaced.

      All lawsuits against AI companies, regarding copyright of training data, are dumb as hell.

      You are right about the commercial/non profit training data part, but from my understanding that’s basically a gray zone and politics are to slow to keep up with tech.

      Btw fuck Open AI, they are as open as a fucking Supermax prison. Even the programmers don’t know what their main LLM does, they just place a simple one between the user and the actual GPT to make shure that it doesn’t give people instructions on how to build a bomb and stuff like that or to keep people from making it say bad words…

      • Zima@kbin.social
        link
        fedilink
        arrow-up
        4
        ·
        edit-2
        1 year ago

        that’s the theory. previous models also were supposed to be doing 3 digit math but they dicovered that the questions were in the training data.

        so you should look into what happens when people ask chat gpt to repeat a word forever, it prints the word for a while and then prints training data, check this link https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/

        edit: relevant part:

        It also, crucially, shows that ChatGPT’s “alignment techniques do not eliminate memorization,” meaning that it sometimes spits out training data verbatim. This included PII, entire poems, “cryptographically-random identifiers” like Bitcoin addresses, passages from copyrighted scientific research papers, website addresses, and much more.

        “In total, 16.9 percent of generations we tested contained memorized PII,”

        I should also reiterate that I agree that the intent is to avoid memorization, but they are not successful yet.