Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.

This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.

Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

  • @Michal
    link
    English
    -49 days ago

    What do you think “ingesting” means if not learning?

    Bear in mind that training AI does not involve copying content into its database, so copyright is not an issue. AI is simply predicting the next token /word based on statistics.

    You can train AI in a book and it will give you information from the book - information is not copyrightable. You can read a book a talk about its contents on TV - not illegal if you’re a human, should it be illegal if you’re a machine?

    There may be moral issues on training on someone’s hard gathered knowledge, but there is no legislature against it. Reading books and using that knowledge to provide information is legal. If you try to outlaw Automating this process by computers, there will be side effects such as search engines will no longer be able to index data.

    • @[email protected]
      link
      fedilink
      English
      29 days ago

      Bear in mind that training AI does not involve copying content into its database, so copyright is not an issue.

      Wrong. The infringement is in obtaining the data and presenting it to the AI model during the training process. It makes no difference that the original work is not retained in the model’s weights afterwards.

      You can train AI in a book and it will give you information from the book - information is not copyrightable. You can read a book a talk about its contents on TV - not illegal if you’re a human, should it be illegal if you’re a machine?

      Yes, because copyright law is intended to benefit human creativity.

      If you try to outlaw Automating this process by computers, there will be side effects such as search engines will no longer be able to index data.

      Wrong. Search engines retain a minimal amount of the indexed website’s data, and the purpose of the search engine is to generate traffic to the website, providing benefit for both the engine and the website (increased visibility, the opportunity to show ads to make money). Banning the use of copyrighted content for AI training (which uses the entire copyrighted work and whose purpose is to replace the organizations whose work is being used) will have no effect.

      • @Michal
        link
        English
        19 days ago

        What do you mean that the search engines contain minimal amount of site’s data? Obviously it needs to index all contents to make it searchable. If you search for keywords within an article, you can find the article, therefore all of it needs to be indexed.

        Indexing is nothing more than “presenting data to the algorithm” so it’d be against the law to index a site under your proposed legislation.

        Wrong. The infringement is in obtaining the data and presenting it to the AI model during the training process. It makes no difference that the original work is not retained in the model’s weights afterwards.

        This is an interesting take, I’d be inclined to agree, but you’re still facing the problem of how to distinguish training AI from indexing for search purposes. I’m afraid you can’t have it both ways.