I’m rather curious to see how the EU’s privacy laws are going to handle this.

(Original article is from Fortune, but Yahoo Finance doesn’t have a paywall)

  • @[email protected]
    link
    fedilink
    English
    15
    edit-2
    10 months ago

    How is “don’t rely on content you have no right to use” litteraly impossible?

    We teach to children that there is a Google filter to include only the CC images (that they should use for their presentations).

    Also it’s not like we are talking small companies here, a new billion-making industry is being born and it could totally afford contracts with big platforms that would allow to use their content.

    • @[email protected]
      link
      fedilink
      English
      810 months ago

      This is an article about unlearning data, not about not consuming it in the first place.

      LLM’s are not storing learned data in it’s raw, original form. They are injesting it and building an understanding of language based off of it.

      Attempting to peel out that knowledge would be incredibly difficult, if not impossible because there’s really no way to identify it.

      • @[email protected]
        link
        fedilink
        English
        410 months ago

        And we’re saying that if peeling out knowledge that someone has a right to have forgotten is difficult or impossible, that knowledge should not have been used to begin with. If enforcement means big tech companies have to throw out models because they used personal information without knowledge or consent, boo fucking hoo, let me find a Lilliputian to build a violin for me to play.

        • @[email protected]
          link
          fedilink
          English
          2
          edit-2
          10 months ago

          Okay I get it but that’s a different argument. Starting fresh only gets you so far. Once am LLM exists and is exposed to the public users can submit any data they like and the LLM has no idea the source.

          You could argue then that these models shouldn’t be able to use user submitted data but that would be a devastating restriction to the technology and that starts to become a question of whatever we want this tech to exist at all.

        • @[email protected]
          link
          fedilink
          English
          010 months ago

          If enforcement means big tech companies have to throw out models because they used personal information without knowledge or consent, boo fucking hoo

          A) this article isn’t about a big tech company, it’s about an academic researcher. B) he had consent to use the data when he trained the model. The participants later revoked their consent to have their data used.

    • @[email protected]
      link
      fedilink
      English
      110 months ago

      How is “don’t rely on content you have no right to use” litteraly impossible?

      At the time they used the data, they had a right to use it. The participants later revoked their consent for their data to be used, after the model was already trained at an enormous cost.

      • @[email protected]
        link
        fedilink
        English
        110 months ago

        I have to admit my comment is not really relevant to the article itself (also, I read only the free part of it).

        It was more a reaction to the comment above, which felt more generic. My concern about LLMs is that I could never find an auditable list of websites that were crawled, which would be reasonable to ask for, I think.

    • @[email protected]
      link
      fedilink
      English
      -110 months ago

      And the rest of the data Google has been viewing, cataloging and selling back to everyone for years, because they’re legally allowed to do so… you don’t see the irony in that?

      • @[email protected]
        link
        fedilink
        English
        11
        edit-2
        10 months ago

        Are they selling back scrapped content? I thought it was only user behaviors through the ad network?

        About cataloging at least it is opt-out though robot.txt 🤷

        EDIT: plus, “we are already doing bad” is never a good argument to continue doing bad, if Google were to be in fault this could get the traction to slap their ass

        • @[email protected]
          link
          fedilink
          English
          510 months ago

          Google crawls the internet, archives entire actual photos, large snippets (at least) from every website it sees, aggregates it into a different form and serves it back to people for profit. It’s the same business model, different results with the processing of the data.

          • bobettes_bob
            link
            fedilink
            210 months ago

            Google doesn’t sell the data they collect… They sell ads and use their data to better target people with said ads. Third parties are paying google to target their ads to the right people.

            • @[email protected]
              link
              fedilink
              English
              310 months ago

              You go to google because of the data they collected from the open internet. Peoples’ photos, articles they’ve written, books, etc. They aggregate it, process it and serve it back to you alongside ads. They also collect data about you and sell that as well. But no one would go to Google if they hadn’t aggregated, processed and repackaged the internet’s data.

    • BraveSirZaphod
      link
      fedilink
      -210 months ago

      Because the question of what data one has the right to use is a very open legal question right now.

      There is absolutely nothing illegal about a person examining publicly accessible artwork or text, learning from it, and attempting to reproduce a similar style. AIs are, in essence, doing basically the same thing. However, the sheer difference in time and scale may warrant a different legal treatment. That has not yet been settled, and it will probably take a fair amount of societal debate and new legislation before we have a definite answer.