• stembolts
    link
    fedilink
    arrow-up
    139
    ·
    edit-2
    7 months ago

    This is similar to when I heard reddit was doing the API lockdown, I wrote an automation bot over the weekend that self-destructed my subreddit and the entire post history. The bot also automatically downloaded and archived all of the content on my local machine.

    It was annoying because at first I couldn’t get access to older posts since at the time reddit had changed their API to only show the first X posts (100 or 1,000 or whatever). So I told my bot to delete the posts as it archived them so as I deleted content, reddit had no choice but to populate the page with the older posts.

    And that’s how I archived my subreddit. Reddit banned me two days later for automation, lol. I did not break any of the reddit or reddit api ToS during this process but I guess I upset someone.

    • ubergeek77@lemmy.ubergeek77.chat
      link
      fedilink
      arrow-up
      27
      ·
      7 months ago

      I don’t think I’ve been banned, but I did a similar thing. I requested all my data from Reddit, then used that list of comment/post IDs to mass-edit them. I think I’m in the clear because I used the official third party API, with an official “app.” If you used the private API or instrumented this via the browser, that may be why you were banned.

      Anyway, if you or someone else wants their full history, Reddit will give it to you via a data export request.

    • GBU_28@lemm.ee
      link
      fedilink
      English
      arrow-up
      19
      ·
      7 months ago

      Unfortunately they still have everything. It’s good for the “human” visibility (lack of) but they have the data still

      • stembolts
        link
        fedilink
        arrow-up
        13
        ·
        edit-2
        7 months ago

        Oh I know, I just wanted a copy too.

        Deleting posts from the user PoV was the only way I could come up with to force the API to show them to me.

  • henfredemars@infosec.pub
    link
    fedilink
    English
    arrow-up
    82
    ·
    7 months ago

    I feel like this content craze is going to evaporate soon because all the new content from here forward is sure to be polluted by LLM output already. AI is fast becoming a snake eating its own tail.

    That reminds me. I should go update my licenses to spit in the face of AI training companies.

  • pe1uca@lemmy.pe1uca.dev
    link
    fedilink
    arrow-up
    60
    arrow-down
    1
    ·
    7 months ago

    It’s just a matter of time until all your messages on Discord, Twitter etc. are scraped, fed into a model and sold back to you

    As if it didn’t happen already

  • darkphotonstudio@beehaw.org
    link
    fedilink
    arrow-up
    44
    ·
    7 months ago

    I think people would have less issues with AI training if it was non-profit and for the common good. And there are open source AI projects, many in fact. But yeah, these deals by companies like this are sleazy.

  • verassol@lemmy.ml
    link
    fedilink
    arrow-up
    43
    ·
    7 months ago

    StackOverflow: *grabs money on monetizing massive amounts of user-contributed content without consulting or compensating the users in any way*

    Users: *try to delete it all to prevent it*

    StackOverflow: *your contributions belong to the community, you can’t do that*

    Pretty fucked-up laws. A lot of lawsuits going on right now against AI companies for similar issues. In this case, StackOverflow is entitled to be compensated for its partnership, and because the answers are all CC BY-SA 3.0, no one can complain. Now, that SA? Whatever.

    • 9point6@lemmy.world
      link
      fedilink
      arrow-up
      15
      ·
      7 months ago

      That SA part needs to be tested in court against the AI models themselves

      A lot of this shittiness would probably go away if there was a risk that ingesting certain content would mean you need to release the actual model to the public.

      • verassol@lemmy.ml
        link
        fedilink
        arrow-up
        4
        ·
        edit-2
        7 months ago

        Yeah, their assumption though is you don’t? Neither attribution nor sharealike, not even full-on all-rights-reserved copyright is being respected. Anything public goes and if questions are asked it’s “fair use”. If the user retains CC BY-SA over their content, why is giving a bunch of money to StackOverflow entitling OpenAI to use it all under whatever terms they settled on? Boggles me.

        Now, say, Reddit Terms of Service state clearly that by submitting content you are giving them the right to “a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness (…) in all media formats and channels now known or later developed anywhere in the world.” Speaks volumes on why alternatives (like Lemmy) to these platforms matter.

          • verassol@lemmy.ml
            link
            fedilink
            arrow-up
            3
            ·
            7 months ago

            That’s interesting. I was looking up “Lemmy Terms of Service” for comparison after getting that quote from the Reddit ToS and could not find anything for Lemmy.ml. Now after you mentioned it, looking on my Mastodon instance, nothing either, just a privacy policy. That is indeed kinda weird. Some instances do have their own ToS though. At least something stating a sublicense for distribution should be there for protection of people running instances in locations where it’s relevant.

      • verassol@lemmy.ml
        link
        fedilink
        arrow-up
        5
        ·
        7 months ago

        the claimants were set back because they’ve been asked to prove the connection between AI output and their specific inputs

        I mean, how do you do that for a closed-source model with secretive training data? As far as I know, OpenAI has admitted to using large amounts of copyrighted content, numberless books, newspaper material, all on the basis of fair use claims. Guess it would take a government entity actively going after them at this point.

          • verassol@lemmy.ml
            link
            fedilink
            arrow-up
            2
            ·
            7 months ago

            Thank you for sharing. Your perspective broadens mine, but I feel a lot more negative about the whole “must benefit business” side of things. It is fruitless to hold any entity whatsoever accountable when a whole worldwide economy is in a free-for-all nuke-waving doom-embracing realpolitik vibe.

            Frankly, not sure what would be worse, economic collapse and the consequences to the people, or economic prosperity and… the consequences to the people. Long term, and from a country that is not exactly thriving in the scheme side of things, I guess I’d take the former.

      • bitfucker
        link
        fedilink
        arrow-up
        2
        ·
        7 months ago

        Yep. Can’t wait to overfit LLM to a lot of copyrighted work and share it to public domain. Let’s see if OpenAI will get push back from copyright owner down the road.

  • jubilationtcornpone@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    45
    arrow-down
    3
    ·
    7 months ago

    Data Rule Numero Uno:

    Garbage in, garbage out.

    Have fun training your LLM on a big steaming pile of hot garbage. That’s 80% of Stack Overflows content.

      • ddh@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        9
        ·
        7 months ago

        Can’t wait until the top answer to every Google search is “just google it”

    • LostXOR@fedia.io
      link
      fedilink
      arrow-up
      7
      ·
      edit-2
      7 months ago

      The other 20% is mostly high quality however, and I’m sure they’d filter out the heavily downvoted crud.

    • mnemonicmonkeys@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      6
      ·
      7 months ago

      One time I was went on there to figure out an issue in Arduino. The answer one guy gave was “I don’t know how to do this in Arduino, here’s how you do this in Java”. Not only the the mods prevent any other answers from being posted, I tried the guy’s suggestion in Java and it didn’t even work

  • davel [he/him]@lemmy.ml
    link
    fedilink
    English
    arrow-up
    41
    arrow-down
    3
    ·
    7 months ago

    Good luck with the deleting. It often just means UPDATE comments SET is_deleted = 1 WHERE ID = 666;.

    • chiisana@lemmy.chiisana.net
      link
      fedilink
      arrow-up
      16
      arrow-down
      2
      ·
      7 months ago

      There was similar things done on Reddit during the big exit. I doubt it achieved what people expected it to achieve. Even if they’re not visible externally, I’m sure they can easily access (thereby make deals to license) the data out of their backend / backup; just a matter of how hard they want to try (hint: it’s really not very hard).

      • duncesplayed@lemmy.one
        link
        fedilink
        English
        arrow-up
        15
        ·
        7 months ago

        Yeah during the reddit exodus, people were recommending to overwrite your comment with garbage before deleting it. This (probably) forces them to restore your comment from backup. But realistically they were always going to harvest the comments stored in backup anyway, so I don’t think it caused them any more work.

        If anything, this probably just makes reddit’s/SO’s partnership more valuable because your comments are now exclusive to reddit’s/SO’s backend, and other companies can’t scrape it.

        • Lemongrab@lemmy.one
          link
          fedilink
          arrow-up
          10
          ·
          7 months ago

          It was to make the data inaccessible to general people, therefore removing the reason people visit reddit. Even if reddit could still get the data, regular people would be inconvenienced (in theory) and look somewhere else.

    • plz1@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      7 months ago

      They are not deleting, they are editing. So the platform would have to undo those edits rather than just flipping the visibility flag.

  • Captain Beyond@linkage.ds8.zone
    link
    fedilink
    arrow-up
    31
    arrow-down
    5
    ·
    edit-2
    7 months ago

    There is, I believe, a fundamental misunderstanding as to what exactly a site like Stack Overflow is. It’s not a forum; there’s no such thing as “your posts.” It’s more like Wikipedia, as in a collaborative question-and-answer site, or a knowledgebase. Each question and answer can be edited like a mini wiki page. They aren’t “yours” any more than the Wikipedia page you created ten years ago is; you contributed it to the commons, so (at least in theory) you don’t have the right to take it back.

    Whether whatever "Open"AI is doing is right is another question, of course. But, I don’t think destroying or poisoning the commons to strike back at it is any helpful either; it feels like “destroying it to save it.”

    • tetris11@lemmy.ml
      link
      fedilink
      arrow-up
      17
      ·
      7 months ago

      Fine, but when coding projects undergo licensing changes that the contributors are against, the code author has to remove those contributions and replace them.

    • wuphysics87@lemmy.ml
      link
      fedilink
      arrow-up
      21
      arrow-down
      3
      ·
      7 months ago

      Those answers were given in good faith under the presumption that they would be read and used by another person. Not used to train something to remove the interactions which motivated the answer in the first place.

      • jsomae@lemmy.ml
        link
        fedilink
        arrow-up
        5
        arrow-down
        4
        ·
        7 months ago

        Can you elaborate on what you mean by “remove the interactions which motivated the answer in the first place”? I’m not sure I follow.

          • forgotmylastusername@lemmy.ml
            link
            fedilink
            arrow-up
            4
            arrow-down
            1
            ·
            7 months ago

            The internet had a social contract. The reason people put effort into brain dumping good posts is because the internet was a global collaborative knowledge base for everybody.

            Of course there were always capitalists who sought to privatize and profit from resources. The source materials were generally part of the big giant digital continuum of knowledge. For the parts that weren’t there we’re anarchists who sought to free that knowledge for anyone who wanted to access it.

            AI is bringing about the end of all this as platforms are locking down everything. Old boards and forums had already been shuttering for years as social media was centralizing everything around a few platforms. Now those few platforms are being swallowed up by AI where the collective knowledge of humanity is being put behind paywalls. People no longer want to work directly for the profit of private companies.

            Capitalists can only see dollar signs. They care not for the geological epoch scale forces of nature required to form petroleum. All that matters is can it all be sold and how quickly. Nor do they care for environmental damages they cause. In the same way the AI data mining do not care for the digital ecological disaster they are causing.

            More over it’s a thought terminating cliche when someone says, “<thing> existed before so why’s it suddenly a problem?”. It seems to be yet another out of the bag of rhetorical tricks that wipes the slate of discourse clean. As if all the arguments against it suddenly need to be explained as if none of it had any validity. Not only that but the OPs are often seemingly disingenuously naive. It provides the OP with a blank slate to continually “just ask questions”. Where every response is “but why?” which forces their interlocutors to keep on elaborating in excruciating detail to the point where they give up trying to explain minutiae. Thus the OP can conclude by default they were correct that it’s not a problem after all because they declare nobody has provided them with answers to their satisfaction.

    • haui@lemmy.giftedmc.com
      link
      fedilink
      arrow-up
      18
      ·
      7 months ago

      Simple answer: people vs corporations. A dev or homelabber getting help from you is very different from a company making billions just by mass shoveling your knowledge to the highest bidder.

      The reason we need this as a fediverse service is that everyone can take in this knowledge and one corp doesnt have the ability to sell it. Thats what the worth comes from. Someone holding they key to it.

      • i_am_not_a_robot@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        6
        ·
        7 months ago

        That’s not what I mean. When you contribute content to Stack Exchange, it is licensed CC BY-SA. There are websites that scrape this content and rehost it, or at least there used to be. I’ve had a problem before where all the search results were unanswered Stack Overflow posts or copies of those posts on different sites. Maybe similar to Reddit they restricted access to the data so they could sell it to AI companies.

    • mbirth@lemmy.mbirth.uk
      link
      fedilink
      arrow-up
      13
      ·
      7 months ago

      Currently, all answers are properly attributed. But once OpenAI will have trained and sell a “hackerman” persona, do you really think it will answer people’s questions with ”This answer was contributed by i_am_not_a_robot” or will it just sell this as its own answer?

      • Taleya@aussie.zone
        link
        fedilink
        arrow-up
        8
        ·
        7 months ago

        As a tech, i’m fucking howling because 99% of answers to any given question is already bullshit that ranges from useless to dangerous.

        “The machine” can’t tell the difference and it’s going to be considered authoratitive in its blithe stupidity. hoover up SA all you want, you’re just gonna agregate it with bullishit and poison your own well anyway

  • baseless_discourse@mander.xyz
    link
    fedilink
    arrow-up
    18
    arrow-down
    1
    ·
    edit-2
    7 months ago

    This is a violation of GDPR, no?

    EDIT: user created content is not directly protected under GDPR, only personally identifiable data is pertected under GDPR.

    • lemmyreader@lemmy.mlOP
      link
      fedilink
      English
      arrow-up
      16
      ·
      7 months ago

      Dunno. GDPR is a Europe only thing, and isn’t it only related to how your private data (like name, IP address, phone number) is cared about ?

      • AccountMaker@slrpnk.net
        link
        fedilink
        arrow-up
        7
        ·
        7 months ago

        Right, I think it only covers personal information: companies can only collect what they need to run their service, users can request to see their data etc. I don’t think it applies to comments and posts.

      • Captain Beyond@linkage.ds8.zone
        link
        fedilink
        arrow-up
        3
        ·
        7 months ago

        I would certainly hope so. Stack Overflow content is Creative Commons licensed, so the argument is basically that the GDPR would take precedence over the CC license grant. It’d be scary if GDPR could be weaponized against forks of free software projects in this manner.

        • flux@lemmy.ml
          link
          fedilink
          arrow-up
          4
          ·
          7 months ago

          Would that kind of provision allow me to have my code removed from a git repository history, if that git repository is hosted by a company?

          • interdimensionalmeme@lemmy.ml
            link
            fedilink
            arrow-up
            2
            ·
            7 months ago

            As long as you didn’t give those rights by signing a CLA or a copyleft license. Never sign a CLA unless you’re fully compensated.

          • baseless_discourse@mander.xyz
            link
            fedilink
            arrow-up
            1
            ·
            edit-2
            7 months ago

            I am not a lawyer, but I believe in general, yes.

            Git is not even that convoluted, as all the history is stored in the .git folder within the repo. Unless there is some convoluted structure built on top, they would only need to move the repo folder to a trash disk, waiting to be formated.

            That being said, GDPR is somewhat poorly enforced at the moment, unfortunately. I don’t know if you can sue the company and expect some result within couple of years.

            • refalo
              link
              fedilink
              arrow-up
              3
              ·
              7 months ago

              No because user generated content is not protected.

        • WldFyre@lemm.ee
          link
          fedilink
          arrow-up
          3
          arrow-down
          1
          ·
          7 months ago

          Doesn’t that just mean the data would have to be anonymized ?

          • baseless_discourse@mander.xyz
            link
            fedilink
            arrow-up
            3
            ·
            7 months ago

            I am not a expert or a lawyer, but I believe user actually hold the right to completely erase personal data:

            The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay

            https://gdpr.eu/right-to-be-forgotten/

            Note the word “erasure” as opposed to “anonymize”

            • WldFyre@lemm.ee
              link
              fedilink
              arrow-up
              5
              ·
              7 months ago

              I don’t think that addresses my point. Is my opinion on the new Star Wars movies that I post online or some lines of code I suggest “personal data”? I thought personal data had a specific definition under GDPR

              • nefonous@lemmy.world
                link
                fedilink
                arrow-up
                5
                ·
                7 months ago

                You’re totally right, the content of your posts is not considered personal data (because it isn’t) It’s more about profiling data that can be connected back to your actual person

              • baseless_discourse@mander.xyz
                link
                fedilink
                arrow-up
                3
                ·
                7 months ago

                I think you are right, user generated content doesn’t seem to be protected. This is surprising to me, as user should hold the right to their content, which in my mind should enjoy stronger protection than personal data.

              • Spaenny@discuss.tchncs.de
                link
                fedilink
                arrow-up
                2
                ·
                7 months ago

                Technically, they could retain posts from users if they are irreversibly anonymized. However, ensuring with 100% certainty that none of your posts ever contained any personal data that could lead to the identification of you as an individual is challenging. The safest option is therefore to also delete your posts.

    • refalo
      link
      fedilink
      arrow-up
      1
      ·
      7 months ago

      How does GDPR get away with not defining what a website is when referring to them directly in the law? Like what counts, only html? http? ftp? gopher?