For images, there is nightshade. For music, there is/will be whatever Benn Jordan is doing. For youtube, there is .ASS. But what about poisoning text on a web page? Is there any standard solution out there?

It should be relatively easy. I’ve been thinking about doing something myself, but figured someone else must have already done it.

  • Vogi@piefed.social
    link
    fedilink
    English
    arrow-up
    15
    ·
    3 天前

    There is iocaine and nepenthes. You can easily deploy on your server. Given the scraper uses the correct User-Agent… which they probably do not do especially if everyone started deployed tarpits.

    Would be fun if lemmy or piefed had an option to have the content of posts poisend as there are some new Crawlers crawling in the fedispace recently.

    Or we could of course invent our own lemmy speech that is still english, but really weird.

  • OhNoMoreLemmy@lemmy.ml
    link
    fedilink
    arrow-up
    11
    ·
    3 天前

    Most of these (nightshade etc) are just a secondary form of AI grift that also doesn’t work.

    At best they last till they get effective enough for work arounds to be worth looking for, and then they’re gone. The only remedies that stand a chance of lasting are legal.

  • e8d79@discuss.tchncs.deM
    link
    fedilink
    arrow-up
    7
    ·
    3 天前

    You can target the crawlers using tar pits and proof-of-work application firewalls but I am doubtful that poisoning does anything. The second a poisoning method becomes common enough to have an effect the AI companies will just start filtering for that. Unfortunately the only way I see that prevents your work from being stolen is to either not publish it at all, or to only publish to smaller invite based communities that closely monitor who is accepted.

    • shoki@lemmy.world
      link
      fedilink
      arrow-up
      2
      ·
      2 天前

      you could also have an unique challange, for example showing the user an image that has instructions to append sone text to the url. anything that scrapers are too stupid for (I don’t think they are scraping using “intelligent” ai agents yet)

    • e8d79@discuss.tchncs.deM
      link
      fedilink
      arrow-up
      5
      ·
      3 天前

      That’s a fun idea but AI companies would probably just screenshot the website and OCR the text if this became common. It’s also really inconvenient for the users as it breaks both copy pasting and Ctrl+F searching.

      • GreenBeanMachine@lemmy.world
        link
        fedilink
        arrow-up
        5
        ·
        3 天前

        Yes, it breaks the usability completely. But some of those issues can be fixed with more code. E.g. custom search and copy+paste would be pretty easy to do.

        As for OCR, any solution would be futile against it. If a human can see it, robot can too.

  • Treczoks@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    3 天前

    A simple engine that provides grammatically correct sentences with random content, triggered by following links that are not user accessible. That’s what we need basically everwhere.