programming.dev
  • Communities
  • Create Post
  • Create Community
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
Furbland@lemmy.world to 196@lemmy.blahaj.zone · 9 个月前

rulebots.txt

lemmy.world

message-square
34
link
fedilink
321

rulebots.txt

lemmy.world

Furbland@lemmy.world to 196@lemmy.blahaj.zone · 9 个月前
message-square
34
link
fedilink
alert-triangle
You must log in or register to comment.
  • SteveFromMySpace@lemmy.blahaj.zone
    link
    fedilink
    arrow-up
    113
    ·
    9 个月前

    but not the misuse of public content

    HA

    • unrelatedkeg@lemmy.sdf.org
      link
      fedilink
      arrow-up
      15
      ·
      edit-2
      9 个月前

      but not the misuse of public content

      Is that an admission that they don’t own the content others posted on their site?

      • Furbland@lemmy.worldOP
        link
        fedilink
        arrow-up
        4
        ·
        9 个月前

        you would be a good lawyer

  • shikogo@pawb.social
    link
    fedilink
    arrow-up
    70
    ·
    9 个月前

    I am confused, does this mean Reddit is not going to be searchable on search engines anymore?

    • Zagorath@aussie.zone
      link
      fedilink
      arrow-up
      85
      ·
      9 个月前

      Unfortunately yes. It was reported on last month.

    • Aeri@lemmy.world
      link
      fedilink
      arrow-up
      66
      ·
      9 个月前

      oh no, Reddit is like, the only way to have google still be useful.

      • germanatlas@lemmy.blahaj.zone
        link
        fedilink
        arrow-up
        54
        ·
        9 个月前

        Funnily enough, google is also the only way to have Reddit be useful.

        Their own search function has been nothing but garbage.

      • morgunkorn@discuss.tchncs.de
        link
        fedilink
        arrow-up
        43
        ·
        9 个月前

        That’s the catch, Google made a deal with Reddit and remains the only search engine allowed to access its data for indexing. It cuts off every other search engine

        • vortic@lemmy.world
          link
          fedilink
          arrow-up
          27
          ·
          9 个月前

          Tell me that there is an anti trust suit over this.

          • Furbland@lemmy.worldOP
            link
            fedilink
            arrow-up
            26
            ·
            9 个月前

            There’s a suit over google in general so this may well be part of it

        • TriflingToad@lemmy.world
          link
          fedilink
          arrow-up
          3
          ·
          9 个月前

          really? ddg will show me reddit links, did they have to make a webscraper or something

          • morgunkorn@discuss.tchncs.de
            link
            fedilink
            arrow-up
            4
            ·
            9 个月前

            There’s a cutoff date, anything indexed before the robots.txt was changed stays in the index

      • riodoro1@lemmy.world
        link
        fedilink
        arrow-up
        31
        ·
        9 个月前

        We fucked the internet. It’s proprietary now.

        • Furbland@lemmy.worldOP
          link
          fedilink
          arrow-up
          11
          ·
          edit-2
          9 个月前

          we fucked the internet

          kinky

          • Pup Biru@aussie.zone
            link
            fedilink
            English
            arrow-up
            8
            ·
            9 个月前

            cat5 sounding you say?

            • Swedneck@discuss.tchncs.de
              link
              fedilink
              arrow-up
              2
              ·
              9 个月前

              cat5-o-nine-tails

      • Norah (pup/it/she)@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        9
        ·
        9 个月前

        Good news! Google paid up and still has access I’m pretty sure.

        • Furbland@lemmy.worldOP
          link
          fedilink
          arrow-up
          1
          ·
          9 个月前

          That’s bad news, that means the internet is dying

          • Norah (pup/it/she)@lemmy.blahaj.zone
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            1
            ·
            9 个月前

            Sorry, the /s was sort of implied.

            • Furbland@lemmy.worldOP
              link
              fedilink
              arrow-up
              2
              ·
              9 个月前

              Ah, sorry. I have trouble with that sometimes :P

    • Furbland@lemmy.worldOP
      link
      fedilink
      arrow-up
      9
      ·
      9 个月前

      Perhaps, likely depends on the crawler though

      • unexposedhazard@discuss.tchncs.de
        link
        fedilink
        arrow-up
        12
        ·
        9 个月前

        Yeah i dont think ignoring robots.txt is even illegal. They can ofcourse just block your crawlers IP but that would be a cat and mouse game that they would lose in the end.

  • JusticeForPorygon@lemmy.world
    link
    fedilink
    arrow-up
    55
    arrow-down
    1
    ·
    9 个月前

    Not gonna lie this seems like ultimately a win for the Internet. The years of troubleshooting solutions Reddit Provided can be archived (hopefully) but the less people rely on the site itself, the better. At least in my opinion.

    • TriflingToad@lemmy.world
      link
      fedilink
      arrow-up
      2
      ·
      9 个月前

      I disagree, kinda. Stackoverflow is the other option for questions which is a lot less user friendly, and Lemmy has never shown up in search results for me. If something comes along and makes it simple, great! however I just see a lot more of ad filled hellhole sites in the meantime.

  • Kojichan@lemmy.world
    link
    fedilink
    arrow-up
    52
    ·
    9 个月前

    I remember finding Google’s robots.txt when they first came out. It was a cute little text ASCII art of a robot with a heart that said, “We love robots!”

  • jabathekek@sopuli.xyz
    link
    fedilink
    arrow-up
    50
    ·
    9 个月前

    An ancient text from the before-fore.

    • Furbland@lemmy.worldOP
      link
      fedilink
      arrow-up
      60
      ·
      9 个月前

      this is actually quite recent. the old one was much funnier and clearly had actual soul put into it.

      • asudox
        link
        fedilink
        arrow-up
        6
        ·
        9 个月前

        my shiny metal ass

  • itsnicodegallo@lemm.ee
    link
    fedilink
    arrow-up
    8
    ·
    9 个月前

    As annoying as this is, it’s to prevent LLMs from training themselves using Reddit content, and that’s probably the greater of the two evils.

    • Furbland@lemmy.worldOP
      link
      fedilink
      arrow-up
      37
      ·
      9 个月前

      That’s all well and good, but how many LLMs do you think actually respect robots.txt?

      • colin@lemmy.uninsane.org
        link
        fedilink
        English
        arrow-up
        14
        ·
        9 个月前

        from my limited experience, about half? i had to finally set up a robots.txt last month after Anthropic decided it would be OK to crawl my Wikipedia mirror from about a dozen different IP addresses simultaneously, non-stop, without any rate limiting, and bring it to its knees. fuck them for it, but at least it stopped once i added robots.txt.

        Facebook, Amazon, and a few others are ignoring that robots.txt, on the other hand. they have the decency to do it slowly enough that i’d never notice unless i checked the logs, at least.

    • jbk@discuss.tchncs.de
      link
      fedilink
      arrow-up
      32
      ·
      9 个月前

      I thought major LLMs ignored robots.txt

    • cheddar
      link
      fedilink
      arrow-up
      25
      ·
      9 个月前

      It’s to profit from training LLMs: https://arstechnica.com/information-technology/2024/02/your-reddit-posts-may-train-ai-models-following-new-60-million-agreement/

    • Anas@lemmy.world
      link
      fedilink
      arrow-up
      12
      ·
      9 个月前

      It’s to prevent LLMs from training themselves using reddit content, unless they pay the party that took no part in creating said content

      FTFY

196@lemmy.blahaj.zone

196@lemmy.blahaj.zone

Subscribe from Remote Instance

Create a post
You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Be sure to follow the rule before you head out.


Rule: You must post before you leave.



Other rules

Behavior rules:

  • No bigotry (transphobia, racism, etc…)
  • No genocide denial
  • No support for authoritarian behaviour (incl. Tankies)
  • No namecalling
  • Accounts from lemmygrad.ml, threads.net, or hexbear.net are held to higher standards
  • Other things seen as cleary bad

Posting rules:

  • No AI generated content (DALL-E etc…)
  • No advertisements
  • No gore / violence
  • Mutual aid posts are not allowed

NSFW: NSFW content is permitted but it must be tagged and have content warnings. Anything that doesn’t adhere to this will be removed. Content warnings should be added like: [penis], [explicit description of sex]. Non-sexualized breasts of any gender are not considered inappropriate and therefore do not need to be blurred/tagged.

If you have any questions, feel free to contact us on our matrix channel or email.

Other 196’s:

  • [email protected]
  • [email protected]
Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 630 users / day
  • 2.29K users / week
  • 5.45K users / month
  • 18.1K users / 6 months
  • 164 local subscribers
  • 17.5K subscribers
  • 19K Posts
  • 221K Comments
  • Modlog
  • mods:
  • Moss@lemmy.blahaj.zone
  • greembow@lemmy.blahaj.zone
  • moss@lemmy.world
  • Queue@beehaw.org
  • funky-rodent [he/him]@lemmy.blahaj.zone
  • Peachy [they/she] @lemmy.blahaj.zone
  • threegnomes@lemmy.blahaj.zone
  • greembow@lemmy.world
  • remotelove@lemmy.ca
  • Roflmasterbigpimp@feddit.de
  • A_Very_Big_Fan@lemm.ee
  • qaz@lemmy.blahaj.zone
  • A_Very_Big_Fan@lemmy.world
  • qaz@lemmy.sdf.org
  • qaz@lemmy.world
  • qaz@sh.itjust.works
  • BE: 0.19.11
  • Modlog
  • Legal
  • Instances
  • Docs
  • Code
  • join-lemmy.org