As I was browsing lemmy and the fediverse at large, this question kept popping into my head.

Since multimedia files have a much bigger footprint than raw text, it made me feel worried since as time goes, massive resources will be needed to keep up with the big data coming in.

I do wonder if the instances have taken the route of the cloud and just decided to put all of it in something like AWS S3? Or maybe they use self hosted storage with something like minio for object storage?

  • laenurd
    link
    fedilink
    English
    48
    edit-2
    11 months ago

    This will differ greatly from instance to instance. The people running lemmy.world have published some info on their infrastructure. My instance is running on a rather small VPS with 100GB storage, but I will have to rethink my solution rather soon as images and videos from my subbed communities [Edit: which are stored on outside sites] are eating around a gigabyte per day and I think this is likely to increase.

    Edit: I want to clarify that I was partially wrong - Lemmy only locally caches content which is hosted on outside sites (e.g. imgur). It does not cache content that was directly uploaded to another Lemmy instance and just embeds the source media.

    • @BlurkerOP
      link
      English
      1511 months ago

      Thank you for your work for the community! I think with more people using lemmy, we should also as users lookout for the infra we are using because the admins are not a mega corporation ready to spin up infinite resources.

      • laenurd
        link
        fedilink
        English
        1411 months ago

        No need to thank me, currently I am the only non-bot-user of my instance and do not allow registrations 😅

        Many of the bigger instances have links to donate to their operators, but I am doubtful that relying solely on donations will be enough in the long run.

        • Nat
          link
          fedilink
          English
          111 months ago

          Since you’re the only one, you might consider setting an expiration on the media so your local storage serves as more of a cache. Like, I’m sure you’re far more likely to revisit a recent thread than a super old one, and as long as the original instance is still around you could redownload the media. This might require software patches though idk

          • laenurd
            link
            fedilink
            English
            211 months ago

            Firstly: I was partially wrong about what gets cached, see my original comment.

            There is an open pull request which is meant to give some options regarding media serving. Right now it’s only a rough sketch though and does not implement a lot functionality.

      • @[email protected]
        link
        fedilink
        English
        1011 months ago

        Gfycat is shutting down, sadly. There’s no money to be had in hosting pictures and videos for other sites that are viewed without ads. We already saw the Imgur clamp down a month or so ago. If these instances can’t self host the content it’s all going to have an invisible expiration date.

      • @[email protected]
        link
        fedilink
        English
        6
        edit-2
        11 months ago

        This is the best idea in my opinion. Even on Reddit people used Imgur a bunch. It would be cool if either Lenny itself, or the various apps that have sprung up tied into an API for Imgur or similar to store pictures/videos, then it would be seamless to the end user.

        EDIT: looks like Memmy on iOS actually already does this!

      • laenurd
        link
        fedilink
        English
        311 months ago

        I was wrong about what gets cached: media that is hosted directly on remote instances is not cached, while media from outside sources (imgur etc.) is cached and served from that cache.

        So, from a small instance’s point of view, the best case scenario would be if everyone used Lemmy’s own media hosting exclusively. But that would, of course, greatly increase the storage requirements of larger instances.

      • @[email protected]
        link
        fedilink
        English
        311 months ago

        The issue is that if they shut down or change policies we get bit rot. Idk though it’s starting to sound pretty good.

      • laenurd
        link
        fedilink
        English
        4
        edit-2
        11 months ago

        Everything. It does some re-encoding when it retrieves content from other instances and you can set limits for pictrs (the software Lemmy uses to host media) regarding file sizes etc.

        Edit: I was partially wrong about what is cached, see my original comment

        • 🌴 𝓣𝓸𝓾𝓻𝓲𝓼𝓽
          link
          fedilink
          English
          211 months ago

          I need to look more into pictrs and what it can do. Is this done on purpose for image redundancy? I get the reason if the original instance goes offline then I’d still have a copy but maybe I don’t really want a copy? Also would be nice if I could get it to convert everything to webp

          • jsqribe
            link
            fedilink
            English
            511 months ago

            I think by default it already converts everything into webp. The repo will have more information on how it all works.

        • Skyhighatrist
          link
          fedilink
          English
          211 months ago

          When I was looking into hosting my own instance I thought I saw an option to disable media file replication entirely so that they would always have to be fetched from their home instance.

          • laenurd
            link
            fedilink
            English
            211 months ago

            That would be great to know, any chance you remember where you read that?

            • Skyhighatrist
              link
              fedilink
              English
              2
              edit-2
              11 months ago

              It’s possible that I’ve misunderstood. And it’s also important to note that I was looking into this for the purposes of creating my own, single user instance. I wasn’t planning on posting to my own instance, just using it as a single logon where I could control what other instances I federated with.

              Here it mentions not installing pict-rs and removing its configuration if you don’t need image hosting. My interpretation at the time was that it would mean that no images would be hosted locally on my instance. But that was very early on before I understood more about federation, and now I realize that it may in fact also mean that any content coming from federated instances could have images broken, not that it would load the images from the remote instance. So now, I no longer think that this is a solution for not syncing images, but I’m not at all sure of that.

            • Skyhighatrist
              link
              fedilink
              English
              211 months ago

              No, but I bet I could find it again if I hadn’t just imagined it and made it up for this comment. Give me a few.

    • @[email protected]
      link
      fedilink
      English
      311 months ago

      Honestly you may benefit from a cloud system that can put old images into cold storage when they go unaccessed for more than a few days

  • RCMaehl [Any]
    link
    fedilink
    English
    14
    edit-2
    11 months ago

    Edit: I am partially wrong. (See below)

    They’re stored on their host Instance. Only text is copied across instances.

    • laenurd
      link
      fedilink
      English
      20
      edit-2
      11 months ago

      That is not true. As long as a user on your instance is subscribed to a community, the media content of posts [Edit: only posts linking to outside sources, e.g imgur] of that community is stored locally on your instance as well.

      This, of course, only applies to media which is uploaded to Lemmy, links to media hosted externally are not downloaded.

      See this issue for more context.

      Edit: I want to clarify that I was partially wrong - Lemmy only locally caches content which is hosted on outside sites. It does (should?) not cache content that was directly uploaded to a Lemmy instance and just embeds the source media.

      • @[email protected]
        link
        fedilink
        English
        2511 months ago

        I think this could be a ticking DOS time bomb.

        Someone manages to spam upload massive files to the largest Lemmy instances could wipe out a ton of smaller ones.

        Not to mention scalability wise this seems like a nightmare… eventually the largest Lemmy instances will have petabytes of media data with 100s of gbs coming in per day, giving other instances no chance to sync with them.

        I think the system architecture needs a significant review. This won’t scale.

        • @[email protected]
          link
          fedilink
          English
          511 months ago

          I agree. It’s also a tremendous waste of resources. I’m all for redundancy (like CDNs), but this seems incredibly poorly thought out. If Lemmy (as a whole) every scales to the size of other social media, the space requirements will start to become unreasonable.

          Why wouldn’t something like symlinks be implemented? Not saying specifically use symlinks, but there has to be a similar, better way.

          • laenurd
            link
            fedilink
            English
            6
            edit-2
            11 months ago

            The obvious way would be to just not cache content locally and always link to the source instance. While this would concentrate the strain immensely, it would also greatly decrease the storage space used by all other instances.

            There might also be other viable alternatives such as using a CDN and having it selectively cache content which is requested often etc.

            ~~As of now, Lemmy does not support either, though. ~~

            Edit: I want to clarify that I was partially wrong - Lemmy only locally caches content which is hosted on outside sites. It does (should?) not cache content that was directly uploaded to a Lemmy instance and just embeds the source media.

        • laenurd
          link
          fedilink
          English
          311 months ago

          Agree. If I’m not mistaken, you can only disable the caching of sensitive (NSFW) content on your instance by disabling NSFW in general. This doesn’t go for SFW content though.

          It shouldn’t be very hard to do this for all content though, if I find the time I might look into implementing this.

        • AFK BRB Chocolate
          link
          fedilink
          English
          311 months ago

          I feel like the developers should spend some time adding features to reduce malicious activity. They could provide settings to the admins to limit the number of things one user can do in a day, like number of images, total size of images, number of communities created, etc. Sure, someone could create multiple accounts, but it would still make it harder to attack Lemmy.

        • ndguardian
          link
          fedilink
          English
          111 months ago

          I mentioned something akin to this possibility a couple days ago, but was told this likely wasn’t the case. I’ll have to see if I can dig up the argument for that.

      • Remy Rose
        link
        fedilink
        English
        311 months ago

        This actually brings up another question for me! Say your account is on an instance that doesn’t allow something, like nudity. If you subscribe to a community on another instance that DOES allow it, you’re saying that everything you see there does end up (redundantly) hosted by your home instance. Has the Lemmy moderation/admin community in general decided on whether or not that’s breaking the home instance’s rules?

        • laenurd
          link
          fedilink
          English
          211 months ago

          Right now you can only disable caching of nsfw content by disabling NSFW for the instance, but of course this has nothing to do with “soft” rules that are only written out in text.

          Imo the best solution would be to allow admins to have more granular control over caching, e.g. disabling caching for specific instances / communities or whitelisting. And we need an option to disable caching altogether.

        • laenurd
          link
          fedilink
          English
          111 months ago

          I updated my comment as I was partially wrong about what gets cached.

    • @BlurkerOP
      link
      English
      211 months ago

      That sounds good for reliability since an instance can still lookup posts even if another fails.

      For videos and images, do they store them as blobs in the database or do they use something more catered to files like object storage or maybe a regular filesystem with metadata on a database?

        • @BlurkerOP
          link
          English
          211 months ago

          That’s an awesome name. The Rust community never fails to deliver lol