inb4: IPFS doesn’t work, unfortunately as you cannot provide the hash of an arbitrarily large file and retrieve it from the network. IPFS content IDs (CID) are a hash of the tree of chunks. Changes to chunk size can also change the hash!

Basically, I’d like to take the SHA256, SHA3, blake2, md5, of a file and either retrieve it from a network or get a list of sources for that file. Does something like that exist already or will I have to build it?

If I have to build it

it will be a really simple, dumb, HTTP service with

  • GET /uris/:hash:?alg=sha256|md5|blake
  • POST /uri/:hash: with the contents being a URI to the file
    supported URI schemes would probably be HTTP/S and FTP. Maybe P2P protocols like IPFS and if there’s a way to target a specific file in a torrent, maybe magnet links too. But that’s feels like risky territory.

Of course for hashing requests it would have a limited task queue (maybe 5 in parallel?), rate limiting by IP, and a size limit for retrieval (1GB feels like more than enough).

Can’t think of a way to do it with a DHT 🤷

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    5 hours ago

    BitTorrent and Hyphanet have mechanisms that do this.

    Magnet URIs are a standard way of encoding this.

    EDIT: You typically want a slightly-more-elaborate approach than just handing the network a hash and then getting a file.

    You typically want to be able to “chunk” a large file, so that you can pull it from multiple sources. The problem is that you can only validate that information is correct once you have the whole file. So, say you “chunk” the file, get part of it from one source and part from another. A malicious source could feed you incorrect data. You can validate that the end file does not hash to the right value, but then you have no idea what part of the file that some source fed you is invalid, so you don’t know who to re-fetch data from.

    What’s more-common is a system where you have the hash of a hash tree of a file. That way, you can take the hash, request the hash tree from the network, validate that the hash tree hashes to the hash, and then start requesting chunks of the file, where a leaf node in the hash tree is the hash of a chunk. That way, you can validate data at a chunk level, and know that a chunk is invalid after requesting no more than one chunk from a given source.

    See Merkle tree, which also mentions Tiger Tree Hash; TTH is typically used as a key in magnet URIs.

    EDIT2:

    Can’t think of a way to do it with a DHT

    All of the DHTs that I can think of exist to implement this sort of thing.

    EDIT3: Oh, skimmed over your concern, didn’t notice that you took issue with using a hash tree. I think that one normally does want a hash tree, that it’s a mistake to use a straight hash. I mean, you can generate the hash of a hash tree as easily as the hash of a file, if you have that file, which it sounds like you do. On Linux, rhash(1) can generate hashes of hash trees. So if you already have the file, that’s probably what you want.

    Hypothetically, I guess you could go build some kind of index mapping hashes to hashes of hash trees. Don’t know whether you can pull the hash off BitTorrent or something, but I wouldn’t be surprised if it is. But…you’re probably better off with hash trees, unless you can’t see the file and already are committed to a straight hash of the file.

    EDIT4:

    I mean:

    $ rhash --sha1 --hex pkgs 
    7d3a772009aacfe465cb44be414aaa6604ca1ef0  pkgs
    $ rhash -T --hex pkgs 
    18cab20ffdc55614ed45c5620d85b0230951432cdae2303a  pkgs
    $
    

    Either way, straight hash or hash of a hash tree, you’re getting a hex string that identifies your file uniquely. Just that in the hash tree case, you solve some significant problems related to the other thing that you want to do, fetch your file. Might be more compute-intensive to generate a hash of a hash tree, but unless you’re really compute-constrained…shrugs

    • tinkralgeOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      You typically want a slightly-more-elaborate approach than just handing the network a hash and then getting a file.

      […]

      Blake 3 supports verified streaming as it is built upon merkle trees as you described. So is IPFS. As I mentioned, IPFS hashes are that of the tree, not the file contents themselves, but that doesn’t help when you have a SHA256 sum of a file and want to download it. Maybe there are networks that map the SHA256 sum to a blake3 sum, an IPFS CID, or even an HTTP URI, but I don’t know of one, hence the question here.

      BitTorrent and Hyphanet have mechanisms that do this.

      Do you know of a way to exploit that? A library maybe?

      • tal@lemmy.today
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 hours ago

        Nah, I wrote that when I thought that you just wanted content-based addressing, not that you specifically objected to hash trees being used for the addressing.

  • litchralee@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    1
    ·
    edit-2
    6 hours ago

    provide the hash of an arbitrarily large file and retrieve it from the network

    I sense an XY Problem scenario. Can you explain what you’re seeking to ultimately build and what requirements you have?

    Does the solution need to be distributed? Does the retrieval need to complete ASAP or can wait until data becomes available? What sort of reliability/availability does this need? If only certain hash algorithms can be supported, which ones do you need and why?

    I ask this because the answer will be drastically different if you’re building the content distribution system for a small video game versus building the successor to Kim Dotcom’s Mega file-sharing service.

    • tinkralgeOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      It’s quite simple: I want to retrieveFile(fileHash) where fileHash is the output of md5sum $file or sha256sum $file, or whatever other hashing algorithm exists.

  • AnAmericanPotato
    link
    fedilink
    English
    arrow-up
    1
    ·
    6 hours ago

    IPFS content IDs (CID) are a hash of the tree of chunks. Changes to chunk size can also change the hash!

    I don’t understand why this is a deal-breaker. It seems like you could accomplish what you describe within IPFS simply by committing to a fixed chunk size. That’s valid within IPFS, right?

    Is it important to use any specific hashing algorithm(s)? If not, then isn’t an IPFS CID (with a fixed, predetermined chunk size) a stable hash algorithm in and of itself?

    • tinkralgeOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      If you sha256sum $file and send that hash to somebody, they can’t download the file from IPFS (unless it’s <2MB IINM), that’s the problem. And it can be any hashing algorithm md5, blake, whatever.

    • tinkralgeOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      How do I retrieve a file from bittorrent with just its hash? Does WebMirror solve that? I’ll have a look at it…