inb4: IPFS doesn’t work, unfortunately as you cannot provide the hash of an arbitrarily large file and retrieve it from the network. IPFS content IDs (CID) are a hash of the tree of chunks. Changes to chunk size can also change the hash!
Basically, I’d like to take the SHA256, SHA3, blake2, md5, of a file and either retrieve it from a network or get a list of sources for that file. Does something like that exist already or will I have to build it?
If I have to build it
it will be a really simple, dumb, HTTP service with
GET /uris/:hash:?alg=sha256|md5|blake
POST /uri/:hash:
with the contents being a URI to the file
supported URI schemes would probably be HTTP/S and FTP. Maybe P2P protocols like IPFS and if there’s a way to target a specific file in a torrent, maybe magnet links too. But that’s feels like risky territory.
Of course for hashing requests it would have a limited task queue (maybe 5 in parallel?), rate limiting by IP, and a size limit for retrieval (1GB feels like more than enough).
Can’t think of a way to do it with a DHT 🤷
Ipfs with chunking disabled then
BitTorrent and Hyphanet have mechanisms that do this.
Magnet URIs are a standard way of encoding this.
EDIT: You typically want a slightly-more-elaborate approach than just handing the network a hash and then getting a file.
You typically want to be able to “chunk” a large file, so that you can pull it from multiple sources. The problem is that you can only validate that information is correct once you have the whole file. So, say you “chunk” the file, get part of it from one source and part from another. A malicious source could feed you incorrect data. You can validate that the end file does not hash to the right value, but then you have no idea what part of the file that some source fed you is invalid, so you don’t know who to re-fetch data from.
What’s more-common is a system where you have the hash of a hash tree of a file. That way, you can take the hash, request the hash tree from the network, validate that the hash tree hashes to the hash, and then start requesting chunks of the file, where a leaf node in the hash tree is the hash of a chunk. That way, you can validate data at a chunk level, and know that a chunk is invalid after requesting no more than one chunk from a given source.
See Merkle tree, which also mentions Tiger Tree Hash; TTH is typically used as a key in magnet URIs.
EDIT2:
Can’t think of a way to do it with a DHT
All of the DHTs that I can think of exist to implement this sort of thing.
EDIT3: Oh, skimmed over your concern, didn’t notice that you took issue with using a hash tree. I think that one normally does want a hash tree, that it’s a mistake to use a straight hash. I mean, you can generate the hash of a hash tree as easily as the hash of a file, if you have that file, which it sounds like you do. On Linux,
rhash(1)
can generate hashes of hash trees. So if you already have the file, that’s probably what you want.Hypothetically, I guess you could go build some kind of index mapping hashes to hashes of hash trees. Don’t know whether you can pull the hash off BitTorrent or something, but I wouldn’t be surprised if it is. But…you’re probably better off with hash trees, unless you can’t see the file and already are committed to a straight hash of the file.
EDIT4:
I mean:
$ rhash --sha1 --hex pkgs 7d3a772009aacfe465cb44be414aaa6604ca1ef0 pkgs $ rhash -T --hex pkgs 18cab20ffdc55614ed45c5620d85b0230951432cdae2303a pkgs $
Either way, straight hash or hash of a hash tree, you’re getting a hex string that identifies your file uniquely. Just that in the hash tree case, you solve some significant problems related to the other thing that you want to do, fetch your file. Might be more compute-intensive to generate a hash of a hash tree, but unless you’re really compute-constrained…shrugs
You typically want a slightly-more-elaborate approach than just handing the network a hash and then getting a file.
[…]
Blake 3 supports verified streaming as it is built upon merkle trees as you described. So is IPFS. As I mentioned, IPFS hashes are that of the tree, not the file contents themselves, but that doesn’t help when you have a SHA256 sum of a file and want to download it. Maybe there are networks that map the SHA256 sum to a blake3 sum, an IPFS CID, or even an HTTP URI, but I don’t know of one, hence the question here.
BitTorrent and Hyphanet have mechanisms that do this.
Do you know of a way to exploit that? A library maybe?
Nah, I wrote that when I thought that you just wanted content-based addressing, not that you specifically objected to hash trees being used for the addressing.
provide the hash of an arbitrarily large file and retrieve it from the network
I sense an XY Problem scenario. Can you explain what you’re seeking to ultimately build and what requirements you have?
Does the solution need to be distributed? Does the retrieval need to complete ASAP or can wait until data becomes available? What sort of reliability/availability does this need? If only certain hash algorithms can be supported, which ones do you need and why?
I ask this because the answer will be drastically different if you’re building the content distribution system for a small video game versus building the successor to Kim Dotcom’s Mega file-sharing service.
It’s quite simple: I want to
retrieveFile(fileHash)
wherefileHash
is the output ofmd5sum $file
orsha256sum $file
, or whatever other hashing algorithm exists.It’s a string, dawg. Just maintain a database of hash and resource location. Lookup hash, return location.
you have to fucking hope no one figures out how to backwards engineer the algorithm you choose
Why?
If two files have the same hash, you may receive the file you request by hash, or you may receive a different, possibly malicious file.
https://en.m.wikipedia.org/wiki/Collision_attack
Strong cryptographic hashes are resistant to such attacks, but md5 is relatively weak.
Absolutely. An example of a malicious collision would be to request the file with the SHA-1 of 38762cf7f55934b34d179ae6a4c80cadccbb7f0a. But… there’s two of them here.
MD5 is so broken that its former status as a cryptographic hash function has been stripped. And efforts are underway to replace SHA-1 where it’s used, since although it takes some prerequisites to intentionally create a SHA-1 collision today, it’s worth remembering that “attacks always get better, they never get worse”.
I’m not sure what your concern is. I’d basically like to call a function
retrieveFile(fileHash)
and get bytes back. Or callretrieveFileLocations(fileHash)
and get URIs back to where the file can be downloaded. Also, it’ll be opensource, so nothing to reverse engineer.
IPFS content IDs (CID) are a hash of the tree of chunks. Changes to chunk size can also change the hash!
I don’t understand why this is a deal-breaker. It seems like you could accomplish what you describe within IPFS simply by committing to a fixed chunk size. That’s valid within IPFS, right?
Is it important to use any specific hashing algorithm(s)? If not, then isn’t an IPFS CID (with a fixed, predetermined chunk size) a stable hash algorithm in and of itself?
If you
sha256sum $file
and send that hash to somebody, they can’t download the file from IPFS (unless it’s <2MB IINM), that’s the problem. And it can be any hashing algorithm md5, blake, whatever.
There is BitTorrent which I’m sure you’re aware of, and then there is also WebTorrent which you may not.
I’m also actively working on this exact problem with WebMirror with the key difference being that it works in browsers without requiring any additional software. Here is its demo: https://webmirror-demo.netlify.app/
How do I retrieve a file from bittorrent with just its hash? Does WebMirror solve that? I’ll have a look at it…