• provide the hash of an arbitrarily large file and retrieve it from the network

    I sense an XY Problem scenario. Can you explain what you’re seeking to ultimately build and what requirements you have?

    Does the solution need to be distributed? Does the retrieval need to complete ASAP or can wait until data becomes available? What sort of reliability/availability does this need? If only certain hash algorithms can be supported, which ones do you need and why?

    I ask this because the answer will be drastically different if you’re building the content distribution system for a small video game versus building the successor to Kim Dotcom’s Mega file-sharing service.

    • It’s quite simple: I want to retrieveFile(fileHash) where fileHash is the output of md5sum $file or sha256sum $file, or whatever other hashing algorithm exists.

      • This seems like a restatement of X. We still don’t understand Y. I’m especially confused about:

        • Why are SHA-256 and friends ok, but IPFS CIDs are not? They have basically the same functionality.
        • Do you need a distributed network, or is a single server ok?

        There was some hint that maybe you’re concerned about reproducibility for CIDs? If you fix the block size, hash algorithm, and content codec you’ll get consistent results. SHA-256 also breaks data into chunks of 64 bytes as it happens.

        Anyway Wikipedia has a list of content-addressable store implementations. A couple that stand out to me are git and git-annex.

      • It’s a string, dawg. Just maintain a database of hash and resource location. Lookup hash, return location.

  • 1 year

    BitTorrent and Hyphanet have mechanisms that do this.

    Magnet URIs are a standard way of encoding this.

    EDIT: You typically want a slightly-more-elaborate approach than just handing the network a hash and then getting a file.

    You typically want to be able to “chunk” a large file, so that you can pull it from multiple sources. The problem is that you can only validate that information is correct once you have the whole file. So, say you “chunk” the file, get part of it from one source and part from another. A malicious source could feed you incorrect data. You can validate that the end file does not hash to the right value, but then you have no idea what part of the file that some source fed you is invalid, so you don’t know who to re-fetch data from.

    What’s more-common is a system where you have the hash of a hash tree of a file. That way, you can take the hash, request the hash tree from the network, validate that the hash tree hashes to the hash, and then start requesting chunks of the file, where a leaf node in the hash tree is the hash of a chunk. That way, you can validate data at a chunk level, and know that a chunk is invalid after requesting no more than one chunk from a given source.

    See Merkle tree, which also mentions Tiger Tree Hash; TTH is typically used as a key in magnet URIs.

    EDIT2:

    Can’t think of a way to do it with a DHT

    All of the DHTs that I can think of exist to implement this sort of thing.

    EDIT3: Oh, skimmed over your concern, didn’t notice that you took issue with using a hash tree. I think that one normally does want a hash tree, that it’s a mistake to use a straight hash. I mean, you can generate the hash of a hash tree as easily as the hash of a file, if you have that file, which it sounds like you do. On Linux, rhash(1) can generate hashes of hash trees. So if you already have the file, that’s probably what you want.

    Hypothetically, I guess you could go build some kind of index mapping hashes to hashes of hash trees. Don’t know whether you can pull the hash off BitTorrent or something, but I wouldn’t be surprised if it is. But…you’re probably better off with hash trees, unless you can’t see the file and already are committed to a straight hash of the file.

    EDIT4:

    I mean:

    $ rhash --sha1 --hex pkgs 
    7d3a772009aacfe465cb44be414aaa6604ca1ef0  pkgs
    $ rhash -T --hex pkgs 
    18cab20ffdc55614ed45c5620d85b0230951432cdae2303a  pkgs
    $
    

    Either way, straight hash or hash of a hash tree, you’re getting a hex string that identifies your file uniquely. Just that in the hash tree case, you solve some significant problems related to the other thing that you want to do, fetch your file. Might be more compute-intensive to generate a hash of a hash tree, but unless you’re really compute-constrained…shrugs

    • You typically want a slightly-more-elaborate approach than just handing the network a hash and then getting a file.

      […]

      Blake 3 supports verified streaming as it is built upon merkle trees as you described. So is IPFS. As I mentioned, IPFS hashes are that of the tree, not the file contents themselves, but that doesn’t help when you have a SHA256 sum of a file and want to download it. Maybe there are networks that map the SHA256 sum to a blake3 sum, an IPFS CID, or even an HTTP URI, but I don’t know of one, hence the question here.

      BitTorrent and Hyphanet have mechanisms that do this.

      Do you know of a way to exploit that? A library maybe?

      • 1 year

        Nah, I wrote that when I thought that you just wanted content-based addressing, not that you specifically objected to hash trees being used for the addressing.

  • you have to fucking hope no one figures out how to backwards engineer the hashing algorithm you choose

    • I’m not sure what your concern is. I’d basically like to call a function retrieveFile(fileHash) and get bytes back. Or call retrieveFileLocations(fileHash) and get URIs back to where the file can be downloaded. Also, it’ll be opensource, so nothing to reverse engineer.

      • md5 for example is already vulnerable. People have figured out how to manipulate data into having a pre-specified hash. Meaning someone could engineer deliberate hash collisions and serve you any file they like.

        SHA-256 doesn’t (i think) have this issue, so far hah.

    • How do I retrieve a file from bittorrent with just its hash? Does WebMirror solve that? I’ll have a look at it…

  • IPFS content IDs (CID) are a hash of the tree of chunks. Changes to chunk size can also change the hash!

    I don’t understand why this is a deal-breaker. It seems like you could accomplish what you describe within IPFS simply by committing to a fixed chunk size. That’s valid within IPFS, right?

    Is it important to use any specific hashing algorithm(s)? If not, then isn’t an IPFS CID (with a fixed, predetermined chunk size) a stable hash algorithm in and of itself?

    • If you sha256sum $file and send that hash to somebody, they can’t download the file from IPFS (unless it’s <2MB IINM), that’s the problem. And it can be any hashing algorithm md5, blake, whatever.