Youtube would be a prime example: I’m guessing the storage required for the metadata of all videos is too large to be stored on a single server, so how do they achieve millisecond-level performance on searches and handle millions of queries routinely?

What kind of infrastructure and technology is required for this?

Do you have any resources I could use to learn more on this subject?

  • HelloRoot@lemy.lol
    link
    fedilink
    English
    arrow-up
    15
    ·
    2 months ago

    At massive scale, indexing is done by distributing the data rather than relying on a single machine. The index is split into shards, each holding a subset of the data, commonly partitioned by hashing IDs or dividing term ranges. Every shard is replicated to multiple machines so reads can be load-balanced and failures do not take the system down.

    Search queries are handled by a coordinator that sends the query to the relevant shards in parallel, collects their partial results, merges and ranks them, and returns the final result. Because all shards work at the same time, query latency depends on the slowest shard, not on total index size.

    This setup is built on search engines based on inverted indexes, usually derived from Lucene, either via systems like Elasticsearch or via custom implementations. Metadata and related data are stored in distributed databases or key-value stores, while index updates are streamed asynchronously so writes do not block reads. Caching at multiple layers keeps frequently accessed data in memory, and the whole system runs on large clusters that automatically handle placement, scaling, and failures.

    idk where you are, but where I live anybody can go to the university lectures for free, as long as they are not full. Or the library and browse the relevant section. Personally I learned everything IT related from uni courses and searching for my topics of interest in the uni lib. So thats my shitty recomendation, I’m sure there are online resources and courses on it though.