
The second level of this index is a 20TB compressed sorted list of (url, date, pointer) tuples called CDX records. Playback is accomplished by binary searching a 2-level index of pointers into the WARC data. I know they've done some things to keep the index more current than they did back then.) (Edit: FYI, my knowledge is 5 years old now.Then it would use a little mechanism (network broadcast maybe?) to find out what server that (unique) filename was on, then it would request the particular record from that server. (Here's where I'm a little rusty) The web frontend would get a request, query the appropriate index machine.


Binary search across a sorted text file is surprisingly fast - in part because the first few points you look at in the file remain cached in RAM, since you hit them frequently. It was implemented by building a sorted text file (first sorted on the url, second on the time) and sharded across many machines by simply splitting it into N roughly equal sizes.

That is, you can seek to a particular offset and start decompressing a record. Archived data was in ARC file format (predecessor to ) which is essentially a concatenation of separately gzipped records.
