zemili.blogg.se - Wayback archive

The second level of this index is a 20TB compressed sorted list of (url, date, pointer) tuples called CDX records. Playback is accomplished by binary searching a 2-level index of pointers into the WARC data. I know they've done some things to keep the index more current than they did back then.) (Edit: FYI, my knowledge is 5 years old now.Then it would use a little mechanism (network broadcast maybe?) to find out what server that (unique) filename was on, then it would request the particular record from that server. (Here's where I'm a little rusty) The web frontend would get a request, query the appropriate index machine.

Binary search across a sorted text file is surprisingly fast - in part because the first few points you look at in the file remain cached in RAM, since you hit them frequently. It was implemented by building a sorted text file (first sorted on the url, second on the time) and sharded across many machines by simply splitting it into N roughly equal sizes.

An sorted index of all the content was built that would let you lookup (url) and give you a list of times or (url, time) to (filename, file-offset).

Thus you could get at any archived web page with a triple (server, filename, file-offset) Thus it was spread across a lot of commodity grade machines.

That is, you can seek to a particular offset and start decompressing a record. Archived data was in ARC file format (predecessor to ) which is essentially a concatenation of separately gzipped records.