Design of a datastructure for disk spillover, and external sort (rolled into one!).
Introduction
This document describes an 'ExtendedHashtable' design which is, essentially, a hash table that grows in memory to a certain size and then spills itself onto disk. It uses memory to cache a 'working set' of objects, and disk to enable working on very large datasets.
It performs merges of the data on disk and in memory whenever it spills to maintain sorted order of the keys (given implicitly as a side effect of STL maps). Thus, at all times data is stored in sorted order and completely merged (there are no multiple keys with the same value)
Details
These are early design notes, with pseudocode.
Create a new datastructure to store PAO's for you. It's interface is an insert method and it grows to a maximum size (# of inserted objects) and then flushes itself to disk. It keeps track of it's latest flushed file (simplicity for now, assume we cleanup later), and a working set of PAO's in memory. Insert basically never fails (until disk fills and working set size is reached).
It also has a retrieval method given a key, it returns the corresponding PAO or some suitable error value.
Also, give it a destructor that performs a final merge/flush of data to the hard disk (as in, when it destructs/goes out of scope, it clears the in memory HT and flushes to disk---transparently when map and reduce tasks die their data will get flushed to local disk properly). Also, in case of early job termination this means we should be able to get what is in memory out to disk as long as it isn't a segfault. Fairly defensive programming.
It will also need a finalize method that takes in a path and will merge the local disk and memory into that path---straight into the DFS for reducer output. I think this is an OK place to put this function?
So now instead of directly iterating/using maps in the Mapper.cpp and Reducer.cpp, use this datastructure.
That will keep the interface nice, and we can transparently modify this datastructure in the future. I think that will be very important as we explore different methods of doing this (log structured methods on SSD etc.).
when HT hits #size with PAO's --> flush to disk,
serialize each object
then clear HT (dont erase as u go..)
if dump_file not null
merge
dump_file = new dump_file
+ while (bytes in file or elements in HT (iter not end))
if (char* HT_KEY > char* file Key)
write char* HT_KEY PAO
get another HT_KEY PAO
continue
if (char* HT_KEY < char* file Key)
write char* file Key PAO
get another file key PAO
continue
write(merge HT_KEY PAO and file Key PAO)
clear HT
else
dump_file = new dump_file
write each PAO from HT to dump_file
clear HT
insert new PAO, continue
at end of map job or reduce job, merge in memory table with file table, writing to final output files as you go
Functions to use/gotchas:
- Compare keys using the map comparator.
- Use serialize and deserialize methods to read/write into files.