You are looking at an old revision of the page MeetingNotes. This revision was created by Hrishikesh Amur.
Notes from weekly meetings
Jan 25
Generalized libminni infrastructure;
* New Mapper class and PAO class need to be defined
** PAO class has add and merge
By default, how many passes do we want on the map-side?
* Google-sponsored Google code projects
- vary the size of 25G dataste and see what happens to the difference
- bin the PAOs and do a pass at the end;
** parametrize the number of buckets (how many partitions on flash vs. disk)
Feb 1
Look at: Piccolo
Architecture: log store per bucket; sort buckets.
- Is there a fundamental relationships between number of output files, number of buckets, number of reducers that we can come up with?
- Size of the bucket: smaller is better since we have to sort these
- Number of buckets: if it's too large more overhead; overhead from SSD if we are appending to too many files
- make a list of designs explored (Hrishi)
Erik's note on partition functions: So suppose that there are R reducers, and we want there to be B bins per mapper. Then we need a universal partition function that specifies P = lcm(R,B) different bins if it is to be perfectly useful for both (note that it is not the gcd like I claimed in the meeting). Then you need to group P / R partitions together to get a partition function for reducers, and group P/B partitions together to get a partition function for bins.
The perfect partition function would the P-quantiles for the intermediate
key-space (for some ordering on the keys, e.g. lexicographic). Since we don't
know this, would would either have to determine this by:
1. Domain knowledge
2. Experimentation
Feb 8
- for the smallest, is nsort doing it in memory
nsort parameters
need to talk about the following cases:
- dataset fits in memory
- dataset larger than memory, but the SSD supports writing N buckets with max perf. such that each bucket fits in memory
- dataset larger than that
why is hash not doing much better than sort in the first section?
** mod function on Atoms
** replace with Hsieh and check
** other hash functions?Look at partition functions such that we can keep all the buckets at around the same size.