Minni: Lightweight MapReduce Library

MeetingNotes

You are looking at an old revision of the page MeetingNotes. This revision was created by Hrishikesh Amur.

Table of Content

Notes from weekly meetings

Jan 25

Generalized libminni infrastructure;
* New Mapper class and PAO class need to be defined
** PAO class has add and merge

By default, how many passes do we want on the map-side?
* Google-sponsored Google code projects

  • vary the size of 25G dataste and see what happens to the difference
  • bin the PAOs and do a pass at the end;
    ** parametrize the number of buckets (how many partitions on flash vs. disk)

Feb 1

Look at: Piccolo

Architecture: log store per bucket; sort buckets.

  • Is there a fundamental relationships between number of output files, number of buckets, number of reducers that we can come up with?
  • Size of the bucket: smaller is better since we have to sort these
  • Number of buckets: if it's too large more overhead; overhead from SSD if we are appending to too many files
  • make a list of designs explored (Hrishi)

Erik's note on partition functions: So suppose that there are R reducers, and we want there to be B bins per mapper. Then we need a universal partition function that specifies P = lcm(R,B) different bins if it is to be perfectly useful for both (note that it is not the gcd like I claimed in the meeting). Then you need to group P / R partitions together to get a partition function for reducers, and group P/B partitions together to get a partition function for bins.

The perfect partition function would the P-quantiles for the intermediate key-space (for some ordering on the keys, e.g. lexicographic). Since we don't know this, would would either have to determine this by:
1. Domain knowledge
2. Experimentation

Feb 8

  • for the smallest, is nsort doing it in memory
  • nsort parameters

  • need to talk about the following cases:

  1. dataset fits in memory
  2. dataset larger than memory, but the SSD supports writing N buckets with max perf. such that each bucket fits in memory
  3. dataset larger than that
  • why is hash not doing much better than sort in the first section?
    ** mod function on Atoms
    ** replace with Hsieh and check
    ** other hash functions?

  • Look at partition functions such that we can keep all the buckets at around the same size.

Created: 14 years 5 months ago
by Hrishikesh Amur

Old Revisions