Minni: Lightweight MapReduce Library

MeetingNotes

Table of Content

Notes from weekly meetings

Jan 25

Generalized libminni infrastructure;
* New Mapper class and PAO class need to be defined
** PAO class has add and merge

By default, how many passes do we want on the map-side?
* Google-sponsored Google code projects

  • vary the size of 25G dataste and see what happens to the difference
  • bin the PAOs and do a pass at the end;
    ** parametrize the number of buckets (how many partitions on flash vs. disk)

Feb 1

Look at: Piccolo

Architecture: log store per bucket; sort buckets.

  • Is there a fundamental relationships between number of output files, number of buckets, number of reducers that we can come up with?
  • Size of the bucket: smaller is better since we have to sort these
  • Number of buckets: if it's too large more overhead; overhead from SSD if we are appending to too many files
  • make a list of designs explored (Hrishi)

Erik's note on partition functions: So suppose that there are R reducers, and we want there to be B bins per mapper. Then we need a universal partition function that specifies P = lcm(R,B) different bins if it is to be perfectly useful for both (note that it is not the gcd like I claimed in the meeting). Then you need to group P / R partitions together to get a partition function for reducers, and group P/B partitions together to get a partition function for bins.

The perfect partition function would the P-quantiles for the intermediate key-space (for some ordering on the keys, e.g. lexicographic). Since we don't know this, would would either have to determine this by:
1. Domain knowledge
2. Experimentation

Feb 8

  • for the smallest, is nsort doing it in memory
  • nsort parameters

  • need to talk about the following cases:

  1. dataset fits in memory
  2. dataset larger than memory, but the SSD supports writing N buckets with max perf. such that each bucket fits in memory
  3. dataset larger than that
  • why is hash not doing much better than sort in the first section?
    ** mod function on Atoms
    ** replace with Hsieh and check
    ** other hash functions?

  • Look at partition functions such that we can keep all the buckets at around the same size.

  • List of contributions for the paper

Created: 14 years 5 months ago
by Hrishikesh Amur

Updated: 14 years 5 months ago
by Hrishikesh Amur

Old Revisions