
Comment 1 by Wolfgang Richter, Jul 1, 2010
Labels:
Priority:Critical
Priority:Medium
Status: Accepted
Owner: abalacha
Status: Accepted
Owner: abalacha

Comment 2 by Hrishikesh Amur, Jul 1, 2010
Attaching tarball of job0_partition0.map.
- part0.tar.bz2 - 206 bytes

Comment 3 by Hrishikesh Amur, Jul 1, 2010
Last one corrupted. This one should be fine. Attaching test and workdaemon log files from next run too.
- part0.tar.bz2 - 25.26 kB
- test.log - 6.78 kB
- workdaemon.log - 3.30 kB

Comment 4 by Hrishikesh Amur, Jul 1, 2010
reposting map local file after bound fix.
- job0_partition0.map - 160.78 kB

Comment 6 by Wolfgang Richter, Jul 1, 2010
Master was missing sending a message notifying the workdaemon that there were nodes with finished map data. Fix in commit f8f76b9. Patch attached. Now a new issue was just uncovered, could be in lib code or workdaemon code: "Reducer: Going to do reduce on the file /home/wolf/Dropbox/CMU/Courses/OS_and_Distributed_Systems/wolf_repo/s rc/worker/gen-cpp/job3.reduce I am here to do reduce on /home/wolf/Dropbox/CMU/Courses/OS_and_Distributed_Systems/wolf_repo/s rc/worker/gen-cpp/job3.reduce Inside and now going to read the file Reading errorReducer: Done with reducing"
- missing_master_message.patch - 576 bytes - view

Comment 7 by Wolfgang Richter, Jul 2, 2010
New hot fix in commit 6ac2399, potentially fixing the previous issue. This one will guarantee that a message about nodes with data arrives before the allmapsfinished message. If Erik's code is correct, it will then have a chance to grab the data before running reducers without a list of nodes to pull intermediate data from. This time the list should *definitely* be populated with the new synchronous call. This patch needs to be verified...
- hotfix1.patch - 8.39 kB - view

Comment 8 by Wolfgang Richter, Jul 2, 2010
Reducer still having error. Partial log attached. It seems it's getting a file name, but for some reason reads fail (or are being reported as failures) on that file? After job completion, the file exists and is also attached to this issue.
- reducer_read_error_log.log - 1.23 kB
- job3.reduce - 30 bytes

Comment 9 by Wolfgang Richter, Jul 2, 2010
I think we'll need more debug info for this one. Athula should probably add a bunch of debug statements telling us the length of the file, read bytes, etc. so we can try and narrow down the issue.

Comment 10 by Wolfgang Richter, Jul 2, 2010
Commit 7e05f68 should provide a final fix for this issue. Patch is attached. Please verify. Problem was that fread return value unit was being interpreted as bytes when it's really the count of "elements" read (elements are arbitrary in the number of bytes).
- final_fix?.patch - 8.71 kB - view

Comment 11 by Wolfgang Richter, Jul 2, 2010
Hrishi has verified, looks like this chain of fixes and discussion are coming to an end, closing out.
Status:
Fixed
Sign in to reply to this comment.
Reported by Hrishikesh Amur, Jul 1, 2010