Tuesday, February 21, 2017

Speed up FileOutputCommitter

                                                    

                     Speed up FileOutputCommitter



If Job generates many files to commit then the commitJob method call at the end of the job can take minutes or hours depending on number of mapper and reducer in the job. This is a performance regression from MR1, as in MR1 the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. In Yarn (Hadoop 2.x), the commit is single-threaded and waits until all tasks have completed before commencing which can cause huge delay in job finishing successfully even though all Map and Reduce tasks have finished.

Below Jira has details of the problem and hence community came with new algorithm to overcome this performance issue.

https://issues.apache.org/jira/browse/MAPREDUCE-4815

In algorithm version 1: 
1. commitTask will rename directory 

$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/ 

2. recoverTask will also do a rename 

$joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/ 

3. commitJob will merge every task output file  ---> This does nothing but re-arranging and copying the output to joboutput Dir ( So job is not Success unless this stage is done)

$joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS 

So if a job generates many files then the commitJob method call at the end of the job can take Hrs. The commit is single-threaded and has to waits until all tasks have completed before commencing. Algorithm version 2 will change the behavior and make it more efficient. 

In algorithm version 2: 

1. commitTask will rename all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/ 

2. recoverTask actually doesn't require to do anything

3 commitJob can simply delete $joboutput/_temporary and write $joboutput/_SUCCESS This algorithm will reduce the output commit time for large jobs by having the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.

Implementing :

To enable the new algorithm, set to 2 in the mapred-site.xml on all the nodes or during job submission use  
-Dmapreduce.fileoutputcommitter.algorithm.version=2 .

Verify :

We can verify job is using Algorithm 1 or 2 from AM container logs as below. 

2017-02-14 00:46:06,817 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1





No comments:

Post a Comment