Tuesday, June 20, 2017

Debugging Failed Jobs due to Failed AM containers .

                                  Debugging Failed Jobs due to Failed AM containers 


As described in YARN-522 when AM crashes all we see is below log message in Job client . This gives no indication of what happened for AM to crash 2 times and gradually Job to fail.

17/06/20 17:12:27 INFO mapreduce.Job:  map 0% reduce 0%
17/06/20 17:12:27 INFO mapreduce.Job: Job job_1497941938392_0007 failed with state FAILED due to: Application application_1497941938392_0007 failed 2 times due to AM Container for appattempt_1497941938392_0007_000002 exited with  exitCode: 1
For more detailed output, check application tracking page:http://node107rhel72:8088/cluster/app/application_1497941938392_0007Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e02_1497941938392_0007_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:563)
 at org.apache.hadoop.util.Shell.run(Shell.java:460)
 at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:748)


 --- To debug this first thing we need is to check is container logs either from "yarn logs -applicationId <application ID>" (If log aggregation is turned on ) or check under userlogs to understand the reason AM is failing. If we have the details its almost certain we can fix the issue quickly from the error message in container logs. 


--- There have been cases when the logs get deleted as soon as job fails so there is no way to Debug it further. To capture the details below is what can be done :  

i) Run the job on single node via label bases scheduling and verify the job fails with exact same error message . ( Choose node which doesn't have bunch of jobs running )
ii) Now when the Job is submitted again backup details from below location every second . This will capture the container logs before they are deleted to get the RC.

/opt/mapr/hadoop/hadoop-2.7.0/logs/userlogs



If all above tests tricks fail to diagnose Yarn application problems (  container logs are not being created or cleaned up before you can get to it ), set this property's value large enough (for example, to 600 = 10 minutes) to permit examination of these directories. After changing the property's value, you must restart the Nodemanager in order for it to have an effect. 

<property><name>yarn.nodemanager.delete.debug-delay-sec</name><value>600</value><source>yarn-site.xml</source></property>

Now when the job is run below localized directory still will have jars and launch_container.sh script which can be inspected for any issues/clues .

[mapr@node107rhel72 root]$ ls -l /tmp/hadoop-mapr/nm-local-dir/usercache/mapr/appcache/application_1497941938392_0004/container_e02_1497941938392_0004_01_000001
total 12
-rw-------. 1 mapr mapr  100 Jun 20 14:38 container_tokens
lrwxrwxrwx. 1 mapr mapr  105 Jun 20 14:38 job.jar -> /tmp/hadoop-mapr/nm-local-dir/usercache/mapr/appcache/application_1497941938392_0004/filecache/11/job.jar
drwxr-s---. 2 mapr mapr   46 Jun 20 14:38 jobSubmitDir
lrwxrwxrwx. 1 mapr mapr  105 Jun 20 14:38 job.xml -> /tmp/hadoop-mapr/nm-local-dir/usercache/mapr/appcache/application_1497941938392_0004/filecache/13/job.xml
-rwx------. 1 mapr mapr 4394 Jun 20 14:38 launch_container.sh
drwxr-s---. 2 mapr mapr    6 Jun 20 14:38 tmp

[mapr@node107rhel72 root]$ ls -l /tmp/hadoop-mapr/nm-local-dir/usercache/mapr/appcache/application_1497941938392_0004/container_e02_1497941938392_0004_01_000005
total 104
-rw-------. 1 mapr mapr   129 Jun 20 14:38 container_tokens
lrwxrwxrwx. 1 mapr mapr   105 Jun 20 14:38 job.jar -> /tmp/hadoop-mapr/nm-local-dir/usercache/mapr/appcache/application_1497941938392_0004/filecache/11/job.jar
-rw-r-----. 1 mapr mapr 96256 Jun 20 14:38 job.xml
-rwx------. 1 mapr mapr  4081 Jun 20 14:38 launch_container.sh
drwxr-s---. 2 mapr mapr     6 Jun 20 14:38 tmp

No comments:

Post a Comment