Wednesday, July 26, 2017

Job affecting other Jobs in cluster

                                       Job affecting other Jobs in cluster

Recently we saw interesting occurrence of an issue described earlier ( below ), only variation was some Bad job was impacting a very important job by killing its containers randomly with "Killing taskAttempt because it is running on unusable node" message.


One Bad job was causing to fill up NM cache local dirs on various nodes (/opt/mapr/tmp/hadoop-mapr/nm-local-dir) causing node to go bad and terminate all running containers as explained in earlier blog .

Now to find the job which was Culprit below are the steps which were followed .

1) Found the latest attempt of my job which was killed and got the node name on which it ran ( Killed due to NM being unstable ) . 

2) On that node got all the NM logs from the time below message was logged to the time when other consecutive message was logged.

Start Message :    "local-dirs are bad"
------
------
-----
End Message :    " local-dirs are good"

3) Between this time as we know all the running containers would be killed and cleanups would happen for the NM to get back to healthy state . Got a list of all the containers which were killed to get sense and narrow down jobs which can be culprit, Alternatively we can also grep for applications which are cleaned up between the time NM goes bad and becomes good .

Example log line :


INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/mapr/tmp/hadoop-mapr/nm-local-dir/usercache/User/appcache/application_1496463308975_240690


4)  Repeated above steps for another similar occurrence to find out common jobs to narrow down the job suspects( Incase suspected job count is huge) . 

5) In my case I was left with 3 Jobs which were suspect due to which imp job was impacted since some App seemed to be having Data skew or some kind of issue which was filling up local NM cache.

 Bingo on inspecting the Jobs we saw one of the job had 8K maps completed and ~1K reducers where all attempts completed but one of them was still running .  On checking the reducer stats it was clear this reducer was down pretty much all the work due to possible skew, since the shuffle byte was ~262 GB + and increasing .  On checking further it was seen this reducer spun up on multiple nodes and was killed every time since it utilized all the local NM space to make the node unusable , now we were clear this job will not succeed and need to be stopped after getting app team involved ( To make sure such issues don't happen in future ) .

Name                          Value
Combine input records 0
Combine output records 0
CPU time spent (ms) 1,533,690
Failed Shuffles 0
GC time elapsed (ms) 85,562
Merged Map outputs 3,702
Physical memory (bytes) snapshot 7,814,942,720
Reduce input groups 0
Reduce input records 0
Reduce output records 0
Reduce shuffle bytes 262,595,013,166
Shuffled Maps 3,750
Spilled Records 0
Total committed heap usage (bytes) 7,265,714,176
Virtual memory (bytes) snapshot 9,836,429,312 

Now As long term solution: Application team was informed about this issue and were asked to fix this kind of skews to prevent other production impact.

ADMIN TASK : To pro-actively find such bad job and kill them before they impact production users.

We realized we need to monitor usage for NM local dir and find out if there is any job which is filling up the space and alert the admin to take action.  Below is an example when we caught a culprit Job filling up NM local dir live and we were able to point out the application ID.

[mapr@NodeA:/opt/mapr/tmp/hadoop-mapr/nm-local-dir/usercache] sudo du -sh *  |grep G
1.2G    UserA
98G     UserB
5.4G    UserC

[mapr@NodeA:/opt/mapr/tmp/hadoop-mapr/nm-local-dir/usercache/UserB] sudo du -sh *  |grep G
99G     appcache
[mapr@dbslp1027:/opt/mapr/tmp/hadoop-mapr/nm-local-dir/usercache/UserB/appcache] sudo du -sh *  |grep G
99G     application_1499818693665_1082462

Automation :

So to automate this, wrote small script which would monitor and find any huge files under NM local dir which could be suspect .

[root@node107rhel72 ~]# cat NM_Temp_SpaceMonitor.sh 
##################################################################

#!/bin/bash

##################################################################
#
#     This script will find any huge files (50 GB) under NM_LOCAL_DIR
#     and report, for Admin to take action.
#
##################################################################
NM_LOCAL_DIR=/opt/mapr/tmp/hadoop-mapr/nm-local-dir/

if [ ! -d $NM_LOCAL_DIR ]; then
  echo ERROR: Not a directory: $logDir
  exit 1
fi

CULPRIT_FILE_LOC=`find $NM_LOCAL_DIR -type f -size +50G -print0 | xargs -0`
echo "Complete Path $CULPRIT_FILE_LOC" 
echo "$CULPRIT_FILE_LOC"  | sed 's,/, ,g' | tr " " "\n" | grep -i app |  xargs -0 echo  CULPRIT_APPLICATION is 


##################################################################


Note :- Above script only catches only files which are greater then 50 GB but recently i found a Spark job which was causing NM to be unusable since they were having huge blockmgr dir but individual files were only ~2-3 GB. I used below command to give me culprit application ID and added extra automation to find such spark jobs as well.

[root@tssperf09 ~]# du -h /opt/mapr/tmp/hadoop-mapr/nm-local-dir |awk '$1 ~ /[0-9]*G/ {print}' |sort -nr|sed 's/G//g' |awk '{ if ( $1 > 50.0 ) print }' | grep blockmgr |cut -d'/' -f10

application_1521597354345_3389657


[root@tssperf09 ~]# 

No comments:

Post a Comment