Wednesday, July 26, 2017

Job affecting other Jobs in cluster

                                       Job affecting other Jobs in cluster

Recently we saw interesting occurrence of an issue described earlier ( below ), only variation was some Bad job was impacting a very important job by killing its containers randomly with "Killing taskAttempt because it is running on unusable node" message.


One Bad job was causing to fill up NM cache local dirs on various nodes (/opt/mapr/tmp/hadoop-mapr/nm-local-dir) causing node to go bad and terminate all running containers as explained in earlier blog .

Now to find the job which was Culprit below are the steps which were followed .

1) Found the latest attempt of my job which was killed and got the node name on which it ran ( Killed due to NM being unstable ) . 

2) On that node got all the NM logs from the time below message was logged to the time when other consecutive message was logged.

Start Message :    "local-dirs are bad"
------
------
-----
End Message :    " local-dirs are good"

3) Between this time as we know all the running containers would be killed and cleanups would happen for the NM to get back to healthy state . Got a list of all the containers which were killed to get sense and narrow down jobs which can be culprit, Alternatively we can also grep for applications which are cleaned up between the time NM goes bad and becomes good .

Example log line :


INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /opt/mapr/tmp/hadoop-mapr/nm-local-dir/usercache/User/appcache/application_1496463308975_240690


4)  Repeated above steps for another similar occurrence to find out common jobs to narrow down the job suspects( Incase suspected job count is huge) . 

5) In my case I was left with 3 Jobs which were suspect due to which imp job was impacted since some App seemed to be having Data skew or some kind of issue which was filling up local NM cache.

 Bingo on inspecting the Jobs we saw one of the job had 8K maps completed and ~1K reducers where all attempts completed but one of them was still running .  On checking the reducer stats it was clear this reducer was down pretty much all the work due to possible skew, since the shuffle byte was ~262 GB + and increasing .  On checking further it was seen this reducer spun up on multiple nodes and was killed every time since it utilized all the local NM space to make the node unusable , now we were clear this job will not succeed and need to be stopped after getting app team involved ( To make sure such issues don't happen in future ) .

Name                          Value
Combine input records 0
Combine output records 0
CPU time spent (ms) 1,533,690
Failed Shuffles 0
GC time elapsed (ms) 85,562
Merged Map outputs 3,702
Physical memory (bytes) snapshot 7,814,942,720
Reduce input groups 0
Reduce input records 0
Reduce output records 0
Reduce shuffle bytes 262,595,013,166
Shuffled Maps 3,750
Spilled Records 0
Total committed heap usage (bytes) 7,265,714,176
Virtual memory (bytes) snapshot 9,836,429,312 

Now As long term solution: Application team was informed about this issue and were asked to fix this kind of skews to prevent other production impact.

ADMIN TASK : To pro-actively find such bad job and kill them before they impact production users.

We realized we need to monitor usage for NM local dir and find out if there is any job which is filling up the space and alert the admin to take action.  Below is an example when we caught a culprit Job filling up NM local dir live and we were able to point out the application ID.

[mapr@NodeA:/opt/mapr/tmp/hadoop-mapr/nm-local-dir/usercache] sudo du -sh *  |grep G
1.2G    UserA
98G     UserB
5.4G    UserC

[mapr@NodeA:/opt/mapr/tmp/hadoop-mapr/nm-local-dir/usercache/UserB] sudo du -sh *  |grep G
99G     appcache
[mapr@dbslp1027:/opt/mapr/tmp/hadoop-mapr/nm-local-dir/usercache/UserB/appcache] sudo du -sh *  |grep G
99G     application_1499818693665_1082462

Automation :

So to automate this, wrote small script which would monitor and find any huge files under NM local dir which could be suspect .

[root@node107rhel72 ~]# cat NM_Temp_SpaceMonitor.sh 
##################################################################

#!/bin/bash

##################################################################
#
#     This script will find any huge files (50 GB) under NM_LOCAL_DIR
#     and report, for Admin to take action.
#
##################################################################
NM_LOCAL_DIR=/opt/mapr/tmp/hadoop-mapr/nm-local-dir/

if [ ! -d $NM_LOCAL_DIR ]; then
  echo ERROR: Not a directory: $logDir
  exit 1
fi

CULPRIT_FILE_LOC=`find $NM_LOCAL_DIR -type f -size +50G -print0 | xargs -0`
echo "Complete Path $CULPRIT_FILE_LOC" 
echo "$CULPRIT_FILE_LOC"  | sed 's,/, ,g' | tr " " "\n" | grep -i app |  xargs -0 echo  CULPRIT_APPLICATION is 


##################################################################


Note :- Above script only catches only files which are greater then 50 GB but recently i found a Spark job which was causing NM to be unusable since they were having huge blockmgr dir but individual files were only ~2-3 GB. I used below command to give me culprit application ID and added extra automation to find such spark jobs as well.

[root@tssperf09 ~]# du -h /opt/mapr/tmp/hadoop-mapr/nm-local-dir |awk '$1 ~ /[0-9]*G/ {print}' |sort -nr|sed 's/G//g' |awk '{ if ( $1 > 50.0 ) print }' | grep blockmgr |cut -d'/' -f10

application_1521597354345_3389657


[root@tssperf09 ~]# 

Thursday, July 20, 2017

Distcp Across Secure MapR clusters

                                   Distcp Across Secure MapR clusters 



This Blog assumes you have 2 clusters up and running securely


Note :-  For this Blog i just have 1 node in each cluster but as needed will mention what files are needed on all nodes.

Source Cluster  - Node node106rhel72/10.10.70.106
Destination Cluster  - Node node107rhel72/10.10.70.107  

1)  i) On all node in SOURCE CLUSTER verify that maprserverticket , cldb.key , ssl_truststore, ssl_keystore are same. Run md5sum on these file on each node to confirm.
ii) Make sure destination cluster details are added to "
mapr-clusters.conf" file


[root@node106rhel72 ~]# cat /opt/mapr/conf/mapr-clusters.conf
Container-cluster secure=true 10.10.70.106:7222
Container-cluster2 secure=true 10.10.70.107:7222

2) On all node in DESTINATION CLUSTER verify that maprserverticket , cldb.key , ssl_truststore, ssl_keystore are same. Run md5sum on these file on each node to confirm.
3)  i) Copy /opt/mapr/conf/ssl_truststore from DESTINATION CLUSTER to cldb node of SOURCE CLUSTER under /tmp/

[root@node107rhel72 conf]# scp  /opt/mapr/conf/ssl_truststore  10.10.70.106:/tmp/
root@10.10.70.106's password: 
ssl_truststore                                                                                                                         100%  798     0.8KB/s   00:00    
[root@node107rhel72 conf]#

 2) Now run the below command to merge ssl_truststore on SOURCE CLUSTER

Note: Ignore ssl_truststore merge step if in case you have already done it earlier.
$ chmod 644 /opt/mapr/conf/ssl_truststore
$ /opt/mapr/server/
manageSSLKeys.sh merge /tmp/ssl_truststore /opt/mapr/conf/ssl_truststore 
$ chmod 444 /opt/mapr/conf/ssl_truststore 

4) Copy the merged truststore file '/opt/mapr/conf/ssl_truststore' on all the node in SOURCE CLUSTER under /opt/mapr/conf/ 

5) Generate cross-cluster ticket from DESTINATION CLUSTER for user who wants to do distcp ( MapR in our case), in this case i created ticket to last for 10 years

$ maprlogin generateticket -type crosscluster -out /tmp/destination-ticket -duration 3650:0:0 

Note: - It is critical to specify an appropriate value for the duration. After the ticket expires, communication between the clusters will stop. In this example, the duration of ten years is given for convenience of explanation. Use a value that is consistent with your security policies.

6) Copy file /tmp/destination-ticket from DESTINATION CLUSTER to SOURCE CLUSTER's cldb node under /tmp. 


scp /tmp/destination-ticket  10.10.70.106:/tmp/

7) At SOURCE CLUSTER append the content of file /tmp/destination-ticket in /opt/mapr/conf/maprserverticket .

$ cat /tmp/destination-ticket >> /opt/mapr/conf/maprserverticket


8) Copy file /opt/mapr/conf/maprserverticket on all the nodes in SOURCE CLUSTER . 

9) Stop warden and 
zookeeper in SOURCE CLUSTER followed by starting ZK and then warden once ZK is up

10) On SOURCE CLUSTER create user ticket for user mapr for source and destination cluster .

maprlogin password
maprlogin password -cluster Dest

cat /tmp/maprticket_2000
Source KV34qQ0jtmQXObJglDiZqqHHm507pbYOsHd4qIEEavC+0PGDlB/YeTBGReOxf+EleSEO78pYvNqzoqK5uK+5Gibx0v+XPEyl2UuDgBR6GUBwx4yUUxnUY7Ct4STdcHmvcyE47AVM4gXc9ivQCvkokyIvZwYiGtwVQ8rnTNrLuzuUPAH8GMbR486UgMQ8axy8QIcA2zexIT0K0Ct7Fj612UPVonXZDfnAB2yG5gEhdmxLOMPmQLm9qt6f49Pzrn96IwHGLXQtUAmfrTwrbPPPOSUshA==
Dest 4D9Z469Y3j7h3sy2CVZwQrlXDEWHCtmCENQQGFvVzoGsytXp4K3OLOf+BZhLIoTBZuu2uzmV/1SbnqYUfO9NXsxAx3Bomez9iZ3ni7Kfk9m9CTEPydl9updp8IFQZ83jQ7IERM3WgN/rouEg3T/BnwPA2+U2cnGjeeCgXH3lmopJGiYFCegXWhhn9TmKawH0Vp4f3tDBBo2nWjr1sCnBvsBXhYP6DQzA3vLdmbGWQn6d2IJRNUA0irG8MSjxzZ4E9y4S2hu4gnLYE0IXgXNoWWhawQ==


Validation :


1) Created a test file and pushed in source cluster
[root@node106rhel72 ~]# vi abi
[mapr@node106rhel72 ~]# hadoop fs -put abi /
[mapr@node106rhel72 ~]# hadoop fs -ls /abi
Found 7 items
-rwxr-xr-x   3 root root        266 2017-07-20 19:16 /abi

2) Now distcp across clusters and preserve the permissions. 

[root@node106rhel72 ~]# hadoop distcp -p /mapr/Container-cluster/abi /mapr/Container-cluster2/tmp/
17/07/20 19:16:59 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[/mapr/Container-cluster/abi], targetPath=/mapr/Container-cluster2/tmp, targetPathExists=true, preserveRawXattrs=false}
17/07/20 19:16:59 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to node106rhel72/10.10.70.106:8032
17/07/20 19:17:00 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
17/07/20 19:17:00 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
17/07/20 19:17:00 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to node106rhel72/10.10.70.106:8032
17/07/20 19:17:00 INFO mapreduce.JobSubmitter: number of splits:1
17/07/20 19:17:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1500592302342_0009
17/07/20 19:17:01 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
17/07/20 19:17:01 INFO impl.YarnClientImpl: Submitted application application_1500592302342_0009
17/07/20 19:17:01 INFO mapreduce.Job: The url to track the job: https://node106rhel72:8090/proxy/application_1500592302342_0009/
17/07/20 19:17:01 INFO tools.DistCp: DistCp job-id: job_1500592302342_0009
17/07/20 19:17:01 INFO mapreduce.Job: Running job: job_1500592302342_0009
17/07/20 19:17:09 INFO mapreduce.Job: Job job_1500592302342_0009 running in uber mode : false
17/07/20 19:17:09 INFO mapreduce.Job:  map 0% reduce 0%
17/07/20 19:17:14 INFO mapreduce.Job:  map 100% reduce 0%
17/07/20 19:17:14 INFO mapreduce.Job: Job job_1500592302342_0009 completed successfully
17/07/20 19:17:14 INFO mapreduce.Job: Counters: 34
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=100285
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
MAPRFS: Number of bytes read=631
MAPRFS: Number of bytes written=266
MAPRFS: Number of read operations=33
MAPRFS: Number of large read operations=0
MAPRFS: Number of write operations=1
Job Counters 
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=2800
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=2800
Total vcore-seconds taken by all map tasks=2800
Total megabyte-seconds taken by all map tasks=2867200
DISK_MILLIS_MAPS=1400
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=144
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=0
CPU time spent (ms)=440
Physical memory (bytes) snapshot=271511552
Virtual memory (bytes) snapshot=2987470848
Total committed heap usage (bytes)=904396800
File Input Format Counters 
Bytes Read=221
File Output Format Counters 
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=266
BYTESEXPECTED=266
COPY=1

3) Validated file exists on destination cluster .

[root@node106rhel72 ~]# hadoop fs -ls /mapr/Container-cluster2/tmp/abi
-rwxr-xr-x   3 root root        266 2017-07-20 19:16 /mapr/Container-cluster2/tmp/abi