Sunday, May 14, 2017

Jobs running Slow (RM allocating Containers Very slowly)


                          Jobs running Slow (RM allocating Containers Very slowly)

I am sure almost every Hadoop Admin has got a complain/ticket one or the other time from their client teams complain job/Job's are running slow. Most times issues are straightforward where job is running slow due to data skew, queue out of resources,some node swapping/ running out of system resources etc but last week I had interesting issue where all jobs were running slow, Even though queue/cluster was only 10 % or less utilized.

Observation : When job would be submitted AM would spin up and start asking for containers to execute the attempts but even though AM would need 100 containers it would get only 10-15 containers while rest attempt would just sit in UNASSIGNED for long time even though the queue would be capable of getting more containers to run the job parallel and hence it was obvious the slow down was due to containers not getting allocated via scheduler .

2017-05-11 09:01:30,424 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1492222427384_2309252_m_029818_0 TaskAttempt Transitioned from NEW to UNASSIGNED

As everyone know performance issues are the most painful to troubleshoot and can only be done effectively if we do have stats from the time when system was functioning as expected so that we know what to expect and set some benchmarks to compare against.

Fortunately org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode Class logs messages as below which can tell us when scheduler assigned a container to NM ( Based on request from NM )

Assigned container container_XXX of capacity <memory:2048, vCores:1, disks:0.07> on host XXX

Now during the time of issue ran below command to see how much containers were getting allocated per min , As we can see it was ~300 containers per minute.

grep -i allo yarn-mapr-resourcemanager-<Hostname>.log |grep  Assigned|grep ^'2017-05-11 16'|awk '{print $1,$2}'|awk -F: '{print $1,$2}' |sort|uniq -c
    
    287 2017-05-11 16 07
    383 2017-05-11 16 08
    213 2017-05-11 16 09

Now I wasn't sure is 300 a good number or bad hence needed same detail when issue was not seen.  I had to find time when no slowness was seen and run the same command and as seen it was 10k container.

grep -i alloc yarn-mapr-resourcemanager-<Hostname>.log | grep  Assigned|grep ^'2017-05-13' |awk '{print $1,$2}' |awk -F: '{print $1,$2}' | sort | uniq -c
  
  11244 2017-05-13 14 19
  10903 2017-05-13 14 20

Wow so now we know there is something definitely wrong in Resource manager ( most likely scheduler thread ) since jobs are moving but very slow.

Troubleshooting :-

1) Since we confirmed container allocation is slow we checked system stats on RM node as well as AM node it came out clean and didn't see issues with Memory CPU network or disks .

2) Now we turned on Debug's for RM scheduler class and AM of a sample job.

RM Debug :  yarn daemonlog -setlevel <ActiveRM-IP>:8088 org.apache.hadoop.yarn.server.resourcemanager.scheduler DEBUG
AM Debug :  Set yarn.app.mapreduce.am.command-opts = “-Xmx1024m -Dhadoop.root.logger=DEBUG,CLA” In mapred-site.xml ( Client node )

But there was no useful information logged which could give us clue why the Scheduler was not assigning containers fast enough . Yet looking at Scheduler UI on RM page we can see the queue does have enough resource but not more then 25 containers are getting scheduled.

Used Resources: <memory:51200, vCores:25, disks:1.680000000000001>
Num Active Applications: 1
Num Pending Applications: 0
Min Resources: <memory:820000, vCores:100, disks:8.0>
Max Resources: <memory:1210000, vCores:150, disks:12.0>
Steady Fair Share: <memory:1210000, vCores:150, disks:8.0>
Instantaneous Fair Share: <memory:1210000, vCores:150, disks:12.0> 

3)  Finally ran jstack/"kill -3 " on RM pid every few seconds to find out what every thread in the JVM is doing at that particular point in time and if they are progressing.

 Then needed to analyze the dump to see where the threads are stuck if any or where ( which code ) its spending most of its time. I recently started using https://spotify.github.io/threaddump-analyzer/ to make analysis of thread dump less painful .

i) There were total 287 threads found .

101 threads with this stack:
"IPC Server handler 9 on 8032/8030": awaiting notification on [0x00000003c90eea78]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:109) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2004)
49 threads with this stack:
"IPC Server handler 9 on 8031": waiting to acquire [0x00000003c00aab10]
at org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:144)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:387)
at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceTrackerPBServiceImpl.nodeHeartbeat(ResourceTrackerPBServiceImpl.java:68)
at org.apache.hadoop.yarn.proto.ResourceTracker$ResourceTrackerService$2.callBlockingMethod(ResourceTracker.java:81)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2032)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2030)

There were some other threads which were Sleeping Etc

Finally found the thread of interest : This is the thread which is the only running thread, on which all the other threads are waiting to get a lock .


0x00000003c00aab10
HostsFileReader
Held by:
IPC Server handler 19 on 8031
49 threads waiting to take lock:
Complete stack :
"IPC Server handler 19 on 8031": running, holding [0x00000003c00aab10]
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:563)
at org.apache.hadoop.yarn.server.resourcemanager.NodesListManager.isValidNode(NodesListManager.java:147)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(ResourceTrackerService.java:387)
at org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceTrackerPBServiceImpl.nodeHeartbeat(ResourceTrackerPBServiceImpl.java:68)
at org.apache.hadoop.yarn.proto.ResourceTracker$ResourceTrackerService$2.callBlockingMethod(ResourceTracker.java:81)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2032)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2030)

ii) Now just to understand the service served by the respective port below is table .

ServicePort
ResourceManager Client RPC8032
ResourceManager Resource Tracker RPC (for NodeManagers)8031
ResourceManager Scheduler RPC (for ApplicationMasters)8030
Since the main thread in question is running on port 8031 where all the other thread was waiting on , I have a strong clue that RM scheduling delay is happening during sending/tracking RPC ( for NodeManager).

From the stack trace its clear that most likely the slowness is coming when some kind of lookup is happening i.e when "lookupAllHostAddr" method is called.

4) To verify my suspicion I wrote small script which would run from RM nodes and SSH to every node in cluster and run below command in a loop. Bingo now we found a few nodes where it was taking almost 20 seconds to get a response while from most nodes response was in Milliseconds .

getent hosts `hostname`

So now we confirmed issue was due to Infrastructure ( DNS resolution / Network ) and Hadoop was out of picture.

5) Logged into one of the node where lookup was experiencing huge delays and observed /etc/resolve.conf had an wrong entry for first Nameserver ( This Name server was painfully slow in responding ) . Verified there were some nodes which had this obsolete entry of Nameserver while some good nodes did have Namserver which was local to the network and quick in response .

Solution In My case:

As a solution we kept first Nameserver entry as the node which was local to the network and quick in response across the cluster and restarted nscd service.  As soon as this was done all jobs quickly completed . Yay !!!


Other possibilities ( Which could add delays ):

Check if the IPv6 is assigned to the host, If this is please disable IPv6 if you are not using it. ( first from the control nodes and gradually every node in batches ) . 

 As seen in stack trace below method is called hence there could be lookup issues between IPv6 versus IPv4 in some cases, if the Domain Name System (DNS) server is not configured to handle IPv6 queries, the application may have to wait for the IPv6 query to timeout . By default, java.net.InetAddress resolutions requiring the network will be carried out over IPv6 first if the operating system supports both IPv4 and IPv6.

java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)

To reverse the default behavior and use IPv4 over IPv6, add the following Generic JVM argument ( RM, NM etc ) :
  1. **-Djava.net.preferIPv4Stack=true*





No comments:

Post a Comment