Monday, February 26, 2018

NM blacklisting due to LogAggregationService

                                   NM blacklisting due to LogAggregationService         

Recently I had a very interesting issue where few NM after restart would go blacklisted within 10 mins and the reasons were not the usual once like tmp space issue, Node issue leading to Attempt failure etc . 

I will list the troubleshooting procedure followed with work around and the RC of why the cluster ran into this issue.

1)  Below error was seen in RM node . This indicated NM was not heartbeating with the RM , hence the state was changing to lost .


2018-02-25 16:06:31,727 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:NodeA:8091 Timed out after 600 secs
2018-02-25 16:06:31,727 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node NodeA:8091 as it is now LOST
2018-02-25 16:06:31,727 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: NodeA:8091 Node Transitioned from RUNNING to LOST

2) On checking the system stats it was clear NM was using all the CPU cores and causing huge load on the system .

3) In NM out file i saw messages as below which pointed user lookup was not working well and NM threads were going OOM and also it saw GC pauses. Although this would point to Infra issue where user lookups were not being served fast enough it was just a distraction and made the problem harder to debug .



2018-02-25 16:09:04,1783 ERROR Client fs/client/fileclient/cc/client.cc:787 Thread: 29949 GetUid failed, could not get userinfo for user UserA, error 0

2018-02-25 16:09:04,1802 ERROR Client fs/client/fileclient/cc/client.cc:787 Thread: 29949 GetUid failed, could not get userinfo for user UserB, error 0

Halting due to Out Of Memory Error...
Halting due to Out Of Memory Error...
Halting due to Out Of Memory Error...
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "LogAggregationService #754"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "LogAggregationService #12498"

We noted NM was running with 4Gb Xmx, easy workaround would be to increase Xmx to say 12 or 16 GB and see if it took care of the problem .

4) Before doing any workaround I wanted to check what internal threads were getting kicked which was causing huge load on the system ( almost used all the avail CPU's ) , so i took jstack of NM process, I saw huge number of LogAggregationService threads being kicked in. 


[root@ip-10-165-137-99 Jstack]# grep LogAggregationService abizer.txt | grep prio | head -2
"LogAggregationService #19661" #19812 prio=5 os_prio=0 tid=0x000000000e627000 nid=0x1438 waiting for monitor entry [0x00007f2520652000]
"LogAggregationService #19660" #19811 prio=5 os_prio=0 tid=0x000000000e623800 nid=0x1437 waiting for monitor entry [0x00007f2520753000]

[root@ip-10-165-137-99 Jstack]# grep LogAggregationService abizer.txt | grep prio | wc -l
38397
[root@ip-10-165-137-99 Jstack]#


At this point it was clear issue was with huge number of Aggregation threads being triggered . We tried to find a way to limit number of Aggregation thread but unfortunately came across YARN-4697 which does confirm  NM aggregation thread pool is not bound by limits and needs code fix since cluster didn't have the fix already .


Workaround :  As a workaround we disabled node manager recovery by setting below value to false and restarted the NM  . Once NM restarted it came up and didn't trigger  Log Aggregation .  Bingo ! We found a good workaround to get out of the problem. 

yarn.nodemanager.recovery.enabled

On safe side we also cleaned up all the older user logs which were still present on local filesystem and had chance of being Aggregated.

Note : There can be scenario where to start executing Jobs again normally may take longer since NM may already have reservation on the nodes and jobs might have failed/exited already so we would need to wait for reservation to Expire.  In case quick recovery is needed RM recovery can be disabled temporarily .

Now after the cluster recovered, I wanted to find out the real cause of why so many aggregation threads were getting triggered in first place. Below is what we found which now made sense on why system behaved the way it did .

Root Cause : Reason why so many log aggregation were kicked in after the NM restart was since quiet a few apps log aggregation was pending for few days due to Name container of the yarn aggregation volume growing very huge, size of it was equivalent to the size of SP and then it had no space to grow . Hence all the aggregation attempts failed once the space on SP was completely utilized. Aggregation was retried once a new volume location for Aggregation was specified followed by restart of NM which went for recovery of container and then triggering Aggregation attempt.



  



No comments:

Post a Comment