Monday, November 13, 2017

Fair Scheduler resource updates

                          Fair Scheduler resource updates



Recently I had an interesting case where some resources in the fair scheduler queue were updated yet on the scheduler page we didn't see updated values. The main concern was will application team get extra resources they are paying for , if yes is there a bug in scheduler UI .

1) On checking the RM logs it was clear the file was indeed getting read but the question was why values are not updated ?

2017-11-13 16:18:18,124 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml
2017-11-13 16:18:28,127 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml
2017-11-13 16:18:38,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml
2017-11-13 16:18:48,133 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml
2017-11-13 16:18:58,137 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2017-11-13 16:19:08,141 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2) I turned in scheduler debug log to narrow down if it was just Schedular UI issue or indeed this values were not updated in scheduler  .  Reviewing the debug logs it was found Scheduler only knows about old values .

yarn daemonlog -setlevel <RM Hostname>:8088 org.apache.hadoop.yarn.server.resourcemanager.scheduler DEBUG

One step in right direction ,  Now we knew it was not scheduler UI but something wrong with scheduler itself .

3) So assuming user was hitting some kind of bug when updating multi-level queues i decided to repro the issue in-house but unfortunately everything worked fine and all the updates i was able to see on the scheduler page. 

This confirmed it was environment issue or something wrong with customers environment/fair-scheduler.xml   file 

4) I tried to load customer fair-scheduler.xml in my local repro to check if the file was readable or if there was some kind of issue with the format etc .  Unfortunately my logs also updated the file was being read and reported no error but the scheduler page didn't update the new queue's etc .

5) Finally i restarted RM hoping it will read and display the queue's in the scheduler  .


Bingo !!! This time RM failed to come up and logged below messages which gave me clue for the RC of the issue.

Caused by: java.io.IOException: Failed to initialize FairScheduler
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1441)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1458)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 7 more
Caused by: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: Bad fair scheduler config file: queue name (mapr.general) shouldn't contain period.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:437)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:516)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:516)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:355)

at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1439)


As seen from the stack trace the fair-scheduler.xml had an issue where someone while updating the file incorrectly updated fair scheduler config file with queue name (mapr.general) and queue name can never contain period which was causing the file to be not read and Scheduler not being updated.


Key takeaway :

After updating your configs always validate updated queue resources show up, easier to catch issue when things are updated recently then back track the problem with no clue what someone else has done .




No comments:

Post a Comment