Tuesday, March 6, 2018

CLDB failing over with NoSuchMethod

                  CLDB failing over with NoSuchMethod


Had an interesting issue where CLDB would fail over randomly with below error .

2018-03-05 13:03:27,571 FATAL CLDB [RPC-85]: CLDBShutdown: CldbError
2018-03-05 13:03:27,571 FATAL CLDB [RPC-85]: CLDB Exception
java.lang.NoSuchMethodError: com.mapr.fs.cldb.topology.StoragePool.getInTransitContainersCount()I
        at com.mapr.fs.cldb.CLDBServer.dumpInfo(CLDBServer.java:6953)
        at com.mapr.fs.cldb.CLDBServer.processRpc(CLDBServer.java:4537)
        at com.mapr.fs.cldb.CLDBServer.requestArrived(CLDBServer.java:3266)
        at com.mapr.fs.Rpc$RpcExecutor.run(Rpc.java:160)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)


1) From looking at the stack issue was clear i.e method was not found in the class for some kind of dumpinfo command. After reviewing maprcli logs , problem was triggered when running " maprcli dump balancerinfo " command.  So first workaround was not to run this command in any script or manually till the issue is fixed .

2) Now we want to find which jar is suppose to have the class and method in question   (StoragePool.getInTransitContainersCount) .  U checked on my local test machine at same patch level , we could narrow down to cldb-5.2.2-mapr.jar having the required class and method .


i)  [root@node107rhel72 logs]# grep -iR StoragePool /opt/mapr/lib/*.jar
Binary file /opt/mapr/lib/cldb-5.2.2-mapr.jar matches
Binary file /opt/mapr/lib/maprfs-5.2.2-mapr.jar matches

ii)  [root@node107rhel72 logs]# unzip -t /opt/mapr/lib/cldb-5.2.2-mapr.jar | grep -i StoragePool
    testing: com/mapr/fs/cldb/StoragePoolHandler.class   OK
    testing: com/mapr/fs/cldb/TrackContainerInMemory$PerStoragePoolInfo.class   OK
    testing: com/mapr/fs/cldb/processors/StoragePoolOfflineTask.class   OK
    testing: com/mapr/fs/cldb/processors/StoragePoolOfflineWorker.class   OK
    testing: com/mapr/fs/cldb/table/Table$StoragePoolPropertiesEntry.class   OK
    testing: com/mapr/fs/cldb/topology/LoadTracker$StoragePoolBinsCount.class   OK
    testing: com/mapr/fs/cldb/topology/StoragePool.class   OK

iii)  [root@node107rhel72 logs]# javap -classpath /opt/mapr/lib/cldb-5.2.2-mapr.jar com.mapr.fs.cldb.topology.StoragePool | grep getInTransitContainersCount
  public int getInTransitContainersCount();
[root@node107rhel72 logs]# 


3) Now I wanted to make sure all the CLDB nodes have the same version of jar which has the method needed which was the case since md5sum matched on all nodes .

[root@node107rhel72 ~]# md5sum /opt/mapr/lib/cldb-5.2.2-mapr.jar
e74ccc9340eaaee077b87dec76ffe3c9  /opt/mapr/lib/cldb-5.2.2-mapr.jar
[root@node107rhel72 ~]# 


4) So the question was if this jar was there and i checked for any older duplicate jars "cldb-*.jar" which were not present why was CLDB not able to get the method . We suspected the class picked up by JVM is from another jar and that is missing the method .

i) To narrow this down in customer env we created a file with all the jar path under "/opt/mapr/lib/".

$ls -l /opt/mapr/lib/* | awk '{print $NF}' | grep ".jar" > /tmp/jar.txt 

ii) Then we wanted to check which other jar's have StoragePool Class .

$ cat /tmp/jar.txt | while read line; do unzip -t $line | grep -Hsi "StoragePool"&& echo "$line" ; done

(standard input):    testing: com/mapr/fs/proto/Common$StoragePoolInfoOrBuilder.class   OK
(standard input):    testing: com/mapr/fs/proto/Common$StoragePoolInfo$1.class   OK
(standard input):    testing: com/mapr/fs/proto/Common$StoragePoolInfo$Builder.class   OK
(standard input):    testing: com/mapr/fs/proto/Common$StoragePoolInfo.class   OK
(standard input):    testing: com/mapr/fs/cldb/proto/CLDBProto$StoragePoolProperties$1.class   OK
(standard input):    testing: com/mapr/fs/cldb/proto/CLDBProto$StoragePoolProperties$Builder.class   OK
(standard input):    testing: com/mapr/fs/cldb/proto/CLDBProto$StoragePoolProperties.class   OK
(standard input):    testing: com/mapr/fs/cldb/proto/CLDBProto$FSMasterForContainerOnStoragePoolOrBuilder.class   OK
(standard input):    testing: com/mapr/fs/cldb/proto/CLDBProto$FSMasterForContainerOnStoragePool$1.class   OK
(standard input):    testing: com/mapr/fs/cldb/proto/CLDBProto$FSMasterForContainerOnStoragePool$Builder.class   OK
(standard input):    testing: com/mapr/fs/cldb/proto/CLDBProto$FSMasterForContainerOnStoragePool.class   OK
(standard input):    testing: com/mapr/fs/cldb/proto/CLDBProto$ActiveStoragePool.class   OK
/opt/mapr/lib/maprfs-5.2.2-mapr.jar

(standard input):    testing: com/mapr/fs/cldb/StoragePoolHandler.class   OK
(standard input):    testing: com/mapr/fs/cldb/processors/StoragePoolOfflineTask.class   OK
(standard input):    testing: com/mapr/fs/cldb/processors/StoragePoolOfflineWorker.class   OK
(standard input):    testing: com/mapr/fs/cldb/table/Table$StoragePoolPropertiesEntry.class   OK
(standard input):    testing: com/mapr/fs/cldb/topology/StoragePool.class   OK
/opt/mapr/lib/cldb-5.2.2-mapr.jar

(standard input):    testing: com/mapr/fs/cldb/topology/StoragePool.class   OK
/opt/mapr/lib/case00055209_debug_180_v3.jar


Bingo ! we found the culprit jar to be old Debug jar which has the same class but from older version which was built for troubleshooting issue while ago "/opt/mapr/lib/case00055209_debug_180_v3.jar" which was in place 


Fix was simple to remove the culprit jar from the location and restart CLDB's to not pick up this debug Jar any more.



Alternative DEBUG Method (Not preferred since it needs restart of service) :

Another method to debug this issue was to enable Verbose Class loading and check which jar "com.mapr.fs.cldb.topology.StoragePool" is getting loaded from . I took the debug jar and put in my test env and i could repo the same issue . Below steps would confirm findings from earlier part of the blog.

1) Added verbose class options to load all the details on which jars the class are loading from .

[root@node107rhel72 logs]# grep verbose /opt/mapr/initscripts/mapr-cldb 
CLDB_OPTS="${CLDB_OPTS} -verbose:class -Dmapr.library.flatclass -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=${CLDB_JMXREMOTE_PORT} -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false "
[root@node107rhel72 logs]# 

2) Now run below command for CLDB to bounce .

[root@node107rhel72 logs]# maprcli dump balancerinfo ;tailf -10 cldb.log 
ERROR (10009) -  Couldn't connect to the CLDB service. dump balancerinfo RPC failed
2018-03-06 23:58:47,232 ERROR CLDB [RPC-6]: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2018-03-06 23:58:47,232 ERROR CLDB [RPC-6]: java.lang.Thread.run(Thread.java:745)
2018-03-06 23:58:47,232 INFO CLDBServer [RPC-6]: Shutdown: Stopping CLDB

After it restarts it will start loading class from the jars and every loading activity will be logged . Below log lines clearly shows which jar the class is loading from.

[Loaded com.mapr.fs.cldb.counters.FileServerMetrics from file:/opt/mapr/lib/cldb-5.2.2-mapr.jar]
[Loaded com.mapr.fs.cldb.topology.Node from file:/opt/mapr/lib/cldb-5.2.2-mapr.jar]
[Loaded com.mapr.fs.cldb.topology.StoragePool from file:/opt/mapr/lib/case00055209_debug_180_v3.jar]

Fix is to remove the culprit jar from the location and restart CLDB's to not pick up this debug Jar any more for class to be loaded from CLDB jar instead .




No comments:

Post a Comment