AWS/Azure(Cloud)/Spark/Hadoop / Linux : March 2018

Tuesday, March 20, 2018

Turning on FileClient Debug dynamically

Turning on FileClient Debug dynamically

As explained in earlier blogs Enabling FC debug can be very helpful while debugging issues and seeing what calls the client is making to the cluster incase there are timeout's , retry errors but no obvious issues at the cluster side. While the below blogs explain way to turn on FC debug when the process starts up it can be priceless if FC can be turned on dynamically.

http://abizeradenwala.blogspot.com/2016/11/interacting-with-hadoop-cluster-file.html
http://abizeradenwala.blogspot.com/2016/11/file-client-debug-logging.html

Lets say we need to get FC debug logs for NM pid .

1) Get NM pid .

[root@node107rhel72 logs]# jps | grep 12531

12531 NodeManager

[root@node107rhel72 logs]# 

2) Run below command to list the details on shared memory segments. From below commands we know the shmid of interest is 1006731277 .

[root@node107rhel72 logs]# ipcs -mp | grep 12531

------ Shared Memory Creator/Last-op PIDs --------

shmid      owner      cpid       lpid    

1006731277 mapr       12531      17541     

1006764046 mapr       12531      12531     

1006796815 mapr       12531      12531     

[root@node107rhel72 logs]# ipcs -m | egrep "bytes|1234"

------ Shared Memory Segments --------

key        shmid      owner      perms      bytes      nattch     status 

0x00000000 0          mapr       644        1234       0                       

0x00000000 1002995714 mapr       644        1234       1          dest               

0x00000000 1006731277 mapr       644        1234       1          dest         

[root@node107rhel72 logs]# 

3) Now since we know the Shmid we can enable FC DEBUG via below command .

/opt/mapr/server/tools/fcdebug -s 1006731277 -l DEBUG

4) Verify debug messages are logged .

i) Std out of the process will log messages as below.

[root@node107rhel72 logs]# tailf yarn-mapr-nodemanager-node107rhel72.out

2018-03-20 19:48:42,1632 DEBUG Client fs/client/fileclient/cc/client.cc:3686 Thread: 19570 atime in millisec =  1520406282000

2018-03-20 19:48:42,1632 DEBUG Client fs/client/fileclient/cc/client.cc:3703 Thread: 19570 nlink - 1

2018-03-20 19:48:42,1632 DEBUG Client fs/client/fileclient/cc/client.cc:3526 Thread: 19570 PathWalk: Adding /var/mapr/local/node107rhel72/mapred/nodeManager/fidservers Fid 2113.38.1969838 to Fidcache

2018-03-20 19:48:42,1632 DEBUG Client fs/client/fileclient/cc/client.cc:3530 Thread: 19570 PathWalk: WalkDone File /var/mapr/local/node107rhel72/mapred/nodeManager/fidservers, resp fid 2113.38.1969838

2018-03-20 19:48:42,1632 DEBUG JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:1147 Thread: 19570  -- Exit JNI getattr -- /var/mapr/local/node107rhel72/mapred/nodeManager/fidservers

or

ii) You can also verify if debugs are turned on for the module in question.

[root@node107rhel72 logs]# /opt/mapr/server/tools/fcdebug -i -s 1006731277

              Global :      DEBUG 

                 RPC :      DEBUG 

       MapServerFile :      DEBUG 

        MapServerDir :      DEBUG 

           Container :      DEBUG 

            Snapshot :      DEBUG 

                Util :      DEBUG 

         Replication :      DEBUG 

           PunchHole :      DEBUG 

             KvStore :      DEBUG 

            Truncate :      DEBUG 

           Orphanage :      DEBUG 

          FileServer :      DEBUG 

           Heartbeat :      DEBUG 

               Defer :      DEBUG 

       ServerCommand :      DEBUG 

               Write :      DEBUG 

                  DB :      DEBUG 

               DBRpc :      DEBUG 

         HighLatency :      DEBUG 

             DBLocks :      DEBUG 

          DBMemIndex :      DEBUG 

        DBPermission :      DEBUG 

              DBRepl :      DEBUG 

            AceCache :      DEBUG 

              Marlin :      DEBUG 

            MarlinLG :      DEBUG 

           LogStream :      DEBUG 

          FileAccess :      DEBUG 

           FSProfile :      DEBUG 

                NFSD :      DEBUG 

            Cidcache :      DEBUG 

              Client :      DEBUG 

            Fidcache :      DEBUG

5) Turn of FC debug dynamically after getting all the logs of interest .

[root@node107rhel72 logs]# /opt/mapr/server/tools/fcdebug -s 1006731277

Applying default level on all modules 

[root@node107rhel72 logs]#

Note:-  If finding shmid is challenge you can run below command against the PID of interest and Debug logging will be enabled on the fly after a minute.

[root@node107rhel72 logs]# for shm in `ipcs -mp | grep -w 12531 | cut -f 1 -d " "`; do echo $shm ;/opt/mapr/server/tools/fcdebug  -s $shm -l DEBUG; done

1006731277

1006764046

1006796815

[root@node107rhel72 logs]# 

Thursday, March 15, 2018

Tear down K8S cluster

Tear down K8S cluster

1) List all the nodes in K8S cluster.

[root@tssperf09 ~]# kubectl get nodes

NAME STATUS ROLES AGE VERSION

tssperf09.lab Ready master 44d v1.9.2

tssperf10.lab Ready <none> 44d v1.9.2

tssperf11.lab Ready <none> 44d v1.9.2

[root@tssperf09 ~]#

2) First drain the node and make sure that the node is empty before shutting it down.

i) First drain all the non-master node's .

[root@tssperf09 ~]# kubectl drain tssperf10.lab --delete-local-data --force --ignore-daemonsets

node "tssperf10.lab" cordoned

WARNING: Ignoring DaemonSet-managed pods: calico-node-k8wcz, kube-proxy-87mt7, mapr-volplugin-pk5q8; Deleting pods with local storage: mapr-volplugin-pk5q8

pod "calico-kube-controllers-d554689d5-4456k" evicted

node "tssperf10.lab" drained

[root@tssperf09 ~]#

ii) Now drain the master node itself .

[root@tssperf09 ~]# kubectl drain tssperf09.lab --delete-local-data --force --ignore-daemonsets

node "tssperf09.lab" cordoned

WARNING: Ignoring DaemonSet-managed pods: calico-etcd-rvxmg, calico-node-qrw75, kube-proxy-2ztzj, mapr-volplugin-rcsz9; Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: etcd-tssperf09.lab, kube-apiserver-tssperf09.lab, kube-controller-manager-tssperf09.lab, kube-scheduler-tssperf09.lab, maprfs-volume-example; Deleting pods with local storage: mapr-volplugin-rcsz9

pod "calico-kube-controllers-d554689d5-6j5b5" evicted

pod "maprfs-volume-example" evicted

pod "kube-dns-6f4fd4bdf-jndmv" evicted

node "tssperf09.lab" drained

[root@tssperf09 ~]#

3) Now we should see the node's has Scheduling disabled for new POD shouldn't land on this nodes .

[root@tssperf09 ~]# kubectl get nodes

NAME STATUS ROLES AGE VERSION

tssperf09.lab Ready,SchedulingDisabled master 44d v1.9.2

tssperf10.lab Ready,SchedulingDisabled <none> 44d v1.9.2

tssperf11.lab Ready,SchedulingDisabled <none> 44d v1.9.2

4) Finally delete slave nodes followed by master node .

[root@tssperf09 ~]# kubectl delete node tssperf10.lab

node "tssperf10.lab" deleted

[root@tssperf09 ~]# kubectl get nodes

NAME STATUS ROLES AGE VERSION

tssperf09.lab Ready,SchedulingDisabled master 44d v1.9.2

tssperf11.lab Ready,SchedulingDisabled <none> 44d v1.9.2

[root@tssperf09 ~]# kubectl delete node tssperf11.lab

node "tssperf11.lab" deleted

[root@tssperf09 ~]# kubectl delete node tssperf09.lab

node "tssperf09.lab" deleted

[root@tssperf09 ~]# kubectl get nodes

No resources found.

[root@tssperf09 ~]#

5) Finally reset all kubeadm installed state on every node.

[root@tssperf09 ~]# kubeadm reset

[preflight] Running pre-flight checks.

[reset] Stopping the kubelet service.

[reset] Unmounting mounted directories in "/var/lib/kubelet"

[reset] Removing kubernetes-managed containers.

[reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes /var/lib/etcd]

[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]

[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

[root@tssperf09 ~]#

Tuesday, March 6, 2018

CLDB failing over with NoSuchMethod

Had an interesting issue where CLDB would fail over randomly with below error .

2018-03-05 13:03:27,571 FATAL CLDB [RPC-85]: CLDBShutdown: CldbError

2018-03-05 13:03:27,571 FATAL CLDB [RPC-85]: CLDB Exception

java.lang.NoSuchMethodError: com.mapr.fs.cldb.topology.StoragePool.getInTransitContainersCount()I

at com.mapr.fs.cldb.CLDBServer.dumpInfo(CLDBServer.java:6953)

at com.mapr.fs.cldb.CLDBServer.processRpc(CLDBServer.java:4537)

at com.mapr.fs.cldb.CLDBServer.requestArrived(CLDBServer.java:3266)

at com.mapr.fs.Rpc$RpcExecutor.run(Rpc.java:160)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

1) From looking at the stack issue was clear i.e method was not found in the class for some kind of dumpinfo command. After reviewing maprcli logs , problem was triggered when running " maprcli dump balancerinfo " command. So first workaround was not to run this command in any script or manually till the issue is fixed .

2) Now we want to find which jar is suppose to have the class and method in question (StoragePool.getInTransitContainersCount) . U checked on my local test machine at same patch level , we could narrow down to cldb-5.2.2-mapr.jar having the required class and method .

i)  [root@node107rhel72 logs]# grep -iR StoragePool /opt/mapr/lib/*.jar

Binary file /opt/mapr/lib/cldb-5.2.2-mapr.jar matches

Binary file /opt/mapr/lib/maprfs-5.2.2-mapr.jar matches

ii)  [root@node107rhel72 logs]# unzip -t /opt/mapr/lib/cldb-5.2.2-mapr.jar | grep -i StoragePool

    testing: com/mapr/fs/cldb/StoragePoolHandler.class   OK

    testing: com/mapr/fs/cldb/TrackContainerInMemory$PerStoragePoolInfo.class   OK

    testing: com/mapr/fs/cldb/processors/StoragePoolOfflineTask.class   OK

    testing: com/mapr/fs/cldb/processors/StoragePoolOfflineWorker.class   OK

    testing: com/mapr/fs/cldb/table/Table$StoragePoolPropertiesEntry.class   OK

    testing: com/mapr/fs/cldb/topology/LoadTracker$StoragePoolBinsCount.class   OK

    testing: com/mapr/fs/cldb/topology/StoragePool.class   OK

iii)  [root@node107rhel72 logs]# javap -classpath /opt/mapr/lib/cldb-5.2.2-mapr.jar com.mapr.fs.cldb.topology.StoragePool | grep getInTransitContainersCount

  public int getInTransitContainersCount();

[root@node107rhel72 logs]# 

3) Now I wanted to make sure all the CLDB nodes have the same version of jar which has the method needed which was the case since md5sum matched on all nodes .

[root@node107rhel72 ~]# md5sum /opt/mapr/lib/cldb-5.2.2-mapr.jar

e74ccc9340eaaee077b87dec76ffe3c9  /opt/mapr/lib/cldb-5.2.2-mapr.jar

[root@node107rhel72 ~]# 

4) So the question was if this jar was there and i checked for any older duplicate jars "cldb-*.jar" which were not present why was CLDB not able to get the method . We suspected the class picked up by JVM is from another jar and that is missing the method .

i) To narrow this down in customer env we created a file with all the jar path under "/opt/mapr/lib/".

$ls -l /opt/mapr/lib/* | awk '{print $NF}' | grep ".jar" > /tmp/jar.txt 

ii) Then we wanted to check which other jar's have StoragePool Class .

$ cat /tmp/jar.txt | while read line; do unzip -t $line | grep -Hsi "StoragePool"&& echo "$line" ; done

(standard input): testing: com/mapr/fs/proto/Common$StoragePoolInfoOrBuilder.class OK

(standard input): testing: com/mapr/fs/proto/Common$StoragePoolInfo$1.class OK

(standard input): testing: com/mapr/fs/proto/Common$StoragePoolInfo$Builder.class OK

(standard input): testing: com/mapr/fs/proto/Common$StoragePoolInfo.class OK

(standard input): testing: com/mapr/fs/cldb/proto/CLDBProto$StoragePoolProperties$1.class OK

(standard input): testing: com/mapr/fs/cldb/proto/CLDBProto$StoragePoolProperties$Builder.class OK

(standard input): testing: com/mapr/fs/cldb/proto/CLDBProto$StoragePoolProperties.class OK

(standard input): testing: com/mapr/fs/cldb/proto/CLDBProto$FSMasterForContainerOnStoragePoolOrBuilder.class OK

(standard input): testing: com/mapr/fs/cldb/proto/CLDBProto$FSMasterForContainerOnStoragePool$1.class OK

(standard input): testing: com/mapr/fs/cldb/proto/CLDBProto$FSMasterForContainerOnStoragePool$Builder.class OK

(standard input): testing: com/mapr/fs/cldb/proto/CLDBProto$FSMasterForContainerOnStoragePool.class OK

(standard input): testing: com/mapr/fs/cldb/proto/CLDBProto$ActiveStoragePool.class OK

/opt/mapr/lib/maprfs-5.2.2-mapr.jar

(standard input): testing: com/mapr/fs/cldb/StoragePoolHandler.class OK

(standard input): testing: com/mapr/fs/cldb/processors/StoragePoolOfflineTask.class OK

(standard input): testing: com/mapr/fs/cldb/processors/StoragePoolOfflineWorker.class OK

(standard input): testing: com/mapr/fs/cldb/table/Table$StoragePoolPropertiesEntry.class OK

(standard input): testing: com/mapr/fs/cldb/topology/StoragePool.class OK

/opt/mapr/lib/cldb-5.2.2-mapr.jar

(standard input): testing: com/mapr/fs/cldb/topology/StoragePool.class OK

/opt/mapr/lib/case00055209_debug_180_v3.jar

Bingo ! we found the culprit jar to be old Debug jar which has the same class but from older version which was built for troubleshooting issue while ago "/opt/mapr/lib/case00055209_debug_180_v3.jar" which was in place 

Fix was simple to remove the culprit jar from the location and restart CLDB's to not pick up this debug Jar any more.

Alternative DEBUG Method (Not preferred since it needs restart of service) :

Another method to debug this issue was to enable Verbose Class loading and check which jar "com.mapr.fs.cldb.topology.StoragePool" is getting loaded from . I took the debug jar and put in my test env and i could repo the same issue . Below steps would confirm findings from earlier part of the blog.

1) Added verbose class options to load all the details on which jars the class are loading from .

[root@node107rhel72 logs]# grep verbose /opt/mapr/initscripts/mapr-cldb 

CLDB_OPTS="${CLDB_OPTS} -verbose:class -Dmapr.library.flatclass -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=${CLDB_JMXREMOTE_PORT} -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false "

[root@node107rhel72 logs]# 

2) Now run below command for CLDB to bounce .

[root@node107rhel72 logs]# maprcli dump balancerinfo ;tailf -10 cldb.log 

ERROR (10009) -  Couldn't connect to the CLDB service. dump balancerinfo RPC failed

2018-03-06 23:58:47,232 ERROR CLDB [RPC-6]: java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

2018-03-06 23:58:47,232 ERROR CLDB [RPC-6]: java.lang.Thread.run(Thread.java:745)

2018-03-06 23:58:47,232 INFO CLDBServer [RPC-6]: Shutdown: Stopping CLDB

After it restarts it will start loading class from the jars and every loading activity will be logged . Below log lines clearly shows which jar the class is loading from.

[Loaded com.mapr.fs.cldb.counters.FileServerMetrics from file:/opt/mapr/lib/cldb-5.2.2-mapr.jar]

[Loaded com.mapr.fs.cldb.topology.Node from file:/opt/mapr/lib/cldb-5.2.2-mapr.jar]

[Loaded com.mapr.fs.cldb.topology.StoragePool from file:/opt/mapr/lib/case00055209_debug_180_v3.jar]

Fix is to remove the culprit jar from the location and restart CLDB's to not pick up this debug Jar any more for class to be loaded from CLDB jar instead .