Monday, February 26, 2018

Installing K8S

                                  Installing K8S



Follow below process to install K8S on all nodes for K8S cluster.

 1) Install docker and start on all the node in cluster .

yum install -y docker
systemctl enable docker && systemctl start docker

2)  Setup repo for K8S

[root@tssperf10 ~]# cat /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
[root@tssperf10 ~]# 

Note:- Turn off SE Linux .

setenforce 0

3) Install kubernetes services and start kubelet services .

yum install -y kubelet kubeadm kubectl

systemctl enable kubelet && systemctl start kubelet

4) To avoid issues with traffic being routed incorrectly due to iptables add below configs to k8s.conf  and reload system config from files . 

[root@tssperf10 ~]# cat /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
[root@tssperf10 ~]#

sysctl --system


5) Start Kubenetes master on first node .


[root@tssperf09 ~]# kubeadm init
[init] Using Kubernetes version: v1.9.2
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks.
[WARNING FileExisting-crictl]: crictl not found in system path
[certificates] Generated ca certificate and key.
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [tssperf09.lab kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.10.72.249]
[certificates] Generated apiserver-kubelet-client certificate and key.
[certificates] Generated sa key and public key.
[certificates] Generated front-proxy-ca certificate and key.
[certificates] Generated front-proxy-client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "scheduler.conf"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
[init] Waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests".
[init] This might take a minute or longer if the control plane images have to be pulled.
[apiclient] All control plane components are healthy after 47.002868 seconds
[uploadconfig] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[markmaster] Will mark node tssperf09.lab as master by adding a label and a taint
[markmaster] Master tssperf09.lab tainted and labelled with key/value: node-role.kubernetes.io/master=""
[bootstraptoken] Using token: f663f7.9e62cd14e10df69f
[bootstraptoken] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstraptoken] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstraptoken] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstraptoken] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[addons] Applied essential addon: kube-dns
[addons] Applied essential addon: kube-proxy

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of machines by running the following on each node
as root:

  kubeadm join --token f663f7.9e62cd14e10df69f 10.10.72.249:6443 --discovery-token-ca-cert-hash sha256:9b7f89467f5fcb50a9007ba2f44fd6b2f4f9d25d62a97ff0902e81e9da894ba5

[root@tssperf09 ~]#

Note :- For testing we will run below command . Just to export kubeconfig on the shell itself.

 export KUBECONFIG=/etc/kubernetes/admin.conf


6) Install a pod network for Pods to communicate with each other.

[root@tssperf09 ~]# kubectl apply -f https://docs.projectcalico.org/v2.6/getting-started/kubernetes/installation/hosted/kubeadm/1.6/calico.yaml
configmap "calico-config" created
daemonset "calico-etcd" created
service "calico-etcd" created
daemonset "calico-node" created
deployment "calico-kube-controllers" created
deployment "calico-policy-controller" created
clusterrolebinding "calico-cni-plugin" created
clusterrole "calico-cni-plugin" created
serviceaccount "calico-cni-plugin" created
clusterrolebinding "calico-kube-controllers" created
clusterrole "calico-kube-controllers" created
serviceaccount "calico-kube-controllers" created
[root@tssperf09 ~]#

7) Now we can see Kubernetes master is up

[root@tssperf09 ~]# kubectl cluster-info
Kubernetes master is running at https://10.10.72.249:6443
KubeDNS is running at https://10.10.72.249:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
[root@tssperf09 ~]# kubectl get all
NAME             TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
svc/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   7m
[root@tssperf09 ~]# kubectl get nodes
NAME            STATUS    ROLES     AGE       VERSION
tssperf09.lab   Ready     master    7m        v1.9.2

[root@tssperf09 ~]# kubectl get pods --all-namespaces
NAMESPACE     NAME                                      READY     STATUS    RESTARTS   AGE
kube-system   calico-etcd-rvxmg                         1/1       Running   0          5m
kube-system   calico-kube-controllers-d554689d5-6j5b5   1/1       Running   0          5m
kube-system   calico-node-qrw75                         2/2       Running   0          5m
kube-system   etcd-tssperf09.lab                        1/1       Running   0          8m
kube-system   kube-apiserver-tssperf09.lab              1/1       Running   0          7m
kube-system   kube-controller-manager-tssperf09.lab     1/1       Running   0          8m
kube-system   kube-dns-6f4fd4bdf-jndmv                  3/3       Running   0          8m
kube-system   kube-proxy-2ztzj                          1/1       Running   0          8m
kube-system   kube-scheduler-tssperf09.lab              1/1       Running   0          7m
[root@tssperf09 ~]# kubectl taint nodes --all node-role.kubernetes.io/master-
node "tssperf09.lab" untainted
[root@tssperf09 ~]# 


8) Now allow other 2 nodes to join the Kubernetes cluster .


i)  [root@tssperf10 ~]# kubeadm join --token f663f7.9e62cd14e10df69f 10.10.72.249:6443 --discovery-token-ca-cert-hash sha256:9b7f89467f5fcb50a9007ba2f44fd6b2f4f9d25d62a97ff0902e81e9da894ba5
[preflight] Running pre-flight checks.
[WARNING FileExisting-crictl]: crictl not found in system path
[discovery] Trying to connect to API Server "10.10.72.249:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.10.72.249:6443"
[discovery] Requesting info from "https://10.10.72.249:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "10.10.72.249:6443"
[discovery] Successfully established connection with API Server "10.10.72.249:6443"

This node has joined the cluster:
* Certificate signing request was sent to master and a response
  was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the master to see this node join the cluster.

ii)  [root@tssperf11 ~]# kubeadm join --token f663f7.9e62cd14e10df69f 10.10.72.249:6443 --discovery-token-ca-cert-hash sha256:9b7f89467f5fcb50a9007ba2f44fd6b2f4f9d25d62a97ff0902e81e9da894ba5
[preflight] Running pre-flight checks.
[WARNING FileExisting-crictl]: crictl not found in system path
[discovery] Trying to connect to API Server "10.10.72.249:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.10.72.249:6443"
[discovery] Requesting info from "https://10.10.72.249:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "10.10.72.249:6443"
[discovery] Successfully established connection with API Server "10.10.72.249:6443"

This node has joined the cluster:
* Certificate signing request was sent to master and a response
  was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the master to see this node join the cluster.
[root@tssperf11 ~]# 

9 )  Now the Kubernetes cluster is up and it has one master and 2 worker nodes.

[root@tssperf09 ~]# kubectl get nodes
NAME            STATUS    ROLES     AGE       VERSION
tssperf09.lab   Ready     master    17m       v1.9.2
tssperf10.lab   Ready     <none>    2m        v1.9.2
tssperf11.lab   Ready     <none>    1m        v1.9.2
[root@tssperf09 ~]# 




NM blacklisting due to LogAggregationService

                                   NM blacklisting due to LogAggregationService         

Recently I had a very interesting issue where few NM after restart would go blacklisted within 10 mins and the reasons were not the usual once like tmp space issue, Node issue leading to Attempt failure etc . 

I will list the troubleshooting procedure followed with work around and the RC of why the cluster ran into this issue.

1)  Below error was seen in RM node . This indicated NM was not heartbeating with the RM , hence the state was changing to lost .


2018-02-25 16:06:31,727 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:NodeA:8091 Timed out after 600 secs
2018-02-25 16:06:31,727 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node NodeA:8091 as it is now LOST
2018-02-25 16:06:31,727 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: NodeA:8091 Node Transitioned from RUNNING to LOST

2) On checking the system stats it was clear NM was using all the CPU cores and causing huge load on the system .

3) In NM out file i saw messages as below which pointed user lookup was not working well and NM threads were going OOM and also it saw GC pauses. Although this would point to Infra issue where user lookups were not being served fast enough it was just a distraction and made the problem harder to debug .



2018-02-25 16:09:04,1783 ERROR Client fs/client/fileclient/cc/client.cc:787 Thread: 29949 GetUid failed, could not get userinfo for user UserA, error 0

2018-02-25 16:09:04,1802 ERROR Client fs/client/fileclient/cc/client.cc:787 Thread: 29949 GetUid failed, could not get userinfo for user UserB, error 0

Halting due to Out Of Memory Error...
Halting due to Out Of Memory Error...
Halting due to Out Of Memory Error...
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "LogAggregationService #754"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "LogAggregationService #12498"

We noted NM was running with 4Gb Xmx, easy workaround would be to increase Xmx to say 12 or 16 GB and see if it took care of the problem .

4) Before doing any workaround I wanted to check what internal threads were getting kicked which was causing huge load on the system ( almost used all the avail CPU's ) , so i took jstack of NM process, I saw huge number of LogAggregationService threads being kicked in. 


[root@ip-10-165-137-99 Jstack]# grep LogAggregationService abizer.txt | grep prio | head -2
"LogAggregationService #19661" #19812 prio=5 os_prio=0 tid=0x000000000e627000 nid=0x1438 waiting for monitor entry [0x00007f2520652000]
"LogAggregationService #19660" #19811 prio=5 os_prio=0 tid=0x000000000e623800 nid=0x1437 waiting for monitor entry [0x00007f2520753000]

[root@ip-10-165-137-99 Jstack]# grep LogAggregationService abizer.txt | grep prio | wc -l
38397
[root@ip-10-165-137-99 Jstack]#


At this point it was clear issue was with huge number of Aggregation threads being triggered . We tried to find a way to limit number of Aggregation thread but unfortunately came across YARN-4697 which does confirm  NM aggregation thread pool is not bound by limits and needs code fix since cluster didn't have the fix already .


Workaround :  As a workaround we disabled node manager recovery by setting below value to false and restarted the NM  . Once NM restarted it came up and didn't trigger  Log Aggregation .  Bingo ! We found a good workaround to get out of the problem. 

yarn.nodemanager.recovery.enabled

On safe side we also cleaned up all the older user logs which were still present on local filesystem and had chance of being Aggregated.

Note : There can be scenario where to start executing Jobs again normally may take longer since NM may already have reservation on the nodes and jobs might have failed/exited already so we would need to wait for reservation to Expire.  In case quick recovery is needed RM recovery can be disabled temporarily .

Now after the cluster recovered, I wanted to find out the real cause of why so many aggregation threads were getting triggered in first place. Below is what we found which now made sense on why system behaved the way it did .

Root Cause : Reason why so many log aggregation were kicked in after the NM restart was since quiet a few apps log aggregation was pending for few days due to Name container of the yarn aggregation volume growing very huge, size of it was equivalent to the size of SP and then it had no space to grow . Hence all the aggregation attempts failed once the space on SP was completely utilized. Aggregation was retried once a new volume location for Aggregation was specified followed by restart of NM which went for recovery of container and then triggering Aggregation attempt.



  



Tuesday, February 13, 2018

Analyze broken pipe error in Hive client via tcp dump

         Analyze broken pipe error in Hive client via tcp dump



I recently had a strange issue where hive job would randomly fail for specific source data pipeline with Broken pipe exception,  I didn't see any exception in Metastore logs to indicate why the connection was broken ( Nor any issue in System resource being an issue  ).

So to investigate the issue i took TCP dumps from client node and Metastore node.

client node : tcpdump -i <any or eth0> host <HS meta IP> and port <HS meta Port> -w client.pcap
metastore node :  tcpdump -i <any or eth0> host <Client IP>  -w meta.pcap

Now viewing pcap file wireshark shows connection was terminated gracefully and new connection was only created after ~7 mins difference .




From the first 3 lines we can see client initiated Fin to terminate the session with server on port 11000 and server accepted to terminate the connection gracefully, after which client did acknowledge breaking the connection .


  • host B (Server) sends a data packet to host A (Client)
  • and then host A wants to close the connection.
  • Host A (depending on timing) can respond with [FIN,ACK] indicating that it received the sent packet and wants to close the session.
  • Host B should then respond with a [FIN,ACK] indicating that it received the termination request (the ACK part) and that it too will close the connection (the FIN part).

  • Note : -  

    [ACK] is the acknowledgement that the previously sent data packet was received.
    [FIN] is sent by a host when it wants to terminate the connection; the TCP protocol requires both endpoints to send the termination request (i.e. FIN).

     [PSH] Makes this packet a PUSH packet. In normal flow receiver will not acknowledge each packet after receiving. Receiver will keep the data it get received in a buffer for some time until it gives to the application. PUSH packet will tell the receiver to give the data to the application immediately and then it will acknowledge.

    [RST] Reset the connection. One particular example of sending the RST packet would be in response for a packet received for a closed socket.

    [SYN]: Start the connection, synchronize the sequence numbers. First packet from each end will only have this flag is set.

    Interestingly from client logs the broken pipe exception was seen at 15:04 which means the client code tried to get info from metastore when the connection between client and metastore was already closed .

    FIX : To fix client code was updated/fixed to check for active connection before any calls were made to metastore .


    Note :  Just to understand TCP 3 way handshake which is initiated in earlier screenshot at "2018-02-08 15:07:57" shows 3 way handshake . 

    (#1) Client will send a packet with SYN flag is set and random number(R1) included in the sequence number field.
    (#2) Server will send a packet with SYN flag and ACK flags are set. sequence number field will contain a new random number(R2) and acknowledgement number field will contain clients sequence number +1 (R1+1).(Which is the next sequence number server is expecting from the client)

    (#3) Client will acknowledge servers SYN packet by sending a packet with ACK flag is set and acknowledge number field with R2+1. (Which is the next sequence number client is expecting from the server)