Saturday, December 31, 2016

Identify Network latency

                                Identify Network latency


Sometimes file client processes are not getting timely responses from MFS processes this could be problem in MFS server process itself which is busy i.e due to many requests, MFS churning CPU or issue is coming for network latency. Lets assume there is no problem in MFS and MFS is responding to request as soon as the requests gets in MFS work queue.

First do a quick rpc test to check if there are any obvious network issue where connection between some nodes is just failing . If test in below blog comes out clean then continue reading.

http://abizeradenwala.blogspot.com/2016/11/quick-network-test-for-mapr-cluster.html

 To troubleshoot Network latency, one thing that can be checked is the send queue sizes for open TCP connections (e.g. the third column in output of "netstat -pan").  On a normal operating network, where the source and destination machines have free memory/CPU and there is no network bottleneck on the interfaces of the source and dest machines, the send queue size should not be more than a few thousand bytes at most.  If you see a lot of connections with 10K+ bytes in the send queue then that generally indicates some sort of problem.

First check the send queue sizes for all open connections on all nodes then you can identify if there is a cluster wide network issue or a node specific network issue (or no network issue at all).  

i) If you see connections on many different nodes with large send queue sizes, and the destinations of those connections are also to a wide variety of different nodes then that indicates a cluster wide issue such as a faulty switch.

ii) If you see lots of connections with large send queue sizes but only on one particular node, that would typically indicate that one node is having trouble sending data out onto the network.

iii) If you see connections on lots of different nodes with large send queue sizes and the destinations of those connections are all to one particular node then that indicates that one particular node is having trouble receiving data.

Further, after observing these large send queue sizes take a packet capture for the specific ports of those connections which has huge queue side , after analyzing in wire-shark it would show lots of TCP retransmits/connection reset indicating packet loss in the network.

The moral of the story is you can use "netstat -pan" output collected from across the cluster a few times say every 10 secs to identify if there are persistent large send queue sizes for connections, in which case you likely have a network issue of some sort.

No comments:

Post a Comment