Sunday, June 16, 2019

Network debugging from Databricks notebook

               Network debugging from Databricks notebook


Network setup issues:

Any time you messages as below in init script logs or when you do simple Apt-get install it usually means our notebook cannot reach archive.ubuntu.com on port 80 .

Cannot initiate the connection to archive.ubuntu.com:80 (2001:67c:1560:8001::14). - connect (101: Network is unreachable) [IP: 2001:67c:1560:8001::14 80]


or if you see messages with stack trace in driver logs.

Caused by: java.util.concurrent.TimeoutException: Timeout during file system operation after 1200 seconds. Please configure databricks.data.rpcTimeout to change this timeout.


Above can be due to number of reasons .

1) Outgoing traffic to internet is blocked and hence it cannot reach to archive.ubuntu.com hence you see "Network is unreachable"

%sh ping -c 2 google.com

PING google.com (172.217.9.238): 56 data bytes--- google.com ping statistics ---2 packets transmitted, 0 packets received, 100% packet loss

Mostlike here traffic is blocked by SG or NACL . Below blog has all details on what ports need to be open for communication. If not route setup for the instances are incorrect .



2) Another test for this could be with Telnet to specific hostname and port  . You should see message as below any other message like timeout/ connection refused etc means driver is not able to reach to specific host on specific port .

%sh telnet archive.ubuntu.com 80

Trying 91.189.88.24...Connected to archive.ubuntu.com.Escape character is '^]'.


Or use NC to see if you can get to port and is connection can be opened.


%sh nc -zv archive.ubuntu.com 80archive.ubuntu.com [91.189.88.149] 80 (http) open

If you don't see output as above Issues could be due to incorrect VPC peering or ports not opening or firewall . Below blog can help debug .


3) Also its possible your driver is not able to resolve the hostname, you can quickly check if hostname is getting resolved to IP address you expect to else you can try step 2 with IP instead of hostname to confirm .

%sh nslookup archive.ubuntu.com

Server: 10.177.0.2 Address: 10.177.0.2#53 Non-authoritative answer: Name: archive.ubuntu.com Address: 91.189.91.23 Name: archive.ubuntu.com Address: 91.189.91.26 Name: archive.ubuntu.com Address: 91.189.88.24 Name: archive.ubuntu.com Address: 91.189.88.31 Name: archive.ubuntu.com Address: 91.189.88.149 Name: archive.ubuntu.com Address: 91.189.88.162

Note : -  In unlikely event where server name nslookup is incorrect then it could be due to DNS caching old record info and server IP has changed .

%sh  nslookup -type=SOA novartis-prod.cloud.databricks.com

Server: 10.177.0.2 Address: 10.177.0.2#53 Non-authoritative answer: novartis-prod.cloud.databricks.com canonical name = dbc-67a72554-618d.cloud.databricks.com. dbc-67a72554-618d.cloud.databricks.com canonical name = ec2-3-90-35-224.compute-1.amazonaws.com. Authoritative answers can be found from: compute-1.amazonaws.com origin = dns-external-master.amazon.com mail addr = root.amazon.com serial = 2013233418 refresh = 28800 retry = 900 expire = 2592000 minimum = 7211 -> Default TTL for record




Flaky Network issues:


Most issues would be solved here but there can be cases where connection Randomly fails . This usually points to problem where network is flaky and not necessarily blocked by Firewall or incorrect configuration.

Messages as below

Caused by: java.io.IOException: SQL Server did not return a response. The connection has been closed. ClientConnectionId:XXX-XXX-XXXX
In above case we need to capture TCP dump from driver for all the traffic going to 3306 i.e Mysql DB .

%scala dbutils.fs.put("dbfs:/databricks/init_scripts/take_tcpdump.sh", """#!/bin/bash echo "initiating tcp dump" sudo tcpdump -w /dbfs/databricks/tcpdump/trace_%Y_%m_%d_%H_%M_%S.pcap -W 1000 -G 1800 -K -n port 3306 > /dbfs/databricks/tcpdump/tcpdump.log 2>&1 & """,true)

Once the issue is hit, download the pcap file of interest and use Wireshark to understand when and how the connection is getting closed .

Below is good example where connection was getting closed abruptly and TCPdump helped to get good understanding on the problem .

https://abizeradenwala.blogspot.com/2018/02/analyze-broken-pipe-error-in-hive.html



Network Latency issues:

Network latency issues is either due to bad node or network choke caused by bad/slow network .

 To troubleshoot Network latency, one thing that can be checked is the send queue sizes for open TCP connections (e.g. the third column in output of "netstat -pan").  On a normal operating network, where the source and destination machines have free memory/CPU and there is no network bottleneck on the interfaces of the source and dest machines, the send queue size should not be more than a few thousand bytes at most.  If you see a lot of connections with 10K+ bytes in the send queue then that generally indicates some sort of problem.


Collect "netstat -pan" every 30 seconds with timestamps during the time of the problem to review .


i)  If you see lots of connections with large send queue sizes but only on one particular node, that would typically indicate that one node is having trouble sending data out onto the network.


ii) If you see connections on lots of different nodes with large send queue sizes and the destinations of those connections are all to one particular node then that indicates that one particular node is having trouble receiving data.

iii) If you see connections on many different nodes with large send queue sizes, and the destinations of those connections are also to a wide variety of different nodes then that indicates a cluster wide issue such as a faulty switch.