Tuning for Fast failover
Recently worked on very interesting case where requirement was to test reliability of MapR cluster i.e core services like ZK, CLDB experience node failure followed by data node failure in succession .
Below were the tunings done to improve recovery time after failure of nodes apart from isolating CLDB and ZK only with their own topology with dedicated nodes.
1) Check network settings:( all Cluster and client nodes )
i) Verify value for tcp_syn_retries is set to 4. tcp_syn_retries impacts the number of TCP layer retries for a new connection before failing. The retry occurs after the initial timeout (1 second) and then the system waits to see if the retry succeeded before retrying again or failing if syn_retries have been done. It is an exponential back-off algorithm. Thus going from 2 to 3 doubles the time for the 3rd retry. 3 to 4 doubles again, and so on. Thus going from 2 to say 4 is not just 2x slower, it’s 5x slower (1+2 +4=7 vs. 1+2+4+8+16=31). For optimal failover behavior we recommend a low value of 4 (the Linux default is usually 5) for all nodes involved (client nodes, NFS gateway, POSIX client, and all cluster nodes) .
cat /proc/sys/net/ipv4/tcp_syn_retries
4
To set the TCP retry count, set the value of tcp_syn_retries to 4 in the /proc/sys/net/ipv4/ directory.
echo 4 > /proc/sys/net/ipv4/tcp_syn_retries
ii) Verify tcp_retries2 is set to 5 . If not set it to 5 since MapR also relies on TCP timeouts to detect transmission timeouts for active connections. Like syn_retries this value controls the number of retries and again it is an exponential backoff algorithm.
cat /proc/sys/net/ipv4/tcp_retries2
5echo 5 > /proc/sys/net/ipv4/tcp_retries2
2) Check MapR-FS timeout settings (Fuse and Hadoop client nodes) .
To reduce the amount of time it takes for hadoop and FUSE-based POSIX clients to detect CLDB and data node failure, define the property,
fs.mapr.connect.timeout
, in the core-site.xml
file. The minimum value for this property can be set as 100 milliseconds
Your entry in core-site.xml file should look similar to the following:
<property>
<name>fs.mapr.connect.timeout</name>
<value>100</value>
<description>file client wait time of 100 milliseconds</description>
</property>
<name>fs.mapr.connect.timeout</name>
<value>100</value>
<description>file client wait time of 100 milliseconds</description>
</property>
3) Verify Fast failover is enabled. ( Clusterwide parameter )
[root@node106rhel72 ~]# maprcli config load -json | grep fastfailover
"mfs.feature.fastfailover":"1",
[root@node106rhel72 ~]#
If value is not set to 1 run below command to enable fast failover.
maprcli config save -values {mfs.feature.fastfailover:1}
4) Default RPC request configuration. ( ALL NFS and loopback nodes and client nodes )
Default RPC requests configuration can negatively impact performance and memory. To avoid performance and memory issues, configure the number of outstanding RPC requests to the NFS server to be 128. The kernel tunable value sunrpc.tcp_slot_table_entries represents the number of simultaneous Remote Procedure Call (RPC) requests.
echo 128 > /proc/sys/sunrpc/tcp_slot_table_entries echo 128 > /proc/sys/sunrpc/tcp_max_slot_table_entries
Remount the NFS client to the NFS gateway for above values to take effect.
Check data failover timing ( recovered in ~5 seconds )
1) Create volume
maprcli volume create -name vol1 -path /vol1
2) List the node/IP on which Name Container master reside
maprcli dump volumeinfo -volumename vol1 -json
3) Run DD to write a huge file to the data node .
time dd if=/dev/zero of=/mapr/<Cluster>/vol1/storagefile bs=1M count=2000000 ( Note the time for write to complete )
Now run the command again and after few seconds stop warden on the IP you got from step 2 . ( Note the time for write to complete )
Obviously the difference in time from step 3 and step 2 is the time recovery took . In my test recovery took ~5 seconds . Ideally we expect all recovery to happen well within 90 secs .