Friday, October 27, 2017

Tuning for Fast failover

Tuning for Fast failover


Recently worked on very interesting case where requirement was to test reliability of MapR cluster i.e core services like ZK, CLDB experience node failure followed by data node failure in succession .

Below were the tunings done to improve recovery time after failure of nodes apart from isolating CLDB and ZK only with their own topology with dedicated nodes.

1) Check network settings:( all Cluster and client nodes )


i) Verify value for tcp_syn_retries is set to 4. tcp_syn_retries impacts the number of TCP layer retries for a new connection before failing. The retry occurs after the initial timeout (1 second) and then the system waits to see if the retry succeeded before retrying again or failing if syn_retries have been done.  It is an exponential back-off algorithm. Thus going from 2 to 3 doubles the time for the 3rd retry. 3 to 4 doubles again, and so on. Thus going from 2 to say 4 is not just 2x slower, it’s 5x slower (1+2 +4=7 vs. 1+2+4+8+16=31). For optimal failover behavior we recommend a low value of 4 (the Linux default is usually 5) for all nodes involved (client nodes, NFS gateway, POSIX client, and all cluster nodes) .

cat /proc/sys/net/ipv4/tcp_syn_retries
4

To set the TCP retry count, set the value of tcp_syn_retries to 4 in the /proc/sys/net/ipv4/ directory. 

echo 4 > /proc/sys/net/ipv4/tcp_syn_retries

ii) Verify tcp_retries2 is set to 5 .  If not set it to 5 since MapR also relies on TCP timeouts to detect transmission timeouts for active connections. Like syn_retries this value controls the number of retries and again it is an exponential backoff algorithm.


cat /proc/sys/net/ipv4/tcp_retries2
5
echo 5 > /proc/sys/net/ipv4/tcp_retries2


2) Check MapR-FS timeout settings (Fuse and Hadoop client nodes) .


To reduce the amount of time it takes for hadoop and FUSE-based POSIX clients to detect CLDB and data node failure, define the property, fs.mapr.connect.timeout, in the core-site.xml file. The minimum value for this property can be set as 100 milliseconds
Your entry in core-site.xml file should look similar to the following:
<property>
 <name>fs.mapr.connect.timeout</name>
 <value>100</value>
 <description>file client wait time of 100 milliseconds</description>
</property>


3) Verify Fast failover is enabled. ( Clusterwide parameter )


[root@node106rhel72 ~]# maprcli config load -json | grep fastfailover
"mfs.feature.fastfailover":"1",
[root@node106rhel72 ~]#

If value is not set to 1 run below command to enable fast failover.


maprcli config save -values {mfs.feature.fastfailover:1}


4) Default RPC request configuration. ( ALL NFS and loopback nodes and client nodes )

Default RPC requests configuration can negatively impact performance and memory. To avoid performance and memory issues, configure the number of outstanding RPC requests to the NFS server to be 128. The kernel tunable value sunrpc.tcp_slot_table_entries represents the number of simultaneous Remote Procedure Call (RPC) requests.


  1. echo 128 > /proc/sys/sunrpc/tcp_slot_table_entries
    echo 128 > /proc/sys/sunrpc/tcp_max_slot_table_entries
Remount the NFS client to the NFS gateway for above values to take effect.

Check data failover timing ( recovered in ~5 seconds )

1) Create volume   
maprcli volume create -name vol1 -path /vol1

2) List the node/IP on which Name Container master reside  
maprcli dump volumeinfo -volumename vol1 -json


3) Run DD to write a huge file to the data node . 
time dd if=/dev/zero of=/mapr/<Cluster>/vol1/storagefile bs=1M count=2000000 ( Note the time for write to complete )

Now run the command again and after few seconds stop warden on the IP you got from step 2 . ( Note the time for write to complete )


Obviously the difference in time from step 3 and step 2 is the time recovery took . In my test recovery took ~5 seconds . Ideally we expect all recovery to happen well within 90 secs .

Tuesday, October 17, 2017

Designing Subnets

                                                    Designing Subnets


By default, MapR automatically uses all available network interface cards (NICs) on each node in a network. MFS is going to try to use any interface that appears up to communicate with other machines and other machines will use any IP addresses that MFS used to register with the CLDB.  

MAPR_SUBNETS variable can be set to restrict the set of interfaces that MFS advertise but that is only set/used when MFS is starting up,  so if someone  purposefully takes down link to an interface on an MFS node, other nodes will continue to try to contact it on that interface and MFS on that node would continue to try to use it. There is no way around this, there is no way to tell a running MFS process to stop using one of its interfaces. When MFS starts, it sends the list of its IP addresses to the CLDB.  when something needs to talk to MFS, it retrieves the list of IP address from the CLDB. In some scenarios you might want MapR to use a restricted subnet of NICs. For example, if you use multiple NICs of mixed speeds (such as 1GbE and 10GbE) on each node, you might want to separate them into two subnets. That way, you can use the faster NICs for MapR and the slower NICs for other functions.

https://maprdocs.mapr.com/home/AdministratorGuide/Designating-Subnets-for-MapR.html?hl=mapr_subnets

Lets consider a host with multiple NIC's and part of different subnet .

[root@vm52 ~]# ip a
1: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:bf:39:96 brd ff:ff:ff:ff:ff:ff
    inet 10.10.80.91/24 brd 10.10.80.255 scope global eth1
    inet6 fe80::20c:29ff:febf:3996/64 scope link 
       valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:39:5c:31 brd ff:ff:ff:ff:ff:ff
    inet 10.10.70.99/24 brd 10.10.70.255 scope global eth0
    inet6 fe80::20c:29ff:fe39:5c31/64 scope link 
       valid_lft forever preferred_lft forever


We can use below tools/link to know CIDR range ( IP range ) which can be then be set at env.sh for MapR to use only particular subnet for communication.

[root@vm52 ~]# grep "SUBNETS=" /opt/mapr/conf/env.sh 
#export MAPR_SUBNETS=
[root@vm52 ~]#








Friday, October 6, 2017

Label based Scheduling (YARN)

           Label based Scheduling (YARN)



This blog assumes you already have atleast 2 nodes 5.2.1 cluster cluster installed . Below is diagram for the steps we will go through below .


 To set up node labels for the purpose of scheduling YARN applications (including MapReduce applications) on a specific node or group of nodes follow below steps:
       1.   Create a text file and specify the labels you want to use for the nodes in your cluster. In this example, the file is named node.labels with 2 nodes.

  1. [mapr@vm52 root]$ cat node.labels
    vm52-1 fast
    vm52 slow
    [mapr@vm52 root]$
  2. Copy the file to a location on MapR-FS where it will not be modified or deleted, such as /var/mapr.
    hadoop fs -put ~/node.labels /var/mapr
  3. Edit yarn-site.xml on all ResourceManager nodes and set the node.labels.file parameter and the optional node.labels.monitor.interval parameter as shown:
  4. <property>
       <name>node.labels.file</name>
       <value>/var/mapr/node.labels</value>
       <description>The path to the node labels file.</description>
    </property>
    
    <property>
       <name>node.labels.monitor.interval</name>
       <value>120000</value>
       <description>Interval for checking the labels file for updates (default is 2 min)</description>
    </property>

     
      4.  Modify fair-scheduler.xml to add "fast" label to MapR queue .

  1. [mapr@vm52 hadoop]$ cat fair-scheduler.xml 
    <allocations>
    <queue name="root">
    <aclSubmitApps>mapr</aclSubmitApps>                                      
     <aclAdministerApps>mapr</aclAdministerApps>
    <queue name="mapr">
    <minResources>20000 mb,1 vcores,0 disks</minResources>
    <maxResources>30000 mb,4 vcores,2 disks</maxResources>
    <maxRunningApps>10</maxRunningApps>
    <weight>1.0</weight>
    <label>fast</label>
    <schedulingPolicy>fair</schedulingPolicy>
    <aclSubmitApps>mapr</aclSubmitApps>
    </queue>
    <queue name="abizer">
    <minResources>20000 mb,40 vcores,5 disks</minResources>
    <maxResources>30000 mb,50 vcores,50 disks</maxResources>
    <maxRunningApps>10</maxRunningApps>
    <weight>1.0</weight>
    <schedulingPolicy>fair</schedulingPolicy>
    <aclSubmitApps>abizer</aclSubmitApps>
    </queue>

    </queue>
    </allocations>
    [mapr@vm52 hadoop]$
     5.   
Restart ResourceManager to set up the labels from the node labels file for the first time. For subsequent   changes to take effect, issue following commands to manually tell the ResourceManager to reload the node labels file .
  • For any YARN applications, including MapReduce jobs, enter "yarn rmadmin -refreshLabels" to refresh labels if any changes are done.
  6. Verify that labels are implemented correctly by running following commands:   

  1. [mapr@vm52 hadoop]$ yarn rmadmin -showLabels
    17/10/04 00:33:58 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to vm52/10.10.70.99:8033
                      Nodes     Labels
                     vm52-1     [fast]
                       vm52     [slow]
    [mapr@vm52 hadoop]$ 


   7.      Running a sample teragen Job in mapr queue for it to run on fast label nodes only .

  1. [mapr@vm52 root]$ hadoop jar /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0-mapr-1607.jar teragen -Dmapreduce.job.queue=mapr 100000000 /teragen21
    17/10/04 00:58:58 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to vm52/10.10.70.99:8032
    17/10/04 00:59:00 INFO terasort.TeraSort: Generating 100000000 using 2
    17/10/04 00:59:00 INFO mapreduce.JobSubmitter: number of splits:2
    17/10/04 00:59:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506992471684_0015
    17/10/04 00:59:03 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
    17/10/04 00:59:03 INFO impl.YarnClientImpl: Submitted application application_1506992471684_0015
    17/10/04 00:59:03 INFO mapreduce.Job: The url to track the job: http://vm52:8088/proxy/application_1506992471684_0015/
    17/10/04 00:59:03 INFO mapreduce.Job: Running job: job_1506992471684_0015
    17/10/04 00:59:12 INFO mapreduce.Job: Job job_1506992471684_0015 running in uber mode : false
    17/10/04 00:59:12 INFO mapreduce.Job:  map 0% reduce 0%
    17/10/04 00:59:23 INFO mapreduce.Job:  map 33% reduce 0%
    17/10/04 01:07:30 INFO mapreduce.Job:  map 66% reduce 0%
    17/10/04 01:07:31 INFO mapreduce.Job:  map 100% reduce 0%
    17/10/04 01:07:32 INFO mapreduce.Job: Job job_1506992471684_0015 completed successfully
    17/10/04 01:07:32 INFO mapreduce.Job: Counters: 33
    File System Counters
    FILE: Number of bytes read=0
    FILE: Number of bytes written=192826
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    MAPRFS: Number of bytes read=170
    MAPRFS: Number of bytes written=10000000000
    MAPRFS: Number of read operations=20
    MAPRFS: Number of large read operations=0
    MAPRFS: Number of write operations=201171874
    Job Counters 
    Killed map tasks=1
    Launched map tasks=2
    Other local map tasks=2
    Total time spent by all maps in occupied slots (ms)=495359
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=495359
    Total vcore-seconds taken by all map tasks=495359
    Total megabyte-seconds taken by all map tasks=507247616
    DISK_MILLIS_MAPS=247680
    Map-Reduce Framework
    Map input records=100000000
    Map output records=100000000
    Input split bytes=170
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=961
    CPU time spent (ms)=195000
    Physical memory (bytes) snapshot=474079232
    Virtual memory (bytes) snapshot=3744894976
    Total committed heap usage (bytes)=317718528
    org.apache.hadoop.examples.terasort.TeraGen$Counters
    CHECKSUM=214760662691937609
    File Input Format Counters 
    Bytes Read=0
    File Output Format Counters 
    Bytes Written=10000000000
    [mapr@vm52 root]$ 

VERIFY :-

  We can see containers are only running of fast node (vm52-1)
  1. [mapr@vm52 hadoop]$  yarn node -list
    17/10/04 01:00:30 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to vm52/10.10.70.99:8032
    Total Nodes:2         
    Node-Id      Node-State Node-Http-Address Number-of-Running-Containers      
    vm52:56287         RUNNING         vm52:8042                            0    
    vm52-1:35535         RUNNING       vm52-1:8042                            2
    [mapr@vm52 hadoop]$