AWS/Azure(Cloud)/Spark/Hadoop / Linux : October 2017

Friday, October 27, 2017

Tuning for Fast failover

Recently worked on very interesting case where requirement was to test reliability of MapR cluster i.e core services like ZK, CLDB experience node failure followed by data node failure in succession .

Below were the tunings done to improve recovery time after failure of nodes apart from isolating CLDB and ZK only with their own topology with dedicated nodes.

1) Check network settings:( all Cluster and client nodes )

i) Verify value for tcp_syn_retries is set to 4. tcp_syn_retries impacts the number of TCP layer retries for a new connection before failing. The retry occurs after the initial timeout (1 second) and then the system waits to see if the retry succeeded before retrying again or failing if syn_retries have been done. It is an exponential back-off algorithm. Thus going from 2 to 3 doubles the time for the 3rd retry. 3 to 4 doubles again, and so on. Thus going from 2 to say 4 is not just 2x slower, it’s 5x slower (1+2 +4=7 vs. 1+2+4+8+16=31). For optimal failover behavior we recommend a low value of 4 (the Linux default is usually 5) for all nodes involved (client nodes, NFS gateway, POSIX client, and all cluster nodes) .

cat /proc/sys/net/ipv4/tcp_syn_retries

To set the TCP retry count, set the value of tcp_syn_retries to 4 in the /proc/sys/net/ipv4/ directory.

echo 4 > /proc/sys/net/ipv4/tcp_syn_retries

ii) Verify tcp_retries2 is set to 5 . If not set it to 5 since MapR also relies on TCP timeouts to detect transmission timeouts for active connections. Like syn_retries this value controls the number of retries and again it is an exponential backoff algorithm.

cat /proc/sys/net/ipv4/tcp_retries2

5
echo 5 > /proc/sys/net/ipv4/tcp_retries2

2) Check MapR-FS timeout settings (Fuse and Hadoop client nodes) .

To reduce the amount of time it takes for hadoop and FUSE-based POSIX clients to detect CLDB and data node failure, define the property, fs.mapr.connect.timeout, in the core-site.xml file. The minimum value for this property can be set as 100 milliseconds

Your entry in core-site.xml file should look similar to the following:

<property>
<name>fs.mapr.connect.timeout</name>
<value>100</value>
<description>file client wait time of 100 milliseconds</description>
</property>

3) Verify Fast failover is enabled. ( Clusterwide parameter )

[root@node106rhel72 ~]# maprcli config load -json | grep fastfailover

"mfs.feature.fastfailover":"1",

[root@node106rhel72 ~]#

If value is not set to 1 run below command to enable fast failover.

maprcli config save -values {mfs.feature.fastfailover:1}

4) Default RPC request configuration. ( ALL NFS and loopback nodes and client nodes )

Default RPC requests configuration can negatively impact performance and memory. To avoid performance and memory issues, configure the number of outstanding RPC requests to the NFS server to be 128. The kernel tunable value sunrpc.tcp_slot_table_entries represents the number of simultaneous Remote Procedure Call (RPC) requests.

echo 128 > /proc/sys/sunrpc/tcp_slot_table_entries
echo 128 > /proc/sys/sunrpc/tcp_max_slot_table_entries

Remount the NFS client to the NFS gateway for above values to take effect.

Check data failover timing ( recovered in ~5 seconds )

1) Create volume

maprcli volume create -name vol1 -path /vol1

2) List the node/IP on which Name Container master reside

maprcli dump volumeinfo -volumename vol1 -json

3) Run DD to write a huge file to the data node .

time dd if=/dev/zero of=/mapr/<Cluster>/vol1/storagefile bs=1M count=2000000 ( Note the time for write to complete )

Now run the command again and after few seconds stop warden on the IP you got from step 2 . ( Note the time for write to complete )

Obviously the difference in time from step 3 and step 2 is the time recovery took . In my test recovery took ~5 seconds . Ideally we expect all recovery to happen well within 90 secs .

Tuesday, October 17, 2017

Designing Subnets

By default, MapR automatically uses all available network interface cards (NICs) on each node in a network. MFS is going to try to use any interface that appears up to communicate with other machines and other machines will use any IP addresses that MFS used to register with the CLDB.

MAPR_SUBNETS variable can be set to restrict the set of interfaces that MFS advertise but that is only set/used when MFS is starting up, so if someone purposefully takes down link to an interface on an MFS node, other nodes will continue to try to contact it on that interface and MFS on that node would continue to try to use it. There is no way around this, there is no way to tell a running MFS process to stop using one of its interfaces. When MFS starts, it sends the list of its IP addresses to the CLDB. when something needs to talk to MFS, it retrieves the list of IP address from the CLDB. In some scenarios you might want MapR to use a restricted subnet of NICs. For example, if you use multiple NICs of mixed speeds (such as 1GbE and 10GbE) on each node, you might want to separate them into two subnets. That way, you can use the faster NICs for MapR and the slower NICs for other functions.

https://maprdocs.mapr.com/home/AdministratorGuide/Designating-Subnets-for-MapR.html?hl=mapr_subnets

Lets consider a host with multiple NIC's and part of different subnet .

[root@vm52 ~]# ip a

1: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000

    link/ether 00:0c:29:bf:39:96 brd ff:ff:ff:ff:ff:ff

    inet 10.10.80.91/24 brd 10.10.80.255 scope global eth1

    inet6 fe80::20c:29ff:febf:3996/64 scope link 

       valid_lft forever preferred_lft forever

2: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000

    link/ether 00:0c:29:39:5c:31 brd ff:ff:ff:ff:ff:ff

    inet 10.10.70.99/24 brd 10.10.70.255 scope global eth0

    inet6 fe80::20c:29ff:fe39:5c31/64 scope link 

       valid_lft forever preferred_lft forever

We can use below tools/link to know CIDR range ( IP range ) which can be then be set at env.sh for MapR to use only particular subnet for communication.

[root@vm52 ~]# grep "SUBNETS=" /opt/mapr/conf/env.sh 

#export MAPR_SUBNETS=

[root@vm52 ~]#

https://www.ipaddressguide.com/cidr

https://www.iplocation.net/subnet-calculator

Friday, October 6, 2017

Label based Scheduling (YARN)

This blog assumes you already have atleast 2 nodes 5.2.1 cluster cluster installed . Below is diagram for the steps we will go through below .

To set up node labels for the purpose of scheduling YARN applications (including MapReduce applications) on a specific node or group of nodes follow below steps:

1. Create a text file and specify the labels you want to use for the nodes in your cluster. In this example, the file is named node.labels with 2 nodes.


[mapr@vm52 root]$ cat node.labels

vm52-1 fast

vm52 slow

[mapr@vm52 root]$

Copy the file to a location on MapR-FS where it will not be modified or deleted, such as /var/mapr.
```
hadoop fs -put ~/node.labels /var/mapr
```
Edit yarn-site.xml on all ResourceManager nodes and set the node.labels.file parameter and the optional node.labels.monitor.interval parameter as shown:

<property>
   <name>node.labels.file</name>
   <value>/var/mapr/node.labels</value>
   <description>The path to the node labels file.</description>
</property>

<property>
   <name>node.labels.monitor.interval</name>
   <value>120000</value>
   <description>Interval for checking the labels file for updates (default is 2 min)</description>
</property>

4. Modify fair-scheduler.xml to add "fast" label to MapR queue .


[mapr@vm52 hadoop]$ cat fair-scheduler.xml 

<allocations>

<queue name="root">

 <aclSubmitApps>mapr</aclSubmitApps>                                      

 <aclAdministerApps>mapr</aclAdministerApps>


<queue name="mapr">

<minResources>20000 mb,1 vcores,0 disks</minResources>

<maxResources>30000 mb,4 vcores,2 disks</maxResources>

<maxRunningApps>10</maxRunningApps>

<weight>1.0</weight>

<label>fast</label>

<schedulingPolicy>fair</schedulingPolicy>

<aclSubmitApps>mapr</aclSubmitApps>

</queue>

<queue name="abizer">

<minResources>20000 mb,40 vcores,5 disks</minResources>

<maxResources>30000 mb,50 vcores,50 disks</maxResources>

<maxRunningApps>10</maxRunningApps>

<weight>1.0</weight>

<schedulingPolicy>fair</schedulingPolicy>

<aclSubmitApps>abizer</aclSubmitApps>

</queue>







</queue>




</allocations>

[mapr@vm52 hadoop]$

5. Restart ResourceManager to set up the labels from the node labels file for the first time. For subsequent changes to take effect, issue following commands to manually tell the ResourceManager to reload the node labels file .

For any YARN applications, including MapReduce jobs, enter "yarn rmadmin -refreshLabels" to refresh labels if any changes are done.

6. Verify that labels are implemented correctly by running following commands:


[mapr@vm52 hadoop]$ yarn rmadmin -showLabels

17/10/04 00:33:58 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to vm52/10.10.70.99:8033

                  Nodes     Labels

                 vm52-1     [fast]

                   vm52     [slow]

[mapr@vm52 hadoop]$

7. Running a sample teragen Job in mapr queue for it to run on fast label nodes only .


[mapr@vm52 root]$ hadoop jar /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0-mapr-1607.jar teragen -Dmapreduce.job.queue=mapr 100000000 /teragen21

17/10/04 00:58:58 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to vm52/10.10.70.99:8032

17/10/04 00:59:00 INFO terasort.TeraSort: Generating 100000000 using 2

17/10/04 00:59:00 INFO mapreduce.JobSubmitter: number of splits:2

17/10/04 00:59:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506992471684_0015

17/10/04 00:59:03 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager

17/10/04 00:59:03 INFO impl.YarnClientImpl: Submitted application application_1506992471684_0015

17/10/04 00:59:03 INFO mapreduce.Job: The url to track the job: http://vm52:8088/proxy/application_1506992471684_0015/

17/10/04 00:59:03 INFO mapreduce.Job: Running job: job_1506992471684_0015

17/10/04 00:59:12 INFO mapreduce.Job: Job job_1506992471684_0015 running in uber mode : false

17/10/04 00:59:12 INFO mapreduce.Job:  map 0% reduce 0%

17/10/04 00:59:23 INFO mapreduce.Job:  map 33% reduce 0%

17/10/04 01:07:30 INFO mapreduce.Job:  map 66% reduce 0%

17/10/04 01:07:31 INFO mapreduce.Job:  map 100% reduce 0%

17/10/04 01:07:32 INFO mapreduce.Job: Job job_1506992471684_0015 completed successfully

17/10/04 01:07:32 INFO mapreduce.Job: Counters: 33

  File System Counters

  FILE: Number of bytes read=0

  FILE: Number of bytes written=192826

  FILE: Number of read operations=0

  FILE: Number of large read operations=0

  FILE: Number of write operations=0

  MAPRFS: Number of bytes read=170

  MAPRFS: Number of bytes written=10000000000

  MAPRFS: Number of read operations=20

  MAPRFS: Number of large read operations=0

  MAPRFS: Number of write operations=201171874

 Job Counters 

  Killed map tasks=1

  Launched map tasks=2

  Other local map tasks=2

  Total time spent by all maps in occupied slots (ms)=495359

  Total time spent by all reduces in occupied slots (ms)=0

  Total time spent by all map tasks (ms)=495359

  Total vcore-seconds taken by all map tasks=495359

  Total megabyte-seconds taken by all map tasks=507247616

  DISK_MILLIS_MAPS=247680

 Map-Reduce Framework

  Map input records=100000000

  Map output records=100000000

  Input split bytes=170

  Spilled Records=0

  Failed Shuffles=0

  Merged Map outputs=0

  GC time elapsed (ms)=961

  CPU time spent (ms)=195000

  Physical memory (bytes) snapshot=474079232

  Virtual memory (bytes) snapshot=3744894976

  Total committed heap usage (bytes)=317718528

 org.apache.hadoop.examples.terasort.TeraGen$Counters

  CHECKSUM=214760662691937609

 File Input Format Counters 

  Bytes Read=0

 File Output Format Counters 

  Bytes Written=10000000000

[mapr@vm52 root]$

VERIFY :-

We can see containers are only running of fast node (vm52-1)


[mapr@vm52 hadoop]$  yarn node -list

17/10/04 01:00:30 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to vm52/10.10.70.99:8032

Total Nodes:2         

Node-Id      Node-State Node-Http-Address Number-of-Running-Containers      

vm52:56287         RUNNING         vm52:8042                            0    

vm52-1:35535         RUNNING       vm52-1:8042                            2

[mapr@vm52 hadoop]$