AWS/Azure(Cloud)/Spark/Hadoop / Linux : 2017

Thursday, November 16, 2017

NFS4 Setup

MapR lets you mount the cluster via NFS so that your applications can read and write data directly. We use NFS Ganesha for supporting NFSv4 features. NFS Ganesha is an Open Source userspace implementation of the NFS server. The MapR NFSv4, running as a userspace process, registers callbacks with NFS Ganesha through the File System Abstraction Layer (FSAL), which is a shared library (libfsalmapr.so). NFS Ganesha loads and uses this library whenever MFS is exported/mounted. The FSAL, in turn uses FileClient (libMapRClient.so library) to connect to the cluster. 

The following diagram illustrates how the MapR processes read and write operations to the MapR cluster using NFSv4. When the user enters a command (such as ls), the NFS client submits the request over TCP to the MapR NFSv4 server. The NFS server uses MapR FileClient to perform the requested operation on the cluster and returns the response to the NFS client over TCP.

Pre-req : 

This blog assumes you have a cluster installed . Run below command to secure the cluster .

[root@node2rhel73 initscripts]# /opt/mapr/server/configure.sh -secure -genkeys -Z 10.10.70.117 -C 10.10.70.117 -N dsemapr

Configuring Hadoop-2.7.0 at /opt/mapr/hadoop/hadoop-2.7.0

Done configuring Hadoop

CLDB node list: 10.10.70.117:7222

Zookeeper node list: 10.10.70.117:5181

Node setup configuration:  cldb fileserver webserver zookeeper

Log can be found at:  /opt/mapr/logs/configure.log

Creating 10 year self signed certificate with subjectDN='CN=node2rhel73'

Zookeeper found on this node, and it is not running. Starting Zookeeper

Warden is not running. Starting mapr-warden. Warden will then start all other configured services on this node

... Starting cldb

... Starting fileserver

... Starting webserver

To further manage the system, use "maprcli", or connect browser to https://node2rhel73:8443/

To stop and start this node, use "systemctl start/stop mapr-warden "

[root@node2rhel73 initscripts]# 

Verify : Zk is up .

[root@node2rhel73 initscripts]# /opt/mapr/initscripts/zookeeper qstatus

JMX enabled by default

Using config: /opt/mapr/zookeeper/zookeeper-3.4.5/conf/zoo.cfg

Mode: standalone

[root@node2rhel73 initscripts]#

Installation :

1) Install nfs-utils on host where NFSv4 is planned to be installed .

yum install nfs-utils -y

2) rpc.statd and rpcbind should be running .

(Formerly nfs-lock)

[root@node2rhel73 initscripts]# /sbin/rpc.statd

[root@node2rhel73 initscripts]# ps -ef| grep rpc.st

rpcuser 29159 1 2 03:35 ? 00:00:00 /sbin/rpc.statd

root 29164 5553 0 03:35 pts/0 00:00:00 grep --color=auto rpc.st

[root@node2rhel73 initscripts]#

On Red Hat and CentOS v6.0 and higher, the rpcbind service (formerly portmapper) must be running. You can use the command ps ax | grep rpcbind to check.

[root@node2rhel73 conf]# service rpcbind start

Redirecting to /bin/systemctl start rpcbind.service

[root@node2rhel73 conf]# ps -ef| grep rpcbind

rpc 7980 1 0 20:17 ? 00:00:00 /sbin/rpcbind -w

root 8359 1168 0 20:19 pts/0 00:00:00 grep --color=auto rpcbind

[root@node2rhel73 conf]#

3) Install "mapr-nfs4server" package , mapr-nfsganesha package will also be installed as a dependency package

yum install mapr-nfs4server -y

[root@node2rhel73 initscripts]# rpm -qa | grep nfs

mapr-nfsganesha-2.3.0.201704102214-1.noarch

nfs-utils-1.3.0-0.33.el7.x86_64

mapr-nfs4server-5.2.0.42783.GA-1.x86_64

libnfsidmap-0.25-15.el7.x86_64

[root@node2rhel73 initscripts]# 

4) Generate service ticket for NFSv4 service to use while starting .

[root@node2rhel73 initscripts]# maprlogin generateticket -type servicewithimpersonation -user mapr -out /tmp/mapr_impersonation_ticket -duration 30:0:0 -renewal 90:0:0

MapR credentials of user 'mapr' for cluster 'dsemapr' are written to '/tmp/mapr_impersonation_ticket'

[root@node2rhel73 initscripts]# cat /tmp/mapr_impersonation_ticket

dsemapr 94UewDZHYdAIewP1EGjlSmfVF0f9YPEUmY5AMnzVJGvncnlZs85cM8JP5GXGCVTh+FMv48MtmLFhxJihnIN20e7xE09/ZZAiZI8YfwJIfB56FSUWAY73iIvO+Uw8i/g+DpO4g7Hj4vVGcOxYW3LXJAOAYP5qJvXgrM/TJh2SqBBfIQQS+M+7fIz+r7RZZkEorpsnj/RPC2okXwSWk6Vwg+c1MAQkKBkBsT6wBTQ6d25yW55qECB6UILFMduJTvHgS0YQNUQIjbM4/ZZE8xuvjGEk3Sn6AvSAz5hfhOllRM7LimE=

[root@node2rhel73 initscripts]#

5)  Update the tkt location .

[root@node2rhel73 initscripts]# grep tkt /opt/mapr/conf/nfs4server.conf 

  tkt_location = /tmp/mapr_impersonation_ticket; 

[root@node2rhel73 initscripts]# 

Update the mount path and pseudo path .

[root@node2rhel73 initscripts]# grep -w "\/mapr" /opt/mapr/conf/nfs4server.conf 

  Path = /mapr;

  Pseudo = /mapr;

[root@node2rhel73 initscripts]# 

Comment out Krb5 since we are not using Kerberos while starting the Nfsv4 .

[root@node2rhel73 initscripts]# grep SecType /opt/mapr/conf/nfs4server.conf 

  # SecType = krb5;

[root@node2rhel73 initscripts]# 

6) Start Nfsv4 server .

[root@node2rhel73 initscripts]# service mapr-nfs4server start

Redirecting to /bin/systemctl start  mapr-nfs4server.service

[root@node2rhel73 initscripts]# service mapr-nfs4server status

Redirecting to /bin/systemctl status  mapr-nfs4server.service

● mapr-nfs4server.service - MapR Technologies, Inc. NFSv4 Server

   Loaded: loaded (/etc/systemd/system/mapr-nfs4server.service; enabled; vendor preset: disabled)

   Active: active (running) since Mon 2017-11-13 03:48:53 EST; 8s ago

  Process: 1622 ExecStart=/opt/mapr/initscripts/mapr-nfs4server start (code=exited, status=0/SUCCESS)

 Main PID: 1682 (nfs4server)

   CGroup: /system.slice/mapr-nfs4server.service

           ├─1682 /opt/mapr/bin/nfs4server -L /opt/mapr/logs/nfs4/nfs4server.log -f /opt/mapr/conf/nfs4server.conf -d

           └─1683 /opt/mapr/bin/ganesha.nfsd -F -L /opt/mapr/logs/nfs4/nfs4server.log -f /opt/mapr/conf/nfs4server.conf

Nov 13 03:48:50 node2rhel73 systemd[1]: Starting MapR Technologies, Inc. NFSv4 Server...

Nov 13 03:48:53 node2rhel73 systemd[1]: Started MapR Technologies, Inc. NFSv4 Server.

[root@node2rhel73 initscripts]# 

7) Show whats exported and mount Nfsv4 to access the cluster.

[root@node2rhel73 initscripts]# showmount -e

Export list for node2rhel73:

/mapr (everyone)

[root@node2rhel73 conf]# /opt/mapr/server/nfs4mgr list-exports

ExportId      Path

30            /mapr

0             /

[root@node2rhel73 conf]#

[root@node2rhel73 initscripts]# mount -t nfs4 node2rhel73:/mapr /mapr

[root@node2rhel73 initscripts]# df -hP /mapr

Filesystem         Size  Used Avail Use% Mounted on

node2rhel73:/mapr  119G  493M  118G   1% /mapr

[root@node2rhel73 initscripts]# mount | grep mapr

node2rhel73:/mapr on /mapr type nfs4 (rw,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrains=2,sec=sys,clientaddr=10.10.70.117,local_lock=none,addr=10.10.70.117)

[root@node2rhel73 initscripts]#

[root@node2rhel73 initscripts]# cd /mapr/dsemapr/

[root@node2rhel73 dsemapr]# ls

a  abizer  apps  hbase  mysql  opt  softlink  test  tmp  user  var

[root@node2rhel73 dsemapr]# hadoop fs -ls /

Found 11 items

-rwxrwxrwx   3 mapr mapr          0 2017-04-27 01:20 /a

drwxrwxrwx   - 26   26            1 2017-08-03 18:30 /abizer

drwxrwxrwx   - 26   26            0 2017-04-20 16:08 /apps

drwxrwxrwx   - 26   26            0 2017-04-20 16:08 /hbase

drwxrwxrwx   - 26   26            0 2017-08-25 23:21 /mysql

drwxrwxrwx   - 26   26            0 2017-04-20 16:09 /opt

-rwxrwxrwx   - 26   26            6 2017-08-07 18:11 /softlink

-rwxrwxrwx   3 26   26            0 2017-04-27 21:15 /test

drwxrwxrwx   - 26   26            2 2017-09-22 01:39 /tmp

drwxrwxrwx   - 26   26            1 2017-08-03 18:43 /user

drwxrwxrwx   - 26   26            1 2017-04-20 16:08 /var

[root@node2rhel73 dsemapr]# 

Monday, November 13, 2017

Fair Scheduler resource updates

Fair Scheduler resource updates

Recently I had an interesting case where some resources in the fair scheduler queue were updated yet on the scheduler page we didn't see updated values. The main concern was will application team get extra resources they are paying for , if yes is there a bug in scheduler UI .

1) On checking the RM logs it was clear the file was indeed getting read but the question was why values are not updated ?

2017-11-13 16:18:18,124 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2017-11-13 16:18:28,127 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2017-11-13 16:18:38,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2017-11-13 16:18:48,133 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2017-11-13 16:18:58,137 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2017-11-13 16:19:08,141 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2) I turned in scheduler debug log to narrow down if it was just Schedular UI issue or indeed this values were not updated in scheduler . Reviewing the debug logs it was found Scheduler only knows about old values .

yarn daemonlog -setlevel <RM Hostname>:8088 org.apache.hadoop.yarn.server.resourcemanager.scheduler DEBUG

One step in right direction , Now we knew it was not scheduler UI but something wrong with scheduler itself .

3) So assuming user was hitting some kind of bug when updating multi-level queues i decided to repro the issue in-house but unfortunately everything worked fine and all the updates i was able to see on the scheduler page.

This confirmed it was environment issue or something wrong with customers environment/fair-scheduler.xml file

4) I tried to load customer fair-scheduler.xml in my local repro to check if the file was readable or if there was some kind of issue with the format etc . Unfortunately my logs also updated the file was being read and reported no error but the scheduler page didn't update the new queue's etc .

5) Finally i restarted RM hoping it will read and display the queue's in the scheduler .

Bingo !!! This time RM failed to come up and logged below messages which gave me clue for the RC of the issue.

Caused by: java.io.IOException: Failed to initialize FairScheduler

 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1441)

 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1458)

 at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)

 ... 7 more

Caused by: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: Bad fair scheduler config file: queue name (mapr.general) shouldn't contain period.

 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:437)

 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:516)

 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:516)

 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:355)

 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1439)

As seen from the stack trace the fair-scheduler.xml had an issue where someone while updating the file incorrectly updated fair scheduler config file with queue name (mapr.general) and queue name can never contain period which was causing the file to be not read and Scheduler not being updated.

Key takeaway :

After updating your configs always validate updated queue resources show up, easier to catch issue when things are updated recently then back track the problem with no clue what someone else has done .

Thursday, November 9, 2017

Space not getting reclaimed for MapR DB table

When data is deleted based on TTL it will be deleted from the disk the next time the segment is packed and then the space will be. Unlike HBase, MapR-DB does not routinely rewrite all data , this approach reduces write amplification. The side effect of this is that storage that should be released will not be released if the same segment is not being actively updated due to write patterns . To reclaim the space almost immediately you can force MapR-DB to pack segments using the "maprcli table region pack -path <Table Name> -fid all" command.

Below steps are will show example where i created a table with very short TTL but space is not reclaimed and solution to reclaim the Space quickly .

1) Description of table which was created with TTL 1 minute.

hbase(main):001:0> describe '/srctable'
Table /srctable is ENABLED
/srctable, {TABLE_ATTRIBUTES => {MAX_FILESIZE => '4294967296', METADATA => {'AUTOSPLIT' => 'true', 'MAPR_UUID' => '5de6339a-a352-bc68-3844-0ad8b6f85900', 'MAX_VALUE_SIZE_IN_MEM' =>
'100'}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'fam0', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => '60 SECONDS (1 MINUTE)', COMPRESSI
ON => 'LZ4', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '8192', REPLICATION_SCOPE => '0', METADATA => {'compression_raw' => '2'}}

2)  Added 50 rows and recorded the size of the table.

[root@node107rhel72 ~]# /opt/mapr/server/tools/loadtest -table /srctable -numrows 50
23:01:49 0 secs 50 rows 50 rows/s 0ms latency 1ms maxLatency
Overall Rate 2941.18 rows/s, Latency 0ms

[root@node107rhel72 ~]# maprcli table info -path /srctable -json
{
"timestamp":1509602514944,
"timeofday":"2017-11-01 11:01:54.944 GMT-0700",
"status":"OK",
"total":1,
"data":[
{
"path":"/srctable",
"numregions":1,
"totallogicalsize":90112,
"totalphysicalsize":81920,
"totalcopypendingsize":0,
"totalrows":50,
"totalnumberofspills":1,

}
]
}
3) Hbase does see 50 row which were inserted by the tool .

hbase(main):001:0> scan '/srctable'
ROW COLUMN+CELL
user1000385178204227360 column=fam0:col_00, timestamp=1509602508622, value=ec-laeotropic-laeotropism-laeotropous-Laertes-laertes-Laertiades-Laestrygon-Laestry
   gones-Laestrygonians-laet-laetation-laeti-laetic-Laetitia-laetrile-laevigate-Laevigrada-laevo-laevo--laevoduction-laevogyrate-laevogyr
   e-laevogyrous-laevolactic-laevorotation-laevorotatory-laevotartaric-laevoversion-laevulin-laevulose-LaF-Lafarge-Lafargeville-Lafayette
   -lafayette-Lafca\x00
user1000385178204227360 column=fam0:col_01, timestamp=1509602508622, value=ness-cathartics-Cathartidae-Cathartides-cathartin-Cathartolinum-Cathay-Cathayan-Cat
   he-cat-head-cathead-catheads-cathect-cathected-cathectic-cathecting-cathection-cathects-cathedra-cathedrae-cathedral-cathedraled-cathe
   dralesque-cathedralic-cathedral-like-cathedrallike-cathedrals-cathedralwise-cathedras-cathedrated-cathedratic-cathedratica-cathedratic
   al-cath\x00

user4640687271668624146 column=fam0:col_00, timestamp=1509602508622, value=ometer-urobenzoic-urobilin-urobilinemia-urobilinogen-urobilinogenuria-urobilinuria-

   urocanic-urocele-Urocerata-urocerid-Uroceridae-urochloralic-urochord-Urochorda-urochordal-urochordate-urochords-urochrome-urochromogen
   -urochs-Urocoptidae-Urocoptis-urocyanogen-Urocyon-urocyst-urocystic-Urocystis-urocystitis-urodaeum-Urodela-uro\x00
user4640687271668624146 column=fam0:col_01, timestamp=1509602508622, value=ability-inheritable-inheritableness-inheritably-inheritage-inheritance-inheritances
   -inherited-inheriting-inheritor-inheritors-inheritress-inheritresses-inheritrice-inheritrices-inheritrix-inherits-inherle-inhesion-inh
   esions-inhesive-inhiate-inhibit-inhibitable-inhibited-inhibiter-inhibiting-inhibition-inhibitionist-inhibitions-\x00
user4640687271668624146 column=fam0:col_02, timestamp=1509602508622, value=d-vallancy-vallar-vallary-vallate-vallated-vallation-Valle-Valleau-Vallecito-Vallec
   itos-vallecula-valleculae-vallecular-valleculate-Vallejo-Vallenar-Vallery-Valletta-vallevarite-Valley-valley-valleyful-valleyite-valle
   ylet-valleylike-valleys-valleyward-valleywise-Valli-Valliant-vallicula-valliculae-vallicular-v\x00
user4876795174170569834 column=fam0:col_00, timestamp=1509602508622, value=arp-biting-sharp-bottomed-sharp-breasted-sharp-clawed-sharp-cornered-sharp-cut-shar
   p-cutting-Sharpe-sharp-eared-sharped-sharp-edged-sharp-elbowed-sharpen-sharpened-sharpener-sharpeners-sharpening-sharpens-sharper-shar
   pers-Sharpes-sharpest-sharp-eye-sharp-eyed-sharp-eyes-sharp-faced-sharp-fanged-sharp-feat\x00

user764277275702281672 column=fam0:col_00,
.........................
.........................

user9105318085603802964 column=fam0:col_02, timestamp=1509602508622, value=er-Plutus-plutus-Pluvi-pluvial-pluvialiform-pluvialine-Pluvialis-pluvially-pluvials
   -pluvian-pluvine-pluviograph-pluviographic-pluviographical-pluviography-pluviometer-pluviometric-pluviometrical-pluviometrically-pluvi
   ometry-pluvioscope-pluvioscopic-Pluviose-pluviose-pluviosity-pluvious-Pluvius-ply-plyboard-plyer-plyers-plygain-plying-plyingly\x00
50 row(s) in 0.3110 seconds

hbase(main):002:0> quit

4)  Now wait a minute and date will be marked for deletion due to TTL of the table being 1 minute and hence you cannot access the data anymore but size of the table won't be reduced.

hbase(main):001:0> scan '/srctable'
ROW COLUMN+CELL
0 row(s) in 0.0450 seconds

[root@node107rhel72 ~]# maprcli table info -path /srctable -json
{
"timestamp":1509602642299,
"timeofday":"2017-11-01 11:04:02.299 GMT-0700",
"status":"OK",
"total":1,
"data":[
{
"path":"/srctable",
"numregions":1,
"totallogicalsize":90112,
"totalphysicalsize":81920,
"totalcopypendingsize":0,
"totalrows":50,

}
]
}

5) Now run region pack on the table for size of the table to be reduced as expected.

[root@node107rhel72 ~]# maprcli table region pack -path /srctable -fid all
[root@node107rhel72 ~]# maprcli table info -path /srctable -json
{
"timestamp":1509602659308,
"timeofday":"2017-11-01 11:04:19.308 GMT-0700",
"status":"OK",
"total":1,
"data":[
{
"path":"/srctable",
"numregions":1,
"totallogicalsize":16384,
"totalphysicalsize":16384,
"totalcopypendingsize":0,
"totalrows":0,
}
]
}
[root@node107rhel72 ~]#

Space reclaimed.

DEBUGGING :

There can be cases you have been told space isn't reclaimed even after running Pack . The way to verify the data doesn't exist on disk is by running a raw scan . If the data which was supposed to be reclaimed is still holding space it would show up in rawScan output .

[root@node107rhel72 ~]#  maprcli debugdb rawScan -fid 2445.32.131292 -startkey "--INFINITY" -dumpfile /tmp/rawScan.txt

[root@node107rhel72 ~]# cat /tmp/rawScan.txt

Row  SpillFid  FamilyDataLength  EntryTime  Column+Cell

user1000385178204227360 2445.54.131336 1084 1510286237058:0 column=fam0:col_00,type=put,timestamp=1510286237058,value=ec-laeotropic-laeotropism-laeotropous-Laertes-laertes-Laertiades-Laestrygon-Laestrygones-Laestrygonians-laet-laetation-laeti-laetic-Laetitia-laetrile-laevigate-Laevigrada-laevo-laevo--laevoduction-laevogyrate-laevogyre-laevogyrous-laevolactic-laevorotation-laevorotatory-laevotartaric-laevoversion-laevulin-laevulose-LaF-Lafarge-Lafargeville-Lafayette-lafayette-Lafca\x00

user1000385178204227360 2445.54.131336 1084 1510286237058:0 column=fam0:col_01,type=put,timestamp=1510286237058,value=ness-cathartics-Cathartidae-Cathartides-cathartin-Cathartolinum-Cathay-Cathayan-Cathe-cat-head-cathead-catheads-cathect-cathected-cathectic-cathecting-cathection-cathects-cathedra-cathedrae-cathedral-cathedraled-cathedralesque-cathedralic-cathedral-like-cathedrallike-cathedrals-cathedralwise-cathedras-cathedrated-cathedratic-cathedratica-cathedratical-cath\x00

user1000385178204227360 2445.54.131336 1084 1510286237058:0 column=fam0:col_02,type=put,timestamp=1510286237058,value=arian-unsectarianism-unsectarianize-unsectarianized-unsectarianizing-unsectional-unsectionalised-unsectionalized-unsectionally-unsectioned-unsecular-unsecularised-unsecularize-unsecularized-unsecularly-unsecurable-unsecurableness-unsecure-unsecured-unsecuredly-unsecuredness-unsecurely-unsecureness-unsecurity-\x00

Now after i run table pack i don't see any data which was existing .

[root@node107rhel72 ~]# maprcli table region pack -path  /srctable -fid all

[root@node107rhel72 ~]#  maprcli debugdb rawScan -fid 2445.32.131292 -startkey "--INFINITY" -dumpfile /tmp/rawScanlater.txt

[root@node107rhel72 ~]# cat /tmp/rawScanlater.txt

Row  SpillFid  FamilyDataLength  EntryTime  Column+Cell

[root@node107rhel72 ~]# 

Friday, October 27, 2017

Tuning for Fast failover

Recently worked on very interesting case where requirement was to test reliability of MapR cluster i.e core services like ZK, CLDB experience node failure followed by data node failure in succession .

Below were the tunings done to improve recovery time after failure of nodes apart from isolating CLDB and ZK only with their own topology with dedicated nodes.

1) Check network settings:( all Cluster and client nodes )

i) Verify value for tcp_syn_retries is set to 4. tcp_syn_retries impacts the number of TCP layer retries for a new connection before failing. The retry occurs after the initial timeout (1 second) and then the system waits to see if the retry succeeded before retrying again or failing if syn_retries have been done. It is an exponential back-off algorithm. Thus going from 2 to 3 doubles the time for the 3rd retry. 3 to 4 doubles again, and so on. Thus going from 2 to say 4 is not just 2x slower, it’s 5x slower (1+2 +4=7 vs. 1+2+4+8+16=31). For optimal failover behavior we recommend a low value of 4 (the Linux default is usually 5) for all nodes involved (client nodes, NFS gateway, POSIX client, and all cluster nodes) .

cat /proc/sys/net/ipv4/tcp_syn_retries

To set the TCP retry count, set the value of tcp_syn_retries to 4 in the /proc/sys/net/ipv4/ directory.

echo 4 > /proc/sys/net/ipv4/tcp_syn_retries

ii) Verify tcp_retries2 is set to 5 . If not set it to 5 since MapR also relies on TCP timeouts to detect transmission timeouts for active connections. Like syn_retries this value controls the number of retries and again it is an exponential backoff algorithm.

cat /proc/sys/net/ipv4/tcp_retries2

5
echo 5 > /proc/sys/net/ipv4/tcp_retries2

2) Check MapR-FS timeout settings (Fuse and Hadoop client nodes) .

To reduce the amount of time it takes for hadoop and FUSE-based POSIX clients to detect CLDB and data node failure, define the property, fs.mapr.connect.timeout, in the core-site.xml file. The minimum value for this property can be set as 100 milliseconds

Your entry in core-site.xml file should look similar to the following:

<property>
<name>fs.mapr.connect.timeout</name>
<value>100</value>
<description>file client wait time of 100 milliseconds</description>
</property>

3) Verify Fast failover is enabled. ( Clusterwide parameter )

[root@node106rhel72 ~]# maprcli config load -json | grep fastfailover

"mfs.feature.fastfailover":"1",

[root@node106rhel72 ~]#

If value is not set to 1 run below command to enable fast failover.

maprcli config save -values {mfs.feature.fastfailover:1}

4) Default RPC request configuration. ( ALL NFS and loopback nodes and client nodes )

Default RPC requests configuration can negatively impact performance and memory. To avoid performance and memory issues, configure the number of outstanding RPC requests to the NFS server to be 128. The kernel tunable value sunrpc.tcp_slot_table_entries represents the number of simultaneous Remote Procedure Call (RPC) requests.

echo 128 > /proc/sys/sunrpc/tcp_slot_table_entries
echo 128 > /proc/sys/sunrpc/tcp_max_slot_table_entries

Remount the NFS client to the NFS gateway for above values to take effect.

Check data failover timing ( recovered in ~5 seconds )

1) Create volume

maprcli volume create -name vol1 -path /vol1

2) List the node/IP on which Name Container master reside

maprcli dump volumeinfo -volumename vol1 -json

3) Run DD to write a huge file to the data node .

time dd if=/dev/zero of=/mapr/<Cluster>/vol1/storagefile bs=1M count=2000000 ( Note the time for write to complete )

Now run the command again and after few seconds stop warden on the IP you got from step 2 . ( Note the time for write to complete )

Obviously the difference in time from step 3 and step 2 is the time recovery took . In my test recovery took ~5 seconds . Ideally we expect all recovery to happen well within 90 secs .