Thursday, November 16, 2017

NFS4 Setup

NFS4 Setup

MapR lets you mount the cluster via NFS so that your applications can read and write data directly. We use NFS Ganesha for supporting NFSv4 features. NFS Ganesha is an Open Source userspace implementation of the NFS server. The MapR NFSv4, running as a userspace process, registers callbacks with NFS Ganesha through the File System Abstraction Layer (FSAL), which is a shared library (libfsalmapr.so). NFS Ganesha loads and uses this library whenever MFS is exported/mounted. The FSAL, in turn uses FileClient (libMapRClient.so library) to connect to the cluster.

The following diagram illustrates how the MapR processes read and write operations to the MapR cluster using NFSv4. When the user enters a command (such as ls), the NFS client submits the request over TCP to the MapR NFSv4 server. The NFS server uses MapR FileClient to perform the requested operation on the cluster and returns the response to the NFS client over TCP.


Pre-req : 
This blog assumes you have a cluster installed . Run below command to secure the cluster .

[root@node2rhel73 initscripts]# /opt/mapr/server/configure.sh -secure -genkeys -Z 10.10.70.117 -C 10.10.70.117 -N dsemapr
Configuring Hadoop-2.7.0 at /opt/mapr/hadoop/hadoop-2.7.0
Done configuring Hadoop
CLDB node list: 10.10.70.117:7222
Zookeeper node list: 10.10.70.117:5181
Node setup configuration:  cldb fileserver webserver zookeeper
Log can be found at:  /opt/mapr/logs/configure.log
Creating 10 year self signed certificate with subjectDN='CN=node2rhel73'
Zookeeper found on this node, and it is not running. Starting Zookeeper
Warden is not running. Starting mapr-warden. Warden will then start all other configured services on this node
... Starting cldb
... Starting fileserver
... Starting webserver
To further manage the system, use "maprcli", or connect browser to https://node2rhel73:8443/
To stop and start this node, use "systemctl start/stop mapr-warden "
[root@node2rhel73 initscripts]# 
Verify : Zk is up .

[root@node2rhel73 initscripts]# /opt/mapr/initscripts/zookeeper qstatus
JMX enabled by default
Using config: /opt/mapr/zookeeper/zookeeper-3.4.5/conf/zoo.cfg
Mode: standalone
[root@node2rhel73 initscripts]#

Installation :


1) Install nfs-utils on host where NFSv4 is planned to be installed .

yum install nfs-utils -y

2) rpc.statd and rpcbind should be running .

(Formerly nfs-lock

[root@node2rhel73 initscripts]# /sbin/rpc.statd
[root@node2rhel73 initscripts]# ps -ef| grep rpc.st
rpcuser  29159     1  2 03:35 ?        00:00:00 /sbin/rpc.statd
root     29164  5553  0 03:35 pts/0    00:00:00 grep --color=auto rpc.st
[root@node2rhel73 initscripts]# 

  • On Red Hat and CentOS v6.0 and higher, the rpcbind service (formerly portmapper)  must be running. You can use the command ps ax | grep rpcbind to check.



[root@node2rhel73 conf]# service rpcbind start
Redirecting to /bin/systemctl start  rpcbind.service

[root@node2rhel73 conf]# ps -ef| grep rpcbind
rpc       7980     1  0 20:17 ?        00:00:00 /sbin/rpcbind -w
root      8359  1168  0 20:19 pts/0    00:00:00 grep --color=auto rpcbind
[root@node2rhel73 conf]# 


3) Install "mapr-nfs4server" package , mapr-nfsganesha package will also be installed as a dependency package

yum install mapr-nfs4server -y

[root@node2rhel73 initscripts]# rpm -qa | grep nfs
mapr-nfsganesha-2.3.0.201704102214-1.noarch
nfs-utils-1.3.0-0.33.el7.x86_64
mapr-nfs4server-5.2.0.42783.GA-1.x86_64
libnfsidmap-0.25-15.el7.x86_64
[root@node2rhel73 initscripts]# 

4) Generate service ticket for NFSv4 service to use while starting .

[root@node2rhel73 initscripts]# maprlogin generateticket -type servicewithimpersonation -user mapr -out /tmp/mapr_impersonation_ticket -duration 30:0:0 -renewal 90:0:0
MapR credentials of user 'mapr' for cluster 'dsemapr' are written to '/tmp/mapr_impersonation_ticket'
[root@node2rhel73 initscripts]# cat /tmp/mapr_impersonation_ticket
dsemapr 94UewDZHYdAIewP1EGjlSmfVF0f9YPEUmY5AMnzVJGvncnlZs85cM8JP5GXGCVTh+FMv48MtmLFhxJihnIN20e7xE09/ZZAiZI8YfwJIfB56FSUWAY73iIvO+Uw8i/g+DpO4g7Hj4vVGcOxYW3LXJAOAYP5qJvXgrM/TJh2SqBBfIQQS+M+7fIz+r7RZZkEorpsnj/RPC2okXwSWk6Vwg+c1MAQkKBkBsT6wBTQ6d25yW55qECB6UILFMduJTvHgS0YQNUQIjbM4/ZZE8xuvjGEk3Sn6AvSAz5hfhOllRM7LimE=
[root@node2rhel73 initscripts]#


5)  Update the tkt location .

[root@node2rhel73 initscripts]# grep tkt /opt/mapr/conf/nfs4server.conf 
  tkt_location = /tmp/mapr_impersonation_ticket; 
[root@node2rhel73 initscripts]# 

Update the mount path and pseudo path .

[root@node2rhel73 initscripts]# grep -w "\/mapr" /opt/mapr/conf/nfs4server.conf 
  Path = /mapr;
  Pseudo = /mapr;
[root@node2rhel73 initscripts]# 

Comment out Krb5 since we are not using Kerberos while starting the Nfsv4 .

[root@node2rhel73 initscripts]# grep SecType /opt/mapr/conf/nfs4server.conf 
  # SecType = krb5;
[root@node2rhel73 initscripts]# 

6) Start Nfsv4 server .

[root@node2rhel73 initscripts]# service mapr-nfs4server start
Redirecting to /bin/systemctl start  mapr-nfs4server.service
[root@node2rhel73 initscripts]# service mapr-nfs4server status
Redirecting to /bin/systemctl status  mapr-nfs4server.service
mapr-nfs4server.service - MapR Technologies, Inc. NFSv4 Server
   Loaded: loaded (/etc/systemd/system/mapr-nfs4server.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-11-13 03:48:53 EST; 8s ago
  Process: 1622 ExecStart=/opt/mapr/initscripts/mapr-nfs4server start (code=exited, status=0/SUCCESS)
 Main PID: 1682 (nfs4server)
   CGroup: /system.slice/mapr-nfs4server.service
           ├─1682 /opt/mapr/bin/nfs4server -L /opt/mapr/logs/nfs4/nfs4server.log -f /opt/mapr/conf/nfs4server.conf -d
           └─1683 /opt/mapr/bin/ganesha.nfsd -F -L /opt/mapr/logs/nfs4/nfs4server.log -f /opt/mapr/conf/nfs4server.conf

Nov 13 03:48:50 node2rhel73 systemd[1]: Starting MapR Technologies, Inc. NFSv4 Server...
Nov 13 03:48:53 node2rhel73 systemd[1]: Started MapR Technologies, Inc. NFSv4 Server.
[root@node2rhel73 initscripts]# 

7) Show whats exported and mount Nfsv4 to access the cluster.

[root@node2rhel73 initscripts]# showmount -e
Export list for node2rhel73:
/mapr (everyone)
[root@node2rhel73 conf]# /opt/mapr/server/nfs4mgr list-exports
ExportId      Path
30            /mapr
0             /


[root@node2rhel73 conf]# 

[root@node2rhel73 initscripts]# mount -t nfs4 node2rhel73:/mapr /mapr
[root@node2rhel73 initscripts]# df -hP /mapr
Filesystem         Size  Used Avail Use% Mounted on
node2rhel73:/mapr  119G  493M  118G   1% /mapr
[root@node2rhel73 initscripts]# mount | grep mapr
node2rhel73:/mapr on /mapr type nfs4 (rw,relatime,vers=4.0,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,port=0,timeo=600,retrains=2,sec=sys,clientaddr=10.10.70.117,local_lock=none,addr=10.10.70.117)
[root@node2rhel73 initscripts]#

[root@node2rhel73 initscripts]# cd /mapr/dsemapr/
[root@node2rhel73 dsemapr]# ls
a  abizer  apps  hbase  mysql  opt  softlink  test  tmp  user  var

[root@node2rhel73 dsemapr]# hadoop fs -ls /
Found 11 items
-rwxrwxrwx   3 mapr mapr          0 2017-04-27 01:20 /a
drwxrwxrwx   - 26   26            1 2017-08-03 18:30 /abizer
drwxrwxrwx   - 26   26            0 2017-04-20 16:08 /apps
drwxrwxrwx   - 26   26            0 2017-04-20 16:08 /hbase
drwxrwxrwx   - 26   26            0 2017-08-25 23:21 /mysql
drwxrwxrwx   - 26   26            0 2017-04-20 16:09 /opt
-rwxrwxrwx   - 26   26            6 2017-08-07 18:11 /softlink
-rwxrwxrwx   3 26   26            0 2017-04-27 21:15 /test
drwxrwxrwx   - 26   26            2 2017-09-22 01:39 /tmp
drwxrwxrwx   - 26   26            1 2017-08-03 18:43 /user
drwxrwxrwx   - 26   26            1 2017-04-20 16:08 /var
[root@node2rhel73 dsemapr]# 








Monday, November 13, 2017

Fair Scheduler resource updates

                          Fair Scheduler resource updates



Recently I had an interesting case where some resources in the fair scheduler queue were updated yet on the scheduler page we didn't see updated values. The main concern was will application team get extra resources they are paying for , if yes is there a bug in scheduler UI .

1) On checking the RM logs it was clear the file was indeed getting read but the question was why values are not updated ?

2017-11-13 16:18:18,124 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml
2017-11-13 16:18:28,127 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml
2017-11-13 16:18:38,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml
2017-11-13 16:18:48,133 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml
2017-11-13 16:18:58,137 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2017-11-13 16:19:08,141 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService: Loading allocation file /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/fair-scheduler.xml

2) I turned in scheduler debug log to narrow down if it was just Schedular UI issue or indeed this values were not updated in scheduler  .  Reviewing the debug logs it was found Scheduler only knows about old values .

yarn daemonlog -setlevel <RM Hostname>:8088 org.apache.hadoop.yarn.server.resourcemanager.scheduler DEBUG

One step in right direction ,  Now we knew it was not scheduler UI but something wrong with scheduler itself .

3) So assuming user was hitting some kind of bug when updating multi-level queues i decided to repro the issue in-house but unfortunately everything worked fine and all the updates i was able to see on the scheduler page. 

This confirmed it was environment issue or something wrong with customers environment/fair-scheduler.xml   file 

4) I tried to load customer fair-scheduler.xml in my local repro to check if the file was readable or if there was some kind of issue with the format etc .  Unfortunately my logs also updated the file was being read and reported no error but the scheduler page didn't update the new queue's etc .

5) Finally i restarted RM hoping it will read and display the queue's in the scheduler  .


Bingo !!! This time RM failed to come up and logged below messages which gave me clue for the RC of the issue.

Caused by: java.io.IOException: Failed to initialize FairScheduler
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1441)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1458)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 7 more
Caused by: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationConfigurationException: Bad fair scheduler config file: queue name (mapr.general) shouldn't contain period.
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:437)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:516)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.loadQueue(AllocationFileLoaderService.java:516)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:355)

at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1439)


As seen from the stack trace the fair-scheduler.xml had an issue where someone while updating the file incorrectly updated fair scheduler config file with queue name (mapr.general) and queue name can never contain period which was causing the file to be not read and Scheduler not being updated.


Key takeaway :

After updating your configs always validate updated queue resources show up, easier to catch issue when things are updated recently then back track the problem with no clue what someone else has done .




Thursday, November 9, 2017

Space not getting reclaimed for MapR DB table

          Space not getting reclaimed for MapR DB table



When data is deleted based on TTL it will be deleted from the disk the next time the segment is packed and then the space will be. Unlike HBase, MapR-DB does not routinely rewrite all data , this approach reduces write amplification.  The side effect of this is that storage that should be released will not be released if the same segment is not being actively updated due to write patterns . To reclaim the space almost immediately you can force MapR-DB to pack segments using the "maprcli table region pack -path  <Table Name> -fid all" command. 

Below steps are will show example where i created a table with very short TTL but space is not reclaimed and solution to reclaim the Space quickly . 


1)  Description of table which was created with TTL 1 minute.

hbase(main):001:0> describe '/srctable'
Table /srctable is ENABLED                                                                                                                                                         
/srctable, {TABLE_ATTRIBUTES => {MAX_FILESIZE => '4294967296', METADATA => {'AUTOSPLIT' => 'true', 'MAPR_UUID' => '5de6339a-a352-bc68-3844-0ad8b6f85900', 'MAX_VALUE_SIZE_IN_MEM' =>
'100'}}                                                                                                                                                                            
COLUMN FAMILIES DESCRIPTION                                                                                                                                                        
{NAME => 'fam0', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => '60 SECONDS (1 MINUTE)', COMPRESSI
ON => 'LZ4', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '8192', REPLICATION_SCOPE => '0', METADATA => {'compression_raw' => '2'}}    

2)  Added 50 rows and recorded the size of the table.

[root@node107rhel72 ~]# /opt/mapr/server/tools/loadtest -table /srctable -numrows 50
23:01:49    0 secs        50 rows       50 rows/s    0ms latency    1ms maxLatency
Overall Rate 2941.18 rows/s, Latency 0ms

[root@node107rhel72 ~]# maprcli table info -path  /srctable -json
{
"timestamp":1509602514944,
"timeofday":"2017-11-01 11:01:54.944 GMT-0700",
"status":"OK",
"total":1,
"data":[
{
"path":"/srctable",
"numregions":1,
"totallogicalsize":90112,
"totalphysicalsize":81920,

"totalcopypendingsize":0,
"totalrows":50,
"totalnumberofspills":1,


}
]
}
3) Hbase does see 50 row which were inserted by the tool .

hbase(main):001:0> scan '/srctable'
ROW                                            COLUMN+CELL                                                                                                                         
 user1000385178204227360                       column=fam0:col_00, timestamp=1509602508622, value=ec-laeotropic-laeotropism-laeotropous-Laertes-laertes-Laertiades-Laestrygon-Laestry
                                               gones-Laestrygonians-laet-laetation-laeti-laetic-Laetitia-laetrile-laevigate-Laevigrada-laevo-laevo--laevoduction-laevogyrate-laevogyr
                                               e-laevogyrous-laevolactic-laevorotation-laevorotatory-laevotartaric-laevoversion-laevulin-laevulose-LaF-Lafarge-Lafargeville-Lafayette
                                               -lafayette-Lafca\x00                                                                                                                
 user1000385178204227360                       column=fam0:col_01, timestamp=1509602508622, value=ness-cathartics-Cathartidae-Cathartides-cathartin-Cathartolinum-Cathay-Cathayan-Cat
                                               he-cat-head-cathead-catheads-cathect-cathected-cathectic-cathecting-cathection-cathects-cathedra-cathedrae-cathedral-cathedraled-cathe
                                               dralesque-cathedralic-cathedral-like-cathedrallike-cathedrals-cathedralwise-cathedras-cathedrated-cathedratic-cathedratica-cathedratic
                                               al-cath\x00                                                                                                                         


 user4640687271668624146                       column=fam0:col_00, timestamp=1509602508622, value=ometer-urobenzoic-urobilin-urobilinemia-urobilinogen-urobilinogenuria-urobilinuria-

                                               urocanic-urocele-Urocerata-urocerid-Uroceridae-urochloralic-urochord-Urochorda-urochordal-urochordate-urochords-urochrome-urochromogen
                                               -urochs-Urocoptidae-Urocoptis-urocyanogen-Urocyon-urocyst-urocystic-Urocystis-urocystitis-urodaeum-Urodela-uro\x00                  
 user4640687271668624146                       column=fam0:col_01, timestamp=1509602508622, value=ability-inheritable-inheritableness-inheritably-inheritage-inheritance-inheritances
                                               -inherited-inheriting-inheritor-inheritors-inheritress-inheritresses-inheritrice-inheritrices-inheritrix-inherits-inherle-inhesion-inh
                                               esions-inhesive-inhiate-inhibit-inhibitable-inhibited-inhibiter-inhibiting-inhibition-inhibitionist-inhibitions-\x00                
 user4640687271668624146                       column=fam0:col_02, timestamp=1509602508622, value=d-vallancy-vallar-vallary-vallate-vallated-vallation-Valle-Valleau-Vallecito-Vallec
                                               itos-vallecula-valleculae-vallecular-valleculate-Vallejo-Vallenar-Vallery-Valletta-vallevarite-Valley-valley-valleyful-valleyite-valle
                                               ylet-valleylike-valleys-valleyward-valleywise-Valli-Valliant-vallicula-valliculae-vallicular-v\x00                                  
 user4876795174170569834                       column=fam0:col_00, timestamp=1509602508622, value=arp-biting-sharp-bottomed-sharp-breasted-sharp-clawed-sharp-cornered-sharp-cut-shar
                                               p-cutting-Sharpe-sharp-eared-sharped-sharp-edged-sharp-elbowed-sharpen-sharpened-sharpener-sharpeners-sharpening-sharpens-sharper-shar
                                               pers-Sharpes-sharpest-sharp-eye-sharp-eyed-sharp-eyes-sharp-faced-sharp-fanged-sharp-feat\x00                                       
  
 user764277275702281672                        column=fam0:col_00,          
.........................
.........................


 user9105318085603802964                       column=fam0:col_02, timestamp=1509602508622, value=er-Plutus-plutus-Pluvi-pluvial-pluvialiform-pluvialine-Pluvialis-pluvially-pluvials
                                               -pluvian-pluvine-pluviograph-pluviographic-pluviographical-pluviography-pluviometer-pluviometric-pluviometrical-pluviometrically-pluvi
                                               ometry-pluvioscope-pluvioscopic-Pluviose-pluviose-pluviosity-pluvious-Pluvius-ply-plyboard-plyer-plyers-plygain-plying-plyingly\x00 
50 row(s) in 0.3110 seconds

hbase(main):002:0> quit

4)  Now wait a minute and date will be marked for deletion due to TTL of the table being 1 minute and hence you cannot access the data anymore but size of the table won't be reduced.

hbase(main):001:0> scan '/srctable'
ROW                                            COLUMN+CELL                                                                                                                         
0 row(s) in 0.0450 seconds

[root@node107rhel72 ~]# maprcli table info -path  /srctable -json
{
"timestamp":1509602642299,
"timeofday":"2017-11-01 11:04:02.299 GMT-0700",
"status":"OK",
"total":1,
"data":[
{
"path":"/srctable",
"numregions":1,
"totallogicalsize":90112,
"totalphysicalsize":81920,
"totalcopypendingsize":0,
"totalrows":50,


}
]
}


5) Now run region pack on the table for size of the table to be reduced as expected.

[root@node107rhel72 ~]# maprcli table region pack -path  /srctable -fid all
[root@node107rhel72 ~]# maprcli table info -path  /srctable -json
{
"timestamp":1509602659308,
"timeofday":"2017-11-01 11:04:19.308 GMT-0700",
"status":"OK",
"total":1,
"data":[
{
"path":"/srctable",
"numregions":1,
"totallogicalsize":16384,
"totalphysicalsize":16384,

"totalcopypendingsize":0,
"totalrows":0,
 }
]
}
[root@node107rhel72 ~]#

Space reclaimed.



DEBUGGING :

There can be cases you have been told space isn't reclaimed even after running Pack .  The way to verify the data doesn't exist on disk is by running a raw scan .  If the data which was supposed to be reclaimed is still holding space it would show up in rawScan output .

[root@node107rhel72 ~]#  maprcli debugdb rawScan -fid 2445.32.131292 -startkey "--INFINITY" -dumpfile /tmp/rawScan.txt
[root@node107rhel72 ~]# cat /tmp/rawScan.txt
Row SpillFid FamilyDataLength EntryTime Column+Cell
user1000385178204227360 2445.54.131336 1084 1510286237058:0 column=fam0:col_00,type=put,timestamp=1510286237058,value=ec-laeotropic-laeotropism-laeotropous-Laertes-laertes-Laertiades-Laestrygon-Laestrygones-Laestrygonians-laet-laetation-laeti-laetic-Laetitia-laetrile-laevigate-Laevigrada-laevo-laevo--laevoduction-laevogyrate-laevogyre-laevogyrous-laevolactic-laevorotation-laevorotatory-laevotartaric-laevoversion-laevulin-laevulose-LaF-Lafarge-Lafargeville-Lafayette-lafayette-Lafca\x00
user1000385178204227360 2445.54.131336 1084 1510286237058:0 column=fam0:col_01,type=put,timestamp=1510286237058,value=ness-cathartics-Cathartidae-Cathartides-cathartin-Cathartolinum-Cathay-Cathayan-Cathe-cat-head-cathead-catheads-cathect-cathected-cathectic-cathecting-cathection-cathects-cathedra-cathedrae-cathedral-cathedraled-cathedralesque-cathedralic-cathedral-like-cathedrallike-cathedrals-cathedralwise-cathedras-cathedrated-cathedratic-cathedratica-cathedratical-cath\x00
user1000385178204227360 2445.54.131336 1084 1510286237058:0 column=fam0:col_02,type=put,timestamp=1510286237058,value=arian-unsectarianism-unsectarianize-unsectarianized-unsectarianizing-unsectional-unsectionalised-unsectionalized-unsectionally-unsectioned-unsecular-unsecularised-unsecularize-unsecularized-unsecularly-unsecurable-unsecurableness-unsecure-unsecured-unsecuredly-unsecuredness-unsecurely-unsecureness-unsecurity-\x00


Now after i run table pack i don't see any data which was existing .

[root@node107rhel72 ~]# maprcli table region pack -path  /srctable -fid all
[root@node107rhel72 ~]#  maprcli debugdb rawScan -fid 2445.32.131292 -startkey "--INFINITY" -dumpfile /tmp/rawScanlater.txt
[root@node107rhel72 ~]# cat /tmp/rawScanlater.txt
Row SpillFid FamilyDataLength EntryTime Column+Cell

[root@node107rhel72 ~]# 




Friday, October 27, 2017

Tuning for Fast failover

Tuning for Fast failover


Recently worked on very interesting case where requirement was to test reliability of MapR cluster i.e core services like ZK, CLDB experience node failure followed by data node failure in succession .

Below were the tunings done to improve recovery time after failure of nodes apart from isolating CLDB and ZK only with their own topology with dedicated nodes.

1) Check network settings:( all Cluster and client nodes )


i) Verify value for tcp_syn_retries is set to 4. tcp_syn_retries impacts the number of TCP layer retries for a new connection before failing. The retry occurs after the initial timeout (1 second) and then the system waits to see if the retry succeeded before retrying again or failing if syn_retries have been done.  It is an exponential back-off algorithm. Thus going from 2 to 3 doubles the time for the 3rd retry. 3 to 4 doubles again, and so on. Thus going from 2 to say 4 is not just 2x slower, it’s 5x slower (1+2 +4=7 vs. 1+2+4+8+16=31). For optimal failover behavior we recommend a low value of 4 (the Linux default is usually 5) for all nodes involved (client nodes, NFS gateway, POSIX client, and all cluster nodes) .

cat /proc/sys/net/ipv4/tcp_syn_retries
4

To set the TCP retry count, set the value of tcp_syn_retries to 4 in the /proc/sys/net/ipv4/ directory. 

echo 4 > /proc/sys/net/ipv4/tcp_syn_retries

ii) Verify tcp_retries2 is set to 5 .  If not set it to 5 since MapR also relies on TCP timeouts to detect transmission timeouts for active connections. Like syn_retries this value controls the number of retries and again it is an exponential backoff algorithm.


cat /proc/sys/net/ipv4/tcp_retries2
5
echo 5 > /proc/sys/net/ipv4/tcp_retries2


2) Check MapR-FS timeout settings (Fuse and Hadoop client nodes) .


To reduce the amount of time it takes for hadoop and FUSE-based POSIX clients to detect CLDB and data node failure, define the property, fs.mapr.connect.timeout, in the core-site.xml file. The minimum value for this property can be set as 100 milliseconds
Your entry in core-site.xml file should look similar to the following:
<property>
 <name>fs.mapr.connect.timeout</name>
 <value>100</value>
 <description>file client wait time of 100 milliseconds</description>
</property>


3) Verify Fast failover is enabled. ( Clusterwide parameter )


[root@node106rhel72 ~]# maprcli config load -json | grep fastfailover
"mfs.feature.fastfailover":"1",
[root@node106rhel72 ~]#

If value is not set to 1 run below command to enable fast failover.


maprcli config save -values {mfs.feature.fastfailover:1}


4) Default RPC request configuration. ( ALL NFS and loopback nodes and client nodes )

Default RPC requests configuration can negatively impact performance and memory. To avoid performance and memory issues, configure the number of outstanding RPC requests to the NFS server to be 128. The kernel tunable value sunrpc.tcp_slot_table_entries represents the number of simultaneous Remote Procedure Call (RPC) requests.


  1. echo 128 > /proc/sys/sunrpc/tcp_slot_table_entries
    echo 128 > /proc/sys/sunrpc/tcp_max_slot_table_entries
Remount the NFS client to the NFS gateway for above values to take effect.

Check data failover timing ( recovered in ~5 seconds )

1) Create volume   
maprcli volume create -name vol1 -path /vol1

2) List the node/IP on which Name Container master reside  
maprcli dump volumeinfo -volumename vol1 -json


3) Run DD to write a huge file to the data node . 
time dd if=/dev/zero of=/mapr/<Cluster>/vol1/storagefile bs=1M count=2000000 ( Note the time for write to complete )

Now run the command again and after few seconds stop warden on the IP you got from step 2 . ( Note the time for write to complete )


Obviously the difference in time from step 3 and step 2 is the time recovery took . In my test recovery took ~5 seconds . Ideally we expect all recovery to happen well within 90 secs .