Saturday, December 31, 2016

Identify Network latency

                                Identify Network latency


Sometimes file client processes are not getting timely responses from MFS processes this could be problem in MFS server process itself which is busy i.e due to many requests, MFS churning CPU or issue is coming for network latency. Lets assume there is no problem in MFS and MFS is responding to request as soon as the requests gets in MFS work queue.

First do a quick rpc test to check if there are any obvious network issue where connection between some nodes is just failing . If test in below blog comes out clean then continue reading.

http://abizeradenwala.blogspot.com/2016/11/quick-network-test-for-mapr-cluster.html

 To troubleshoot Network latency, one thing that can be checked is the send queue sizes for open TCP connections (e.g. the third column in output of "netstat -pan").  On a normal operating network, where the source and destination machines have free memory/CPU and there is no network bottleneck on the interfaces of the source and dest machines, the send queue size should not be more than a few thousand bytes at most.  If you see a lot of connections with 10K+ bytes in the send queue then that generally indicates some sort of problem.

First check the send queue sizes for all open connections on all nodes then you can identify if there is a cluster wide network issue or a node specific network issue (or no network issue at all).  

i) If you see connections on many different nodes with large send queue sizes, and the destinations of those connections are also to a wide variety of different nodes then that indicates a cluster wide issue such as a faulty switch.

ii) If you see lots of connections with large send queue sizes but only on one particular node, that would typically indicate that one node is having trouble sending data out onto the network.

iii) If you see connections on lots of different nodes with large send queue sizes and the destinations of those connections are all to one particular node then that indicates that one particular node is having trouble receiving data.

Further, after observing these large send queue sizes take a packet capture for the specific ports of those connections which has huge queue side , after analyzing in wire-shark it would show lots of TCP retransmits/connection reset indicating packet loss in the network.

The moral of the story is you can use "netstat -pan" output collected from across the cluster a few times say every 10 secs to identify if there are persistent large send queue sizes for connections, in which case you likely have a network issue of some sort.

Wednesday, December 28, 2016

Best Practice for MapR cluster (100+ nodes)

           Best Practice for MapR cluster (100+ nodes)


In a large cluster control nodes need to live on their own nodes and not share the hardware to avoid any resource contention (100 nodes or more) 

CLDB service :


Create CLDB-only nodes to ensure high performance. Setting up CLDB-only nodes involves restricting the CLDB volume to its own topology and making sure all other volumes are on a separate topology. Because both the CLDB-only path and the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. This configuration also provides additional control over the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). 


CLDB nodes will serve just the CLDB data( mapr.cldb.internal volume).  That data consists of a single container Cid 1, which in turn is stored on a single storage pool.  Hence, all disks for MapR on the CLDB nodes should be placed in a single storage pool, otherwise all the disks in the storage pools not containing the CLDB volume data will be always idle. During activities like bringing all of MapRFS online after a full cluster restart will require non-trivial disk IO from the CLDB.  Hence we should ensure there is enough disks to service the ops required based on cluster size (e.g. number of containers and number of nodes).  


Using a 3 or 4 flash SSDs in one SP (20 GB SP ) on the CLDB-only nodes should provide good performance. While 128 GB Memory and 32 cores with 10GbE Network would be sufficient.



ZK service :

 Isolate the ZooKeeper on nodes that do not perform any other function. Isolating the ZooKeeper node enables the node to perform its functions without competing for resources with other processes. Installing a ZooKeeper-only node is similar to any typical node installation, but with a specific subset of packages. On ZK nodes Do not install the FileServer package in order to prevent MapR from using this node for data storage .
64 GB Memory and 32 cores with 10GbE Network would be sufficient while no disks to be given to MapR-FS. 
 On ZooKeeper nodes, dedicate 100 GB partition for the /opt/mapr/zkdata directory to avoid other processes filling that partition with writes and to reduce the possibility of errors due to a full/opt/mapr/zkdata directory. This directory is used to store number of snapshots. Do not share the physical disk where /opt/mapr/zkdata resides with any MapR File System data partitions to avoid I/O conflicts that might lead to ZooKeeper service failures.

RM nodes :
  Isolate the RM's on nodes that do not perform any other function but resource management for the Yarn cluster. 
 6 * 1 TB SAS disks ( Two 3 TB SP's ) , 128 GB Memory and 32 cores with 10GbE Network would be sufficient.

Note :- Yarn behavior 
- RM volume is a standard volume under /data. 
- NM volumes are local volumes, they reside only on NM nodes.
- When a job is submitted, the job's needed dependencies are first copied over to RM volume, then localized to each NM volume. i.e Job client submit jar to RM Volume then NM copies the jar from RM volume.

RM data ( mapr.resourcemanager.volume volume ) volume topology .


There are two different best practices for large cluster, both involving creating RM topology:

  1) If the cluster is homogeneous and each node has equivalent network capacity; do not move RM volume to RM only topology. Keep RM volume under /data so the load is distributed.
  2) If the cluster is heterogeneous, RM volume needs to be on RM only topology to avoid job failure due to AM failing to write commit file on RM volume when mfs is busy; the RM topology nodes should have sufficient Network and Disk capacity to support the need of heavy data copying between RM and NM nodes.


Thursday, December 22, 2016

Fixing Spill inconsistency in MapRDB

                                     Fixing Spill inconsistency in MapRDB

While doing a consistency check we find one of the tablet "2236.32.131230" has problems and its consistency check is failing below are the steps to identify and fix the issue.

Caution :-  Please do not run step 7 from this blog without consulting with MapR Support since  the command is very low level command and needs to be executed with extreme caution.


1) Run consistency check as described in http://abizeradenwala.blogspot.com/2016/12/mapr-db-tablet-consistency-check.html , it would fail as below with an error.

maprcli debugdb checkTablet -fid 2236.32.131230 -startkey user5137029094653148843 -endkey user7197423315127484105 -tracefile t1
ERROR (10009) -  fs rpc failed

2) Get the master node of container which holds the tablet.

 maprcli dump containerinfo -ids 2236 -json
{
"timestamp":1482450014994,
"timeofday":"2016-12-22 03:40:14.994 GMT-0800",
"status":"OK",
"total":1,
"data":[
{
"ContainerId":2236,
"Epoch":5,
"Master":"10.10.70.177:5660--5-VALID",
"ActiveServers":{
"IP:Port":[
"10.10.70.110:5660--5-VALID",
"10.10.70.109:5660--5-VALID",
"10.10.70.111:5660--5-VALID"

3) Now look at /opt/mapr/logs/mfs.log-3 on the master node of the container below errors are logged which point at the issue i.e issue in reading spill 2236.2319.147898 

2016-12-22 15:38:15,2490 ERROR DB db/mfsread.cc:240 ***********FileRead RPC 2236.2319.147898 failed: 116      ----->   RPC error occurred since read RPC failed
2016-12-22 15:38:15,2491 ERROR DB tabletrangecheck.cc:3222 CheckSpillHeaderReadSME : read of spill 2236.2319.147898 failed 116
2016-12-22 15:38:15,2491 ERROR DB tabletrangecheck.cc:1708 TabletRangeCheckSpillmapProcess : child error 116 for spillmap 2236.2299.147866
2016-12-22 15:38:15,2491 ERROR DB tabletrangecheck.cc:1417 TabletRangeCheckSegmapProcess : child error 116 for segmap 2236.1747.134662

4) From tablet Fid we need to get segment fid  2236.1747.134662 which is also printed in the in above mfs.log-3

maprcli debugdb dump -fid 2236.32.131230
value                                                                                                                                                                                                                                                                                              key                               
{"value":{}}                                                                                                                                                                                                                                                                                       endkey.user7197423315127484105    
{"value":{"segfid":"<parentCID>.1747.134662","isFrozen":false,"inSplit":false,"useBucketDesc":true,"lastFlushedBucketFid":"2236.968.145294","numLogicalBlocks":174220,"numPhysicalBlocks":91527,"numRows":76195,"numRowsWithDelete":0,"numRemoteBlocks":0,"numSpills":868,"numSegments":323}}      pmap.user5137029094653148843      
{"value":{"segfid":"<parentCID>.1748.134664","isFrozen":false,"inSplit":false,"useBucketDesc":true,"lastFlushedBucketFid":"2236.687.144732","numLogicalBlocks":179179,"numPhysicalBlocks":94082,"numRows":78300,"numRowsWithDelete":0,"numRemoteBlocks":0,"numSpills":873,"numSegments":342}}      pmap.user5666253514688897233      
{"value":{"segfid":"<parentCID>.1176.133520","isFrozen":false,"inSplit":false,"useBucketDesc":true,"lastFlushedBucketFid":"2236.1516.146388","numLogicalBlocks":390865,"numPhysicalBlocks":179574,"numRows":142771,"numRowsWithDelete":0,"numRemoteBlocks":0,"numSpills":2421,"numSegments":677}}  pmap.user6214000520416211177      
{"value":{}}                                                                                                                                                                                                                                                                                       startkey.user5137029094653148843  

5) Now from segment fid get the spill map for the spill.

maprcli debugdb dump -fid 2236.1747.134662

value                                        key                      
{"value":{"fid":"<parentCID>.988.145334"}}   user5137029094653148843  
{"value":{"fid":"<parentCID>.989.145336"}}   user5139360914320437622  
{"value":{"fid":"<parentCID>.1264.145884"}}  user5646406973934510610  
{"value":{"fid":"<parentCID>.4804.140744"}}  user5647105917867305914  
{"value":{"fid":"<parentCID>.1267.145890"}}  user5648419690992190947  
{"value":{"fid":"<parentCID>.1268.145892"}}  user5651953154952706923  
{"value":{"fid":"<parentCID>.2299.147866"}}  user5652099992370546090  
{"value":{"fid":"<parentCID>.2300.147868"}}  user5654138608241368186  
{"value":{"fid":"<parentCID>.4809.140754"}}  user5654624228442534993  
{"value":{"fid":"<parentCID>.3398.137948"}}  user5656079376459834156  
{"value":{"fid":"<parentCID>.1269.145894"}}  user5658158163442262620  
{"value":{"fid":"<parentCID>.1270.145896"}}  user5660771976780382135  
{"value":{"fid":"<parentCID>.3924.139000"}}  user5661016661850577428  
{"value":{"fid":"<parentCID>.1275.145906"}}  user5662697508043691134  
{"value":{"fid":"<parentCID>.1276.145908"}}  user566617831443475188   

6) Now from the spill map we got the spill Fid and the Key "0" associated with it which has the problem.

maprcli debugdb dump -fid 2236.2299.147866 -json
{
"timestamp":1482450468536,
"timeofday":"2016-12-22 03:47:48.536 GMT-0800",
"status":"OK",
"total":1,
"data":[
{
"key":0,
"numRemoteBlocks":0,
"numSpills":0,
"numSegments":0,
"value":{
"fid":"<parentCID>.2319.147898",
"smeSize":342,
"keyIdxOffset":12,
"keyIdxLength":45806,
"ldbIdxLength":126,
"bloomBitsPerKey":26,
"numLogicalBlocks":606,
"numPhysicalBlocks":360,
"numRows":305,
"numRowsWithDelete":0,
"families":{
"id":[
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11
],
"offset":[
524288,
983040,
1441792,
1900544,
2359296,
2818048,
3276800,
3735552,
4194304,
4653056,
5111808
],
"length":[
433957,
433561,
433523,
433152,
432884,
430810,
433202,
433185,
433931,
432497,
271649
],
"minTimeStamp":[
1482285384222,
1482285384222,
1482285384222,
1482285384222,
1482285384222,
1482285384222,
1482285384222,
1482285384222,
1482285384222,
1482285384222,
1482287036566
],
"maxTimeStamp":[
1482287240397,
1482287240397,
1482287240397,
1482287240397,
1482287240397,
1482287240397,
1482287240397,
1482287240397,
1482287240397,
1482287240397,
1482287240397
]
}
}
}
]
}

7) Now to delete the specific Spill we need the spill map, spill fid and the key from step 5 and 6 to execute below command.

maprcli debugdb multiOp -kvfid 2236.2299.147866 -delkeys 0 -delfids 2236.2319.147898



Monday, December 19, 2016

Breaking up Huge volumes


                                              Breaking up Huge volumes

When volumes become too large their NC also become huge, we need to move the larger sub-directories to their own volumes to optimize the performance (If Number of Inode's are high then lookup would take significantly more time and create more Random I/O) and for the fact if disk/node crash it will take significant time to re-replicate due to high number of Inodes to be resynced

Here's how to move a sub-directory (test) to its own volume and not be part of root volume.

1)  Create a volume to hold new content and mount it any convenient location (test1).

[mapr@n2d logs]$ maprcli volume create -name test -path /test1

2) Now original directory is test and new one (test1) is mounted as volume.

[mapr@n2d logs]$ hadoop mfs -ls /

drwxr-xr-x Z   - mapr mapr          1 2014-10-24 12:59  268435456 /test
               p 2049.52.262576  n2d:5660 n1d:5660 n3d:5660
vrwxr-xr-x Z   - mapr mapr          0 2014-10-24 13:00  268435456 /test1

3) Move the data to volume.

[mapr@n2d logs]$ hadoop fs -mv /test/* /test1/
14/10/24 13:04:32 INFO fs.MapRFileSystem: Cannot rename across volumes, falling back on copy/delete semantics
 [mapr@n2d logs]$ hadoop fs -ls /test1 Found 1 items
-rwxr-xr-x   3 mapr mapr          0 2014-10-24 13:04 /test1/abizer

Note :- For huge volumes instead of using move please use distcp to take advantage of data to be copied in parallel .

hadoop distcp -p  <source> <destination>

4) Now delete original directory or move it to directory say "testold"

[mapr@n2d logs]$ hadoop fs -rmr /test
Deleted maprfs:/test

              or

[mapr@n2d logs]$ hadoop fs -mv /test  /testold

4) Unmount the test volume and mount it with original directory name and verify.

[mapr@n2d logs]$ maprcli volume unmount -name test
[mapr@n2d logs]$ maprcli volume mount -name test -path /test 
[mapr@n2d logs]$ hadoop fs -ls / Found 5 items
drwxr-xr-x   - mapr mapr          0 2014-07-11 13:45 /hbase
drwxr-xr-x   - mapr mapr          2 2014-10-15 15:10 /mirr_users
drwxr-xr-x   - mapr mapr          1 2014-10-24 13:04 /test
drwxr-xr-x   - mapr mapr          2 2014-10-15 15:10 /user
drwxr-xr-x   - mapr mapr          1 2014-07-11 13:45 /var
[mapr@n2d logs]$ hadoop fs -ls /test
Found 1 items
-rwxr-xr-x   3 mapr mapr          0 2014-10-24 13:04 /test/abizer

Saturday, December 10, 2016

MapR-DB Tablet consistency Check

                        MapR-DB Tablet consistency Check


This post assumes you have /abizer table created and its empty for us to now use below utility to load the table.

1)  /opt/mapr/server/tools/loadtest -mode put  -numfamilies 1 -numcols 8 -numrows 115450  -table /abizer
23:44:31    0 secs     47391 rows    47391 rows/s    0ms latency   74ms maxLatency
23:44:32    1 secs     69275 rows    21884 rows/s    0ms latency   62ms maxLatency
23:44:33    2 secs     85854 rows    16579 rows/s    1ms latency  332ms maxLatency
23:44:34    3 secs     91451 rows     5597 rows/s    3ms latency  326ms maxLatency
23:44:35    4 secs     99718 rows     8267 rows/s    2ms latency  370ms maxLatency
23:44:36    5 secs    104277 rows     4559 rows/s    4ms latency  665ms maxLatency
23:44:37    6 secs    106115 rows     1838 rows/s   11ms latency  161ms maxLatency
23:44:38    7 secs    115450 rows     9335 rows/s    5ms latency  343ms maxLatency
Overall Rate 14425.84 rows/s, Latency 1ms



2) Now check the size and other attributes of table 

 maprcli table info -path /abizer -json
{
"timestamp":1481356005057,
"timeofday":"2016-12-09 11:46:45.057 GMT-0800",
"status":"OK",
"total":4,
"data":[
{
"path":"/abizer",
"numregions":4,
"totallogicalsize":866525184,
"totalphysicalsize":538484736,
"totalcopypendingsize":261562368,
"totalrows":173733,
"totalnumberofspills":522,
"totalnumberofsegments":219,
"autosplit":true,
"bulkload":false,
"insertionorder":true,
"tabletype":"binary",
"regionsizemb":4096,
"audit":false,
"maxvalueszinmemindex":100,
"adminaccessperm":"u:root",
"createrenamefamilyperm":"u:root",
"bulkloadperm":"u:root",
"packperm":"u:root",
"deletefamilyperm":"u:root",
"replperm":"u:root",
"splitmergeperm":"u:root",
"defaultappendperm":"u:root",
"defaultcompressionperm":"u:root",
"defaultmemoryperm":"u:root",
"defaultreadperm":"u:root",
"defaultversionperm":"u:root",
"defaultwriteperm":"u:root",
"uuid":"315a20f5-ad00-3638-3229-03afb04b5800"
}
]
}


3) Now since we know there are 4 regions lets check details of 4 regions (Regions also called Tablet) in the table with respective details.

maprcli table region list -path /abizer -json
{
"timestamp":1481356069293,
"timeofday":"2016-12-09 11:47:49.293 GMT-0800",
"status":"OK",
"total":4,
"data":[
{
"primarymfs":"node10.maprlab.local:5660",
"secondarymfs":"",
"startkey":"-INFINITY",
"endkey":"user3051486366437214287",
"lastheartbeat":0,
"fid":"2248.32.131418",
"logicalsize":215515136,
"physicalsize":133914624,
"copypendingsize":127434752,
"numberofrows":42943,
"numberofrowswithdelete":0,
"numberofspills":147,
"numberofsegments":52
},
{
"primarymfs":"node10.maprlab.local:5660",
"secondarymfs":"",
"startkey":"user3051486366437214287",
"endkey":"user5105987298073629634",
"lastheartbeat":0,
"fid":"2247.32.131310",
"logicalsize":216645632,
"physicalsize":134545408,
"copypendingsize":0,
"numberofrows":43089,
"numberofrowswithdelete":0,
"numberofspills":149,
"numberofsegments":53
},
{
"primarymfs":"node10.maprlab.local:5660",
"secondarymfs":"",
"startkey":"user5105987298073629634",
"endkey":"user720538676648464684",
"lastheartbeat":0,
"fid":"2251.32.131262",
"logicalsize":217849856,
"physicalsize":135897088,
"copypendingsize":0,
"numberofrows":44198,
"numberofrowswithdelete":0,
"numberofspills":112,
"numberofsegments":56
},
{
"primarymfs":"node10.maprlab.local:5660",
"secondarymfs":"",
"startkey":"user720538676648464684",
"endkey":"INFINITY",
"lastheartbeat":0,
"fid":"2250.32.131186",
"logicalsize":216514560,
"physicalsize":134127616,
"copypendingsize":134127616,
"numberofrows":43503,
"numberofrowswithdelete":0,
"numberofspills":114,
"numberofsegments":58
}
]
}



4) To get Info/details of local tablets per node alternatively can also be viewed by mrconfig commands.


/opt/mapr/server/mrconfig dbinfo tablets
----------------------
|From Instance 5660::|
----------------------
tablet 2250.32.131186 nref 0 npartitions 1 logicalMB 206 physicalMB 127 rows 43503 splitState None attrAutoSplit 1 tabletSplitThreshSizeMB 6144 partitionSplitThreshSizeMB 2048 isReadOnly 0 error 0 updateError 0
tablet 2251.32.131262 nref 0 npartitions 1 logicalMB 207 physicalMB 129 rows 44198 splitState None attrAutoSplit 1 tabletSplitThreshSizeMB 6144 partitionSplitThreshSizeMB 2048 isReadOnly 0 error 0 updateError 0
tablet 2247.32.131310 nref 0 npartitions 2 logicalMB 206 physicalMB 128 rows 43089 splitState None attrAutoSplit 1 tabletSplitThreshSizeMB 6144 partitionSplitThreshSizeMB 2048 isReadOnly 0 error 0 updateError 0
tablet 2248.32.131418 nref 0 npartitions 2 logicalMB 205 physicalMB 127 rows 42943 splitState None attrAutoSplit 1 tabletSplitThreshSizeMB 6144 partitionSplitThreshSizeMB 2048 isReadOnly 0 error 0 updateError 0


5) Lets say we believe "2251.32.131262" fid has some issues and we want to review the stats of the fid in question, below command can give the info needed.

[root@node10 ~]# maprcli debugdb statTablet -fid 2251.32.131262 -json
{
"timestamp":1481356575116,
"timeofday":"2016-12-09 11:56:15.116 GMT-0800",
"status":"OK",
"total":1,
"data":[
{
"numPhysicalBlocks":16589,
"numLogicalBlocks":26593,
"numRows":44198,
"numRowsWithDelete":0,
"numRemoteBlocks":0,
"numSpills":112,
"numSegments":56
}
]

}



6)  To check the consistency of the tablet we can run below command which check and reports details related to tablet. The startkey and endkey needs to be from output received from "maprcli table region list -path /abizer -json" for the region/Tablet in question.

 If this fails for any reason we need to check /opt/mapr/logs/mfs.log-5 on the master node for the tablet and see if we see the error reported to debug further but normally if there are no issues with the Tablet we would see clean output as below.

[root@node10 ~]# maprcli debugdb checkTablet -fid 2251.32.131262 -tracefile /tmp/tablet-2251.32.131262 -startkey 'user5105987298073629634' -endkey 'user720538676648464684'

TabletRangeCheck done

[root@node10 ~]# cat /tmp/tablet-2251.32.131262 
Tablet 2251.32.131262
Segmap 2251.607.132412
spill 2251.266.131730 parent 2251.265.131728 idx 0 size 2048000
keymap 2251.266.131730 parent 2251.265.131728 idx 0 offset 12 len 31471
cf 2251.266.131730 parent 2251.265.131728 idx 0 family 1 offset 524288 len 2110718
cf 2251.266.131730 parent 2251.265.131728 idx 0 family 2 offset 2686976 len 1048712
Spillmap 2251.265.131728 parent 2251.607.132412
spill 2251.390.131978 parent 2251.383.131964 idx 0 size 2088960
keymap 2251.390.131978 parent 2251.383.131964 idx 0 offset 12 len 32489
cf 2251.390.131978 parent 2251.383.131964 idx 0 family 1 offset 524288 len 2185207
cf 2251.390.131978 parent 2251.383.131964 idx 0 family 2 offset 2752512 len 1050456
Spillmap 2251.383.131964 parent 2251.607.132412
spill 2251.546.132290 parent 2251.383.131964 idx 1 size 1474560
keymap 2251.546.132290 parent 2251.383.131964 idx 1 offset 12 len 19507
cf 2251.546.132290 parent 2251.383.131964 idx 1 family 1 offset 524288 len 1138369
cf 2251.546.132290 parent 2251.383.131964 idx 1 family 2 offset 1703936 len 1138798
[root@node10 ~]# 


Above file proves everything is fine with the tablet and there are no inconsistencies found.