Monday, September 12, 2016

Repairing Offline Volume

                                                   Repairing Offline Volume

If you see below alarm on the cluster while running "maprcli alarm list" command then its potentially a dangerous situation.

Below alarm means Hbase volume is unavailable due to at least one of the copy of Hbase volume doesn't have master.

1472843464042           1            Volume data unavailable                                                                                             VOLUME_ALARM_DATA_UNAVAILABLE         mapr.hbase            

On running "maprcli dump volumeinfo -volumename mapr.hbase -json" and if we see any container in below state where we see Container 2051 doesn't have a valid master. This is due to CLDB expecting latest copy of this container to be present on Node 10.10.70.116 i.e Epoch 42 , but due to some specific reason this copy is offline.
It would be ideal to login into node which has latest copy and get MFS and the SP on which container reside online for other replica's to resync from this container. This could involve running fsck command on the storage pools (or disks) if offlined due to error signal by MFS.

 {
                        "ContainerId":2051,
                        "Epoch":42,
                        "Master":"unknown ip (0)-0-VALID",
                        "ActiveServers":{

                        },
                        "InactiveServers":{
                                "IP:Port":[
                                        "10.10.70.109:5660--41",
                                        "10.10.70.117:5660--40"
                                ]
                        },
                        "UnusedServers":{
                                "IP:Port":"10.10.70.116:5660--42"
                        },
                        "OwnedSizeMB":"367 MB",
                        "SharedSizeMB":"11.88 GB",
                        "LogicalSizeMB":"28.17 GB",
                        "TotalSizeMB":"13.29 GB",
                        "NumInodesInUse":11503,
                        "Mtime":"Wed Jun 01 09:46:32 CDT 2016",
                        "NameContainer":"false",
                        "CreatorContainerId":0,
                        "CreatorVolumeUuid":"",
                        "UseActualCreatorId":false
                },

In event for XYZ reason you are not able to get container with latest Epoch online and are certain you can never get it online i.e due to disk failure, Sp formatted etc only then follow below steps to get the volume online.

Note :- Promoting replica container to master can cause data loss, this method is only supposed to be used in case of disaster and you wish to get back volume online accepting the data loss .

Promoting Container 

1) From the earlier output we can see container on 10.10.70.109 is closest to latest Epoch so we would chose to promote 2051 container on this node .
2) First we need to find SP on which container 2051 resides . From below output we can see this container is on SP1 ( /dev/sdb ) with SPID "077bf85e6e8423410057abfe210ced0f"

[root@node9 ~]# /opt/mapr/server/mrconfig info dumpcontainers | grep 2051
cid:2051 volid:41208203 sp:SP1:/dev/sdb spid:077bf85e6e8423410057abfe210ced0f prev:0 next:0 issnap:0 isclone:0 deleteinprog:0 fixedbyfsck:0 stale:1 querycldb:0 resyncinprog:0

3) Now promote this container via below command

[root@node9 ~]# /opt/mapr/server/mrconfig cntr forcemaster 2051 077bf85e6e8423410057abfe210ced0f
Container Force master for container 2051 on spid 077bf85e6e8423410057abfe210ced0f

4) Now this container which was missing master has master but to make the volume consistent and NC to be aware of this changes we need to run GFSCK on hbase volume with repair option.

/opt/mapr/bin/gfsck rwvolume=mapr.hbase -d -r -y &               ( -d for Debug   -r  for repair   -y  assume yes    &  for running in Background )                                                                                                      

Once above completes please check the report , if everything went well and fixes were done as needed you would see message as below "GlobalFsck completed successfully"

=== End of GlobalFsck Report ===

  remove volume mapr.hbase from global-fsck mode (ret = 0) ...

GlobalFsck completed successfully (7264 ms); Result: repair succeeded
=== === === === === === === === === === === === === === === === === 



Sunday, September 4, 2016

Mrconfig


                                        Mrconfig


The mrconfig commands let you create, remove, and manage containers, storage pools, disk groups, and disks; and provide various useful information.

Note :- The mrconfig commands provide direct control and access to MapR-FS at a low level. If you are not careful, or do not know what you are doing, you can irrevocably destroy valuable data.

1) /opt/mapr/server/mrconfig disk list list all disks assigned to MFS on the node

2) /opt/mapr/server/mrconfig sp list -v lists all SP’s and its details with disk names

3) Display information about containers on a local node

/opt/mapr/server/mrconfig info dumpcontainers

4) Display information about containers on a remote node with an IP address of xx.xx.xx.xx

/opt/mapr/server/mrconfig -h xx.xx.xx.xx info dumpcontainers

5) Details of number of inodes resynced/pending to be resynced for the container.

mrconfig cntr resyncprogress --cids <cid>

6) List all the running threads in MFS

/opt/mapr/server/mrconfig info threads

7) List detail memory usage by each work area .

/opt/mapr/server/mrconfig info slabs

8) List containers in volume

/opt/mapr/server/mrconfig info containerlist users
Volume containers

2074

9) Send FCR (Full container report) to inform cldb that fs changed

/opt/mapr/server/mrconfig set config send.fcr 1

10) To disable and enable/reset default throttling for particular CID .

/opt/mapr/server/mrconfig cntr disablethrottle <cid>

/opt/mapr/server/mrconfig cntr resetthrottle <cid>

11) To disable network throttling we need to set Network Throttle factor to high number say 200 etc.
/opt/mapr/server/mrconfig resync <command> <params>
Note: Default value of FS Resync Network Throttle Factor = 20

setresyncnetworkthrottlefactor <non zero integer>