Repairing Offline Volume
If you see below alarm on the cluster while running "maprcli alarm list" command then its potentially a dangerous situation.
Below alarm means Hbase volume is unavailable due to at least one of the copy of Hbase volume doesn't have master.
1472843464042 1 Volume data unavailable VOLUME_ALARM_DATA_UNAVAILABLE mapr.hbase
On running "maprcli dump volumeinfo -volumename mapr.hbase -json" and if we see any container in below state where we see Container 2051 doesn't have a valid master. This is due to CLDB expecting latest copy of this container to be present on Node 10.10.70.116 i.e Epoch 42 , but due to some specific reason this copy is offline.
{
"ContainerId":2051,
"Epoch":42,
"Master":"unknown ip (0)-0-VALID",
"ActiveServers":{
},
"InactiveServers":{
"IP:Port":[
"10.10.70.109:5660--41",
"10.10.70.117:5660--40"
]
},
"UnusedServers":{
"IP:Port":"10.10.70.116:5660--42"
},
"OwnedSizeMB":"367 MB",
"SharedSizeMB":"11.88 GB",
"LogicalSizeMB":"28.17 GB",
"TotalSizeMB":"13.29 GB",
"NumInodesInUse":11503,
"Mtime":"Wed Jun 01 09:46:32 CDT 2016",
"NameContainer":"false",
"CreatorContainerId":0,
"CreatorVolumeUuid":"",
"UseActualCreatorId":false
},
If you see below alarm on the cluster while running "maprcli alarm list" command then its potentially a dangerous situation.
Below alarm means Hbase volume is unavailable due to at least one of the copy of Hbase volume doesn't have master.
1472843464042 1 Volume data unavailable VOLUME_ALARM_DATA_UNAVAILABLE mapr.hbase
It would be ideal to login into node which has latest copy and get MFS and the SP on which container reside online for other replica's to resync from this container. This could involve running
fsck
command on the storage pools (or disks) if offlined due to error signal by MFS.{
"ContainerId":2051,
"Epoch":42,
"Master":"unknown ip (0)-0-VALID",
"ActiveServers":{
},
"InactiveServers":{
"IP:Port":[
"10.10.70.109:5660--41",
"10.10.70.117:5660--40"
]
},
"UnusedServers":{
"IP:Port":"10.10.70.116:5660--42"
},
"OwnedSizeMB":"367 MB",
"SharedSizeMB":"11.88 GB",
"LogicalSizeMB":"28.17 GB",
"TotalSizeMB":"13.29 GB",
"NumInodesInUse":11503,
"Mtime":"Wed Jun 01 09:46:32 CDT 2016",
"NameContainer":"false",
"CreatorContainerId":0,
"CreatorVolumeUuid":"",
"UseActualCreatorId":false
},
In event for XYZ reason you are not able to get container with latest Epoch online and are certain you can never get it online i.e due to disk failure, Sp formatted etc only then follow below steps to get the volume online.
Note :- Promoting replica container to master can cause data loss, this method is only supposed to be used in case of disaster and you wish to get back volume online accepting the data loss .
Promoting Container
1) From the earlier output we can see container on 10.10.70.109 is closest to latest Epoch so we would chose to promote 2051 container on this node .
2) First we need to find SP on which container 2051 resides . From below output we can see this container is on SP1 ( /dev/sdb ) with SPID "077bf85e6e8423410057abfe210ced0f"
[root@node9 ~]# /opt/mapr/server/mrconfig info dumpcontainers | grep 2051
cid:2051 volid:41208203 sp:SP1:/dev/sdb spid:077bf85e6e8423410057abfe210ced0f prev:0 next:0 issnap:0 isclone:0 deleteinprog:0 fixedbyfsck:0 stale:1 querycldb:0 resyncinprog:0
3) Now promote this container via below command
[root@node9 ~]# /opt/mapr/server/mrconfig cntr forcemaster 2051 077bf85e6e8423410057abfe210ced0f
Container Force master for container 2051 on spid 077bf85e6e8423410057abfe210ced0f
4) Now this container which was missing master has master but to make the volume consistent and NC to be aware of this changes we need to run GFSCK on hbase volume with repair option.
/opt/mapr/bin/gfsck rwvolume=mapr.hbase -d -r -y & ( -d for Debug -r for repair -y assume yes & for running in Background )
Once above completes please check the report , if everything went well and fixes were done as needed you would see message as below "GlobalFsck completed successfully"
=== End of GlobalFsck Report ===
remove volume mapr.hbase from global-fsck mode (ret = 0) ...
GlobalFsck completed successfully (7264 ms); Result: repair succeeded
=== === === === === === === === === === === === === === === === ===