Big Data / Systems / Cloud / AI : February 2017

Tuesday, February 21, 2017

Speed up FileOutputCommitter

If Job generates many files to commit then the commitJob method call at the end of the job can take minutes or hours depending on number of mapper and reducer in the job. This is a performance regression from MR1, as in MR1 the tasks commit directly to the final output directory as they were completing and commitJob had very little to do. In Yarn (Hadoop 2.x), the commit is single-threaded and waits until all tasks have completed before commencing which can cause huge delay in job finishing successfully even though all Map and Reduce tasks have finished.

Below Jira has details of the problem and hence community came with new algorithm to overcome this performance issue.

https://issues.apache.org/jira/browse/MAPREDUCE-4815

In algorithm version 1:

1. commitTask will rename directory

$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/

2. recoverTask will also do a rename

$joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/

3. commitJob will merge every task output file ---> This does nothing but re-arranging and copying the output to joboutput Dir ( So job is not Success unless this stage is done)

$joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS

So if a job generates many files then the commitJob method call at the end of the job can take Hrs. The commit is single-threaded and has to waits until all tasks have completed before commencing. Algorithm version 2 will change the behavior and make it more efficient.

In algorithm version 2:

1. commitTask will rename all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/

2. recoverTask actually doesn't require to do anything

3 commitJob can simply delete $joboutput/_temporary and write $joboutput/_SUCCESS This algorithm will reduce the output commit time for large jobs by having the tasks commit directly to the final output directory as they were completing and commitJob had very little to do.

Implementing :

To enable the new algorithm, set to 2 in the mapred-site.xml on all the nodes or during job submission use

-Dmapreduce.fileoutputcommitter.algorithm.version=2 .

Verify :

We can verify job is using Algorithm 1 or 2 from AM container logs as below.

2017-02-14 00:46:06,817 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1

Saturday, February 18, 2017

Procedure to remove Data nodes

Use the following procedure to remove a node and uninstall the MapR software from the cluster .

Note : - If the node you are decommissioning runs a critical service such as CLDB or ZooKeeper please stop since this doc only accounts for Data nodes i.e Fileserver/Nodemanager.

1) Yarn Containers :

First remove the node from the Label for no new task attempts to be scheduled on this node.

ps -ef | grep container ( wait for the output to be zero i.e all existing container finish execution)

2) Decommission the node :

Complete the following steps to decommission a node i.e drain data out from the node:

Drain the data by moving the node to the /decommissioned physical topology. All the data on a node in the /decommissioned topology is migrated to volumes topology i.e nodes part of respective volume topo.

Run the following command to check if a given volume is present on the node:

maprcli dump volumenodes -volumename <volume> -json | grep <ip>

Note: Run this command for each non-local volume in your cluster to verify that the node being decommissioned is not storing any volume data. Alternatively you can also see from MCS id MFS disks occupying no space.

3) Node Removal : Removing the node from CLDB list .

Note :- Only after Node is completely decommissioned continue next steps .

1. Stop warden on the decommissioned : service maps-warden stop
2. Remove any local volumes on the node eg mapr.node9.maprlab.local.local.mapred.

maprcli volume remove -name < vol name >

3. After warden is stopped for 5 minutes issue node remove command to remove the node completely.
4. Determine which MapR packages are installed on the node

dpkg --list | grep mapr (Ubuntu)
rpm -qa | grep mapr (Red Hat or CentOS)

5. Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples:

apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu)
yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS)

6. Remove the /opt/mapr and /opt/cores directory to remove any instances of hostid, hostname and cores left behind by the package manager.

Wednesday, February 15, 2017

Core Dumps

In most Linux systems core dump files are generated after an uncaught signal in a process (as a SIGSEGV or SIGQUIT). Core is state of the process at particular instance and is very helpful for debugging issues.

File where core generation limit is specified at user level.

cat /etc/security/limits.conf | grep core

#        - core - limits the core file size (KB)

#*               soft    core            0

mapr - core unlimited

Current core limit for user "MapR"

[mapr@node9 root]$ ulimit -c

unlimited

[mapr@node9 root]$ ulimit -a

core file size          (blocks, -c) unlimited

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 30

file size               (blocks, -f) unlimited

pending signals                 (-i) 95148

max locked memory       (kbytes, -l) unlimited

max memory size         (kbytes, -m) unlimited

open files                      (-n) 65535

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 10240

cpu time               (seconds, -t) unlimited

max user processes              (-u) unlimited

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

The Core Pattern in Kernel :

The core location and pattern is determined by the value stored in "/proc/sys/kernel/core_pattern"

[root@node9 ~]# cat /proc/sys/kernel/core_pattern

/opt/cores/%e.core.%p.%h

[root@node9 ~]#

This pattern shows how the core file will be generated and where. Two things that can be understood from the output: 1) The filename of the core dump file generated will be “core” and Binary name in the start 2) the current directory used to store it

You can use the following pattern elements in the core_pattern file:


%p: pid
%: '%' is dropped
%%: output one '%'
%u: uid
%g: gid
%s: signal number
%t: UNIX time of dump
%h: hostname
%e: executable filename
%: both are dropped

You can detect which node generated the core file (with the hostname), which program generated it (with executable filename ), and also when did it happen (with the unix time).

Below command can be used to re-direct core to customer location.

echo "/opt/cores/%e.core.%p.%h" > /proc/sys/kernel/core_pattern

Configure core dump forever

The changes done above are only applicable until the next reboot. In order to make the change persist reboots, we will need to add the following in “/etc/sysctl.conf“:

i)  tail -2 /etc/sysctl.conf 

# Custom core file pattern and location.

kernel.core_pattern=/opt/cores/%e.core.%p.%h

Now to Load in sysctl settings from the file /etc/sysctl.con

ii) sysctl -p

iii) sysctl -a | grep kernel.core_pattern

kernel.core_pattern = /opt/cores/%e.core.%p.%h (sysctl.conf is the file controlling every configuration under /proc/sys)

Test Core generated in correct Location:

Below command will trigger a segmentation fault in current shell and generate a core file in location we are looking to collect core dump.

[mapr@node9 root]$ kill -s SIGSEGV $$

Segmentation fault (core dumped)

[root@node9 ~]# ls /opt/cores/

bash.core.25469.node9.maprlab.local

[root@node9 ~]# 

Saturday, February 11, 2017

Cleaning up removed services from MCS

Sometimes even though we have removed a service which existed once in the cluster and all the packages related to service are removed it still shows up on MCS . This is usually caused due to following unclean way of removing services from the node and services still lingering around in ZK. To clear this disturbance and remove a service from MCS follow below steps.

Of course you will have to replace the drill bits section with the service you are trying to remove.

1) On any ZK node run below command to connect to the quorum :

/opt/mapr/zookeeper/zookeeper-3.4.5/bin/zkCli.sh -server <ZK-IP>:5181

2) Remove services and service_config for drill-bits ( i.e Any services you are trying to remove ) .

rmr /services/drill-bits

rm /services_config/drill-bits

3) Restart web server:

maprcli node services -name webserver -filter '[csvc==webserver]' -action restart

Now the service shouldn’t show-up once you login into MCS.

Disable disk calculation/allocation in MapR

By default map uses 0.5 disk while reducer uses 1.33 disk .

[root@node9 ~]# hadoop conf | grep mapreduce.*.disk

<property><name>mapreduce.map.disk</name><value>0.5</value></property>

<property><name>mapreduce.reduce.disk</name><value>1.33</value></property>

[root@node9 ~]#

How do I disable disk calculation/allocation in MapR ?

Job Level:
You can set the following value when you run a job:

-Dmapreduce.map.disk=0 or/and -Dmapreduce.reduce.disk=0

Cluster level :
You can set the following properties in /opt/mapr/hadoop/hadoop-*/etc/hadoop/mapred-site.xml on all the nodes ( No restarts needed ) :

<name>mapreduce.map.disk </name>

<description>Number of “disks” allocated per map task </description>

</property>

<name>mapreduce.reduce.disk </name>

<description>Number of “disks” allocated per reduce task </description>

</property>

After the change below would be the values as expected.

[root@node9 ~]# hadoop conf | grep mapreduce.*.disk

<property><name>mapreduce.map.disk</name><value>0</value><source>mapred-site.xml</source></property>

<property><name>mapreduce.reduce.disk</name><value>0</value><source>mapred-site.xml</source></property>

[root@node9 ~]#

Note :- The cluster is fully functional without using any 'disks' resources but the applications will be hanging in SCHEDULED state because Application Master container can not be assigned to app_attempt due to AM context can not be built if any of the resource is Zero.

So the configuration in fair-scheduler should contain as least some value for every resource ( Non Zero and non factional ).

Lets say for this reason in fair-site.xml we keep value for disk as 1 even though our jobs will never ask for disks.

<minResources>143155 mb,58 vCores,1 disks</minResources>

<maxResources>386310 mb,118 vcores,1 disks</maxResources>