Monday, April 17, 2017

Filemigrater

                                 Filemigrater

This Blog assumes you have a 5.2 MapR secure cluster already up and running and as per your requirement you need to continuously scan for new files in specific location in MapR and move them over to bucket in AWS as per policies setup.

Note:- I have a single node cluster for this Blog but in real scenarios you could install file migrator on one of the nodes where usually lesser services are installed. 

1) Stop warden

service maps-warden stop

2) Install File migrator service and run configure.sh.

rpm -ivh mapr-filemigrate-1.0.0.201704071106-1.x86_64.rpm 

/opt/mapr/server/configure.sh -R

3) Now start warden .

service mapr-warden start

When the rpm is installed it installs below file would be installed as well which is more of a way for MCS to track this service i.e part of pluggable service.

[root@node9 conf]# cat /opt/mapr/filemigrate/filemigrate-1.0.0/conf/warden.filemigrate.conf 
services=filemigrate:1:cldb
service.displayname=FileMigrate
service.command.start=/opt/mapr/filemigrate/filemigrate-1.0.0/bin/mapr-filemigrate.sh start
service.command.stop=/opt/mapr/filemigrate/filemigrate-1.0.0/bin/mapr-filemigrate.sh stop
service.command.monitorcommand=/opt/mapr/filemigrate/filemigrate-1.0.0/bin/mapr-filemigrate.sh status
service.command.type=BACKGROUND
service.ui.port=9444
service.uri=/api/login
service.baseservice=0
service.logs.location=/opt/mapr/filemigrate/filemigrate-1.0.0/logs/filemigrate.log
service.process.type=BINARY
service.alarm.tersename=nafmsd
service.alarm.label=FileMigrateServerDown
#The items here need to be set consistently to accurately reflect memory available for the service.
#Default is 100MB of memory.
service.env=JAVA_HEAP=100m,JETTY_PORT=9444
service.heapsize.min=100
service.heapsize.max=100
service.heapsize.percent=2

4) Once the cluster and File Migrater service is up we can see then running.

[root@node9 conf]# jps
15829 WardenMain
16321 Jps
16660 CLDB
22119 ResourceManager
21463 CommandServer
22324 FileMigrateApplication
29649 QuorumPeerMain
27632 NodeManager

5)  Since MapR only trusts its own self-signed certificates by default. To configure MapR to trust the certificates used by AWS S3 for HTTPS upload, you’ll need to configure additional trusted certificates. Add one of the following to the /opt/mapr/conf/ssl_truststore file on every node in the cluster. As of this writing, the root certificate used by AWS S3 is the Baltimore CyberTrust root certificate provided by Digicert
  

Warning: Most Baltimore CyberTrust root certificates will expire in 2025 and expired certificates cannot be used for connecting to AWS S3. When Amazon replaces their certificates with those issued by new certificate authorities, update the truststore to hold both the old and new root certificates for a smooth transition. 

Download the cert.

[root@node9 tmp]# wget https://www.digicert.com/CACerts/BaltimoreCyberTrustRoot.crt
--2017-04-11 18:19:11--  https://www.digicert.com/CACerts/BaltimoreCyberTrustRoot.crt
Resolving www.digicert.com... 64.78.193.234
Connecting to www.digicert.com|64.78.193.234|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 891 [application/x-x509-ca-cert]
Saving to: “BaltimoreCyberTrustRoot.crt”

100%[=========================================================================================================================================>] 891         --.-K/s   in 0s      

2017-04-11 18:19:20 (175 MB/s) - “BaltimoreCyberTrustRoot.crt” saved [891/891]
[root@node9 conf]# cd /opt/mapr/conf


Run the command to add the certificate.  Enter the keystore password when prompted. The default is mapr123 

[root@node9 conf]# keytool -importcert -file /tmp/BaltimoreCyberTrustRoot.crt -keystore ssl_truststore
Enter keystore password:  
Owner: CN=Baltimore CyberTrust Root, OU=CyberTrust, O=Baltimore, C=IE
Issuer: CN=Baltimore CyberTrust Root, OU=CyberTrust, O=Baltimore, C=IE
Serial number: 20000b9
Valid from: Fri May 12 11:46:00 PDT 2000 until: Mon May 12 16:59:00 PDT 2025
Certificate fingerprints:
  MD5:  AC:B6:94:A5:9C:17:E0:D7:91:52:9B:B1:97:06:A6:E4
  SHA1: D4:DE:20:D0:5E:66:FC:53:FE:1A:50:88:2C:78:DB:28:52:CA:E4:74
  SHA256: 16:AF:57:A9:F6:76:B0:AB:12:60:95:AA:5E:BA:DE:F2:2A:B3:11:19:D6:44:AC:95:CD:4B:93:DB:F3:F2:6A:EB
  Signature algorithm name: SHA1withRSA
  Version: 3

Extensions: 

#1: ObjectId: 2.5.29.19 Criticality=true
BasicConstraints:[
  CA:true
  PathLen:3
]

#2: ObjectId: 2.5.29.15 Criticality=true
KeyUsage [
  Key_CertSign
  Crl_Sign
]

#3: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: E5 9D 59 30 82 47 58 CC   AC FA 08 54 36 86 7B 3A  ..Y0.GX....T6..:
0010: B5 04 4D F0                                        ..M.
]
]
Trust this certificate? [no]:  y
Certificate was added to keystone                     <-- Means Cert was added sucessfully.
[root@node9 conf]# 

Note:-  Copy the ssl_truststore file to all other MapR nodes in the same location (i.e., /opt/mapr/conf/) in case of multi node cluster. 

6)  Now restart file migrater

 maprcli node services -name filemigrate -nodes 10.10.70.109 -action restart


Now login to service UI Directly.

https://10.10.70.109:9444      ( mapr/<mapr user password>)





7)  Now since the service is up next step is to configure File migration to be able to connect to aws using secret key with different setting, followed by restart for files to be copied from MapR cluster to AWS .

Adding Properties is straight forward and can be done via command line or UI .

a) Command line : Copy the edited FileMigrate.properties file to /var/mapr/filemigrate/ directory on MapRFS. 


b) UI : Click on setting on right top corner and fill out info needed .





8) Now to start moving the files to S3 

i. add new policy via clicking on "New File Migration Policy" Alternatively, select Policy from the dropdown menu.
ii. Set the following in the Add Data Migration Policy page and click OK.

(Required) Directory Path
(Required) Target Bucket
Purge Interval
Delete Empty Directories
Ignore Files Regex
X-Attributes 

Test I : To Test Data Migration I created 4 files as below under the "srcvol" mount point.

[root@node9 ~]# hadoop fs -ls /srcvol
Found 5 items
-rwxr-xr-x   3 root root          0 2017-04-11 19:33 /srcvol/a
-rwxr-xr-x   3 root root          0 2017-04-11 19:33 /srcvol/b
-rwxr-xr-x   3 root root          0 2017-04-11 19:33 /srcvol/c
-rwxr-xr-x   3 root root       1359 2017-04-11 19:32 /srcvol/filemigrate.out

From the "filemigrate.log" we can see when "com.mapr.filemigrate.FileMigrateServer" starts the scan it finds this new 4 files and now uploads the files to S3 .

2017-04-11 19:48:03,648 INFO  com.mapr.filemigrate.FileMigrateServer [FileMigrateServer:mainthread]: Starting new file scan...
2017-04-11 19:48:03,648 INFO  com.mapr.filemigrate.ScanDirectoryTree [FileMigrateServer:mainthread]: Starting incremental scan of /srcvol
2017-04-11 19:48:03,651 INFO  com.mapr.filemigrate.FileMigrateServer [FileMigrateServer:mainthread]: Scan for new files completed after 0.00 seconds.
2017-04-11 19:48:03,651 INFO  com.mapr.filemigrate.FileMigrateServer [FileMigrateServer:mainthread]: Pausing for 60.00 seconds.
2017-04-11 19:48:21,663 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:monitorActiveUploads]: Stats summary: UploadStats [activeUploadsCounter=0, waitingUploadsCounter=4, bytesUploadedLastHour=0, uploadsLastHour=0]
2017-04-11 19:48:30,658 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:lookForWork]: starting upload for maprfs:///srcvol/filemigrate.out
2017-04-11 19:48:36,664 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:clearCompletedUploads]: upload completed successfully for ActiveUpload [path=maprfs:///srcvol/filemigrate.out, state=Completed, bucket=filemigratertest, enqueueTime=Tue Apr 11 19:33:03 PDT 2017, uploadStartTime=Tue Apr 11 19:48:30 PDT 2017, uploadCompleteTime=Tue Apr 11 19:48:35 PDT 2017, size=1359, modificationTime=Tue Apr 11 19:32:57 PDT 2017, key=srcvol/filemigrate.out, errors=3, percentSent=100.0, lastStateUpdate=Tue Apr 11 19:48:35 PDT 2017]
2017-04-11 19:49:03,649 INFO  com.mapr.filemigrate.FileMigrateServer [FileMigrateServer:mainthread]: Starting new file scan...
2017-04-11 19:49:03,649 INFO  com.mapr.filemigrate.ScanDirectoryTree [FileMigrateServer:mainthread]: Starting incremental scan of /srcvol
2017-04-11 19:49:03,651 INFO  com.mapr.filemigrate.FileMigrateServer [FileMigrateServer:mainthread]: Scan for new files completed after 0.00 seconds.
2017-04-11 19:49:03,651 INFO  com.mapr.filemigrate.FileMigrateServer [FileMigrateServer:mainthread]: Pausing for 60.00 seconds.
2017-04-11 19:49:21,661 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:lookForWork]: starting upload for maprfs:///srcvol/a
2017-04-11 19:49:21,672 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:lookForWork]: starting upload for maprfs:///srcvol/c
2017-04-11 19:49:21,680 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:lookForWork]: starting upload for maprfs:///srcvol/b
2017-04-11 19:49:27,666 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:clearCompletedUploads]: upload completed successfully for ActiveUpload [path=maprfs:///srcvol/a, state=Completed, bucket=filemigratertest, enqueueTime=Tue Apr 11 19:34:03 PDT 2017, uploadStartTime=Tue Apr 11 19:49:21 PDT 2017, uploadCompleteTime=Tue Apr 11 19:49:26 PDT 2017, size=0, modificationTime=Tue Apr 11 19:33:22 PDT 2017, key=srcvol/a, errors=3, percentSent=-0.0, lastStateUpdate=Tue Apr 11 19:49:26 PDT 2017]
2017-04-11 19:49:27,671 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:clearCompletedUploads]: upload completed successfully for ActiveUpload [path=maprfs:///srcvol/c, state=Completed, bucket=filemigratertest, enqueueTime=Tue Apr 11 19:34:03 PDT 2017, uploadStartTime=Tue Apr 11 19:49:21 PDT 2017, uploadCompleteTime=Tue Apr 11 19:49:26 PDT 2017, size=0, modificationTime=Tue Apr 11 19:33:33 PDT 2017, key=srcvol/c, errors=3, percentSent=-0.0, lastStateUpdate=Tue Apr 11 19:49:26 PDT 2017]
2017-04-11 19:49:27,676 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:clearCompletedUploads]: upload completed successfully for ActiveUpload [path=maprfs:///srcvol/b, state=Completed, bucket=filemigratertest, enqueueTime=Tue Apr 11 19:34:03 PDT 2017, uploadStartTime=Tue Apr 11 19:49:21 PDT 2017, uploadCompleteTime=Tue Apr 11 19:49:26 PDT 2017, size=0, modificationTime=Tue Apr 11 19:33:28 PDT 2017, key=srcvol/b, errors=3, percentSent=-0.0, lastStateUpdate=Tue Apr 11 19:49:26 PDT 2017]

Verify Test I :

[root@node9 logs]# yum install s3cmd

Installed:
  s3cmd.noarch 0:1.6.1-1.el6                                                                                                                                                         

Dependency Installed:
  python-magic.x86_64 0:5.04-30.el6                                                                                                                                                  

Dependency Updated:
  file.x86_64 0:5.04-30.el6                                                              
file-libs.x86_64 0:5.04-30.el6                                                             

Complete!


Note :-  bucket filemigratertest should be existing for the service to perform its tasks, if its not existing you can create the bucket via AWS console or below command from cli.

[root@node9 logs]# s3cmd mb s3://filemigratertest

Bucket 's3://filemigratertest/' created

Verified : Files are indeed existing in the bucket.


[root@node9 logs]# s3cmd ls s3://filemigratertest/srcvol/

2017-04-12 02:49         0   s3://filemigratertest/srcvol/a
2017-04-12 02:49         0   s3://filemigratertest/srcvol/b
2017-04-12 02:49         0   s3://filemigratertest/srcvol/c
2017-04-12 02:48      1359   s3://filemigratertest/srcvol/filemigrate.out

Same can be verified from AWS console as well.





Test II : Created a new file "/srcvol/d" and testing if service picks this new file and uploads it.

[root@node9 logs]# hadoop fs -ls /srcvol
Found 5 items
-rwxr-xr-x   3 root root          0 2017-04-11 19:33 /srcvol/a
-rwxr-xr-x   3 root root          0 2017-04-11 19:33 /srcvol/b
-rwxr-xr-x   3 root root          0 2017-04-11 19:33 /srcvol/c
-rwxr-xr-x   3 root root          0 2017-04-11 20:11 /srcvol/d
-rwxr-xr-x   3 root root       1359 2017-04-11 19:32 /srcvol/filemigrate.out

As seen after a minute when the scan runs this file is queued for the upload .

2017-04-11 20:12:03,676 INFO  com.mapr.filemigrate.FileMigrateServer [FileMigrateServer:mainthread]: Starting new file scan...


2017-04-11 20:12:03,676 INFO  com.mapr.filemigrate.ScanDirectoryTree [FileMigrateServer:mainthread]: Starting incremental scan of /srcvol
2017-04-11 20:12:03,702 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:lookForWork]: starting upload for maprfs:///srcvol/d
2017-04-11 20:12:03,706 INFO  com.mapr.filemigrate.FileMigrateServer [FileMigrateServer:mainthread]: Scan for new files completed after 0.00 seconds. 
2017-04-11 20:12:03,706 INFO  com.mapr.filemigrate.FileMigrateServer [FileMigrateServer:mainthread]: Pausing for 59.97 seconds.
2017-04-11 20:12:12,695 INFO  com.mapr.filemigrate.S3UploadManager [S3UploadManager:clearCompletedUploads]: upload completed successfully for ActiveUpload [path=maprfs:///srcvol/d, state=Completed, bucket=filemigratertest, enqueueTime=Tue Apr 11 20:12:03 PDT 2017, uploadStartTime=Tue Apr 11 20:12:03 PDT 2017, uploadCompleteTime=Tue Apr 11 20:12:08 PDT 2017, size=0, modificationTime=Tue Apr 11 20:11:20 PDT 2017, key=srcvol/d, errors=0, percentSent=-0.0, lastStateUpdate=Tue Apr 11 20:12:08 PDT 2017]



Verified :

[root@node9 logs]# s3cmd ls s3://filemigratertest/srcvol/
2017-04-12 02:49         0   s3://filemigratertest/srcvol/a
2017-04-12 02:49         0   s3://filemigratertest/srcvol/b
2017-04-12 02:49         0   s3://filemigratertest/srcvol/c
2017-04-12 03:12         0   s3://filemigratertest/srcvol/d

2017-04-12 02:48      1359   s3://filemigratertest/srcvol/filemigrate.out