Saturday, November 26, 2016

Interacting with Hadoop Cluster ( File Client) Part 1

                                Interacting with Hadoop Cluster ( File Client)



We interact with hadoop cluster for various reasons like loading data into hadoop cluster, reading data from it, running jobs etc . Unknowingly we use Fileclient in the backend to do these operations,  So what is Fileclient ? Fileclient is an interface which an application or clients uses to communicate with the server   (Hadoop Cluster ) i.e When an application/client needs to communicate with the cluster it links to the library provided by FileClient package to initiate an open interface between the Client and Servers for communication .

Architecture:

Fileclient can accept requests from Java or C applications and plugs in the request to respective layer to proceed with the requests. Say when any user (client) tries to read a File in Hadoop cluster, File Op reaches C Layer and then its forwarded to the common layer which then in turn is passed to RPC layer for it to make remote system calls to process the requests and give back the result to the user once it has the info requested.






Scope of doc ( file client ) : FileClient is a huge topic to discuss, scope of this blog is to understand what file services are served by FileClient and components for it to do so quickly and efficiently.
   When an application/client needs to communicate with the cluster it links to the library provided by FileClient package to initiate an open interface, Lets understand why Open interface (libMapRClient) is needed between clients and Server for File operations.






Above figure shows the flow of control when Client would like to identify/list a file in cluster. There are 3 steps :

Client :  Client takes inputs as file name to perform any operation on them i.e read, write, open etc . Client only understand file name as an input.

libMapRClient : This library is mainly used for mapping file to FID and vise-versa.

Server : Server identifies files only by FID ( File Identifier )

We know file server only identifies files by FID but what is FID?  What is FID comprised of?
FID is short form for file identifier, FID is the number by which fileserver identifies different files .  This is made up of three parts separated by “.” container number , inode number and unique number identifier.

Below command shows  how /Test file name is interpreted by FID “2049.51.1180684” as primary chunk.





While if I would like to know the file name for the  fid “2049.51.1180684” . We can run below commands to get the info.




Now when the above commands were run there is number of steps fileclient follows to get the fid for a file and vice-versa. To see detailed log messages listing every step file client follows we can run the same command with Debug enabled as below to see every step involved in FileClient till it gets the corresponding FID for /Test file

hadoop mfs -Dfs.mapr.trace=debug  -ls /Test


Explaining every step is out of scope for this doc but I plan to write future posts listing and explaining all steps involved in file client writing  (Loading) files to the cluster or reading files from the cluster.   

File Client has 3 type of cache, which it stores in its memory to make the operation quicker during the life of the process.  They are :

1) CID cache :  All the CID ( Contianer ID) along with node on which they reside and the node on which Master CID reside is cached.

Like in log lines below from earlier debug command we see Cidcache created new entry for 2049 container which it comes across long with IP “172.16.122.159” on which this CID resides.

2015-05-28 15:08:56,7661 DEBUG Cidcache fs/client/fileclient/cc/cidcache.cc:353 Thread: 12608 Created new entry for cid 2049
2015-05-28 15:08:56,7661 DEBUG Cidcache fs/client/fileclient/cc/cidcache.cc:811 Thread: 12608 Setting srcCluster my.cluster.com, Cid 2049
2015-05-28 15:08:56,7662 DEBUG Cidcache fs/client/fileclient/cc/cidcache.cc:119 Thread: 12608 PopulateEntry: For CID 2049 received host IP 172.16.122.159


2) Fid Cache :  All the FID with their co-responding files are cached by FileClient for it to look it up later quickly.

We got below log lines from earlier debug command which does confirm Fid “2049.573.1051710” is added to Fidcache for /Test file

2015-05-28 15:08:56,7677 DEBUG Client fs/client/fileclient/cc/client.cc:2630 Thread: 12608 PathWalk: Adding /Test Fid 2049.573.1051710 to Fidcache
2015-05-28 15:08:56,7677 DEBUG Client fs/client/fileclient/cc/client.cc:2634 Thread: 12608 PathWalk: WalkDone File /Test, resp fid 2049.573.1051710
2015-05-28 15:08:56,7677 DEBUG JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:1049 Thread: 12608  -- Exit JNI getattr -- /Test

While few lines later in log lines its confirms when file /Test is tried to be opened FileClient does a lookup in FID cache and succeeds in cache hit.

2015-05-28 15:08:56,8119 DEBUG Client fs/client/fileclient/cc/client.cc:1321 Thread: 12608 >Open: file /Test
2015-05-28 15:08:56,8119 DEBUG Client fs/client/fileclient/cc/client.cc:2514 Thread: 12608 Lookupfid : start = /Test, end = (nil)
2015-05-28 15:08:56,8119 DEBUG Client fs/client/fileclient/cc/client.cc:2564 Thread: 12608 Path /Test fid:2049.573.1051710 found in fidcache
2015-05-28 15:08:56,8120 DEBUG Inode fs/client/fileclient/cc/inode.cc:225 Thread: 12608 itab:fill  2049.573.1051710 cache hit, copied fattrs from 0x7f993c940ab0

3) FID map :  This stores the Map for the FID .


No comments:

Post a Comment