Thursday, May 3, 2018

Databricks Architecture on AWS (Single Tenant)

    Databricks Architecture on AWS (Single Tenant)


In previous blog we deployed a Databricks instance which allowed us to spin up clusters on demand, create notebooks , attach/detach from clusters as needed to run distributed computation via spark on the cluster etc but in the backend there has to be quiet a few services which stitch  different pieces like cluster spin up, notebook creation , getting space in workspace etc . This blog is aimed to explain  a very simple Single-Tenant setup and shed light on backend services and responsibility of each service and tasks they perform for customers to spin-up clusters and run job .

http://abizeradenwala.blogspot.com/2018/05/setting-up-databricks-instanceshard.html

Below is screenshot of myself spinning up a test Spark cluster and attaching it to notebook to perform the computation.



User Access and Cluster Deployment :

Below Figure shows the details Databricks Deployment Architecture. Once the customer logs in with his credential the request is forwarded from Web App/ Web frontend to DB central service to verify if user and credentials are legit . Once verifies the user is authorized and sees welcome to Databricks screen after which he can perform all the tasks like cluster spin-up , run jobs etc.


Databricks is a hosted end-to-end platform deployed within each customer’s Amazon Web Services (AWS) account. At high level the platform consists of Control Plane and Data Plane , Control plane communicates with service on Data plane via Cross-Account IAM role via issuing API.

Data Plane :
In majority of customers deployment model Spark clusters are deployed within customers account (within their own VPC )
Control Plane :
 The front-end and Cluster Manager services are deployed within an isolated VPC in a Databricks-controlled account. 
AWS Cross-Account IAM Roles :
In order to make API calls to a customer’s AWS account, Databricks uses cross-account identity and access management (IAM) roles.  This was done in blog earlier i.e setting up AWS Cross-Account since Cluster Manager would have fire API calls to spin up EC2 nodes , scale up/down etc



Spark Cluster Deployment's (Processing):  

Below diagram show typical Spark cluster deployed in customer instance's . The Databricks services (Spark Driver/Executors) running within a customer’s VPC and it will use the previously configured IAM role to make the necessary AWS calls for deploying Spark clusters via Cluster Manager in Control Plane. 




Data Access and Processing : 

For running jobs customers would usually have to read and write from persistent storage. S3 buckets are used for storage purpose, Databricks service do not inherently have access to data sources via the IAM role delegation. In order to access data user provides their own credentials required by the data source in question (In our case when we created Bucket we associated with policy/user who can read and write ) .  Below figure gives a good overview on how spark clusters (Data plane) access data located in S3 buckets.










No comments:

Post a Comment