Saturday, September 15, 2018

Connecting Databricks Spark cluster to Amazon Redshift

                            Connecting Databricks Spark cluster to Amazon Redshift 


The use of Redshift connector involves several network connections, illustrated in the following diagram:



This library reads and writes data to S3 when transferring data to/from Redshift. As a result, it requires IAM role with read and write access to a S3 bucket (specified using the tempdirconfiguration parameter)attached to the Spark Cluster.

Setup Details :

Redshift :

1) Create an IAM role for attaching to Redshift cluster when we bring it up. This role should have access to write to S3 bucker from Redshift service.



2) Create a redshift  cluster from AWS console after populating basic info .

Spark Cluster :

3) Create a role with complete S3 access which would be used to write to temp bucket during parallel write and reads . Below is screenshot of role created .

4)  Create a Spark cluster with IAM role having write access to temp S3 bucket and DBR version greater than 3.5
 5) Download and attach JDBC jar to the cluster as shown below.

Jar Download : https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html



5) Import below notebook in your Db workspace for us to be able to read and write to Redshift from DB clusters.

https://docs.databricks.com/_static/notebooks/redshift.html     

5.1) Set the temp S3 location which DB cluster can use.

                
5.2)  Pass credentials for connecting to Redshift




Now the Connection is established , we will do read and write test to ensure everything works as expected.


Write Test : 

1 a ) We have a pre-populated table "diamonds" in the cluster as seen below.



b ) Running below code in Scala cell would append data from  table "diamonds" in the spark cluster to Redshift .

import org.apache.spark.sql.SaveMode
sqlContext.sql("select * from diamonds limit 10")
  .write
  .format("com.databricks.spark.redshift")
  .option("url", jdbcUrl)
  .option("tempdir", tempDir)
  .option("dbtable", "diamonds")
  .option("aws_iam_role","arn:aws:iam::826763667205:role/Redshift_cust")
  .mode(SaveMode.Append) // <--- Append to the existing table
  .save()
         
Once cell execution is successful we can connect to Redshift cluster via Sql workbench and verify data was populated .


Read Test : 

2 a) we'll load data from the Redshift tables that we created in the previous write test i.e we'll create a DataFrame from an entire Redshift table:

Run Below code to create the DF 

val diamonds_from_redshift = sqlContext.read
  .format("com.databricks.spark.redshift")
  .option("url", jdbcUrl) // <--- JDBC URL that we configured earlier
  .option("tempdir", tempDir) // <--- temporary bucket that we created earlier
  .option("dbtable", "diamonds") // <--- name of the table in Redshift
  .option("aws_iam_role","arn:aws:iam::826763667205:role/Redshift_cust")
  .load()

Once the DF is created, we can create a temp view and query the temp table to confirm read from Redshift also works as expected .



Bingo we are able to read and write to/from redshift from DB cluster !!!



No comments:

Post a Comment