Big Data / Systems / Cloud / AI : Connecting to Databricks from PyCharm (On Mac) using Databricks Connect

Databricks Connect allows you to connect your favorite IDE (IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio), notebook server (Zeppelin, Jupyter), and other custom applications to Azure Databricks clusters and run Apache Spark code.

I will walk you through steps to connect your PyCharm installed on Mac book to DB clusters to run jobs against and get results back in stdout of PyCharm .

Note : Here I will be connecting to cluster with Databricks Runtime version 6.3 and Python 3.7 . It is assumed you have PyCharm and python 3.7 already setup on your Mac .

Step 1 : Install the client

Uninstall PySpark If installed (In my case it was not installed)

C02WG59KHTD5:Downloads abizeradenwala$ pip uninstall pyspark

Skipping pyspark as it is not installed.

C02WG59KHTD5:Downloads abizeradenwala$

Step 2 : Install the Databricks Connect client (I had older client which is removed automatically)

C02WG59KHTD5:Downloads abizeradenwala$ /Library/Frameworks/Python.framework/Versions/3.7/bin/pip3 install -U databricks-connect==6.3.*

Collecting databricks-connect==6.3.*

  Downloading https://files.pythonhosted.org/packages/fd/b4/3a1a1e45f24bde2a2986bb6e8096d545a5b24374f2cfe2b36ac5c7f30f4b/databricks-connect-6.3.1.tar.gz (246.4MB)

    100% |████████████████████████████████| 246.4MB 174kB/s 

Requirement already satisfied, skipping upgrade: py4j==0.10.7 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from databricks-connect==6.3.*) (0.10.7)

Requirement already satisfied, skipping upgrade: six in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from databricks-connect==6.3.*) (1.14.0)

Installing collected packages: databricks-connect

  Found existing installation: databricks-connect 5.5.3

    Uninstalling databricks-connect-5.5.3:

      Successfully uninstalled databricks-connect-5.5.3

  Running setup.py install for databricks-connect ... done

Successfully installed databricks-connect-6.3.1

C02WG59KHTD5:Downloads abizeradenwala$ 

        Step 3 : Gather connection properties

-Azure workspace URL(Has orgID)

-Token (PAT)

-Cluster ID

       Step 4 : Configure the connection. You will have to interactively answer and provide details collected above.

C02WG59KHTD5:Downloads abizeradenwala$ databricks-connect configure

Copyright (2018) Databricks, Inc.

...

...

Databricks Platform Services: the Databricks services or the Databricks

Community Edition services, according to where the Software is used.

Licensee: the user of the Software, or, if the Software is being used on

behalf of a company, the company.

Do you accept the above agreement? [y/N] y

Set new config values (leave input empty to accept default):

Databricks Host [no current value, must start with https://]: https://westus2.azuredatabricks.net                                           - Databricks Token [no current value]: XYZZZZZZZZZZZ

IMPORTANT: please ensure that your cluster has:

- Databricks Runtime version of DBR 5.1+

- Python version same as your local Python (i.e., 2.7 or 3.5)

- the Spark conf `spark.databricks.service.server.enabled true` set

Cluster ID (e.g., 0921-001415-jelly628) [no current value]: 0317-213025-tarry631

Org ID (Azure-only, see ?o=orgId in URL) [0]: 6935536957980197

Port [15001]: 

Updated configuration in /Users/abizeradenwala/.databricks-connect

* Spark jar dir: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark/jars

* Spark home: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark

* Run `pip install -U databricks-connect` to install updates

* Run `pyspark` to launch a Python shell

* Run `spark-shell` to launch a Scala shell

* Run `databricks-connect test` to test connectivity

Databricks Connect User Survey: https://forms.gle/V2indnHHfrjGWyQ4A

C02WG59KHTD5:Downloads abizeradenwala$

Step 5 : Setup spark home via running below command on command line or save it in ~/.bash_profile

export SPARK_HOME=/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark

         Step 6 : Test connectivity to Azure Databricks.

C02WG59KHTD5:bin abizeradenwala$ databricks-connect test

* PySpark is installed at /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark

* Checking SPARK_HOME

* Checking java version

java version "1.8.0_181"

Java(TM) SE Runtime Environment (build 1.8.0_181-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

* Testing scala command

20/03/21 00:55:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

20/03/21 00:55:45 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5-SNAPSHOT

/_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)

Type in expressions to have them evaluated.

Type :help for more information.

scala> spark.range(100).reduce(_ + _)

Spark context Web UI available at http://c02wg59khtd5.attlocal.net:4040

Spark context available as 'sc' (master = local[*], app id = local-1584770145654).

Spark session available as 'spark'.

View job details at https://westus2.azuredatabricks.net/?o=6935536957980197#/setting/clusters/0317-213025-tarry631/sparkUi

View job details at https://westus2.azuredatabricks.net/?o=6935536957980197#/setting/clusters/0317-213025-tarry631/sparkUi

res0: Long = 4950

scala> :quit

* Testing python command

20/03/21 00:56:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

20/03/21 00:56:07 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.

View job details at https://westus2.azuredatabricks.net/?o=6935536957980197#/setting/clusters/0317-213025-tarry631/sparkUi

[Stage 6:>                                                          (0 + 4) / 8]* Testing dbutils.fs

[FileInfo(path='dbfs:/FileStore/', name='FileStore/', size=0), FileInfo(path='dbfs:/Knox/', name='Knox/', size=0), FileInfo(path='dbfs:/PradeepKumar/', name='PradeepKumar/', size=0), FileInfo(path='dbfs:/Users/', name='Users/', size=0), FileInfo(path='dbfs:/abc.sh', name='abc.sh', size=20), FileInfo(path='dbfs:/bank-full.csv', name='bank-full.csv', size=4610348), FileInfo(path='dbfs:/bogdan/', name='bogdan/', size=0), FileInfo(path='dbfs:/checkpoint/', name='checkpoint/', size=0), FileInfo(path='dbfs:/cluster-logs/', name='cluster-logs/', size=0), FileInfo(path='dbfs:/databricks/', name='databricks/', size=0), FileInfo(path='dbfs:/databricks-datasets/', name='databricks-datasets/', size=0), FileInfo(path='dbfs:/databricks-results/', name='databricks-results/', size=0), FileInfo(path='dbfs:/dbfs/', name='dbfs/', size=0), FileInfo(path='dbfs:/delta/', name='delta/', size=0), FileInfo(path='dbfs:/foobar', name='foobar', size=876), FileInfo(path='dbfs:/gauarav/', name='gauarav/', size=0), FileInfo(path='dbfs:/gaurav_poc/', name='gaurav_poc/', size=0), FileInfo(path='dbfs:/gaurav_rupnar/', name='gaurav_rupnar/', size=0), FileInfo(path='dbfs:/glm_data.csv', name='glm_data.csv', size=9836), FileInfo(path='dbfs:/glm_model/', name='glm_model/', size=0), FileInfo(path='dbfs:/jordan/', name='jordan/', size=0), FileInfo(path='dbfs:/jose/', name='jose/', size=0), FileInfo(path='dbfs:/jose.gonzalezmunoz@databricks.com/', name='jose.gonzalezmunoz@databricks.com/', size=0), FileInfo(path='dbfs:/knox/', name='knox/', size=0), FileInfo(path='dbfs:/local_disk0/', name='local_disk0/', size=0), FileInfo(path='dbfs:/matt/', name='matt/', size=0), FileInfo(path='dbfs:/ml/', name='ml/', size=0), FileInfo(path='dbfs:/mlflow/', name='mlflow/', size=0), FileInfo(path='dbfs:/mnt/', name='mnt/', size=0), FileInfo(path='dbfs:/piyushmnt/', name='piyushmnt/', size=0), FileInfo(path='dbfs:/pradeepkumar/', name='pradeepkumar/', size=0), FileInfo(path='dbfs:/rdd1-1562366996207/', name='rdd1-1562366996207/', size=0), FileInfo(path='dbfs:/scripts/', name='scripts/', size=0), FileInfo(path='dbfs:/takeshi/', name='takeshi/', size=0), FileInfo(path='dbfs:/te', name='te', size=36), FileInfo(path='dbfs:/test/', name='test/', size=0), FileInfo(path='dbfs:/test1/', name='test1/', size=0), FileInfo(path='dbfs:/testing/', name='testing/', size=0), FileInfo(path='dbfs:/testing1/', name='testing1/', size=0), FileInfo(path='dbfs:/testing2', name='testing2', size=3717), FileInfo(path='dbfs:/tmp/', name='tmp/', size=0), FileInfo(path='dbfs:/tmp1/', name='tmp1/', size=0), FileInfo(path='dbfs:/user/', name='user/', size=0), FileInfo(path='dbfs:/xin/', name='xin/', size=0), FileInfo(path='dbfs:/xyz.sh', name='xyz.sh', size=20), FileInfo(path='dbfs:/{workingDir}/', name='{workingDir}/', size=0)]

* All tests passed.

C02WG59KHTD5:bin abizeradenwala$

This confirms Mac can connect to Databricks cluster remotely.  

Configuring PyCharm

Create New Project → give it a name (dbconnectabizer)

Specify interpreter → File → Preference for new projects → expand your user folder → find python3.7 → OK → Create

-  Also install Databricks-connect package as show below and click ok.

Select your project → New → Python File

Create dbctest (will create a .py file)
Run → Edit Configurations

Click “+” icon (top left) → Python → Script path (.py file created earlier) → Open
Add Environment variables → Add new → PYSPARK_PYTHON, python3 → Apply & OK
Apply

Type your code (execute something from PyCharm to Databricks) → Run

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

from pyspark.sql.functions import col

song_df = spark.read \

.option('sep','\t') \

.option("inferSchema","true") \

.csv("/databricks-datasets/songs/data-001/part-0000*")

tempo_df = song_df.select(

col('_c4').alias('artist_name'),

col('_c14').alias('tempo'),

)

avg_tempo_df = tempo_df \

.groupBy('artist_name') \

.avg('tempo') \

.orderBy('avg(tempo)',ascending=False)

print("Calling show command which will trigger Spark processing")

avg_tempo_df.show(truncate=False)

You can see it’s executing against the cluster

Bingo !!! we got can now execute spark code from PyCharm to Databricks and vie the results in stdout .

Big Data / Systems / Cloud / AI

Saturday, March 21, 2020

Connecting to Databricks from PyCharm (On Mac) using Databricks Connect

Step 1 : Install the client

* All tests passed.

Configuring PyCharm

No comments:

Post a Comment