Saturday, March 21, 2020

Connecting to Databricks from PyCharm (On Mac) using Databricks Connect


Databricks Connect allows you to connect your favorite IDE (IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio), notebook server (Zeppelin, Jupyter), and other custom applications to Azure Databricks clusters and run Apache Spark code.


I will walk you through steps to connect your PyCharm installed on Mac book to DB clusters to run jobs against and get results back in stdout of PyCharm .

Note :  Here I will be connecting to cluster with Databricks Runtime version 6.3  and  Python 3.7 . It is assumed you have PyCharm and python 3.7 already setup on your Mac .




      Step 1 : Install the client

  1. Uninstall PySpark If installed (In my case it was not installed)

    C02WG59KHTD5:Downloads abizeradenwala$ pip uninstall pyspark
    Skipping pyspark as it is not installed.
    C02WG59KHTD5:Downloads abizeradenwala$

    Step 2 : Install the Databricks Connect client (I had older client which is removed automatically)

C02WG59KHTD5:Downloads abizeradenwala$ /Library/Frameworks/Python.framework/Versions/3.7/bin/pip3 install -U databricks-connect==6.3.*
Collecting databricks-connect==6.3.*
  Downloading https://files.pythonhosted.org/packages/fd/b4/3a1a1e45f24bde2a2986bb6e8096d545a5b24374f2cfe2b36ac5c7f30f4b/databricks-connect-6.3.1.tar.gz (246.4MB)
    100% |████████████████████████████████| 246.4MB 174kB/s 
Requirement already satisfied, skipping upgrade: py4j==0.10.7 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from databricks-connect==6.3.*) (0.10.7)
Requirement already satisfied, skipping upgrade: six in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from databricks-connect==6.3.*) (1.14.0)
Installing collected packages: databricks-connect
  Found existing installation: databricks-connect 5.5.3
    Uninstalling databricks-connect-5.5.3:
      Successfully uninstalled databricks-connect-5.5.3
  Running setup.py install for databricks-connect ... done
Successfully installed databricks-connect-6.3.1
C02WG59KHTD5:Downloads abizeradenwala$ 


        Step 3 : Gather connection properties

-Azure workspace URL(Has orgID)
-Token (PAT)
-Cluster ID

       Step 4 : Configure the connection. You will have to interactively answer and provide details collected above.

C02WG59KHTD5:Downloads abizeradenwala$ databricks-connect configure
Copyright (2018) Databricks, Inc.

...
...

Databricks Platform Services: the Databricks services or the Databricks
Community Edition services, according to where the Software is used.

Licensee: the user of the Software, or, if the Software is being used on
behalf of a company, the company.

Do you accept the above agreement? [y/N] y
Set new config values (leave input empty to accept default):
Databricks Host [no current value, must start with https://]: https://westus2.azuredatabricks.net                                           - Databricks Token [no current value]: XYZZZZZZZZZZZ

IMPORTANT: please ensure that your cluster has:
- Databricks Runtime version of DBR 5.1+
- Python version same as your local Python (i.e., 2.7 or 3.5)
- the Spark conf `spark.databricks.service.server.enabled true` set

Cluster ID (e.g., 0921-001415-jelly628) [no current value]: 0317-213025-tarry631
Org ID (Azure-only, see ?o=orgId in URL) [0]: 6935536957980197
Port [15001]: 

Updated configuration in /Users/abizeradenwala/.databricks-connect
* Spark jar dir: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark/jars
* Spark home: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark
* Run `pip install -U databricks-connect` to install updates
* Run `pyspark` to launch a Python shell
* Run `spark-shell` to launch a Scala shell
* Run `databricks-connect test` to test connectivity

Databricks Connect User Survey: https://forms.gle/V2indnHHfrjGWyQ4A

C02WG59KHTD5:Downloads abizeradenwala$

          Step 5 : Setup spark home via running below command on command line or save it in  ~/.bash_profile

export SPARK_HOME=/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark

         Step 6 : Test connectivity to Azure Databricks.

C02WG59KHTD5:bin abizeradenwala$ databricks-connect test
* PySpark is installed at /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark
* Checking SPARK_HOME
* Checking java version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
* Testing scala command
20/03/21 00:55:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/21 00:55:45 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5-SNAPSHOT
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(100).reduce(_ + _)
Spark context Web UI available at http://c02wg59khtd5.attlocal.net:4040
Spark context available as 'sc' (master = local[*], app id = local-1584770145654).
Spark session available as 'spark'.
View job details at https://westus2.azuredatabricks.net/?o=6935536957980197#/setting/clusters/0317-213025-tarry631/sparkUi
View job details at https://westus2.azuredatabricks.net/?o=6935536957980197#/setting/clusters/0317-213025-tarry631/sparkUi
res0: Long = 4950

scala> :quit

* Testing python command
20/03/21 00:56:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/21 00:56:07 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
View job details at https://westus2.azuredatabricks.net/?o=6935536957980197#/setting/clusters/0317-213025-tarry631/sparkUi
[Stage 6:>                                                          (0 + 4) / 8]* Testing dbutils.fs
[FileInfo(path='dbfs:/FileStore/', name='FileStore/', size=0), FileInfo(path='dbfs:/Knox/', name='Knox/', size=0), FileInfo(path='dbfs:/PradeepKumar/', name='PradeepKumar/', size=0), FileInfo(path='dbfs:/Users/', name='Users/', size=0), FileInfo(path='dbfs:/abc.sh', name='abc.sh', size=20), FileInfo(path='dbfs:/bank-full.csv', name='bank-full.csv', size=4610348), FileInfo(path='dbfs:/bogdan/', name='bogdan/', size=0), FileInfo(path='dbfs:/checkpoint/', name='checkpoint/', size=0), FileInfo(path='dbfs:/cluster-logs/', name='cluster-logs/', size=0), FileInfo(path='dbfs:/databricks/', name='databricks/', size=0), FileInfo(path='dbfs:/databricks-datasets/', name='databricks-datasets/', size=0), FileInfo(path='dbfs:/databricks-results/', name='databricks-results/', size=0), FileInfo(path='dbfs:/dbfs/', name='dbfs/', size=0), FileInfo(path='dbfs:/delta/', name='delta/', size=0), FileInfo(path='dbfs:/foobar', name='foobar', size=876), FileInfo(path='dbfs:/gauarav/', name='gauarav/', size=0), FileInfo(path='dbfs:/gaurav_poc/', name='gaurav_poc/', size=0), FileInfo(path='dbfs:/gaurav_rupnar/', name='gaurav_rupnar/', size=0), FileInfo(path='dbfs:/glm_data.csv', name='glm_data.csv', size=9836), FileInfo(path='dbfs:/glm_model/', name='glm_model/', size=0), FileInfo(path='dbfs:/jordan/', name='jordan/', size=0), FileInfo(path='dbfs:/jose/', name='jose/', size=0), FileInfo(path='dbfs:/jose.gonzalezmunoz@databricks.com/', name='jose.gonzalezmunoz@databricks.com/', size=0), FileInfo(path='dbfs:/knox/', name='knox/', size=0), FileInfo(path='dbfs:/local_disk0/', name='local_disk0/', size=0), FileInfo(path='dbfs:/matt/', name='matt/', size=0), FileInfo(path='dbfs:/ml/', name='ml/', size=0), FileInfo(path='dbfs:/mlflow/', name='mlflow/', size=0), FileInfo(path='dbfs:/mnt/', name='mnt/', size=0), FileInfo(path='dbfs:/piyushmnt/', name='piyushmnt/', size=0), FileInfo(path='dbfs:/pradeepkumar/', name='pradeepkumar/', size=0), FileInfo(path='dbfs:/rdd1-1562366996207/', name='rdd1-1562366996207/', size=0), FileInfo(path='dbfs:/scripts/', name='scripts/', size=0), FileInfo(path='dbfs:/takeshi/', name='takeshi/', size=0), FileInfo(path='dbfs:/te', name='te', size=36), FileInfo(path='dbfs:/test/', name='test/', size=0), FileInfo(path='dbfs:/test1/', name='test1/', size=0), FileInfo(path='dbfs:/testing/', name='testing/', size=0), FileInfo(path='dbfs:/testing1/', name='testing1/', size=0), FileInfo(path='dbfs:/testing2', name='testing2', size=3717), FileInfo(path='dbfs:/tmp/', name='tmp/', size=0), FileInfo(path='dbfs:/tmp1/', name='tmp1/', size=0), FileInfo(path='dbfs:/user/', name='user/', size=0), FileInfo(path='dbfs:/xin/', name='xin/', size=0), FileInfo(path='dbfs:/xyz.sh', name='xyz.sh', size=20), FileInfo(path='dbfs:/{workingDir}/', name='{workingDir}/', size=0)]

* All tests passed.

C02WG59KHTD5:bin abizeradenwala$


This confirms Mac can connect to Databricks cluster remotely.  


Configuring PyCharm 



  • Create New Project → give it a name (dbconnectabizer)
  • Specify interpreter → File → Preference for new projects → expand your user folder find python3.7 → OK → Create

- Also install Databricks-connect package as show below and click ok.



  • Select your project → New → Python File
    • Create dbctest (will create a .py file)
    • Run → Edit Configurations 
      • Click “+” icon (top left) → Python → Script path (.py file created earlier) → Open 
      • Add Environment variables → Add new → PYSPARK_PYTHON, python3 → Apply & OK




      • Apply
    • Type your code (execute something from PyCharm to Databricks) → Run

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.functions import col

song_df = spark.read \
    .option('sep','\t') \
    .option("inferSchema","true") \
    .csv("/databricks-datasets/songs/data-001/part-0000*")

tempo_df = song_df.select(
                    col('_c4').alias('artist_name'),
                    col('_c14').alias('tempo'),
                   )

avg_tempo_df = tempo_df \
    .groupBy('artist_name') \
    .avg('tempo') \
    .orderBy('avg(tempo)',ascending=False)

print("Calling show command which will trigger Spark processing")
avg_tempo_df.show(truncate=False)

    • You can see it’s executing against the cluster



Bingo !!! we got can now execute spark code from PyCharm to Databricks and vie the results in stdout .


No comments:

Post a Comment