Sunday, May 27, 2018

Upgrading python version for Databricks Notebook


With my last blog i just showed how to use init scripts to install customer packages by creating bash script to reside in a sub-directory of the init scripts directory named the same as the cluster name. For example, to specify init scripts for the cluster named testabizer-python3.6, create the directory dbfs:/databricks/init/testabizer-python3.6, and put all shell scripts that should run on cluster testabizer-python3.6 in that directory.

http://abizeradenwala.blogspot.com/2018/05/upgrade-python-version-before-cluster.html

This is great for most cases but in some cases Databricks Notebook has to use the new version of package/library but since some path are not set and cluster/containers already started the Notebook might still use older version.

In this blog i will write details on how to upgrade to Python 3.6 and make sure DB Notebook uses them as well.

Step 1

Download the Anaconda Python distribution from https://www.continuum.io/downloads:

%sh curl -O https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh -O /dbfs/tmp/

Step 2

We will want to make sure all of the packages Databricks includes by default in Databricks and PySpark are replicated in Anaconda so create list of python packages and save in DBFS location
%sh  /databricks/python/bin/pip freeze > /tmp/python_packages.txt
%fs cp file:/tmp/python_packages.txt /tmp/python_packages.txt

Step 3 

Use an init script to change the default Python distribution.

dbutils.fs.put("dbfs:/databricks/init/testabizer-python3.6/install_conda1.sh", """#!/bin/bash
sudo bash /dbfs/tmp/Anaconda3-5.1.0-Linux-x86_64.sh -b -p /anaconda3
mv /databricks/python /databricks/python_old
ln -s /anaconda3 /databricks/python
/databricks/python/bin/pip install -r /dbfs/tmp/python_packages.txt
""")

Now after cluster restarts verify python was upgraded and available via Notebook as needed.


Bingo !!! Now you can use 3.6 version of python .

No comments:

Post a Comment