Sunday, November 15, 2020

Kubenetes 101

                                                                 Kubenetes 101 


Architecture :


Master :

- Node has 4 process - API server (Client, command line eg kubectl etc), Scheduler (Decides which worker node pod will be scheduled on), Controller Manager (Detect state changes of Pod and recover) and etcd (state of cluster in form of Kv store)



Worker Nodes (Multiple):

- App Pods where work is done.

- Node has 3 processes on every node :  Contrainer Runtime, Kubelet (Schedules and tracks Pod on local node) and Kube-Proxy(Networking related decisions)


 

Resources in K8S :


Pod - Smallest unit of K8s and each pod has an IP (Non static across restarts)

Service - Perm IP to Pod's (Even if service terminates IP stays)

   Internal Service -  Hostname:Port  (Risk to expose Hostname)  usually ClusterIP

   External Service - LB Service

Ingress : Service which talks to external word and passes traffic internally to services. ( https://your-app.com )




ConfigMap -  To have configs externally. 

Secrets -  To have credentials stored and stored in base64 encoded  (Passwd/Cert).



Volume : Local/External storage to persist data across Pod restart 

Deployments : Blue print for most application deployments.  Abstraction level over Pods .

StatefulSet - For DB app's to make sure writes are synchronized (Avoid corruption).  





Saturday, August 8, 2020

Writing Functions and Importing Custom/System Functions

 

                    Writing Functions and Importing Custom/System Functions


# writing custom function to find index 

print('Imported testp...')


def find_index(to_search, target):
'''Find the index of a value in a sequence'''
for i, value in enumerate(to_search):
if value == target:
return i

return -1



# Import sys module and printing all System Path.

import sys
print(sys.path)

# Import OS module and print location of file where the function is defined.

import os
print(os.__file__)

Comparisons and Loops with break/Continue


# Comparisons:


# Equal: ==
# Not Equal: !=
# Greater Than: >
# Less Than: <
# Greater or Equal: >=
# Less or Equal: <=
# Object Identity: is



Break statement :


nums = [1,2,3,4,5]

for num in nums:

if num == 2:

### Here we are breaking out of while loop when num is equal to 2, output will only be 1
break
print(num)
num += 1



Continue statement :

nums = [1,2,3,4,5]

for num in nums:

if num == 2:

### Here we are passing off when num is equal to 2, so output is 1,3,4,5
continue
print(num)
num += 1



Tuesday, August 4, 2020

List, Tuples, Set and Dictionary

                                          List, Tuples, Set and Dictionary


# Empty Lists - Mutable
empty_list = []
empty_list = list()

# Empty Tuples -
Immutable
empty_tuple = ()
empty_tuple = tuple()

# Empty Sets - throw away duplicates or membership test
empty_set = set()

# Empty Dict - Key Value pairs
empty_dict = {} # This isn't right! It's a dict



Examples
list_1 = ['History', 'Math', 'Physics', 'CompSci']
print(list_1)
Update list
list_1[0] = 'Art'


#tuple tuple_1 = ('History', 'Math', 'Physics', 'CompSci')

print(tuple_1)
# Sets
cs_courses = {'History', 'Math', 'Physics', 'CompSci'}
print(cs_courses)
# Dictionary
student = {'name': 'John', 'age': 25, 'courses': ['Math', 'CompSci']}

for key, value in student.items():
print(key, value)

Sunday, August 2, 2020

Python String formatting


                                       Python String formatting

                                             

  Formatting string in python can be achieved via different ways as below.


Declaring variables :


greeting = 'Hello'

name = 'Abizer'

morninggreeting = 'Morning !'



Concatenating string (Default and easy when not many strings need to be Concatenated)


message =  greeting + ", " + name + "." + morninggreeting

Formatting strings via below method to make it more readable.



message =  '{}, {}.{}'.format(greeting, name, morninggreeting)


If your using 3.6+ version of Python we can use f strings to achieve same thing


message = f'{greeting}, {name}.{morninggreeting}'





Finally print the output.


print(message)

Output :

Hello, Abizer.Morning !


Wednesday, June 17, 2020

Git Basics and commands Part 1


                                 Git based add, delete, modify and commit 


First create a dir where you are working on with files and you would finally be syncing with Git repo .  All commands are run on local Mac till final modifications are done to be ready to commit to repo.

1) Initialize Git to track files under that dir .

C02WG59KHTD5:GitProjects abizeradenwala$ git initInitialized empty Git repository in /Users/abizeradenwala/Desktop/Learning/Git/GitProjects/.git/
.git file indicates the directory is being tracked by GIT .
C02WG59KHTD5:GitProjects abizeradenwala$ ls -ltatotal 8drwxr-xr-x  9 abizeradenwala  staff  288 Jun 13 15:44 .gitdrwxr-xr-x  5 abizeradenwala  staff  160 Jun 13 15:44 .-rw-r--r--  1 abizeradenwala  staff    0 Jun 13 15:43 Git_cheatsheet-rw-r--r--  1 abizeradenwala  staff   13 Jun 13 15:43 git_testdrwxr-xr-x  5 abizeradenwala  staff  160 Jun 13 15:41 ..


2) Make an update to "git_test" file and add the commits to staging before it can be committed.

C02WG59KHTD5:GitProjects abizeradenwala$ git add git_test C02WG59KHTD5:GitProjects abizeradenwala$ git status 
On branch master
No commits yet
Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
new file:   git_test
Untracked files:
  (use "git add <file>..." to include in what will be committed)
Git_cheatsheet
Add everything to the staging in present dir .

git add .
C02WG59KHTD5:GitProjects abizeradenwala$ git statusOn branch master
No commits yet
Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
new file:   Git_cheatsheet
new file:   git_test

3) Finally commit the changes .

C02WG59KHTD5:GitProjects abizeradenwala$ git commit -m "revision 1"
[master (root-commit) 493ff9a] revision 1 2 files changed, 1 insertion(+) create mode 100644 Git_cheatsheet create mode 100644 git_testC02WG59KHTD5:GitProjects abizeradenwala$ git statusOn branch masternothing to commit, working tree clean
Check all the past commit logs.
C02WG59KHTD5:GitProjects abizeradenwala$ git log 
commit 493ff9a38d285fb9dc0582c92091aaf1369526d2 (HEAD -> master)Author: abizeradenwala <abizer.adenwala@databricks.com>Date:   Sat Jun 13 15:47:17 2020 -0500
    revision 1
C02WG59KHTD5:GitProjects abizeradenwala$
4) Modify the file and discard changes .
C02WG59KHTD5:GitProjects abizeradenwala$ echo "Appending some data" >> git_test C02WG59KHTD5:GitProjects abizeradenwala$ git statusOn branch masterChanges not staged for commit:  (use "git add <file>..." to update what will be committed)  (use "git checkout -- <file>..." to discard changes in working directory)
modified:   git_test
no changes added to commit (use "git add" and/or "git commit -a")
Discard changes
C02WG59KHTD5:GitProjects abizeradenwala$ cat git_test this is testAppending some dataC02WG59KHTD5:GitProjects abizeradenwala$ git checkout git_testC02WG59KHTD5:GitProjects abizeradenwala$ cat git_test this is test
5) Delete and retrieve the deleted files back tracked by GIT
C02WG59KHTD5:GitProjects abizeradenwala$ rm git_test C02WG59KHTD5:GitProjects abizeradenwala$ git statusOn branch masterChanges to be committed:  (use "git reset HEAD <file>..." to unstage)
modified:   git_test
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
deleted:    git_test
C02WG59KHTD5:GitProjects abizeradenwala$ git checkout git_test
C02WG59KHTD5:GitProjects abizeradenwala$ git status
On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
modified:   git_test
C02WG59KHTD5:GitProjects abizeradenwala$ ls
Git_cheatsheet git_test
C02WG59KHTD5:GitProjects abizeradenwala$
                                Or

C02WG59KHTD5:GitProjects abizeradenwala$ git rm git_test rm 'git_test'C02WG59KHTD5:GitProjects abizeradenwala$ git statusOn branch masterChanges to be committed:  (use "git reset HEAD <file>..." to unstage)
deleted:    git_test
C02WG59KHTD5:GitProjects abizeradenwala$ ls
Git_cheatsheet
C02WG59KHTD5:GitProjects abizeradenwala$ 
Undo deletes : 


C02WG59KHTD5:GitProjects abizeradenwala$ git reset HEAD git_test 
Unstaged changes after reset:
D git_test
C02WG59KHTD5:GitProjects abizeradenwala$ git checkout -- git_test 
C02WG59KHTD5:GitProjects abizeradenwala$ git status
On branch master
nothing to commit, working tree clean
C02WG59KHTD5:GitProjects abizeradenwala$ ls
Git_cheatsheet git_test
C02WG59KHTD5:GitProjects abizeradenwala$ 
6) Do multiple commits and track those .


C02WG59KHTD5:GitProjects abizeradenwala$ echo "Appending data 2st time" >> git_test C02WG59KHTD5:GitProjects abizeradenwala$ cat git_test this is testAppending data 1st timeAppending data 2st timeC02WG59KHTD5:GitProjects abizeradenwala$ git add git_test C02WG59KHTD5:GitProjects abizeradenwala$  git commit -m "revision 3"[master f096b1b] revision 3 1 file changed, 1 insertion(+)C02WG59KHTD5:GitProjects abizeradenwala$ git statusOn branch masternothing to commit, working tree cleanC02WG59KHTD5:GitProjects abizeradenwala$ git logcommit f096b1ba79695110f36693e048d20c7164893eed (HEAD -> master)Author: abizeradenwala <abizer.adenwala@databricks.com>Date:   Wed Jun 17 15:41:52 2020 -0500
    revision 3
commit e2c4c00104c751f8de592d28c4dac0805762fa0d
Author: abizeradenwala <abizer.adenwala@databricks.com>
Date:   Wed Jun 17 15:40:21 2020 -0500
    revision 2
commit 493ff9a38d285fb9dc0582c92091aaf1369526d2
Author: abizeradenwala <abizer.adenwala@databricks.com>
Date:   Sat Jun 13 15:47:17 2020 -0500
    revision 1
C02WG59KHTD5:GitProjects abizeradenwala$ 
Roll back to first commit version :

C02WG59KHTD5:GitProjects abizeradenwala$ cat git_test this is testAppending data 1st timeAppending data 2st timeC02WG59KHTD5:GitProjects abizeradenwala$ git checkout 493ff9a38d285fb9dc0582c92091aaf1369526d2Note: checking out '493ff9a38d285fb9dc0582c92091aaf1369526d2'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:
  git checkout -b <new-branch-name>
HEAD is now at 493ff9a revision 1
C02WG59KHTD5:GitProjects abizeradenwala$ cat git_test 
this is test
C02WG59KHTD5:GitProjects abizeradenwala$ git status
HEAD detached at 493ff9a
nothing to commit, working tree clean
C02WG59KHTD5:GitProjects abizeradenwala$ 













Saturday, March 21, 2020

Connecting to Databricks from PyCharm (On Mac) using Databricks Connect


Databricks Connect allows you to connect your favorite IDE (IntelliJ, Eclipse, PyCharm, RStudio, Visual Studio), notebook server (Zeppelin, Jupyter), and other custom applications to Azure Databricks clusters and run Apache Spark code.


I will walk you through steps to connect your PyCharm installed on Mac book to DB clusters to run jobs against and get results back in stdout of PyCharm .

Note :  Here I will be connecting to cluster with Databricks Runtime version 6.3  and  Python 3.7 . It is assumed you have PyCharm and python 3.7 already setup on your Mac .




      Step 1 : Install the client

  1. Uninstall PySpark If installed (In my case it was not installed)

    C02WG59KHTD5:Downloads abizeradenwala$ pip uninstall pyspark
    Skipping pyspark as it is not installed.
    C02WG59KHTD5:Downloads abizeradenwala$

    Step 2 : Install the Databricks Connect client (I had older client which is removed automatically)

C02WG59KHTD5:Downloads abizeradenwala$ /Library/Frameworks/Python.framework/Versions/3.7/bin/pip3 install -U databricks-connect==6.3.*
Collecting databricks-connect==6.3.*
  Downloading https://files.pythonhosted.org/packages/fd/b4/3a1a1e45f24bde2a2986bb6e8096d545a5b24374f2cfe2b36ac5c7f30f4b/databricks-connect-6.3.1.tar.gz (246.4MB)
    100% |████████████████████████████████| 246.4MB 174kB/s 
Requirement already satisfied, skipping upgrade: py4j==0.10.7 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from databricks-connect==6.3.*) (0.10.7)
Requirement already satisfied, skipping upgrade: six in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from databricks-connect==6.3.*) (1.14.0)
Installing collected packages: databricks-connect
  Found existing installation: databricks-connect 5.5.3
    Uninstalling databricks-connect-5.5.3:
      Successfully uninstalled databricks-connect-5.5.3
  Running setup.py install for databricks-connect ... done
Successfully installed databricks-connect-6.3.1
C02WG59KHTD5:Downloads abizeradenwala$ 


        Step 3 : Gather connection properties

-Azure workspace URL(Has orgID)
-Token (PAT)
-Cluster ID

       Step 4 : Configure the connection. You will have to interactively answer and provide details collected above.

C02WG59KHTD5:Downloads abizeradenwala$ databricks-connect configure
Copyright (2018) Databricks, Inc.

...
...

Databricks Platform Services: the Databricks services or the Databricks
Community Edition services, according to where the Software is used.

Licensee: the user of the Software, or, if the Software is being used on
behalf of a company, the company.

Do you accept the above agreement? [y/N] y
Set new config values (leave input empty to accept default):
Databricks Host [no current value, must start with https://]: https://westus2.azuredatabricks.net                                           - Databricks Token [no current value]: XYZZZZZZZZZZZ

IMPORTANT: please ensure that your cluster has:
- Databricks Runtime version of DBR 5.1+
- Python version same as your local Python (i.e., 2.7 or 3.5)
- the Spark conf `spark.databricks.service.server.enabled true` set

Cluster ID (e.g., 0921-001415-jelly628) [no current value]: 0317-213025-tarry631
Org ID (Azure-only, see ?o=orgId in URL) [0]: 6935536957980197
Port [15001]: 

Updated configuration in /Users/abizeradenwala/.databricks-connect
* Spark jar dir: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark/jars
* Spark home: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark
* Run `pip install -U databricks-connect` to install updates
* Run `pyspark` to launch a Python shell
* Run `spark-shell` to launch a Scala shell
* Run `databricks-connect test` to test connectivity

Databricks Connect User Survey: https://forms.gle/V2indnHHfrjGWyQ4A

C02WG59KHTD5:Downloads abizeradenwala$

          Step 5 : Setup spark home via running below command on command line or save it in  ~/.bash_profile

export SPARK_HOME=/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark

         Step 6 : Test connectivity to Azure Databricks.

C02WG59KHTD5:bin abizeradenwala$ databricks-connect test
* PySpark is installed at /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark
* Checking SPARK_HOME
* Checking java version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
* Testing scala command
20/03/21 00:55:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/21 00:55:45 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5-SNAPSHOT
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.range(100).reduce(_ + _)
Spark context Web UI available at http://c02wg59khtd5.attlocal.net:4040
Spark context available as 'sc' (master = local[*], app id = local-1584770145654).
Spark session available as 'spark'.
View job details at https://westus2.azuredatabricks.net/?o=6935536957980197#/setting/clusters/0317-213025-tarry631/sparkUi
View job details at https://westus2.azuredatabricks.net/?o=6935536957980197#/setting/clusters/0317-213025-tarry631/sparkUi
res0: Long = 4950

scala> :quit

* Testing python command
20/03/21 00:56:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/03/21 00:56:07 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
View job details at https://westus2.azuredatabricks.net/?o=6935536957980197#/setting/clusters/0317-213025-tarry631/sparkUi
[Stage 6:>                                                          (0 + 4) / 8]* Testing dbutils.fs
[FileInfo(path='dbfs:/FileStore/', name='FileStore/', size=0), FileInfo(path='dbfs:/Knox/', name='Knox/', size=0), FileInfo(path='dbfs:/PradeepKumar/', name='PradeepKumar/', size=0), FileInfo(path='dbfs:/Users/', name='Users/', size=0), FileInfo(path='dbfs:/abc.sh', name='abc.sh', size=20), FileInfo(path='dbfs:/bank-full.csv', name='bank-full.csv', size=4610348), FileInfo(path='dbfs:/bogdan/', name='bogdan/', size=0), FileInfo(path='dbfs:/checkpoint/', name='checkpoint/', size=0), FileInfo(path='dbfs:/cluster-logs/', name='cluster-logs/', size=0), FileInfo(path='dbfs:/databricks/', name='databricks/', size=0), FileInfo(path='dbfs:/databricks-datasets/', name='databricks-datasets/', size=0), FileInfo(path='dbfs:/databricks-results/', name='databricks-results/', size=0), FileInfo(path='dbfs:/dbfs/', name='dbfs/', size=0), FileInfo(path='dbfs:/delta/', name='delta/', size=0), FileInfo(path='dbfs:/foobar', name='foobar', size=876), FileInfo(path='dbfs:/gauarav/', name='gauarav/', size=0), FileInfo(path='dbfs:/gaurav_poc/', name='gaurav_poc/', size=0), FileInfo(path='dbfs:/gaurav_rupnar/', name='gaurav_rupnar/', size=0), FileInfo(path='dbfs:/glm_data.csv', name='glm_data.csv', size=9836), FileInfo(path='dbfs:/glm_model/', name='glm_model/', size=0), FileInfo(path='dbfs:/jordan/', name='jordan/', size=0), FileInfo(path='dbfs:/jose/', name='jose/', size=0), FileInfo(path='dbfs:/jose.gonzalezmunoz@databricks.com/', name='jose.gonzalezmunoz@databricks.com/', size=0), FileInfo(path='dbfs:/knox/', name='knox/', size=0), FileInfo(path='dbfs:/local_disk0/', name='local_disk0/', size=0), FileInfo(path='dbfs:/matt/', name='matt/', size=0), FileInfo(path='dbfs:/ml/', name='ml/', size=0), FileInfo(path='dbfs:/mlflow/', name='mlflow/', size=0), FileInfo(path='dbfs:/mnt/', name='mnt/', size=0), FileInfo(path='dbfs:/piyushmnt/', name='piyushmnt/', size=0), FileInfo(path='dbfs:/pradeepkumar/', name='pradeepkumar/', size=0), FileInfo(path='dbfs:/rdd1-1562366996207/', name='rdd1-1562366996207/', size=0), FileInfo(path='dbfs:/scripts/', name='scripts/', size=0), FileInfo(path='dbfs:/takeshi/', name='takeshi/', size=0), FileInfo(path='dbfs:/te', name='te', size=36), FileInfo(path='dbfs:/test/', name='test/', size=0), FileInfo(path='dbfs:/test1/', name='test1/', size=0), FileInfo(path='dbfs:/testing/', name='testing/', size=0), FileInfo(path='dbfs:/testing1/', name='testing1/', size=0), FileInfo(path='dbfs:/testing2', name='testing2', size=3717), FileInfo(path='dbfs:/tmp/', name='tmp/', size=0), FileInfo(path='dbfs:/tmp1/', name='tmp1/', size=0), FileInfo(path='dbfs:/user/', name='user/', size=0), FileInfo(path='dbfs:/xin/', name='xin/', size=0), FileInfo(path='dbfs:/xyz.sh', name='xyz.sh', size=20), FileInfo(path='dbfs:/{workingDir}/', name='{workingDir}/', size=0)]

* All tests passed.

C02WG59KHTD5:bin abizeradenwala$


This confirms Mac can connect to Databricks cluster remotely.  


Configuring PyCharm 



  • Create New Project → give it a name (dbconnectabizer)
  • Specify interpreter → File → Preference for new projects → expand your user folder find python3.7 → OK → Create

- Also install Databricks-connect package as show below and click ok.



  • Select your project → New → Python File
    • Create dbctest (will create a .py file)
    • Run → Edit Configurations 
      • Click “+” icon (top left) → Python → Script path (.py file created earlier) → Open 
      • Add Environment variables → Add new → PYSPARK_PYTHON, python3 → Apply & OK




      • Apply
    • Type your code (execute something from PyCharm to Databricks) → Run

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.functions import col

song_df = spark.read \
    .option('sep','\t') \
    .option("inferSchema","true") \
    .csv("/databricks-datasets/songs/data-001/part-0000*")

tempo_df = song_df.select(
                    col('_c4').alias('artist_name'),
                    col('_c14').alias('tempo'),
                   )

avg_tempo_df = tempo_df \
    .groupBy('artist_name') \
    .avg('tempo') \
    .orderBy('avg(tempo)',ascending=False)

print("Calling show command which will trigger Spark processing")
avg_tempo_df.show(truncate=False)

    • You can see it’s executing against the cluster



Bingo !!! we got can now execute spark code from PyCharm to Databricks and vie the results in stdout .