Sunday, May 27, 2018

Upgrading python version for Databricks Notebook


With my last blog i just showed how to use init scripts to install customer packages by creating bash script to reside in a sub-directory of the init scripts directory named the same as the cluster name. For example, to specify init scripts for the cluster named testabizer-python3.6, create the directory dbfs:/databricks/init/testabizer-python3.6, and put all shell scripts that should run on cluster testabizer-python3.6 in that directory.

http://abizeradenwala.blogspot.com/2018/05/upgrade-python-version-before-cluster.html

This is great for most cases but in some cases Databricks Notebook has to use the new version of package/library but since some path are not set and cluster/containers already started the Notebook might still use older version.

In this blog i will write details on how to upgrade to Python 3.6 and make sure DB Notebook uses them as well.

Step 1

Download the Anaconda Python distribution from https://www.continuum.io/downloads:

%sh curl -O https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh -O /dbfs/tmp/

Step 2

We will want to make sure all of the packages Databricks includes by default in Databricks and PySpark are replicated in Anaconda so create list of python packages and save in DBFS location
%sh  /databricks/python/bin/pip freeze > /tmp/python_packages.txt
%fs cp file:/tmp/python_packages.txt /tmp/python_packages.txt

Step 3 

Use an init script to change the default Python distribution.

dbutils.fs.put("dbfs:/databricks/init/testabizer-python3.6/install_conda1.sh", """#!/bin/bash
sudo bash /dbfs/tmp/Anaconda3-5.1.0-Linux-x86_64.sh -b -p /anaconda3
mv /databricks/python /databricks/python_old
ln -s /anaconda3 /databricks/python
/databricks/python/bin/pip install -r /dbfs/tmp/python_packages.txt
""")

Now after cluster restarts verify python was upgraded and available via Notebook as needed.


Bingo !!! Now you can use 3.6 version of python .

Tuesday, May 15, 2018

Setting up and accessing Public/Private(NAT G/W) EC2 instances

On AWS console look for EC2 service and click on it to get to EC2 dashboard . Fig below shows EC2 dashboard.



Public Instance :


Click on Launch Instance and choose AMI and instance type after which we can configure instance into the VPC we just created .  Under subnets we can select the private and public once for 2 different instances . Rest all the other fields can be kept default .




In Tab 6 , we will create a new policy and allow traffic on specific ports.


Finally , Review all the details and click on launch once all details seem accurate.


Once the Public Instance comes up you will see as below , it will show public IP, key which can be used to access the instance and all the other details .





From My desktop i can reach to the EC2 instance i just spined up .


Desktop abizeradenwala$ ssh -i Abizerpem.pem ec2-user@54.213.230.44
Last login: Wed May  9 06:02:43 2018 from 50.225.159.163
[ec2-user@ip-10-0-1-13 ~]$ sudo su -
[root@ip-10-0-1-13 ~]# uptime

 18:17:34 up 6 days, 23:05,  1 user,  load average: 0.00, 0.01, 0.05
[root@ip-10-0-1-13 ~]#


Private Instance :

Similar process is followed for private instance but once it comes up it doesn't have public IP's, this is due to fact that auto assignment of IP's for this subnet was disabled .


 Also this instance is not ping-able nor can be reached from Desktop or other private instances . 

So now question is how can this private instances serve request and receive requests -  Answer is NAT instance or G/W .


Rest of the blog walks through allocating a NAT gateway inside the public subnet, and updating the default route tables to make sure specific traffic goes through the NAT gateway for private EC2 instance.
Step 1: Creating NAT g/w and Associate a route table with the gateway subnet 
1) Go to existing subnets and copy the public subnet ID "subnet-e8d55e91" in our case.
2) Create NAT gateway .

3) Once the NAT G/W is available you should see the status turn green .


4) Now select the Routes tab for the newly created Route table and add another route 

Enter 0.0.0.0/0 in Destination, and choose AbizerVPC’s NAT gateway (starts with “nat-xxxxxx”) from Target, and then click Save



Lets make sure the added Nat route becomes Active .

5) Go to the Subnet Associations tab, edit and check the gateway subnet you created in Step 1, and then click Save. This grants direct internet access for the gateway subnet, making it a “public subnet” but routable via NAT G/W.



YaY now private instance is reachable Via NAT g/w and can be reached to.

[root@ip-10-0-1-13 ~]# ping 10.0.2.77
PING 10.0.2.77 (10.0.2.77) 56(84) bytes of data.
64 bytes from 10.0.2.77: icmp_seq=1 ttl=64 time=0.964 ms
64 bytes from 10.0.2.77: icmp_seq=2 ttl=64 time=0.845 ms
^C
--- 10.0.2.77 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.845/0.904/0.964/0.066 ms



[root@ip-10-0-1-13 tmp]# ssh -i Abizerpem.pem ec2-user@10.0.2.77
Last login: Tue May 15 18:42:07 2018 from ip-10-0-1-13.us-west-2.compute.internal
[ec2-user@ip-10-0-2-77 ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.0.2.77  netmask 255.255.255.0  broadcast 10.0.2.255
        inet6 fe80::4a8:57ff:fe41:915a  prefixlen 64  scopeid 0x20<link>
        ether 06:a8:57:41:91:5a  txqueuelen 1000  (Ethernet)
        RX packets 3598  bytes 200871 (196.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3935  bytes 226415 (221.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 64  bytes 5920 (5.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 64  bytes 5920 (5.7 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[ec2-user@ip-10-0-2-77 ~]$ 


We have reached our end goal , We have VPC with 2 subnets (Public/Private) and both can reach out to the world via IGW and NAT G/W .



















Monday, May 14, 2018

VPC Peering across accounts

                                           

VPC Peering across accounts


A VPC peering connection is a networking connection between two VPCs that enables you to route traffic between them using private IPv4 addresses or IPv6 addresses. Instances in either VPC can communicate with each other as if they are within the same network. You can create a VPC peering connection between your own VPCs, or with a VPC in another AWS account ( same AWS Region ).The following diagram illustrates all of the different components that are involved in peering your Databricks deployment/Account A to your other AWS infrastructure receding in other accounts / Account B.



For example, Databricks is deployed in one AWS account and the RDS/instance is deployed into another. A peering connection is established to link the two VPCs across both AWS accounts. This helps in the EC2 instances to talk to RDS/instance without going to internet .


As we move through this process it helps to keep a table of information to refer back to.

  1. ID and CIDR Range of your AbizerVPC VPC (VPC A).
  2. ID and CIDR Range of your other infrastructure (AbizerDatabricksVPC) i.e  VPC B
  3. ID of the main route table of your AbizerVPC (rtb-ec023794).

AWS ServiceNameIDCIDR Range
VPCAbizerVPCvpc-29e2ca5010.0.0.0/16
VPCAbizerDatabricksVPCvpc-bba395c2172.78.0.0/16
Route TableCusttestmainroutertb-ec023794


Step 1: Create a peering connection


  1. Navigate to the VPC Dashboard.
  2. Select Peering Connections.
  3. Click Create Peering Connection
  4. Set the VPC Requester to the Databricks VPC ID (vpc-29e2ca50).
  5. Set the VPC Acceptor to the AbizerDatabricksVPC VPC ID (vpc-bba395c2).
  6. Click Create Peering Connection.

Once all the correct info is passed VPC peering succeeds with below message . 



The Peering ID is pcx-bb31d0d3 .

Step 2: Accept the peering connection request

The VPC with the account B (Account 997819012307) will need to have its owner approve the request. The status on Peering Connections indicates Pending Acceptance until this is done as seen in fig below. (Select Actions > Accept Request)


Step 3: Add DNS resolution to peering connection


  1. Log into the AWS Account that hosts the AbizerVPC.
  2. Navigate to the VPC Dashboard.
  3. Select Peering Connections.
  4. From the Actions menu, select Edit DNS Settings.
  5. Click to enable DNS resolution.

Note :-  Incase if you hit error as below and enabling DNS resolution fails .

"Public Hostnames are disabled for: vpc-29e2ca50"

Go to the VPC tab and after selecting the VPC in question click on action --> Edit DNS Hostname (enable Public hostnames)



Similarly enable DNS resolution for Account B .


Step 4: Add destination to AbizerVPC main route table (This is the route the instance will use to communicate with account B)


  1. Select Route Tables in the VPC Dashboard.
  2. Search for the AbizerVPC ID.
  3. Click the Edit button under the Routes tab.
  4. Click Add another route.
  5. Enter the CIDR range of the Aurora VPC for the Destination
  6. Enter the ID of the peering connection for the Target. 


Step 5: Add destination to AbizerDatabricksVPC main route table (This is the route the instance will use to communicate with account A).


  1. Select Route Tables in the VPC Dashboard.
  2. Search for the AbizerDatabricksVPC ID.
  3. Click the Edit button under the Routes tab.
  4. Click Add another route.
  5. Enter the CIDR range of the Databricks VPC for the Destination.
  6. Enter the ID of the peering connection for the Target.


Once saved VPC peering should work and for resources to reach from Subnet A to Subnet B shouldn't need Internet access (Traffic should go through private amazon g/w)

NOTE : -   In this blog its assumed security group for both VPC are wide open and allow traffic flow between the VPC's, incase its not the case we would have to set security rules to allow them .

Below Table has all the info related to VPC which are peered/ID etc , along with main route and peering connection ID .


AWS ServiceNameIDCIDR Range
VPCAbizerVPCvpc-29e2ca5010.0.0.0/16
VPCAbizerDatabricksVPCvpc-bba395c2172.78.0.0/16
Route TableCusttestmainroutertb-ec023794
Peering ConnectionAbizerVPC --> AbizerDatabricksVPCpcx-bb31d0d3
Security GroupDefault Groupsg-df2d17a1