Sunday, September 9, 2018

Set up an external metastore for Databricks deployment (via Init script)

        Set up an external metastore for Databricks deployment (via Init script)


In my earlier blog we saw how to setup a Mysql DB and setup VPC peering for databricks VPC to communicate with VPC Mysql is in.
http://abizeradenwala.blogspot.com/2018/09/vpc-peering-between-two-vpcs-in-same.html
Now we will use this Mysql as an external metastore for our DB spark clusters, when you want your clusters to connect to your existing Hive metastore without explicitly setting required configurations, setting this via init scripts would be easy way to have DB cluster connect to external megastore every time cluster starts.
To set up an external metastore in the local mode using an init script, open a notebook and execute the following snippet (Connecting to ExternalMetastoreTesting cluster, yet its not using external metastore). This snippet adds an init script external-metastore.sh (this name is not mandatory) to /databricks/init/<cluster-name>/ in Databricks File System (DBFS). 

Note : 
The reason we use "mariadb" driver is since Databricks comes with JDBC libraries for MySQL out of the box. You can use any driver as long as the JDBC driver jar is specifically put in classpath. 
  • Databricks Runtime 3.4 and above include the org.mariadb.jdbc driver.
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#connecting-to-sql-databases-using-jdbc

This init script writes required configuration options to a configuration file named 00-custom-spark.conf in a JSON-like format under /databricks/driver/conf/ inside every node of the cluster, whenever a cluster with the name specified as <cluster-name> starts.

 Note :- that Databricks provides default Spark configurations in the /databricks/driver/conf/spark-branch.conf file. Configuration files in the /databricks/driver/conf directory apply in reverse alphabetical order. If you want to change the name of the 00-custom-spark.conf file, make sure that it continues to apply before the spark-branch.conf file.
Now edit the cluster and set below properties in spark config .
spark.hadoop.hive.metastore.schema.verification false datanucleus.fixedDatastore false datanucleus.schema.autoCreateAll true datanucleus.autoCreateSchema true



Verification : We can see now the metadata is coming from the external metastore we configured.


Directly connecting to Mysql does show the metastore table "abihive" exist and has required tables.

C02WG59KHTD5:a2df71c3-a02a-11e8-821f-000d3a04560d abizeradenwala$ mysql -h externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com -P 3306 -u root -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3403
Server version: 5.7.22-log Source distribution

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| ExternalMetastore  |
| abihive            |             |
| innodb             |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
11 rows in set (0.08 sec)

mysql> use abihive;

Database changed
mysql> show tables;
+---------------------------+
| Tables_in_abihive         |
+---------------------------+
| BUCKETING_COLS            |
| CDS                       |
| COLUMNS_V2                |
| DATABASE_PARAMS           |
| DBS                       |
| FUNCS                     |
| FUNC_RU                   |
| GLOBAL_PRIVS              |
| PARTITIONS                |
| PARTITION_KEYS            |
| PARTITION_KEY_VALS        |
| PARTITION_PARAMS          |
| PART_COL_STATS            |
| ROLES                     |
| SDS                       |
| SD_PARAMS                 |
| SEQUENCE_TABLE            |
| SERDES                    |
| SERDE_PARAMS              |
| SKEWED_COL_NAMES          |
| SKEWED_COL_VALUE_LOC_MAP  |
| SKEWED_STRING_LIST        |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES             |
| SORT_COLS                 |
| TABLE_PARAMS              |
| TAB_COL_STATS             |
| TBLS                      |
| VERSION                   |
+---------------------------+
29 rows in set (0.08 sec)

mysql> 

Also we can review driver logs when the cluster starts to review all the required configs are picked up correctly.

18/09/09 20:07:13 INFO MetastoreMonitor$: Generic external metastore configurd (config=jdbc:mariadb://externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com:3306/abihive?createDatabaseIfNotExist=true) 
18/09/09 20:07:14 INFO DriverCorral: Creating the driver context 
18/09/09 20:07:14 INFO DatabricksILoop$: Class Server Dir: /tmp/spark-6467922a-76bc-45ff-885f-d1c1879c95a9 
18/09/09 20:07:14 INFO SparkConfUtils$: Customize spark config according to file /tmp/custom-spark.conf 
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: datanucleus.autoCreateSchema -> true 
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: spark.executor.tempDirectory -> /local_disk0/tmp 
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: spark.hadoop.hive.metastore.schema.verification -> false 
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: datanucleus.fixedDatastore -> false 
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: datanucleus.schema.autoCreateAll -> true 
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: spark.driver.tempDirectory -> /local_disk0/tmp 
18/09/09 20:07:14 WARN SparkConf: The configuration key 'spark.scheduler.listenerbus.eventqueue.size' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.scheduler.listenerbus.eventqueue.capacity' instead. 
18/09/09 20:07:14 INFO SparkContext: Running Spark version 2.3.1 
18/09/09 20:07:15 INFO SparkContext: Submitted application: Databricks Shell 18/09/09 20:07:15 INFO SparkContext: Spark configuration: datanucleus.autoCreateSchema=true datanucleus.fixedDatastore=false datanucleus.schema.autoCreateAll=true eventLog.rolloverIntervalSeconds=3600 spark.akka.frameSize=256 spark.app.name=Databricks Shell spark.cleaner.referenceTracking.blocking=false spark.databricks.acl.client=com.databricks.spark.sql.acl.client.SparkSqlAclClient spark.databricks.acl.provider=com.databricks.sql.acl.ReflectionBackedAclProvider spark.databricks.cloudProvider=AWS spark.databricks.clusterSource=UI spark.databricks.clusterUsageTags.autoTerminationMinutes=120 spark.databricks.clusterUsageTags.clusterAllTags=[{"key":"Vendor","value":"Databricks"},{"key":"Creator","value":"abizer.adenwala@databricks.com"},{"key":"ClusterName","value":"ExternalMetastoreTesting"},{"key":"ClusterId","value":"0909-180303-tall49"},{"key":"Name","value":"cust-success-worker"}]

Yay , external metastore is now configured .

Note:-
Below google doc link has my notebook in HTML and DBC format for reviewer to review and easily use .
Troubleshooting :

Incase if the Databricks UI shows the database tables not loading, we can review driver logs and checkout errors if any .

1) In below case it is looking for "abihive" database in external metastore but this database is missing hence the problem . To fix this either create this DB in external metastore manually or add "?createDatabaseIfNotExist=true" while giving metastore string as below.

"spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:mariadb://externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com:3306/abihive?createDatabaseIfNotExist=true"

================================================================
18/09/09 19:16:56 ERROR Schema: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:mariadb://externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com:3306/abihive, username = root. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLSyntaxErrorException: Unknown database 'abihive'
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:163)
------
org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:mariadb://externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com:3306/abihive, username = root. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLSyntaxErrorException: Unknown database 'abihive'
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:163)
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:106)
================================================================



No comments:

Post a Comment