Set up an external metastore for Databricks deployment (via Init script)
In my earlier blog we saw how to setup a Mysql DB and setup VPC peering for databricks VPC to communicate with VPC Mysql is in.
http://abizeradenwala.blogspot.com/2018/09/vpc-peering-between-two-vpcs-in-same.html
Now we will use this Mysql as an external metastore for our DB spark clusters, when you want your clusters to connect to your existing Hive metastore without explicitly setting required configurations, setting this via init scripts would be easy way to have DB cluster connect to external megastore every time cluster starts.
To set up an external metastore in the local mode using an init script, open a notebook and execute the following snippet (Connecting to ExternalMetastoreTesting cluster, yet its not using external metastore). This snippet adds an init script external-metastore.sh
(this name is not mandatory) to /databricks/init/<cluster-name>/
in Databricks File System (DBFS).
Note :
The reason we use "mariadb" driver is since Databricks comes with JDBC libraries for MySQL out of the box. You can use any driver as long as the JDBC driver jar is specifically put in classpath.
- Databricks Runtime 3.4 and above include the
org.mariadb.jdbc
driver.
Note :- that Databricks provides default Spark configurations in the /databricks/driver/conf/spark-branch.conf
file. Configuration files in the /databricks/driver/conf
directory apply in reverse alphabetical order. If you want to change the name of the 00-custom-spark.conf
file, make sure that it continues to apply before the spark-branch.conf
file.
Now edit the cluster and set below properties in spark config .
spark.hadoop.hive.metastore.schema.verification false
datanucleus.fixedDatastore false
datanucleus.schema.autoCreateAll true
datanucleus.autoCreateSchema true
Verification : We can see now the metadata is coming from the external metastore we configured.
Directly connecting to Mysql does show the metastore table "abihive" exist and has required tables.
C02WG59KHTD5:a2df71c3-a02a-11e8-821f-000d3a04560d abizeradenwala$ mysql -h externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com -P 3306 -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3403
Server version: 5.7.22-log Source distribution
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| ExternalMetastore |
| abihive | |
| innodb |
| mysql |
| performance_schema |
| sys |
+--------------------+
11 rows in set (0.08 sec)
mysql> use abihive;
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_abihive |
+---------------------------+
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| DATABASE_PARAMS |
| DBS |
| FUNCS |
| FUNC_RU |
| GLOBAL_PRIVS |
| PARTITIONS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_STATS |
| ROLES |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| VERSION |
+---------------------------+
29 rows in set (0.08 sec)
mysql>
Also we can review driver logs when the cluster starts to review all the required configs are picked up correctly.
18/09/09 20:07:13 INFO MetastoreMonitor$: Generic external metastore configurd (config=jdbc:mariadb://externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com:3306/abihive?createDatabaseIfNotExist=true)
18/09/09 20:07:14 INFO DriverCorral: Creating the driver context
18/09/09 20:07:14 INFO DatabricksILoop$: Class Server Dir: /tmp/spark-6467922a-76bc-45ff-885f-d1c1879c95a9
18/09/09 20:07:14 INFO SparkConfUtils$: Customize spark config according to file /tmp/custom-spark.conf
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: datanucleus.autoCreateSchema -> true
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: spark.executor.tempDirectory -> /local_disk0/tmp
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: spark.hadoop.hive.metastore.schema.verification -> false
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: datanucleus.fixedDatastore -> false
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: datanucleus.schema.autoCreateAll -> true
18/09/09 20:07:14 INFO SparkConfUtils$: Set spark config: spark.driver.tempDirectory -> /local_disk0/tmp
18/09/09 20:07:14 WARN SparkConf: The configuration key 'spark.scheduler.listenerbus.eventqueue.size' has been deprecated as of Spark 2.3 and may be removed in the future. Please use the new key 'spark.scheduler.listenerbus.eventqueue.capacity' instead.
18/09/09 20:07:14 INFO SparkContext: Running Spark version 2.3.1
18/09/09 20:07:15 INFO SparkContext: Submitted application: Databricks Shell
18/09/09 20:07:15 INFO SparkContext: Spark configuration:
datanucleus.autoCreateSchema=true
datanucleus.fixedDatastore=false
datanucleus.schema.autoCreateAll=true
eventLog.rolloverIntervalSeconds=3600
spark.akka.frameSize=256
spark.app.name=Databricks Shell
spark.cleaner.referenceTracking.blocking=false
spark.databricks.acl.client=com.databricks.spark.sql.acl.client.SparkSqlAclClient
spark.databricks.acl.provider=com.databricks.sql.acl.ReflectionBackedAclProvider
spark.databricks.cloudProvider=AWS
spark.databricks.clusterSource=UI
spark.databricks.clusterUsageTags.autoTerminationMinutes=120
spark.databricks.clusterUsageTags.clusterAllTags=[{"key":"Vendor","value":"Databricks"},{"key":"Creator","value":"abizer.adenwala@databricks.com"},{"key":"ClusterName","value":"ExternalMetastoreTesting"},{"key":"ClusterId","value":"0909-180303-tall49"},{"key":"Name","value":"cust-success-worker"}]
Yay , external metastore is now configured .
Note:-
Below google doc link has my notebook in HTML and DBC format for reviewer to review and easily use .
Troubleshooting :
Incase if the Databricks UI shows the database tables not loading, we can review driver logs and checkout errors if any .
1) In below case it is looking for "abihive" database in external metastore but this database is missing hence the problem . To fix this either create this DB in external metastore manually or add "?createDatabaseIfNotExist=true" while giving metastore string as below.
"spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:mariadb://externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com:3306/abihive?createDatabaseIfNotExist=true"
================================================================
18/09/09 19:16:56 ERROR Schema: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:mariadb://externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com:3306/abihive, username = root. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLSyntaxErrorException: Unknown database 'abihive'
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:163)
------
org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:mariadb://externalmetastore.cj11tymkwz5w.us-west-2.rds.amazonaws.com:3306/abihive, username = root. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLSyntaxErrorException: Unknown database 'abihive'
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:163)
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:106)
================================================================