Azure Databricks外部Hive Metastore [英] Azure Databricks external Hive Metastore
问题描述
我检查了有关Azure Databricks外部Hive Metastore(Azure SQL数据库)用法的[文档] [1].
I checked the [documentation][1] about usage of Azure Databricks external Hive Metastore (Azure SQL database).
我能够下载jar并将其放入/dbfs/hive_metastore_jar
I was able to download jars and place them into /dbfs/hive_metastore_jar
我的下一步是使用Init文件运行集群:
My next step is to run cluster with Init file:
# Hive-specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore client.
# JDBC connect string for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<host>.database.windows.net:1433;database=<database> #should I add more parameters?
# Username to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionUserName admin
# Password to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionPassword p@ssword
# Driver class name for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
# Spark specific configuration options
spark.sql.hive.metastore.version 2.7.3 #I am not sure about this
# Skip this one if <hive-version> is 0.13.x.
spark.sql.hive.metastore.jars /dbfs/hive_metastore_jar
我已经将ini文件上传到DBMS并启动了集群.无法读取ini.有问题..[1]: https://docs.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
I've uploaded ini file to the DBMS and launch cluster. It was failed to read ini. Something wrong.. [1]: https://docs.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
推荐答案
我暂时解决了这个问题.我遇到的问题:
I solved this for now. The problems I faced:
- 我没有将Hive jars复制到本地集群.这很重要,我无法引用DBMS,而应该将
spark.sql.hive.metastore.jars
引用到Hive的本地副本.使用INI脚本,我可以复制它们. - 连接很好.我还使用了 Git 我将其部署到Azure SQL数据库中,然后就很好了.
- I didn't copy Hive jars to the local cluster. This is important, I couldn't refer to the DBMS and should refer
spark.sql.hive.metastore.jars
to the local copy of Hive. With INI script I can copy them. - connection was good. I also used the Azure template with Vnet, it is more preferable. Then I allow traffic for Azure SQL from my Vnet with databricks.
- last issue - I had to create Hive schema before start databricks by copy and run DDL from Git with Hive version 1.2 I deployed it into Azure SQL Database and then I was good to go.
There is a useful notebook with steps to download jars. It is downloading jars to tmp
then we should copy it to the own folder. Finally, within cluster creation we should refer to INI script that has all parameters. It has the step of copy jars from DBFS to local file system of cluster.
// This example is for an init script named `external-metastore_hive121.sh`.
dbutils.fs.put(
"dbfs:/databricks/scripts/external-metastore_hive121.sh",
"""#!/bin/sh
|# A temporary workaround to make sure /dbfs is available.
|sleep 10
|# Copy metastore jars from DBFS to the local FileSystem of every node.
|cp -r /dbfs/metastore_jars/hive-v1_2/* /databricks/hive_1_2_1_metastore_jars
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options.
| # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
| # JDBC connect string for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://host--name.database.windows.net:1433;database=tcdatabricksmetastore_dev;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net"
|
| # Username to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "admin"
|
| # Password to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "P@ssword"
|
| # Driver class name for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
|
| # Spark specific configuration options
| "spark.sql.hive.metastore.version" = "1.2.1"
| # Skip this one if ${hive-version} is 0.13.x.
| "spark.sql.hive.metastore.jars" = "/databricks/hive_1_2_1_metastore_jars/*"
|}
|EOF
|""".stripMargin,
overwrite = true)
该命令将在DBFS中创建一个文件,我们将使用它作为集群创建的参考.
The command will create a file in DBFS and we will use it as a reference for the cluster creation.
根据文档,我们应该使用config:
According to the documentation, we should use config:
datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false
为了创建Hive DDL.它对我不起作用,这就是为什么我使用git自己创建模式和表的原因.
In order to create the Hive DDL. It didn't work for me, that's why I used git and create schema and tables myself.
您可以使用以下命令测试所有功能:
You can test that all works with command:
%sql show databases
这篇关于Azure Databricks外部Hive Metastore的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!