在 Azure 数据块中创建外部表 [英] Create External table in Azure databricks

查看:20
本文介绍了在 Azure 数据块中创建外部表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 azure databricks 的新手,正在尝试创建一个指向 Azure Data Lake Storage (ADLS) Gen-2 位置的外部表.

I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location.

从 databricks 笔记本中,我尝试为 ADLS 访问设置 spark 配置.我仍然无法执行创建的 DDL.

From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created.

注意:对我有用的一种解决方案是将 ADLS 帐户安装到集群,然后使用外部表的 DDL 中的安装位置.但是我需要检查是否可以使用没有安装位置的 ADLS 路径创建外部表 DDL.

Note: One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount location.

# Using Principal credentials
spark.conf.set("dfs.azure.account.auth.type", "OAuth")
spark.conf.set("dfs.azure.account.oauth.provider.type", "ClientCredential")
spark.conf.set("dfs.azure.account.oauth2.client.id", "client_id")
spark.conf.set("dfs.azure.account.oauth2.client.secret", "client_secret")
spark.conf.set("dfs.azure.account.oauth2.client.endpoint", 
"https://login.microsoftonline.com/tenant_id/oauth2/token")

DDL

create external table test(
id string,
name string
)
partitioned by (pt_batch_id bigint, pt_file_id integer)
STORED as parquet
location 'abfss://container@account_name.dfs.core.windows.net/dev/data/employee

收到错误

Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);

我需要帮助了解是否可以直接在 DDL 中引用 ADLS 位置?

I need help in knowing if this is possible to refer to ADLS location directly in DDL?

谢谢.

推荐答案

如果您可以使用 Python(或 Scala).

Sort of if you can use Python (or Scala).

从建立连接开始:

TenantID = "blah"

def connectLake():
  spark.conf.set("fs.azure.account.auth.type", "OAuth")
  spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
  spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))
  spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))
  spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")

connectLake()
lakePath = "abfss://liquix@mystorageaccount.dfs.core.windows.net/"

使用 Python,您可以使用以下方法注册表:

Using Python you can register a table using:

spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")

如果您执行了 connectLake() 函数,您现在可以查询该表 - 这在您当前的会话/笔记本中很好.

You can now query that table if you have executed the connectLake() function - which is fine in your current session/notebook.

现在的问题是,如果一个新会话进入并且他们尝试从该表中选择 * 它将失败,除非他们首先运行 connectLake() 函数.无法绕过该限制,因为您必须证明访问湖泊的凭据.

The problem is now if a new session comes in and they try select * from that table it will fail unless they run the connectLake() function first. There is no way around that limitation as you have to prove credentials to access the lake.

您可能需要考虑 ADLS Gen2 凭证传递:https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html

You may want to consider ADLS Gen2 credential pass through: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html

请注意,这需要使用高并发集群.

Note that this requires using a High Concurrency cluster.

这篇关于在 Azure 数据块中创建外部表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆