Azure Databricks到Azure SQL DW:长文本列 [英] Azure Databricks to Azure SQL DW: Long text columns

查看:263
本文介绍了Azure Databricks到Azure SQL DW:长文本列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从Azure Databricks笔记本环境中填充Azure SQL DW.我在pyspark中使用了内置连接器:

I would like to populate an Azure SQL DW from an Azure Databricks notebook environment. I am using the built-in connector with pyspark:

sdf.write \
  .format("com.databricks.spark.sqldw") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "test_table") \
  .option("url", url) \
  .option("tempDir", temp_dir) \
  .save()

这正常工作,但是当我包含内容足够长的字符串列时出现错误.我收到以下错误:

This works fine, but I get an error when I include a string column with a sufficiently long content. I get the following error:

Py4JJavaError:调用o1252.save时发生错误.:com.databricks.spark.sqldw.SqlDWSideException:SQL DW无法执行连接器产生的JDBC查询.

Py4JJavaError: An error occurred while calling o1252.save. : com.databricks.spark.sqldw.SqlDWSideException: SQL DW failed to execute the JDBC query produced by the connector.

基础SQLException:-com.microsoft.sqlserver.jdbc.SQLServerException:HdfsBridge :: recordReaderFillBuffer-填充记录读取器缓冲区时遇到意外错误:HadoopSqlException:字符串或二进制数据将被截断.[ErrorCode = 107090] [SQLState = S0001]

Underlying SQLException(s): - com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopSqlException: String or binary data would be truncated. [ErrorCode = 107090] [SQLState = S0001]

据我了解,这是因为默认字符串类型为NVARCHAR(256).可以进行配置(参考)),但最大NVARCHAR长度为4k个字符.我的琴弦偶尔达到1万个字符.因此,我很好奇如何将某些列导出为text/longtext.

As I understand it, this is because the default string type is NVARCHAR(256). It is possible to configure (reference), but the maximum NVARCHAR length is 4k characters. My strings occasionally reach 10k characters. Therefore, I am curious as to how I can export certain columns as text/longtext instead.

我猜想,如果在创建表之后仅执行 preActions ,则以下内容将起作用.不是,因此失败.

I would guess that the following would work, if only the preActions were executed after table was created. It's not, and therefore it fails.

sdf.write \
  .format("com.databricks.spark.sqldw") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "test_table") \
  .option("url", url) \
  .option("tempDir", temp_dir) \
  .option("preActions", "ALTER TABLE test_table ALTER COLUMN value NVARCHAR(MAX);") \
  .save()

此外, postActions 在插入数据后执行,因此这也将失败.

Also, postActions are executed after data is inserted, and therefore this will also fail.

有什么想法吗?

推荐答案

我遇到了类似的问题,并能够使用以下选项解决该问题:

I had a similar problem and was able to resolve it using the options:

.option("maxStrLength",4000)

因此在您的示例中,这将是:

Thus in your example this would be:

sdf.write \
  .format("com.databricks.spark.sqldw") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "test_table") \
  .option("maxStrLength",4000)\
  .option("url", url) \
  .option("tempDir", temp_dir) \
  .save()

这是在此处记录的:

" StringType映射到Azure Synapse中的NVARCHAR(maxStrLength)类型.您可以使用maxStrLength来设置Azure Synapse中名称为dbTable的表中所有NVARCHAR(maxStrLength)类型列的字符串长度."

如果您的字符串超过4k,则应该:

If your strings go over 4k then you should:

使用NVARCHAR(MAX)预定义表列,然后以追加模式写入表.在这种情况下,您不能使用默认的列存储索引,因此可以使用HEAP或设置适当的索引.懒堆将是:

Pre-define your table column with NVARCHAR(MAX) and then write in append mode to the table. In this case you can't use the default columnstore index so either use a HEAP or set proper indexes. A lazy heap would be:

CREATE TABLE example.table
(
    NormalColumn NVARCHAR(256),
    LongColumn NVARCHAR(4000),
    VeryLongColumn NVARCHAR(MAX)
) 
WITH (HEAP)

然后,您可以照常对其进行写入,而无需使用maxStrLength选项.这也意味着您不会过度指定所有其他字符串列.

Then you can write to it as usual, without the maxStrLength option. This also means you don't overspecify all other string columns.

其他选项包括:

  1. 使用split将1列转换为几个字符串列.
  2. 另存为实木复合地板,然后从突触内部加载

这篇关于Azure Databricks到Azure SQL DW:长文本列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆