尝试转换列数据时值太长失败 [英] Value too long failure when attempting to convert column data

查看:156
本文介绍了尝试转换列数据时值太长失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

场景

我有一个源文件,其中每行上都包含JSON块.

I have a source file that contains blocks of JSON on each new line.

然后我有一个简单的U-SQL摘录,如下所示,其中[RawString]代表文件中的每个新行,并且[FileName]被定义为@SourceFile路径中的变量.

I then have a simple U-SQL extract as follows where [RawString] represents each new line in the file and the [FileName] is defined as a variable from the @SourceFile path.

@BaseExtract = 
    EXTRACT 
        [RawString] string, 
        [FileName] string
    FROM
        @SourceFile 
    USING 
        Extractors.Text(delimiter:'\b', quoting : false);

对于我的大多数数据,此操作都可以成功执行,而且我可以在脚本中解析[RawString]作为JSON,而不会出现任何问题.

This executes without failure for the majority of my data and I'm able to parse the [RawString] as JSON further down in my script without any problems.

但是,我最近的文件中似乎有一排额外的数据,无法提取.

However, I seem to have an extra long row of data in a recent file that cannot be extracted.

错误

在Visual Studio中本地执行此操作,并在Azure中针对我的Data Lake Analytics服务执行此操作,我得到以下信息.

Executing this both locally in Visual Studio and against my Data Lake Analytics service in Azure I get the following.

E_RUNTIME_USER_EXTRACT_COLUMN_CONVERSION_TOO_LONG

E_RUNTIME_USER_EXTRACT_COLUMN_CONVERSION_TOO_LONG

尝试转换列数据时值太长失败.

Value too long failure when attempting to convert column data.

无法将字符串转换为正确的类型.结果数据长度为 太长了.

Can not convert string to proper type. The resulting data length is too long.

请参见下面的屏幕截图.

See screen shots below.

已使用其他工具进行了检查,我可以确认源文件中最长行的长度为 189,943个字符.

Having checked this with other tools I can confirm the length of the longest line in the source file is 189,943 characters.

问题

所以我的朋友给我的问题...

So my questions for you my friends...

  1. 还有其他人达到这个极限吗?
  2. 定义的字符行限制是什么?
  3. 解决这个问题的最佳方法是什么?
  4. 是否需要自定义提取器?

其他事项

其他一些想法...

  • 由于文件中的每一行都是一个自包含的JSON数据块,所以我无法拆分行.
  • 如果手动将单个长行复制到一个单独的文件中并格式化,那么JSON USQL可以使用Newtonsoft.Json库按预期处理它.
  • 当前,我正在将VS2015与Data Lake Tools版本2.2.7一起使用.

在此先感谢您的支持.

推荐答案

列中U-SQL字符串值的限制当前为128kB(请参阅

The limit for a U-SQL string value in a column is currently 128kB (see https://msdn.microsoft.com/en-us/library/azure/mt764129.aspx).

根据我的经验,很多人都在使用它(尤其是在处理JSON时).有几种解决方法:

In my experience a lot of people are running into it (especially when processing JSON). There are a few ways to work around it:

  1. 找到一种重写提取器以返回byte []的方法,并避免生成字符串值,除非您确实需要这样做.这样应该可以为您提供更多数据(最大4MB).

  1. Find a way to rewrite the extractor to return byte[] and avoid generating a string value until you really have to. That should give you more data (up to 4MB).

编写一个自定义提取器,以对特定JSON格式进行所有导航和分解,直到叶节点,从而避免中间的长字符串值.

Write a custom extractor that does all the navigation and decomposition of your specific JSON format down to the leaf nodes, thus avoiding intermediate long string values.

返回SqlArray而不是字符串数据类型值,并将字符串分块为128kB(采用UTF-8编码,而不是C#的默认UTF-16编码!).

Return SqlArray instead of string data type values and chunk the string into 128kB (in UTF-8 encoding, not the C#'s default UTF-16 encoding!).

我们正在考虑增加字符串大小,但是如果您可以在 http://aka上提交/投票, .ms/adlfeedback 会很有帮助.

We are looking into increasing the string size, but if you could file/vote up a request on http://aka.ms/adlfeedback that would be helpful.

这篇关于尝试转换列数据时值太长失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆