在USQL中编写自定义提取器以跳过存在编码问题的行 [英] Writing custom extractor in USQL to skip rows with encoding problems

查看:68
本文介绍了在USQL中编写自定义提取器以跳过存在编码问题的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的数据,涵盖了数百个文件.显然,其中存在一些编码问题(主要是UTF-8,但显然有些字符无效).根据 https://msdn.microsoft.com/en-us/library/azure/mt764098.aspx 如果存在编码错误,则无论将silent标志设置为true(只是跳过错误行的目的),都会发生运行时错误.

I have a large set of data that spans a couple hundred files. Apparently, it's got a few encoding issues in it (it's mostly UTF-8, but apparently some characters just aren't valid). According to https://msdn.microsoft.com/en-us/library/azure/mt764098.aspx if there is an encoding error, a runtime error will occur regardless of setting the silent flag to true (with the aim of just skipping erroring rows).

因此,我需要编写一个自定义提取器.我在 https://blogs.msdn.microsoft.com/data_otaku/2016/10/27/a-fixed-width-extractor-for-azure-data- lake-analytics/只需一行,然后用定界符将其拆分,然后在try块内返回值.如果有任何例外,我将处理它们并继续前进.

As a result, I need to write a custom extractor. I've written one that largely does a simplified version of the example at https://blogs.msdn.microsoft.com/data_otaku/2016/10/27/a-fixed-width-extractor-for-azure-data-lake-analytics/ in that it just takes a row, splits it by the delimiter and just returns the values within a try block. If there are any exceptions, I just handle them and move on.

不幸的是,我在USQL脚本本身中实际引用此提取器时遇到问题.当我按照上述链接的指导进行操作时,它建议在另一个程序集中编写逻辑,进行构建,然后将其注册到ADLS数据库/程序集中,然后通过脚本顶部的REFERENCE ASSEMBLY MyExtractors;将其包括进来(因为这是名称空间)用过的).在下面的Using语句中,我用USING new SimpleExtractor();调用它.如果这样做,在针对type or namespace cannot be found的ADLS服务运行脚本时遇到错误.另外,如果我尝试更精确一些,并在using语句中使用USING new MyExtractors.SimpleExtractor();,则引用上面的USING语句也会产生相同的错误.

Unfortunately, I'm having an issue actually referencing this extractor in the USQL script itself. When I follow the guidance on the above link, it suggests writing the logic in another assembly, building that, registering it in the ADLS database/assemblies and then including it via REFERENCE ASSEMBLY MyExtractors; at the top of the script (as that is the namespace used). In the below Using statement, I call it with USING new SimpleExtractor(); If I do so, I get an error when running the script against the ADLS service that the type or namespace cannot be found. Additionally, if I attempt to be more precise and use USING new MyExtractors.SimpleExtractor(); in the using statement, it yields the same error, citing the USING statement above.

然后,我在 https://azure.microsoft.com/zh-cn/documentation/articles/data-lake-analytics-u-sql-develop-user-defined-operators/描述了在代码隐藏文件中做同样的事情.我删除了单独的程序集,并将逻辑复制到该文件中的类中.步骤#6中的示例没有显示任何REFERENCE ASSEMBLY语句,但是再次,当我运行它时,出现了type or namespace name cannot be found错误.

I then found additional documentation in an older source at https://azure.microsoft.com/en-us/documentation/articles/data-lake-analytics-u-sql-develop-user-defined-operators/ that describes doing the same thing but in the code-behind file. I deleted the separate assembly and copied the logic into a class in that file. The example in step #6 doesn't show any REFERENCE ASSEMBLY statements, but again, when I run it, I get an error that the type or namespace name cannot be found.

查看最新的发行说明,希望此处有些过时,我唯一看到的是,如果使用USING语句,则需要引用自定义代码的程序集(如第一次尝试),然后才真正使用它.

Looking at the most recent release notes in hopes that something is just out of date here, the only thing I see is that if I use a USING statement , I need a reference to the custom code's assembly (as in the first attempt) prior to actually using it, which I am.

任何人都可以提供有关如何在USQL中正确引用UDO的指导,或者以其他方式指示如何让运行时以静默方式处理编码异常(并跳过它们)吗?

Can anyone please provide some guidance on how to properly reference UDOs in USQL, or otherwise indicate how to have the runtime handle Encoding exceptions silently (and just skip them)?

这是我的逻辑在提取器本身中的样子:

Here's what my logic is looking like in the extractor itself:

using System.Collections.Generic;
using System.IO;
using System.Text;
using Microsoft.Analytics.Interfaces;

namespace Utilities
{
    [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
    public class ModifiedTextExtractor : IExtractor
    {
        //Contains the row
        private readonly Encoding _encoding;
        private readonly byte[] _row_delim;
        private readonly char _col_delim;

        public ModifiedTextExtractor()
        {
            _encoding = Encoding.UTF8;
            _row_delim = _encoding.GetBytes("\r\n");
            _col_delim = '\t';
        }

        public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
        {
            //Read the input line by line
            foreach (var current in input.Split(_row_delim))
            {
                using (var reader = new StreamReader(current, this._encoding))
                {
                    var line = reader.ReadToEnd().Trim();

                    //If there are any single or double quotes in the line, escape them
                    line = line.Replace(@"""", @"\""");

                    var count = 0;

                    //Split the input by the column delimiter
                    var parts = line.Split(_col_delim);

                    foreach (var part in parts)
                    {
                        output.Set<string>(count, part);
                        count += 1;
                    }
                }
                yield return output.AsReadOnly();
            }
        }
    }
}

以及如何在USQL语句中使用它的代码段(将其注册为程序集之后):

And a snippet of how I'm trying to use it in the USQL statement (after registering it as an assembly):

REFERENCE ASSEMBLY [Utilities];

CREATE VIEW MyView AS ...
USING new Utilities.ModifiedTextExtractor();

谢谢!

推荐答案

您遇到了VIEW无法引用自定义代码的问题.在U-SQL中,所有对象都需要包含其上下文规范(例如,其主体中的引用程序集(这使对象更加独立),并且避免了拉长潜在依赖关系的问题,而这种依赖关系是对象用户所不知道的).

You are running into the problem that VIEWs cannot refer to custom code. In U-SQL, all objects need to contain their context specification (such as the referenced assemblies in their body (this makes the objects more self-contained and avoids the problem of pulling a potentially long line of dependencies unbeknownst to the user of the object).

您需要做的是将VIEW转换为表值函数:

What you need to do is turning the VIEW into a Table-Valued Function:

CREATE FUNCTION MyFunct(/* optional parameters */) RETURNS @res AS
BEGIN
  REFERENCE ASSEMBLY [Utilities];
  @res = EXTRACT ... USING new Utilities.ModifiedTextExtractor();
END;

然后按如下所示调用函数(注意,您需要在SELECT语句中提供行集别名):

And then call the function as follows (note you need to provide the rowset alias in the SELECT statement):

@data = SELECT ... FROM MyFunct() AS f WHERE ...;

或者如果您不想应用投影或滤镜:

or if you don't want to apply a projection or filter:

@data = MyFunct();

就像视图一样,将内联一个表值函数.

Like a view a table-valued function will be inlined.

这篇关于在USQL中编写自定义提取器以跳过存在编码问题的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆