如何控制Azure数据工厂管道中的数据故障? [英] How to control data failures in Azure Data Factory Pipelines?

查看:70
本文介绍了如何控制Azure数据工厂管道中的数据故障?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于源数据集中的数据与目标数据集中的数据不兼容,我有时会收到错误消息.我想控制管道根据错误类型确定的操作,也许输出或删除那些特定的行,但还要完成其他所有操作.那可能吗?此外,是否有可能从Data Factory保留实际的故障线,而无需以某种简单的方式访问和搜索实际的源数据集?

I receive an error from time and time due to incompatible data in my source data set compared to my target data set. I would like to control the action that the pipeline determines based on error types, maybe output or drop those particulate rows, yet completing everything else. Is that possible? Furthermore, is it possible to get a hold of the actual failing line(s) from Data Factory without accessing and searching in the actual source data set in some simple way?

复制活动在接收器端遇到用户错误:ErrorCode = UserErrorInvalidDataValue,'Type = Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message =列'Timestamp'包含无效值'11667'.无法将'11667'转换为'DateTimeOffset'类型.,Source = Microsoft.DataTransfer.Common,''Type = System.FormatException,Message = String无法识别为有效的DateTime.,Source = mscorlib,'.

谢谢

推荐答案

我认为您在ADF中遇到了一个相当普遍的问题和限制.尽管您使用JSON定义的数据集允许ADF理解数据的结构,但仅是结构,编排工具就无法在活动处理中做任何事情来转换或操纵数据.

I think you've hit a fairly common problem and limitation within ADF. Although the datasets you define with your JSON allow ADF to understand the structure of the data, that is all, just the structure, the orchestration tool can't do anything to transform or manipulate the data as part of the activity processing.

要直接回答您的问题,肯定有可能.但是您需要先分解C#并使用ADF的可扩展性功能来处理坏行,然后再将其传递到最终目的地.

To answer your question directly, it's certainly possible. But you need to break out the C# and use ADF's extensibility functionality to deal with your bad rows before passing it to the final destination.

我建议您扩展数据工厂以包括一个自定义活动,在该活动中,您可以构建一些较低级别的清理过程来转移不良行,如所述.

I suggest you expand your data factory to include a custom activity where you can build some lower level cleaning processes to divert the bad rows as described.

我们经常采用这种方法,因为并非所有数据都是完美的(我希望如此),并且 ETL ELT 不起作用.我更喜欢使用首字母缩写 ECLT . "C"代表干净的地方.或清理,准备等.这当然适用于ADF,因为此服务没有自己的计算或SSIS样式的数据流引擎.

This is an approach we often take as not all data is perfect (I wish) and ETL or ELT doesn't work. I prefer the acronym ECLT. Where the 'C' stands for clean. Or cleanse, prepare etc. This certainly applies to ADF because this service doesn't have its own compute or SSIS style data flow engine.

所以...

有关如何执行此操作.首先,我建议您查看有关创建ADF自定义活动的博客文章.链接:

In terms of how to do this. First I recommend you check out this blog post on creating ADF custom activities. Link:

https://www. purplefrogsystems.com/paul/2016/11/creating-azure-data-factory-custom-activities/

然后在C#类中继承自IDotNetActivity的内容进行以下操作.

Then within your C# class inherited from IDotNetActivity do something like the below.

    public IDictionary<string, string> Execute(
        IEnumerable<LinkedService> linkedServices,
        IEnumerable<Dataset> datasets,
        Activity activity,
        IActivityLogger logger)
    {

    //etc

    using (StreamReader vReader = new StreamReader(YourSource))
        {
            using (StreamWriter vWriter = new StreamWriter(YourDestination))
            {
                while (!vReader.EndOfStream)
                {
                //data transform logic, if bad row etc
                }
            }
        }
  }

您明白了.建立自己的SSIS数据流!

You get the idea. Build your own SSIS data flow!

然后将干净的行写为输出数据集,可以作为下一个ADF活动的输入.要么具有多个管道,要么作为单个管道内的链接活动.

Then write out your clean row as an output dataset, which can be the input for your next ADF activity. Either with multiple pipelines, or as chained activities within a single pipeline.

这是让ADF处理当前服务产品中的不良数据的唯一方法.

This is the only way you will get ADF to deal with your bad data in the current service offerings.

希望这会有所帮助

这篇关于如何控制Azure数据工厂管道中的数据故障?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆