如何在Azure数据工厂的复制数据活动中排除行? [英] How can I exclude rows in a Copy Data Activity in Azure Data Factory?

查看:54
本文介绍了如何在Azure数据工厂的复制数据活动中排除行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用一个复制数据"活动构建了一个管道,该活动可以从Azure Data Lake复制数据并将其输出到Azure Blob Storage.

I have built an Pipeline with one Copy Data activity which copies data from an Azure Data Lake and output it to an Azure Blob Storage.

在输出中,我可以看到我的某些行没有数据,我想将它们从副本中排除.在以下示例中,第二行没有有用的数据:

In the output, I can see that some of my rows do not have data and I would like to exclude them from the copy. In the following example, the 2nd row does not have useful data:

{"TenantId":"qa","Timestamp":"2019-03-06T10:53:51.634Z","PrincipalId":2,"ControlId":"729c3b6e-0442-4884-936c-c36c9b466e9d","ZoneInternalId":0,"IsAuthorized":true,"PrincipalName":"John","StreetName":"Rue 1","ExemptionId":8}
{"TenantId":"qa","Timestamp":"2019-03-06T10:59:09.74Z","PrincipalId":null,"ControlId":null,"ZoneInternalId":null,"IsAuthorized":null,"PrincipalName":null,"StreetName":null,"ExemptionId":null}

问题

在复制数据"活动中,如何设置规则以排除丢失某些值的行?

In the Copy Data activity, how can I put a rule to exclude rows that miss certain values?

这是我的管道代码:

{
    "name": "pipeline1",
    "properties": {
        "activities": [
            {
                "name": "Copy from Data Lake to Blob",
                "type": "Copy",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [
                    {
                        "name": "Source",
                        "value": "tenantdata/events/"
                    },
                    {
                        "name": "Destination",
                        "value": "controls/"
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "AzureDataLakeStoreSource",
                        "recursive": true
                    },
                    "sink": {
                        "type": "BlobSink",
                        "copyBehavior": "MergeFiles"
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "columnMappings": {
                            "Body.TenantId": "TenantId",
                            "Timestamp": "Timestamp",
                            "Body.PrincipalId": "PrincipalId",
                            "Body.ControlId": "ControlId",
                            "Body.ZoneId": "ZoneInternalId",
                            "Body.IsAuthorized": "IsAuthorized",
                            "Body.PrincipalName": "PrincipalName",
                            "Body.StreetName": "StreetName",
                            "Body.Exemption.Kind": "ExemptionId"
                        }
                    }
                },
                "inputs": [
                    {
                        "referenceName": "qadl",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "datalakestaging",
                        "type": "DatasetReference"
                    }
                ]
            }
        ]
    }
}

推荐答案

这是一个很好的问题(该问题为+1),几个月前我也遇到了同样的问题,我很惊讶我在副本"中找不到任何内容解决这个问题的活动(我什至尝试了容错功能,但没有运气).

This is a very good question (+1 for that), I had the same question months back and I was surprised that I could not find anything within the Copy Activity to handle this (I even tried with the fault tolerance feature but no luck).

考虑到我在管道中使用 U-SQL进行了其他转换,我最终使用它来完成此任务.因此,我有一个 U-SQL活动 nofollow noreferrer> IS NOT NULL 运算符,它取决于您的数据,但是您可以使用它,也许您的字符串包含"NULL"或空字符串",这就是它的样子:

And given that I had other transformations going on in my pipelines with U-SQL, I ended up using it to accomplish this. So, instead of a Copy Activity I have a U-SQL Activity in ADF using the IS NOT NULL operator, it depends on your data but you can play with that, maybe your string contains the "NULL" or empty strings "", this is how it looks :

DECLARE @file_set_path string = "adl://myadl.azuredatalake.net/Samples/Data/{date_utc:yyyy}{date_utc:MM}{date_utc:dd}T{date_utc:HH}{date_utc:mm}{date_utc:ss}Z.txt";

@data =
    EXTRACT 
            [id] string,
            date_utc DateTime
    FROM @file_set_path
    USING Extractors.Text(delimiter: '\u0001', skipFirstNRows : 1, quoting:false);

@result =
    SELECT 

            [id] ,
            date_utc.ToString("yyyy-MM-ddTHH:mm:ss") AS SourceExtractDateUTC
    FROM @data
    WHERE id IS NOT NULL -- you can also use WHERE id <> "" or <> "NULL";

OUTPUT @result TO "wasb://samples@mywasb/Samples/Data/searchlog.tsv" USING Outputters.Text(delimiter: '\u0001', outputHeader:true);

注意:支持ADLS和Blob存储输入/输出文件

Notes: ADLS and Blob storage are supported INPUT/OUTPUT files

请告诉我是否有帮助,或者上面的示例对您的数据不起作用. 希望有人会使用复制活动"发布答案,这太棒了,但这是迄今为止的一种可能性.

Let me know if that helps or if the example above does not work for your data. Hopefully somebody will post an answer using Copy Activity and that'd be awesome but this is one possibility so far.

这篇关于如何在Azure数据工厂的复制数据活动中排除行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆