Azure数据工厂-从Data Lake Gen 2 JSON文件提取信息 [英] Azure Data Factory - extracting information from Data Lake Gen 2 JSON files

查看:55
本文介绍了Azure数据工厂-从Data Lake Gen 2 JSON文件提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ADF管道将原始日志数据作为JSON文件加载到Data Lake Gen 2容器中.

我们现在想从这些JSON文件中提取信息,而我正在尝试找到从所述文件中获取信息的最佳方法.我发现Azure Data Lake Analytics和U-SQL脚本功能强大且价格便宜,但它们需要陡峭的学习曲线.

是否有推荐的方法来解析JSON文件并从中提取信息?Data Lake表是否足以存储此提取的信息,然后充当下游报告过程的来源?

最后,Azure Data Factory是否能够解析嵌套数组JSON?

解决方案

我们可以解析JSON文件并通过平坦的活动设置和输出预览:

映射数据流遵循提取,加载和转换(ELT)方法,并与全部位于Azure中的分段数据集一起使用.当前,以下数据集可用于源转换.

因此,我认为在ADF中使用数据流是提取信息并充当下游报告过程来源的最简单方法.

I have an ADF pipeline loading raw log data as JSON files into a Data Lake Gen 2 container.

We now want to extract information from those JSON files and I am trying to find the best way to get information from said files. I found that Azure Data Lake Analytics and U-SQL scripts are pretty powerful and also cheap, but they require a steep learning curve.

Is there a recommended way to parse JSON files and extract information from them? Would Data Lake tables be an adequate storage for this extracted information and act then as a source for downstream reporting process?

And finally, will Azure Data Factory ever be able to parse nested arrays JSONs?

解决方案

We can parse JSON files and extract information via data flow. We can parse nested arrays JSONs via Flatten transformation in mapping data flow.

Json example:

    {   
        "count": 1,
        "value": [{
                    "obj": 123,
                    "lists": [{
                                "employees": [{
                                    
                                        "name": "",
                                        "id": "001",
                                        "tt_1": 0,
                                        "tt_2": 4,
                                        "tt3_": 1
                                    },
                                    {
                                        "name": "",
                                        "id": "002",
                                        "tt_1": 10,
                                        "tt_2": 8,
                                        "tt3_": 1
                                    }]
                            }]
                    }]                  
    }

Flatten active settings and output preview:

Mapping data flow follows an extract, load, and transform (ELT) approach and works with staging datasets that are all in Azure. Currently, the following datasets can be used in a source transformation.

So I think using data flow in ADF is the easiest way to extract information and act then as a source for downstream reporting process.

这篇关于Azure数据工厂-从Data Lake Gen 2 JSON文件提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆