azure数据工厂:如何将一个文件夹中的所有文件合并为一个文件 [英] azure data factory: how to merge all files of a folder into one file

查看:214
本文介绍了azure数据工厂:如何将一个文件夹中的所有文件合并为一个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要通过合并分散在Azure Blob存储中包含的几个子文件夹中的多个文件来创建一个大文件,还需要进行转换,每个文件都包含单个元素的JSON数组,因此最终文件将包含JSON元素数组.

I need to create a big file, by merging multiple files scattered in several subfolders contained in an Azure Blob Storage, also a transformation needs to be done, each file contains a JSON array of a single element, so the final file, will contain an array of JSON elements.

最终目的是在Hadoop& MapReduce作业.

The final purpose is to process that Big file in a Hadoop & MapReduce job.

原始文件的布局与此类似:

The layout of the original files is similar to this:

folder
 - month-01
   - day-01
        - files...

- month-02
    - day-02
        - files...

推荐答案

我根据您的描述进行了测试,请按照我的步骤进行操作.

I did a test based on your descriptions,please follow my steps.

我的模拟数据:

My simulate data:

test1.json驻留在以下文件夹中:date/day1

test1.json resides in the folder: date/day1

test2.json驻留在以下文件夹中:date/day2

test2.json resides in the folder: date/day2

Source DataSet,将文件格式设置设置为Array of Objects,文件路径设置为root path.

Source DataSet,set the file format setting as Array of Objects and file path as root path.

Sink DataSet,将文件格式设置设置为Array of Objects,并将文件路径设置为要存储最终数据的文件.

Sink DataSet,set the file format setting as Array of Objects and file path as the file you want to store the final data.

创建Copy Activity并将Copy behavior设置为Merge Files.

执行结果:

Execution result:

我的测试目标仍然是Azure Blob存储,您可以参考此

The destination of my test is still Azure Blob Storage, you could refer to this link to learn about Hadoop supports Azure Blob Storage.

这篇关于azure数据工厂:如何将一个文件夹中的所有文件合并为一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆