azure数据工厂:如何将一个文件夹中的所有文件合并为一个文件 [英] azure data factory: how to merge all files of a folder into one file
问题描述
我需要通过合并分散在Azure Blob存储中包含的几个子文件夹中的多个文件来创建一个大文件,还需要进行转换,每个文件都包含单个元素的JSON数组,因此最终文件将包含JSON元素数组.
I need to create a big file, by merging multiple files scattered in several subfolders contained in an Azure Blob Storage, also a transformation needs to be done, each file contains a JSON array of a single element, so the final file, will contain an array of JSON elements.
最终目的是在Hadoop& MapReduce作业.
The final purpose is to process that Big file in a Hadoop & MapReduce job.
原始文件的布局与此类似:
The layout of the original files is similar to this:
folder
- month-01
- day-01
- files...
- month-02
- day-02
- files...
推荐答案
我根据您的描述进行了测试,请按照我的步骤进行操作.
I did a test based on your descriptions,please follow my steps.
我的模拟数据:
My simulate data:
test1.json
驻留在以下文件夹中:date/day1
test1.json
resides in the folder: date/day1
test2.json
驻留在以下文件夹中:date/day2
test2.json
resides in the folder: date/day2
Source DataSet
,将文件格式设置设置为Array of Objects
,文件路径设置为root path
.
Source DataSet
,set the file format setting as Array of Objects
and file path as root path
.
Sink DataSet
,将文件格式设置设置为Array of Objects
,并将文件路径设置为要存储最终数据的文件.
Sink DataSet
,set the file format setting as Array of Objects
and file path as the file you want to store the final data.
创建Copy Activity
并将Copy behavior
设置为Merge Files
.
执行结果:
Execution result:
The destination of my test is still Azure Blob Storage, you could refer to this link to learn about Hadoop supports Azure Blob Storage.
这篇关于azure数据工厂:如何将一个文件夹中的所有文件合并为一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!