U-SQL根据文件中的不同值将CSV文件拆分为多个文件 [英] U-SQL Split a CSV file to multiple files based on Distinct values in file

查看:126
本文介绍了U-SQL根据文件中的不同值将CSV文件拆分为多个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Azure Data Lake Store中存储了数据,并且正在使用带有U-SQL的Azure Data Analytic Job处理那里存在的数据.我有几个包含空间数据的CSV文件,类似于此:

I have the Data in Azure Data Lake Store and I am processing the data present there with Azure Data Analytic Job with U-SQL. I have several CSV files which contain spatial data, similar to this:

File_20170301.csv

File_20170301.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-01   | 01   | 20
    45.121    | 21.123    | 2017-03-01   | 02   | 10
    45.121    | 21.123    | 2017-03-01   | 03   | 50
    48.121    | 35.123    | 2017-03-01   | 01   | 60
    48.121    | 35.123    | 2017-03-01   | 02   | 15
    48.121    | 35.123    | 2017-03-01   | 03   | 80

File_20170302.csv

File_20170302.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-02   | 01   | 20
    45.121    | 21.123    | 2017-03-02   | 02   | 10
    45.121    | 21.123    | 2017-03-02   | 03   | 50
    48.121    | 35.123    | 2017-03-02   | 01   | 60
    48.121    | 35.123    | 2017-03-02   | 02   | 15
    48.121    | 35.123    | 2017-03-02   | 03   | 80

每个文件都包含不同日期和所有经纬度组合的数据.

Each file contains data for a different day and for all longtitude-lattitude combinations.

我想合并我拥有的所有文件并拆分数据,这样对于每种经纬度组合我都可以得到一个文件.

I would like to merge all files I have and split the data so I can end up with one file for each longtitude-lattitude combination.

因此,在遍历文件夹中的所有文件并附加所有数据后,我将得到以下结果:

So, after looping through all files in my folder and appending all data for all days, i would end up with the following:

File_45_21.csv

File_45_21.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-01   | 01   | 20
    45.121    | 21.123    | 2017-03-01   | 02   | 10
    45.121    | 21.123    | 2017-03-01   | 03   | 50
    45.121    | 21.123    | 2017-03-02   | 01   | 20
    45.121    | 21.123    | 2017-03-02   | 02   | 10
    45.121    | 21.123    | 2017-03-02   | 03   | 50

File_48_35.csv

File_48_35.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    48.121    | 35.123    | 2017-03-01   | 01   | 60
    48.121    | 35.123    | 2017-03-01   | 02   | 15
    48.121    | 35.123    | 2017-03-01   | 03   | 80
    48.121    | 35.123    | 2017-03-02   | 01   | 60
    48.121    | 35.123    | 2017-03-02   | 02   | 15
    48.121    | 35.123    | 2017-03-02   | 03   | 80

理论上应该发生以下情况:

In theory the following should happen:

  1. 在数据中查找经度和纬度组合的不同值
  2. 采用上述不同值数组,为每个组合创建一个文件,并基于两个参数(经度和纬度)从源文件中提取数据

我正在努力的是如何基于源中的两个参数开始循环和提取数据,以及如何通过参数组合的不同值来划分"数据源.

What I am struggling with is how to get started on the looping and the extraction of data based on two parameters in the source and how to "partition" the data source by the distinct values of the combination of parameters.

推荐答案

我建议查看文件集的概念以对许多文件进行操作(请参阅如何仅使用U-SQL和文件中的某些字段将大文件划分为文件/目录?).

I would recommend to look at the notion of File Sets to operate on many files (see https://msdn.microsoft.com/en-us/library/azure/mt771650.aspx) and some of the suggestions to do dynamic outputs based on values until that feature becomes available (see How do I partition a large file into files/directories using only U-SQL and certain fields in the file? as an example).

这篇关于U-SQL根据文件中的不同值将CSV文件拆分为多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆