Azure Data Lake Analytics:使用U-SQL合并重叠的持续时间 [英] Azure Data Lake Analytics: Combine overlapping time duration using U-SQL

查看:152
本文介绍了Azure Data Lake Analytics:使用U-SQL合并重叠的持续时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用U-SQL从放置在Azure Data Lake Store中的CSV数据中删除重叠的持续时间,并合并这些行.数据集包含开始时间和结束时间,以及每个记录的其他几个属性.这是一个示例:

I want to remove overlapping time duration from CSV data placed in Azure Data Lake Store using U-SQL and combine those rows. Data set contains start time and end time with several other attributes for each record. Here is an example:

Start Time - End Time - Usar Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
8:00 AM - 9:00 AM - ABC
8:00 AM - 10:00 AM - ABC
10:00 AM - 2:00 PM - ABC
7:00 AM - 11:00 AM - ABC
9:00 AM - 11:00 AM - ABC
11:00 AM - 11:30 AM - ABC

去除重叠后,输出数据集将如下所示:

After removing overlap, output data set will look like:

Start Time - End Time - Usar Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
7:00 AM - 2:00 PM - ABC

请注意,CSV包含大量数据,并且包含数GB的大小.我正在尝试解决这个问题,但是没有运气.我想避免为Azure Data Lake Analytics作业使用U-SQL用户定义的运算符,而是从U-SQL中寻找一些有效的解决方案.

Note that CSV contains enormous amount of data and consists of several GB of size. I am trying to solve this problem but no luck. I want to avoid U-SQL User defined operators for Azure Data Lake Analytics jobs and looking for some efficient solution from within U-SQL.

推荐答案

您似乎想聚合提供重叠时间范围的行的所有数据?还是要对其他列中的数据进行处理?

It looks like you want to aggregate all the data for the rows that provide overlapping timeframes? Or what do you want to do with the data in the other columns?

乍一看,我建议您使用用户定义的REDUCER或用户定义的聚合器,具体取决于您要使用其他数据来实现什么.

At first glance, I would suggest that you use a user-defined REDUCER or a user-defined aggregator, depending on what you want to achieve with the other data.

但是,我看到的一个问题是您可能需要一个固定点递归来创建公共的重叠范围.不幸的是,由于无法有效地进行向外扩展的递归处理,因此在U-SQL(或Hive)中没有固定点递归.

However, a problem I see is that you may need a fix point recursion to create the common overlapping ranges. Unfortunately, there is no fix point recursion in U-SQL (nor Hive) because scale out processing of recursion can't be done efficiently.

澄清后更新:

我认为那更容易.您只需将开头的最小值和结尾的最大值与键值分组即可:

That is easier I think. You just take the min of the beginning and the max of the end and group by the key value:

@r = EXTRACT begin DateTime, end DateTime,
             data string
     FROM "/temp/ranges.txt"
     USING Extractors.Text(delimiter:'-');

@r = SELECT MIN(begin) AS begin,
            MAX(end) AS end,
            data
     FROM @r
     GROUP BY data;

OUTPUT @r
TO "/temp/result.csv"
USING Outputters.Csv();

请注意,仅当您的范围在同一天且不跨越午夜时,此方法才有效.

Note this works only if your ranges are on the same day and do not span over midnight.

已更新,该解决方案为用户处理了脱离范围的问题 您可以使用用户定义的reducer来解决它.以下博客文章解释了该解决方案的详细信息,并提供了指向GitHub代码的链接:

UPDATED WITH A SOLUTION THAT HANDLES DISJOINT RANGES FOR A USER You can solve it with a user-defined reducer. The following blog post explains the details of the solution and provides links to the GitHub code: https://blogs.msdn.microsoft.com/mrys/2016/06/08/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos/

这篇关于Azure Data Lake Analytics:使用U-SQL合并重叠的持续时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆