从猪另一行减去一个行的价值 [英] Subtract One row's value from another row in Pig

查看:157
本文介绍了从猪另一行减去一个行的价值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图开发使用猪来分析一些日志文件的示例程序。我要分析不同作业的运行时间。当我在工作的日志文件中读取,我得到的开始时间和作业的结束时间,像这样:

I'm trying to develop a sample program using Pig to analyse some log files. I want to analyze the running time of different jobs. When I read in the log file of the job, I get the start time and the end time of the job, like this:

(Wed,03/20/13,01:03:37,EDT)
(Wed,03/20/13,01:05:00,EDT)

现在,计算消耗的时间,我需要减去这2个时间戳,但因为这两个时间戳都在同一个包,我不知道如何将它们进行比较。所以我在寻找如何做到这一点的想法。谢谢!

Now, to calculate the elapsed time, I need to subtract these 2 timestamps, but since both timestamps are in the same bag, I'm not sure how to compare them. So I'm looking for an idea on how to do this. thanks!

推荐答案

有没有因为这是两个日志行作业的唯一ID?也就是有什么指示哪些事件开始,这是结束了吗?

Is there a unique ID for the job that is in both log lines? Also is there something to indicate which event is start, and which is end?

如果是这样,你可以为启动事件,一旦最终事件读取数据集两次,一次,并加入两个在一起。然后,你就必须在这两个事件的一个记录。

If so, you could read the dataset twice, once for start events, once for end-events, and join the two together. Then you'll have one record with both events in it.

这样:

A = FOREACH logline GENERATE id, type, timestamp;
START = FILTER A BY (type == 'start');

END = FILTER A  BY (type == 'end');

JOINED = JOIN START by ID, END by ID;

DIFF = FOREACH JOINED GENERATE (START.timestamp - END.timestamp); // or whatever;

这篇关于从猪另一行减去一个行的价值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆