Apache Pig 的时差? [英] Time differences in Apache Pig?

查看:37
本文介绍了Apache Pig 的时差?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在大数据上下文中,我有一个按升序排序的时间序列 S1=(t1, t2, t3 ...).我想产生一系列时差:S2=(t2-t1, t3-t2 ...)

In a Big Data context I have a time series S1=(t1, t2, t3 ...) sorted in an ascending order. I would like to produce a series of time differences: S2=(t2-t1, t3-t2 ...)

  1. 有没有办法在 Apache Pig 中做到这一点?短的很低效的自连接,我没有看到.

  1. Is there a way to do this in Apache Pig? Short of a very inefficient self-join, I do not see one.

如果没有,有什么好方法可以做到这一点,适合大量使用数据?

If not, what would be an good way to do this suitable for large amounts of data?

推荐答案

  1. S1 = 生成 ID、时间戳,即从 t1...tn
  2. S2 = Generate Id,Timestamp,即从 t2...tn
  3. S3 = 按 Id 加入 S1,按 Id 加入 S2
  4. S4 = 提取 S1.Timestamp,S2.Timestamp,(S2.TimeStamp - S1.TimeStamp)

编辑

示例数据

2014-02-19T01:03:37
2014-02-26T01:03:39
2014-02-28T01:03:45
2014-04-01T01:04:22
2014-05-11T01:06:02
2014-06-30T01:08:56

脚本

s1 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s11 = foreach s1 generate ToDate(t) as t1;
s1_new = rank s11;

s2 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s22 = foreach s2 generate ToDate(t) as t1;
s2_new = rank s22;

-- Filter records by excluding the 1 ranked row and rank the new data
ss = FILTER s2_new by (rank_s22 > 1);
ss_new = rank ss;

s3 = join s1_new by rank_s11,ss_new by rank_ss;
s4 = foreach s3 generate DaysBetween(ss_new::t1,s1_new::t1) as time_diff;

DUMP s4;

这篇关于Apache Pig 的时差?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆