Apache Pig中的时差? [英] Time differences in Apache Pig?

查看:64
本文介绍了Apache Pig中的时差?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在大数据环境中,我有一个时间序列 S1 =(t1,t2,t3 ...)按升序排序.我想产生一系列时差: S2 =(t2-t1,t3-t2 ...)

In a Big Data context I have a time series S1=(t1, t2, t3 ...) sorted in an ascending order. I would like to produce a series of time differences: S2=(t2-t1, t3-t2 ...)

  1. Apache Pig中有没有办法做到这一点?缺少一个非常 低效的自我加入,我看不到一个.

  1. Is there a way to do this in Apache Pig? Short of a very inefficient self-join, I do not see one.

如果没有,什么是适合大量使用的好方法 数据?

If not, what would be an good way to do this suitable for large amounts of data?

推荐答案

  1. S1 =生成ID,时间戳,即从t1 ... tn
  2. S2 =生成ID,时间戳,即从t2 ... tn
  3. S3 =通过ID加入S1,通过ID加入S2
  4. S4 =提取S1.Timestamp,S2.Timestamp((S2.TimeStamp-S1.TimeStamp)

修改

样本数据

2014-02-19T01:03:37
2014-02-26T01:03:39
2014-02-28T01:03:45
2014-04-01T01:04:22
2014-05-11T01:06:02
2014-06-30T01:08:56

脚本

s1 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s11 = foreach s1 generate ToDate(t) as t1;
s1_new = rank s11;

s2 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s22 = foreach s2 generate ToDate(t) as t1;
s2_new = rank s22;

-- Filter records by excluding the 1 ranked row and rank the new data
ss = FILTER s2_new by (rank_s22 > 1);
ss_new = rank ss;

s3 = join s1_new by rank_s11,ss_new by rank_ss;
s4 = foreach s3 generate DaysBetween(ss_new::t1,s1_new::t1) as time_diff;

DUMP s4;

这篇关于Apache Pig中的时差?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆