Apache Pig中的时差? [英] Time differences in Apache Pig?
本文介绍了Apache Pig中的时差?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
在大数据环境中,我有一个时间序列 S1 =(t1,t2,t3 ...)按升序排序.我想产生一系列时差: S2 =(t2-t1,t3-t2 ...)
In a Big Data context I have a time series S1=(t1, t2, t3 ...) sorted in an ascending order. I would like to produce a series of time differences: S2=(t2-t1, t3-t2 ...)
-
Apache Pig中有没有办法做到这一点?缺少一个非常 低效的自我加入,我看不到一个.
Is there a way to do this in Apache Pig? Short of a very inefficient self-join, I do not see one.
如果没有,什么是适合大量使用的好方法 数据?
If not, what would be an good way to do this suitable for large amounts of data?
推荐答案
- S1 =生成ID,时间戳,即从t1 ... tn
- S2 =生成ID,时间戳,即从t2 ... tn
- S3 =通过ID加入S1,通过ID加入S2
- S4 =提取S1.Timestamp,S2.Timestamp((S2.TimeStamp-S1.TimeStamp)
修改
样本数据
2014-02-19T01:03:37
2014-02-26T01:03:39
2014-02-28T01:03:45
2014-04-01T01:04:22
2014-05-11T01:06:02
2014-06-30T01:08:56
脚本
s1 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s11 = foreach s1 generate ToDate(t) as t1;
s1_new = rank s11;
s2 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s22 = foreach s2 generate ToDate(t) as t1;
s2_new = rank s22;
-- Filter records by excluding the 1 ranked row and rank the new data
ss = FILTER s2_new by (rank_s22 > 1);
ss_new = rank ss;
s3 = join s1_new by rank_s11,ss_new by rank_ss;
s4 = foreach s3 generate DaysBetween(ss_new::t1,s1_new::t1) as time_diff;
DUMP s4;
这篇关于Apache Pig中的时差?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文