Hadoop Pig有序分析函数 [英] Hadoop Pig Ordered Analytical Functions

查看:245
本文介绍了Hadoop Pig有序分析函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Pig中的新成员,并且希望使用有序的分析函数,类似于SQL中的可能。



我的数据如下所示:

(b)

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ b(TAC,2001-08-07,16.3,16.54)
(TAC,2001-08-08,16.55,16.44)
(TAC,2001-08-09,16.45,16.48)
(TAC,2001-08-10,16.5,15.8)

我想要做什么是每天都在寻找股票价格的变化。所以,我的输出看起来是这样的:
$ b $ pre $ (stock_symbol,date,stock_price_open,stock_price_close,stock_price_change)
( (TAC,2001-08-07,16.3,16.54,-0.09)
(TAC,2001-08-08,16.55,16.44 ,0.25)
(TAC,2001-08-09,16.45,16.48,-0.1)
(TAC,2001-08-10,16.5,15.8,0.05)

我希望Pig能够查看当前行前后的行。这是可能的,还是猪不允许进行这种类型的分析?

解决方案

您可以使用下面的脚本来获取输出如预期的那样,但可能需要一些良好的调整。

  A = load'/ tmp / pig / test / test'using PigStorage(','); 
B = foreach生成$ 0为stock_symbol,ToDate($ 1,'yyyy-mm-dd')为dt,(double)$ 2为stock_price_open,(double)$ 3为stock_price_close,'PT24H'为dthour;
C = foreach B产生$ 0为stock_symbol,$ 1为dt_curr,SubtractDuration($ 1,$ 4)为dt_old,$ 2为stock_price_open,$ 3为stock_price_close;
START = FILTER C BY($ 1 == $ 1);
D =由$ 0加入C,由$ 0开始; ((DaysBetween($ 1,$ 6)== 1)和(DaysBetween($ 2,$ 7)== 1));
Filter_D = FILTER D
E = foreach Filter_D生成$ 0为stock_symbol,$ 1为dt,$ 3为stock_price_open,$ 4为stock_price_close,$ 3- $ 8为stock_price_change;

输出为:

(TAC,2001-01-08T00:08:00.000-08: (TAC,2001-01-09T00:08:00.000-08:00,16.45,16.48,-0.10000000000000142)
(TAC,2001-01-10T00:08 :00.000-08:00,16.5,15.8,0.05000000000000071)

由于我们需要计算一天较早的开幕日期,因此采取了在猪中定义24小时的变量PT24H。
通过使用ToDate()& SubtractDuration(),后面跟着一个Join和DaysBetween()动作来获得差异。
$ b

ToDate(),SubtractDuration(),DaysBetween ,你可以编写合适的UDF,以便更好地调整相同的脚本,并采取更适当的操作。


I am new in Pig and would like to use an ordered analytical function, similar to what is possible in SQL.

My data looks something like this:

(stock_symbol,date,stock_price_open,stock_price_close)
(TAC,2001-08-06,16.39,16.36)
(TAC,2001-08-07,16.3,16.54)
(TAC,2001-08-08,16.55,16.44)
(TAC,2001-08-09,16.45,16.48)
(TAC,2001-08-10,16.5,15.8)

What I want to do is find the change in opening stock price from day to day. So, my output would look something like this:

(stock_symbol,date,stock_price_open,stock_price_close,stock_price_change)
(TAC,2001-08-06,16.39,16.36,NULL)
(TAC,2001-08-07,16.3,16.54,-0.09)
(TAC,2001-08-08,16.55,16.44,0.25)
(TAC,2001-08-09,16.45,16.48,-0.1)
(TAC,2001-08-10,16.5,15.8,0.05)

I want Pig to be able to look at a row ahead or behind the current row. Is this possible, or does Pig not allow for this type of analysis?

解决方案

You can use the below script to get the output as expected, but might be some fine tunning is required.

A = load '/tmp/pig/test/test' using PigStorage (',');
B= foreach A generate $0 as stock_symbol, ToDate($1,'yyyy-mm-dd') as dt,(double)$2 as stock_price_open, (double)$3 as stock_price_close,'PT24H' as dthour;
C= foreach B generate $0 as stock_symbol, $1 as dt_curr, SubtractDuration($1,$4) as dt_old, $2 as stock_price_open, $3 as stock_price_close;
START = FILTER C BY ($1 == $1);
D = JOIN C by $0 , START by $0;
Filter_D = FILTER D by ((DaysBetween($1,$6)==1) and (DaysBetween($2,$7)==1));
E = foreach Filter_D generate $0 as stock_symbol, $1 as dt, $3 as stock_price_open, $4 as stock_price_close, $3-$8 as stock_price_change;

The Output as :

(TAC,2001-01-07T00:08:00.000-08:00,16.3,16.54,-0.08999999999999986)
(TAC,2001-01-08T00:08:00.000-08:00,16.55,16.44,0.25)
(TAC,2001-01-09T00:08:00.000-08:00,16.45,16.48,-0.10000000000000142)
(TAC,2001-01-10T00:08:00.000-08:00,16.5,15.8,0.05000000000000071)

As we required to calculate the One Day Older Opening Date so have taking variable "PT24H" which defined 24 Hours in Pig. The Same was printed in next action by using ToDate() & SubtractDuration(), follwed by a Join and DaysBetween() action to get the difference.

ToDate(),SubtractDuration(),DaysBetween() are inbilt Function in PIG UDF, u can write suitable UDF for fine tuning the same script, with more proper action.

这篇关于Hadoop Pig有序分析函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆