Hadoop Pig 有序分析函数 [英] Hadoop Pig Ordered Analytical Functions

查看:33
本文介绍了Hadoop Pig 有序分析函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Pig 的新手,想使用一个有序的分析函数,类似于 SQL 中的功能.

I am new in Pig and would like to use an ordered analytical function, similar to what is possible in SQL.

我的数据如下所示:

(stock_symbol,date,stock_price_open,stock_price_close)
(TAC,2001-08-06,16.39,16.36)
(TAC,2001-08-07,16.3,16.54)
(TAC,2001-08-08,16.55,16.44)
(TAC,2001-08-09,16.45,16.48)
(TAC,2001-08-10,16.5,15.8)

我想要做的是找到每天开盘价的变化.所以,我的输出看起来像这样:

What I want to do is find the change in opening stock price from day to day. So, my output would look something like this:

(stock_symbol,date,stock_price_open,stock_price_close,stock_price_change)
(TAC,2001-08-06,16.39,16.36,NULL)
(TAC,2001-08-07,16.3,16.54,-0.09)
(TAC,2001-08-08,16.55,16.44,0.25)
(TAC,2001-08-09,16.45,16.48,-0.1)
(TAC,2001-08-10,16.5,15.8,0.05)

我希望 Pig 能够查看当前行之前或之后的行.这是可能的,还是 Pig 不允许进行这种类型的分析?

I want Pig to be able to look at a row ahead or behind the current row. Is this possible, or does Pig not allow for this type of analysis?

推荐答案

您可以使用以下脚本获得预期的输出,但可能需要进行一些微调.

You can use the below script to get the output as expected, but might be some fine tuning is required.

A = load '/tmp/pig/test/test' using PigStorage (',');
B= foreach A generate $0 as stock_symbol, ToDate($1,'yyyy-mm-dd') as dt,(double)$2 as stock_price_open, (double)$3 as stock_price_close,'PT24H' as dthour;
C= foreach B generate $0 as stock_symbol, $1 as dt_curr, SubtractDuration($1,$4) as dt_old, $2 as stock_price_open, $3 as stock_price_close;
START = FILTER C BY ($1 == $1);
D = JOIN C by $0 , START by $0;
Filter_D = FILTER D by ((DaysBetween($1,$6)==1) and (DaysBetween($2,$7)==1));
E = foreach Filter_D generate $0 as stock_symbol, $1 as dt, $3 as stock_price_open, $4 as stock_price_close, $3-$8 as stock_price_change;

输出为:

(TAC,2001-01-07T00:08:00.000-08:00,16.3,16.54,-0.08999999999999986)
(TAC,2001-01-08T00:08:00.000-08:00,16.55,16.44,0.25)
(TAC,2001-01-09T00:08:00.000-08:00,16.45,16.48,-0.10000000000000142)
(TAC,2001-01-10T00:08:00.000-08:00,16.5,15.8,0.05000000000000071)

由于我们需要计算较早一天的开放日期,因此取变量PT24H"在 Pig 中定义了 24 小时.使用 ToDate() & 在下一个操作中打印相同的内容SubtractDuration(),然后是 Join 和 DaysBetween() 操作以获取差异.

As we required to calculate the One Day Older Opening Date so have taking variable "PT24H" which defined 24 Hours in Pig. The Same was printed in next action by using ToDate() & SubtractDuration(), followed by a Join and DaysBetween() action to get the difference.

ToDate(),SubtractDuration(),DaysBetween() 是 PIG UDF 中的内置函数,您可以编写合适的 UDF 来微调相同的脚本,更合适的动作.

ToDate(),SubtractDuration(),DaysBetween() are inbilt Function in PIG UDF, u can write suitable UDF for fine tuning the same script, with more proper action.

这篇关于Hadoop Pig 有序分析函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆