迭代连接集后出现 PIG 错误 1066. [英] PIG Error 1066 after iterating through a joined set.

查看:33
本文介绍了迭代连接集后出现 PIG 错误 1066.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图加入一个在月中有天数的集合和年月键上的数据集.在我加入并尝试对集合执行 FOREACH 后,我得到一个错误:1066 ...后端错误:标量在输出中有多于一行.

这是一个有同样问题的缩写集:

$ hadoop fs -cat DIM/*2011、01、312011、02、282011、03、312011、04、302011、05、312011、06、302011、07、312011、08、312011、09、302011、10、312011、11、302011、12、31$ hadoop fs -cat ACCT//*2011,7,26,key1,23.25,2470.02011,7,26,key2,10.416666666666668,232274.083333333342011,7,26,key3,82.83333333333333,541377.252011,7,26,key4,78.5,492823.333333333262011,7,26,key5,110.83333333333334,729811.91666666672011,7,26,key6,102.16666666666666,675941.252011,7,26,key7,118.91666666666666,770896.75

然后在咕噜声中:

咕噜声>DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int, days:int);咕噜声>ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int, day: int, account:chararray, metric1:double, metric2:double);咕噜声>AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING 'replicated';咕噜声>转储 AjD;...(2011,7,26,key1,23.25,2470.0,2011,7,31)(2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)(2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)(2011,7,26,key4,78.5,492823.33333333326,2011,7,31)(2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)(2011,7,26,key6,102.166666666666666,675941.25,2011,7,31)(2011,7,26,key7,118.916666666666666,770896.75,2011,7,31)咕噜声>描述 AjD;AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year:int,DIM::month: int,DIM::days: int}咕噜声>最终 = FOREACH AjD>>生成 ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2/DIM.days);咕噜声>转储最终;...错误 org.apache.pig.tools.grunt.Grunt - 错误 1066:无法打开别名 FINAL 的迭代器.后端错误:标量在输出中有多于一行.第一个:(2011,7,26,key1,23.25,2470.0),第二个:(2011,7,26,key2,10.4166666666666668,232274.08333333334)

但是,如果我存储它并重新加载它以摆脱加入"模式,它会起作用:

咕噜声>使用 PigStorage(','); STORE AjD INTO 'AjD'咕噜声>AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int, day:int, account:chararray, metric1:double, metric2:double, year2:int, month2:int, days:int);咕噜声>最终 = FOREACH AjD2>>生成年、月、帐户,(metric2/天);咕噜声>转储最终;...(2011,7,key1,79.6774193548387)(2011,7,key2,7492.712365591398)(2011,7,key3,17463.782258064515)(2011,7,key4,15897.526881720427)(2011,7,key5,23542.319892473122)(2011,7,key6,21804.5564516129)(2011,7,key7,24867.637096774193)

有没有办法在不存储和重新加载的情况下迭代 (FOREACH) 连接的集合?

解决方案

您是否尝试过 :: 运算符 指定获取哪一列?

(ACCT.metric2/DIM.days) 替换为 (ACCT::metric2/DIM::days).

例如

<预><代码>...最终 = FOREACH AjD产生ACCT.year, ACCT.month, ACCT.account,(ACCT::metric2/DIM::days);

Trying to join a one set which has number of days in the month with a data set on the year month key. After I join the and try to do a FOREACH over the set I get an ERROR: 1066 ... Backend error : Scalar has more than one row in the output.

Here is an abbreviated set with the same problem:

$ hadoop fs -cat DIM/\*
2011,01,31
2011,02,28
2011,03,31
2011,04,30
2011,05,31
2011,06,30
2011,07,31
2011,08,31
2011,09,30
2011,10,31
2011,11,30
2011,12,31

$ hadoop fs -cat ACCT/\*
2011,7,26,key1,23.25,2470.0
2011,7,26,key2,10.416666666666668,232274.08333333334
2011,7,26,key3,82.83333333333333,541377.25
2011,7,26,key4,78.5,492823.33333333326
2011,7,26,key5,110.83333333333334,729811.9166666667
2011,7,26,key6,102.16666666666666,675941.25
2011,7,26,key7,118.91666666666666,770896.75

Then in grunt:

grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int, days:int);
grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int, day: int, account:chararray, metric1:double, metric2:double);
grunt> AjD = JOIN ACCT BY (year,month), DIM  BY (year,month) USING 'replicated';
grunt> dump AjD;
...
(2011,7,26,key1,23.25,2470.0,2011,7,31)
(2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31)
(2011,7,26,key3,82.83333333333333,541377.25,2011,7,31)
(2011,7,26,key4,78.5,492823.33333333326,2011,7,31)
(2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31)
(2011,7,26,key6,102.16666666666666,675941.25,2011,7,31)
(2011,7,26,key7,118.91666666666666,770896.75,2011,7,31)
grunt> describe AjD;
AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year: int,DIM::month: int,DIM::days: int}

grunt> FINAL = FOREACH AjD
>> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days);
grunt> dump FINAL;
...
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias FINAL. Backend error : Scalar has more than one row in the output. 1st : (2011,7,26,key1,23.25,2470.0), 2nd :(2011,7,26,key2,10.416666666666668,232274.08333333334)

However if I store it and reload it to shed the "join" schema it works:

grunt> STORE AjD INTO 'AjD' using PigStorage(',');
grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int, day:int, account:chararray, metric1:double, metric2:double, year2:int, month2:int, days:int);

grunt> FINAL = FOREACH AjD2                                                                   
>> GENERATE year, month, account, (metric2 /days);         

grunt> dump FINAL;
...
(2011,7,key1,79.6774193548387)
(2011,7,key2,7492.712365591398)
(2011,7,key3,17463.782258064515)
(2011,7,key4,15897.526881720427)
(2011,7,key5,23542.319892473122)
(2011,7,key6,21804.5564516129)
(2011,7,key7,24867.637096774193)

Is there a way to iterate (FOREACH) over the joined set without storing and reloading?

解决方案

Have you tried with the :: Operator which specifies which column to get?

Replacing (ACCT.metric2 / DIM.days) by (ACCT::metric2 / DIM::days).

e.g.

...
FINAL = FOREACH AjD
        GENERATE
             ACCT.year, ACCT.month, ACCT.account,(ACCT::metric2 / DIM::days);

这篇关于迭代连接集后出现 PIG 错误 1066.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆