无法将pig元组传递给python UDF [英] Unable to pass pig tuple to python UDF
问题描述
我有master.txt,它有10K条记录,所以它的每一行都是一个元组&整个相同的需求被传递给python UDF。由于它有多个记录,所以在存储p2preportmap时出现以下错误。请帮助
错误如下:
无法打开别名的迭代器p2preportmap。后端错误:
org.apache.pig.backend.executionengine.ExecException:错误0:标量
在输出中有多行。第一:(010301,MTS,MM),第二
:(010B06,MTS,TN)(常见原因:JOIN,然后FOREACH ... GENERATE
foo.bar应该是foo: :bar)
Pig Script如下:
使用streaming_python作为smsiuc_udfs注册'smsiuc_udf.py';
$ p
cdrs = load'2016040111 *'使用PigStorage('|',' - tagFile');
mastergtrec = load'master.txt'使用PigStorage(',',' - tagFile');
mastergt = FOREACH mastergtrec GENERATE(chararray)UPPER($ 1)as opcdpc,(chararray)UPPER($ 2)as gtoptname,(chararray)UPPER($ 3)as gtoptcircle;
mastergttup = FOREACH mastergt生成TOTUPLE(opcdpc,gtoptname,gtoptcircle)为mstgttup;
cdrrecord = FOREACH cdrs GENERATE(chararray)UPPER($ 1)as aparty,(chararray)UPPER($ 2)as bparty,$ 3 as smssentdate,$ 4 as smssenttime,($ 29 =='6'?'作为状态,(chararray)UPPER($ 26)作为srcgt,(chararray)UPPER($ 27)作为destgt,($ 12 ==''405899136999995'''MTSDEL-CDMA':($ 12 =='919875089998 '''MTSRAJ-GSM':($ 12 =='405899150999995'''MTSCHN-CDMA':$ 12)))as smscgt,(chararray)$ 0 as cdrfname,(chararray)$ 13 as prepost;
filteredp2pcdrs = FILTER cdrrecord by smsiuc_udfs.pullp2pcdrs(aparty,bparty,srcgt,destgt)and status =='S'and SUBSTRING(smssentdate,4,6)=='$ MON';
groupp2pcdrs = GROUP filteredp2pcdrs by(srcgt,destgt,aparty,bparty,smscgt,status,prepost);
distinctp2pcdrs = FOREACH groupp2pcdrs {
uniq = DISTINCT filteredp2pcdrs。(srcgt,destgt,aparty,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(uniq)as cnt;
};
p2preportmap = FOREACH distinctp2pcdrs GENERATE smsiuc_udfs.p2preport(srcgt,destgt,aparty,bparty,mastergttup),smscgt,status,prepost,cnt
解决方案这可以通过添加一个虚拟列然后分组完成。 p> dummmy = foreach p2preportmap生成1,$ 0,$ 1 ....
分组=组虚拟$ 0
I have master.txt which has 10K records, so each line of it will be a tuple & whole of the same needs to be passed to python UDF. Since it has multiple records, so on storing p2preportmap getting following error. Please help
Error is as follows:
Unable to open iterator for alias p2preportmap. Backend error : org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (010301,MTS,MM), 2nd :(010B06,MTS,TN) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
Pig Script is as follows:
REGISTER 'smsiuc_udf.py' using streaming_python as smsiuc_udfs; cdrs = load '2016040111*' USING PigStorage('|','-tagFile') ; mastergtrec = load 'master.txt' USING PigStorage(',','-tagFile'); mastergt = FOREACH mastergtrec GENERATE (chararray) UPPER($1) as opcdpc, (chararray) UPPER($2) as gtoptname,(chararray) UPPER($3) as gtoptcircle; mastergttup = FOREACH mastergt generate TOTUPLE(opcdpc,gtoptname,gtoptcircle) as mstgttup; cdrrecord = FOREACH cdrs GENERATE (chararray) UPPER($1) as aparty, (chararray) UPPER($2) as bparty,$3 as smssentdate,$4 as smssenttime,($29=='6' ? 'S' : 'F') as status,(chararray) UPPER($26) as srcgt,(chararray) UPPER($27) as destgt,($12=='405899136999995' ? 'MTSDEL-CDMA' : ($12=='919875089998' ? 'MTSRAJ-GSM' : ($12=='405899150999995' ? 'MTSCHN-CDMA' : $12) ) ) as smscgt, (chararray)$0 as cdrfname,(chararray) $13 as prepost; filteredp2pcdrs = FILTER cdrrecord by smsiuc_udfs.pullp2pcdrs(aparty,bparty,srcgt,destgt) and status == 'S' and SUBSTRING(smssentdate,4,6) == '$MON'; groupp2pcdrs = GROUP filteredp2pcdrs by (srcgt,destgt,aparty,bparty,smscgt,status,prepost); distinctp2pcdrs= FOREACH groupp2pcdrs { uniq = DISTINCT filteredp2pcdrs.(srcgt,destgt,aparty,bparty,smscgt,status,prepost); GENERATE FLATTEN(group),COUNT(uniq) as cnt; }; p2preportmap = FOREACH distinctp2pcdrs GENERATE smsiuc_udfs.p2preport(srcgt,destgt,aparty,bparty,mastergttup ),smscgt,status,prepost,cnt
解决方案This can be done by adding a dummy column and then grouping.
dummmy= foreach p2preportmap generate 1, $0,$1 ....
grouped = group dummy by $0
这篇关于无法将pig元组传递给python UDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!