阿帕奇猪 - 无法读取包 [英] Apache Pig - Not able to read the bag

查看:118
本文介绍了阿帕奇猪 - 无法读取包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用PIG读取逗号分隔的数据,如下所示:

  grunt> cat script / pig / emp_tuple1.txt 
1,kirti,250000,{(100),(200)}
2,kk,240000,{(100),(300)}
3,kumar,200000,{(200),(400)}
4,shinde,290000,{(200),(500),(300),(100)}
5,shinde ky ,260000,{(100),(300),(200)}
6,amol,255000,{(300)}
grunt> emp_t1 = load'script / pig / emp_tuple1.txt'使用PigStorage(',')作为(empno:int,ename:chararray,salary:int,dlist:bag {});
grunt> dump emp_t1;
2015-11-23 12:26:44,450 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - 要处理的输入路径总数:1
(1,kirti, 250000,)
(2,kk,240000,)
(3,kumar,200000,)
(4,shinde,290000,)
(5,shinde ky,260000 ,)
(6,amol,255000,{(300)})

它显示一个警告:

  2015-11-23 12:26:44,173 [LocalJobRunner Map Task Executor#0] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger  -  org.apache.pig.builtin.Utf8StorageConverter(FIELD_DISCARDED_TYPE_CONVERSION_FAILED):无法解释正在转换的字段中的值[123,40,49,48,48,41]键入bag,抓到ParseException<意外结束包>字段丢弃了

它似乎在包中遇到逗号(,)时显示警告。



现在我所做的是:将逗号更改为制表符(或任何其他分隔符)并且工作:

 咕噜>猫脚本/ pig / emp_tuple2.txt; 
1 | kirti | 250000 | {(100),(200)}
2 | kk | 240000 | {(100),(300)}
3 | kumar | 200000 | { 200),(400)}
4 | shinde | 290000 | {(200),(500),(300),(100)}
5 | shinde ky | 260000 | {(100), (300),(200)}
6 | amol | 255000 | {(300)}
grunt> emp_t2 = load'script / pig / emp_tuple2.txt'使用PigStorage('|')作为(empno:int,ename:chararray,salary:int,dlist:bag {});
grunt> dump emp_t1;
2015-11-23 12:31:33,408 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - 要处理的输入路径总数:1
(1,kirti, (200),{(100),(200)})
(2,kk,240000,{(100),(300)})
(3,kumar,200000,{ 400)})
(4,shinde,290000,{(200),(500),(300),(100)})
(5,shinde ky,260000,{(100), (300),(200)})
(6,amol,255000,{(300)})

所以我只是想知道,如果你的逗号分隔的数据是用逗号分隔的行包,它会不会起作用? pre> 让我们进入细节,
1.数据读取为TextInputFormat
2.行记录读取器用于读取行
3.,正在被用来分隔列。

作为,出现在包中,并且是列间的分隔符,包被分成多个列。

有多种方法可以解决这个问题。

1.预处理输入并用其他分隔符替换每行中的前三个,。


I am trying to read the comma separated data using PIG as below:

grunt> cat script/pig/emp_tuple1.txt
1,kirti,250000,{(100),(200)}
2,kk,240000,{(100),(300)}
3,kumar,200000,{(200),(400)}
4,shinde,290000,{(200),(500),(300),(100)}
5,shinde k y,260000,{(100),(300),(200)}
6,amol,255000,{(300)}
grunt> emp_t1 = load 'script/pig/emp_tuple1.txt' using PigStorage(',') as (empno:int, ename:chararray, salary:int, dlist:bag{});
grunt> dump emp_t1;
2015-11-23 12:26:44,450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,kirti,250000,)   
(2,kk,240000,)
(3,kumar,200000,)
(4,shinde,290000,)
(5,shinde k y,260000,)
(6,amol,255000,{(300)})

In between it is showing a warning as:

2015-11-23 12:26:44,173 [LocalJobRunner Map Task Executor #0] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.builtin.Utf8StorageConverter(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to interpret value [123, 40, 49, 48, 48, 41] in field being converted to type bag, caught ParseException <Unexpect end of bag> field discarded

It seems it is showing the warning when it encounters the comma (,) in the bag.

Now what I did is: change the comma to tab (or any other separator) and it worked:

grunt> cat script/pig/emp_tuple2.txt;
1|kirti|250000|{(100),(200)}
2|kk|240000|{(100),(300)}
3|kumar|200000|{(200),(400)}
4|shinde|290000|{(200),(500),(300),(100)}
5|shinde k y|260000|{(100),(300),(200)}
6|amol|255000|{(300)}
grunt> emp_t2 = load 'script/pig/emp_tuple2.txt' using PigStorage('|') as (empno:int, ename:chararray, salary:int, dlist:bag{});
grunt> dump emp_t1;
2015-11-23 12:31:33,408 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,kirti,250000,{(100),(200)})
(2,kk,240000,{(100),(300)})
(3,kumar,200000,{(200),(400)})
(4,shinde,290000,{(200),(500),(300),(100)})
(5,shinde k y,260000,{(100),(300),(200)})
(6,amol,255000,{(300)})

So I am just wondering if you have comma sepqrated data with bags separated with comma, will it not work?

解决方案

Lets go into details, 
 1. Data is being read as TextInputFormat 
 2. Line Record Reader is being used to read lines
 3. , is being used to separate columns. 

as "," occurs in the bag and is the delimeter across columns, bag is being split into multiple columns. 

There are various way to overcome this. 

 1. pre-process the input and replace first three "," in each row by some other delimeter. 

这篇关于阿帕奇猪 - 无法读取包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆