我如何阅读Pig的行李清单? [英] How do I read in a list of bags in Pig?
问题描述
我如何阅读Pig的行李清单?
How do I read in a list of bags in Pig?
我尝试过:
grunt> cat sample.txt
{a,b},{},{c,d}
grunt> data = LOAD 'sample.txt' AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data
({},,)
推荐答案
将数据读入Pig的默认方法是PigStorage('\t')
,也就是说,假定您的数据是制表符分隔的.您的用逗号分隔.所以你应该写LOAD 'sample.txt' USING PigStorage(',') AS...
.
The default method for reading data into Pig is PigStorage('\t')
-- that is, it assumes your data is tab-separated. Yours is comma-separated. So you should write LOAD 'sample.txt' USING PigStorage(',') AS...
.
但是,您的数据不是正确的Pig bag格式.请记住,包是元组的集合.如果无法预处理输入,则必须编写UDF来解析输入形式的输入.因此,应该可以正常工作:
However, your data is not in proper Pig bag format. Remember that a bag is a collection of tuples. If you cannot pre-process your input, you'll have to write a UDF to parse input of the form you have given. So this ought to work:
grunt> cat tmp/data.txt
{(a),(b)},{},{(c),(d)}
grunt> data = LOAD 'tmp/data.txt' USING PigStorage(',') AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data;
(,,{})
出了什么问题?输入字段分隔符(,
)与bag-record分隔符相同的事实使Pig感到困惑.它将您的输入解析为{(a)
,(b)}
和{}
字段,这就是为什么只有第三个字段最终成为bag的原因.这就是为什么您会看到类似Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s)
的警告消息的原因.
What went wrong? The fact that your input field separator (,
) is the same as the bag-record separator is confusing Pig. It parses your input into the fields {(a)
, (b)}
, and {}
, which is why only the third field ends up being a bag. It's why you'll see a warning message like Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s)
.
如果可以,请尝试使用制表符或空格(或分号或...)代替逗号:
If you can, try to use tabs or spaces (or semicolons, or...) instead of commas:
grunt> cat tmp/data.txt
{(a),(b)} {} {(c),(d)}
grunt> data = LOAD 'tmp/data.txt' AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data;
({(a),(b)},{},{(c),(d)})
这篇关于我如何阅读Pig的行李清单?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!