如何阅读 Pig 中的包列表? [英] How do I read in a list of bags in Pig?
问题描述
如何在 Pig 中读取包列表?
How do I read in a list of bags in Pig?
我试过了:
grunt> cat sample.txt
{a,b},{},{c,d}
grunt> data = LOAD 'sample.txt' AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data
({},,)
推荐答案
将数据读入 Pig 的默认方法是 PigStorage('\t')
-- 也就是说,它假定你的数据是制表符分隔.你的是逗号分隔的.所以你应该写 LOAD 'sample.txt' USING PigStorage(',') AS...
.
The default method for reading data into Pig is PigStorage('\t')
-- that is, it assumes your data is tab-separated. Yours is comma-separated. So you should write LOAD 'sample.txt' USING PigStorage(',') AS...
.
但是,您的数据不是正确的 Pig bag 格式.请记住,包是元组的集合.如果您无法预处理您的输入,您将必须编写一个 UDF 来解析您提供的表单的输入.所以这应该起作用:
However, your data is not in proper Pig bag format. Remember that a bag is a collection of tuples. If you cannot pre-process your input, you'll have to write a UDF to parse input of the form you have given. So this ought to work:
grunt> cat tmp/data.txt
{(a),(b)},{},{(c),(d)}
grunt> data = LOAD 'tmp/data.txt' USING PigStorage(',') AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data;
(,,{})
出了什么问题?您的输入字段分隔符 (,
) 与包记录分隔符相同这一事实让 Pig 感到困惑.它将您的输入解析为字段 {(a)
、(b)}
和 {}
,这就是为什么只有第三个字段结束成为一个袋子.这就是为什么您会看到类似遇到警告 FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 次code>的警告消息.
What went wrong? The fact that your input field separator (,
) is the same as the bag-record separator is confusing Pig. It parses your input into the fields {(a)
, (b)}
, and {}
, which is why only the third field ends up being a bag. It's why you'll see a warning message like Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s)
.
如果可以,请尝试使用制表符或空格(或分号,或...)代替逗号:
If you can, try to use tabs or spaces (or semicolons, or...) instead of commas:
grunt> cat tmp/data.txt
{(a),(b)} {} {(c),(d)}
grunt> data = LOAD 'tmp/data.txt' AS (a:bag{}, b:bag{}, c:bag{});
grunt> DUMP data;
({(a),(b)},{},{(c),(d)})
这篇关于如何阅读 Pig 中的包列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!