Pig:按加载的列表进行有效过滤 [英] Pig: efficient filtering by loaded list
问题描述
在Apache Pig(0.16.x版)中,最有效的方法有哪些,该方法可以通过数据集的某个字段的现有值列表来过滤数据集?
In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields?
例如, (根据@inquisitive_mind的提示进行了更新)
For example, (Updated per @inquisitive_mind's tip)
输入:以行分隔的文件,每行一个值 my_codes.txt
Input: a line-separated file with one value per line my_codes.txt
'110'
'100'
'000'
sample_data.txt
sample_data.txt
'110', 2
'110', 3
'001', 3
'000', 1
所需的输出
'110', 2
'110', 3
'000', 1
示例脚本
%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);
错误:
Scalar has more than one row in the output. 1st : ('110'), 2nd :('100')
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )
我也尝试过FILTER sample_data BY code IN my_codes;
,但是"IN"子句似乎需要括号.
我也尝试了FILTER sample_data BY code IN (my_codes);
,但是收到了错误消息:
需要从关系中投影出一列才能用作标量
I had also tried FILTER sample_data BY code IN my_codes;
but the "IN" clause seems to require parenthesis.
I also tried FILTER sample_data BY code IN (my_codes);
but got the error:
A column needs to be projected from a relation for it to be used as a scalar
推荐答案
my_codes.txt文件的代码以行而不是列的形式出现.由于将其加载到单个字段中,因此代码应如下所示
The my_codes.txt file has the codes as a row instead of a column.Since you are loading it into a single field the codes should be like this below
'110'
'100'
'000'
或者,您可以使用JOIN
Alternatively,you can use JOIN
joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE $0,$1;
这篇关于Pig:按加载的列表进行有效过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!