Pig:按加载的列表进行有效过滤 [英] Pig: efficient filtering by loaded list

查看:102
本文介绍了Pig:按加载的列表进行有效过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Apache Pig(0.16.x版)中,最有效的方法有哪些,该方法可以通过数据集的某个字段的现有值列表来过滤数据集?

In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields?

例如, (根据@inquisitive_mind的提示进行了更新)

For example, (Updated per @inquisitive_mind's tip)

输入:以行分隔的文件,每行一个值 my_codes.txt

Input: a line-separated file with one value per line my_codes.txt

'110'
'100'
'000'

sample_data.txt

sample_data.txt

'110', 2
'110', 3
'001', 3
'000', 1

所需的输出

'110', 2
'110', 3
'000', 1

示例脚本

%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);

错误:

Scalar has more than one row in the output. 1st : ('110'), 2nd :('100') 
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )

我也尝试过FILTER sample_data BY code IN my_codes;,但是"IN"子句似乎需要括号. 我也尝试了FILTER sample_data BY code IN (my_codes);,但是收到了错误消息: 需要从关系中投影出一列才能用作标量

I had also tried FILTER sample_data BY code IN my_codes; but the "IN" clause seems to require parenthesis. I also tried FILTER sample_data BY code IN (my_codes); but got the error: A column needs to be projected from a relation for it to be used as a scalar

推荐答案

my_codes.txt文件的代码以行而不是列的形式出现.由于将其加载到单个字段中,因此代码应如下所示

The my_codes.txt file has the codes as a row instead of a column.Since you are loading it into a single field the codes should be like this below

'110'
'100'
'000'

或者,您可以使用JOIN

Alternatively,you can use JOIN

joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE $0,$1;

这篇关于Pig:按加载的列表进行有效过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆