Pig:按加载的列表进行有效过滤 [英] Pig: efficient filtering by loaded list

查看：102 发布时间：2020/9/3 20:11:39 apache-pig

本文介绍了Pig:按加载的列表进行有效过滤的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Apache Pig(0.16.x版)中，最有效的方法有哪些，该方法可以通过数据集的某个字段的现有值列表来过滤数据集?

In Apache Pig (version 0.16.x), what are some of the most efficient methods to filter a dataset by an existing list of values for one of the dataset's fields?

例如， (根据@inquisitive_mind的提示进行了更新)

For example, (Updated per @inquisitive_mind's tip)

输入:以行分隔的文件，每行一个值 my_codes.txt

Input: a line-separated file with one value per line my_codes.txt

'110'
'100'
'000'

sample_data.txt

'110', 2
'110', 3
'001', 3
'000', 1

所需的输出

'110', 2
'110', 3
'000', 1

示例脚本

%default my_codes_file 'my_codes.txt'
%default sample_data_file 'sample_data.txt'
my_codes = LOAD '$my_codes_file' as (code:chararray)
sample_data = LOAD '$sample_data_file' as (code: chararray, point: float)
desired_data = FILTER sample_data BY code IN (my_codes.code);

错误:

Scalar has more than one row in the output. 1st : ('110'), 2nd :('100') 
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )

我也尝试过FILTER sample_data BY code IN my_codes;，但是"IN"子句似乎需要括号. 我也尝试了FILTER sample_data BY code IN (my_codes);，但是收到了错误消息: 需要从关系中投影出一列才能用作标量

I had also tried FILTER sample_data BY code IN my_codes; but the "IN" clause seems to require parenthesis. I also tried FILTER sample_data BY code IN (my_codes); but got the error: A column needs to be projected from a relation for it to be used as a scalar

推荐答案

my_codes.txt文件的代码以行而不是列的形式出现.由于将其加载到单个字段中，因此代码应如下所示

The my_codes.txt file has the codes as a row instead of a column.Since you are loading it into a single field the codes should be like this below

'110'
'100'
'000'

或者，您可以使用JOIN

Alternatively,you can use JOIN

joined_data = JOIN sample_date BY code,my_codes BY code;
desired_data = FOREACH joined_data GENERATE $0,$1;

这篇关于Pig:按加载的列表进行有效过滤的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pig:按加载的列表进行有效过滤 [英] Pig: efficient filtering by loaded list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pig:按加载的列表进行有效过滤 [英] Pig: efficient filtering by loaded list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭