如何在列子集上使用 Pig Latin 执行 DISTINCT? [英] How to perform a DISTINCT in Pig Latin on a subset of columns?

查看:25
本文介绍了如何在列子集上使用 Pig Latin 执行 DISTINCT?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对列的子集执行 DISTINCT 操作.文档 说这可以通过嵌套的 foreach 实现:

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:

您不能在字段子集上使用 DISTINCT;为此,请使用 FOREACH 和嵌套块首先选择字段,然后应用 DISTINCT(参见示例:嵌套块).

You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT (see Example: Nested Block).

对所有列执行 DISTINCT 操作很简单:

It is simple to perform a DISTINCT operation on all of the columns:

A = LOAD 'data' AS (a1,a2,a3,a4);
A_unique = DISTINCT A;

假设我有兴趣在 a1、a2 和 a3 之间执行不同的操作.谁能提供一个示例,说明如何按照文档中的建议使用嵌套的 foreach 执行此操作?

Lets say that I am interested in performing the distinct across a1, a2, and a3. Can anyone provide an example showing how to perform this operation with a nested foreach as suggested in the documentation?

以下是输入和预期输出的示例:

Here's an example of input and expected output:

A = LOAD 'data' AS(a1,a2,a3,a4);
DUMP A;

(1 2 3 4)
(1 2 3 4)
(1 2 3 5)
(1 2 4 4)

-- insert DISTINCT operation on a1,a2,a3 here:
-- ...

DUMP A_unique;

(1 2 3 4)
(1 2 4 4)

推荐答案

对所有其他列进行分组,只将感兴趣的列投影到一个包中,然后使用 FLATTEN 再次展开它们:

Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:

A_unique =
    FOREACH (GROUP A BY a4) {
        b = A.(a1,a2,a3);
        s = DISTINCT b;
        GENERATE FLATTEN(s), group AS a4;
    };

这篇关于如何在列子集上使用 Pig Latin 执行 DISTINCT?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆