如何在Pig Latin中对列的子集执行DISTINCT? [英] How to perform a DISTINCT in Pig Latin on a subset of columns?

查看：78 发布时间：2020/9/3 20:01:52 apache-pig

本文介绍了如何在Pig Latin中对列的子集执行DISTINCT?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想对列的子集执行DISTINCT操作. 文档说，使用嵌套的foreach可以做到这一点:

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:

您不能对字段的子集使用DISTINCT；为此，请使用FOREACH和嵌套块首先选择字段，然后应用DISTINCT(请参见示例:嵌套块).

You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT (see Example: Nested Block).

对所有列执行DISTINCT操作很简单:

It is simple to perform a DISTINCT operation on all of the columns:

A = LOAD 'data' AS (a1,a2,a3,a4);
A_unique = DISTINCT A;

让我们说我对在a1，a2和a3上执行截然不同很感兴趣.谁能提供示例说明如何按照文档中的建议使用嵌套的foreach执行此操作?

Lets say that I am interested in performing the distinct across a1, a2, and a3. Can anyone provide an example showing how to perform this operation with a nested foreach as suggested in the documentation?

这是输入和预期输出的示例:

Here's an example of input and expected output:

A = LOAD 'data' AS(a1,a2,a3,a4);
DUMP A;

(1 2 3 4)
(1 2 3 4)
(1 2 3 5)
(1 2 4 4)

-- insert DISTINCT operation on a1,a2,a3 here:
-- ...

DUMP A_unique;

(1 2 3 4)
(1 2 4 4)

推荐答案

在所有其他列上分组，仅将感兴趣的列投影到包中，然后使用FLATTEN再次将其展开:

Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:

A_unique =
    FOREACH (GROUP A BY a4) {
        b = A.(a1,a2,a3);
        s = DISTINCT b;
        GENERATE FLATTEN(s), group AS a4;
    };

这篇关于如何在Pig Latin中对列的子集执行DISTINCT?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Pig Latin中对列的子集执行DISTINCT? [英] How to perform a DISTINCT in Pig Latin on a subset of columns?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Pig Latin中对列的子集执行DISTINCT? [英] How to perform a DISTINCT in Pig Latin on a subset of columns?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭