我如何在Apache PIG中正确执行内部连接? [英] How can I do this inner join properly in Apache PIG?

查看:93
本文介绍了我如何在Apache PIG中正确执行内部连接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个档案,一个叫做a-records

I have two files, one called a-records

123^record1
222^record2
333^record3

和另一个名为b-records的文件

and the other file called b-records

123^jim
123^jim
222^mike
333^joe

你可以在文件A中看到我有令牌123一次。在文件B中它有两次。有没有办法使用Apache PIG我可以加入数据,这样我只能从A文件中获得一个连接记录?

you can see in file A that I have the token 123 one time. In file B it's in there twice. Is there a way using Apache PIG I can join the data such that I only get ONE joined record from the A file?

这里是我当前的脚本,输出如下

here is my current script which outputs the following below

arecords = LOAD '$a'  USING PigStorage('^')  as (token:chararray, type:chararray);

brecords =  LOAD '$b'  USING PigStorage('^')  as (token:chararray, name:chararray);


x = JOIN arecords BY token, brecords BY token;

dump x;

其中:

which yields:

(123,record1,123,jim)
(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)

当我真正想要的是(注意令牌123只存在于加入)

when what I REALLY want is(notice token 123 is only in there once after the join)

(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)

有什么想法吗?非常感谢

any ideas? thanks so much

推荐答案

我会这样做:

I would do something like this :

arecords = LOAD '$a'  USING PigStorage('^')  as (token:chararray, type:chararray);

brecords =  LOAD '$b'  USING PigStorage('^')  as (token:chararray, name:chararray);

bdistinct = DISTINCT brecords;

x = JOIN arecords BY token, bdistinct BY token;

dump x;

这篇关于我如何在Apache PIG中正确执行内部连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆