Hadoop Pig UDF调用问题 [英] Hadoop Pig UDF invocation issue

查看:150
本文介绍了Hadoop Pig UDF调用问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码工作得很好,但是当我已经有两个现有的行李(带有别名,假设S1和S2代表两个现有的行李),想知道如何调用UDF setDifference来生成集合差异?我认为如果我手动构建一个额外的包,使用我现有的输入包(S1和S2),这将是额外的开销?

 注册datafu-1.2.0.jar; 
define setDifference datafu.pig.sets.SetDifference(); (3),(4),(1),(2),(7),(5),(6)} {(1),(3) ),(5),(12)})
A = load'input.txt'AS(B1:bag {T:tuple(val:int)},B2:bag {T:tuple(val:int )});

F1 = foreach A生成B1;
F2 = foreach A生成B2;

differenced = FOREACH A {
- 输入行李必须按
排序sort_b1 = ORDER B1 by val;
sorted_b2 = ORDER B2 by val;
GENERATE setDifference(sorted_b1,sorted_b2);
}

- 产生:({(2),(4),(6),(7)})
DUMP差异;

更新:

问题是,假设我已经有两个包,如何调用UDF setDifference来获得集合差异?我是否需要制造另一个包含两个独立包的超级包?谢谢。



在此先感谢,
Lin

解决方案

我没有看到任何UDF调用的开销问题。



参考: http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html ,我们有一个使用示例SetDifference方法。



根据API( http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/sets/SetDifference.html )SetDifference方法需要袋作为输入,并发出它们之间的差异。

注意请注意,输入的行李必须进行排序。



在示例代码片段共享中,我没有看到以下代码片段的必要性

  F1 = foreach生成B1; 
F2 = foreach A生成B2;


The following code works quite well, but when I already have two existing bags (with their alias, suppose S1 and S2 for representing two existing bags for two sets), wondering how to call UDF setDifference to generate set differences? I think if I manually construct an additional bag, using my already existing input bags (S1 and S2), it will be additional overhead?

register datafu-1.2.0.jar;
define setDifference datafu.pig.sets.SetDifference();

-- ({(3),(4),(1),(2),(7),(5),(6)} \t {(1),(3),(5),(12)})
A = load 'input.txt' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});

F1 = foreach A generate B1;
F2 = foreach A generate B2;

differenced = FOREACH A {
  -- input bags must be sorted
  sorted_b1 = ORDER B1 by val;
  sorted_b2 = ORDER B2 by val;
  GENERATE setDifference(sorted_b1,sorted_b2);
}

-- produces: ({(2),(4),(6),(7)})
DUMP differenced;

Update:

Question is, suppose I have two bags already, how to call UDF setDifference to get set differences? Do I need to build another super bag which contains the two separate bags? Thanks.

thanks in advance, Lin

解决方案

I don't see any overhead issue with the UDF invocation.

Ref : http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html, we have a example for using SetDifference method.

As per API (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/sets/SetDifference.html) SetDifference method takes bags as input and emits the difference between them.

N.B. Do note that the input bags have to be sorted.

In the example snippet shared, I don't see the need of below code snippet

F1 = foreach A generate B1;
F2 = foreach A generate B2;

这篇关于Hadoop Pig UDF调用问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆