嵌入式hadoop-pig:对UDF使用自动addContainingJar的正确方法是什么? [英] embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?
问题描述
当你使用pigServer.registerFunction时,你不应该明确地调用pigServer.registerJar,而是让猪使用jarManager.findContainingJar自动检测jar。
然而,我们有一个复杂的UDF,它的类依赖于来自多个罐子的其他类。所以我们用maven-assembly创建了一个jar-with-dependencies。但是这会导致整个jar进入pigContext.skipJars(因为它包含pig.jar本身)并且没有被发送到hadoop服务器:($ / b>
什么是正确的方法我们必须手动为每个我们依赖的jar调用registerJar吗?
不知道什么是认证方式,但这里有一些指针:当使用 pigServer.registerFunction
时,
JarManager.createJar
)的jar,并从它仅以 org / apache / pig
, org / antlr / runtime
>等,并将它们发送到jobTracker以及
PigMapReduce
在同一个jar中,因为它不会被发送
HTH
when you use pigServer.registerFunction, you're not supposed to explicitly call pigServer.registerJar, but rather have pig automatically detect the jar using jarManager.findContainingJar.
However, we have a complex UDF who's class is dependent on other classes from multiple jars. So we created a jar-with-dependencies with the maven-assembly. But this causes the entire jar to enter pigContext.skipJars (as it contains the pig.jar itself) and not being sent to the hadoop server :(
What's the correct approach here? Must we manually call registerJar for every jar we depend on?
not sure what's the certified way, but here's some pointers:
- when you use
pigServer.registerFunction
pig automatically detects the jar that contain the udfs and sends it to the jobTracker - pig also automatically detects the jar that contains PigMapReduce class (
JarManager.createJar
), and extracts from it only the classes that start withorg/apache/pig
,org/antlr/runtime
, etc. and sends them to the jobTracker as well - so, if your UDF sits in the same jar as
PigMapReduce
your'e screwed, because it won't get sent - our conclusion: don't use jar-with-dependencies
HTH
这篇关于嵌入式hadoop-pig:对UDF使用自动addContainingJar的正确方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!