在Pig中使用Python UDF时,如何让Hadoop找到导入的Python模块? [英] How do I make Hadoop find imported Python modules when using Python UDFs in Pig?

查看:198
本文介绍了在Pig中使用Python UDF时,如何让Hadoop找到导入的Python模块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Pig(0.9.1)和用Python编写的UDF。 Python脚本从标准Python库导入模块。我已经能够在本地模式下运行Pig scrips成功调用Python UDF,但是当我在群集上运行时,它显示Pig生成的Hadoop作业无法找到导入的模块。需要做些什么?



例如:


  • python(或jython)需要安装在每个任务跟踪器节点上?

  • 是否需要在每个任务跟踪器节点上安装python(或jython)模块?

  • 任务跟踪器节点是否需要知道如何查找模块?
  • 如果是这样,那么如何指定路径(通过环境变量 - 如何为任务跟踪器执行此操作)?

解决方案


Python(或jython)需要安装在每个任务跟踪器
节点上?

是的,因为它在任务跟踪器中执行。


是否需要在每个任务
跟踪器节点上安装python(或jython)模块?


如果您使用的是第三方模块,它应该安装在任务跟踪器中(如geoip等)。


任务跟踪器节点是否需要知道如何查找模块?
如果是这样,你如何指定路径(通过一个环境变量 -
是如何为任务跟踪器完成的)?

作为编程猪一书的答案:


寄存器也用于查找在您的Pig Latin脚本中使用
的Python UDF的资源。在这种情况下,您不会注册jar,而是
而不是包含您的UDF的Python脚本。 Python脚本必须在你的当前目录中

这个很重要:


警告:Pig不会跟踪Python脚本
中的依赖项,并将所需的Python模块发送到您的Hadoop集群。您需要
来确保您需要的模块驻留在
群集中的任务节点上,并且PYTHONPATH环境变量设置在
这些节点上,以便您的UDF能够找到他们为进口。
这个问题在0.9之后已经被修正,但截至撰写本文时还没有发布


您正在使用jython:


Pig不知道Jython解释器在系统上的位置,所以
必须包含jython。当调用Pig时,你的类路径中有jar文件。这个
可以通过设置PIG_CLASSPATH环境变量来完成。


总结一下,如果您使用的是流式处理,那么您可以在猪中使用SHIP命令可以将您的可执行文件发送到群集。如果你使用的是UDF,只要它可以被编译(检查关于jython的注意事项)并且没有第三方依赖关系(你还没有把它放在PYTHONPATH中或者安装在集群中),UDF会在执行时被运送到集群。 (作为提示,如果您在注册时将简单的UDF依赖项与猪脚本放在同一个文件夹中,它会让您的生活变得更加轻松)



希望这些能够解决问题。


I am using Pig (0.9.1) with UDFs written in Python. The Python scripts import modules from the standard Python library. I have been able to run the Pig scrips that call the Python UDFs successfully in local mode, but when I run on the cluster it appears Pig's generated Hadoop job is unable to find the imported modules. What needs to be done?

For example:

  • Does python (or jython) need to be installed on each task tracker node?
  • Do the python (or jython) modules need to be installed on each task tracker node?
  • Do the task tracker nodes need to know how to find the modules?
  • If so, how do you specify the path (via an environment variable - how is that done for the task tracker)?

解决方案

Does python (or jython) need to be installed on each task tracker node?

Yes, since it's executed in task trackers.

Do the python (or jython) modules need to be installed on each task tracker node?

If you are using a 3rd party module, it should be installed in task trackers as well (like geoip, etc).

Do the task tracker nodes need to know how to find the modules? If so, how do you specify the path (via an environment variable - how is that done for the task tracker)?

As an answer from the book "Programming Pig" :

register is also used to locate resources for Python UDFs that you use in your Pig Latin scripts. In this case you do not register a jar, but rather a Python script that contains your UDF. The Python script must be in your current directory.

And also this one is important :

A caveat, Pig does not trace dependencies inside your Python scripts and send the needed Python modules to your Hadoop cluster. You are required to make sure the modules you need reside on the task nodes in your cluster and that the PYTHONPATH environment variable is set on those nodes such that your UDFs will be able to find them for import. This issue has been fixed after 0.9, but as of this writing not yet released.

And if you are using jython :

Pig does not know where on your system the Jython interpreter is, so you must include jython.jar in your classpath when invoking Pig. This can be done by setting the PIG_CLASSPATH environment variable.

As a summary, if you are using streaming then you can use "SHIP" command in pig which would send your executable files to cluster. if you are using UDF, as long as it can be compiled(check the note about jython) and doesn't have 3rd party dependency in it (which you didn't already put in PYTHONPATH / or installed in cluster), the UDF would be shipped to cluster when executed. (As a tip, it would make your life much more easier if you put your simple UDF dependencies in the same folder with pig script when registering)

Hope these would clear things.

这篇关于在Pig中使用Python UDF时,如何让Hadoop找到导入的Python模块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆