在 Pig 中使用 Python UDF 时,如何让 Hadoop 找到导入的 Python 模块? [英] How do I make Hadoop find imported Python modules when using Python UDFs in Pig?

查看:27
本文介绍了在 Pig 中使用 Python UDF 时,如何让 Hadoop 找到导入的 Python 模块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有用 Python 编写的 UDF 的 Pig (0.9.1).Python 脚本从标准 Python 库导入模块.我已经能够在本地模式下成功运行调用 Python UDF 的 Pig 脚本,但是当我在集群上运行时,Pig 生成的 Hadoop 作业似乎无法找到导入的模块.需要做什么?

I am using Pig (0.9.1) with UDFs written in Python. The Python scripts import modules from the standard Python library. I have been able to run the Pig scrips that call the Python UDFs successfully in local mode, but when I run on the cluster it appears Pig's generated Hadoop job is unable to find the imported modules. What needs to be done?

例如:

  • 是否需要在每个任务跟踪器节点上安装 python(或 jython)?
  • 是否需要在每个任务跟踪器节点上安装 python(或 jython)模块?
  • 任务跟踪器节点是否需要知道如何找到模块?
  • 如果是这样,您如何指定路径(通过环境变量 - 如何为任务跟踪器完成)?

推荐答案

是否需要在每个任务跟踪器上安装 python(或 jython)节点?

Does python (or jython) need to be installed on each task tracker node?

是的,因为它是在任务跟踪器中执行的.

Yes, since it's executed in task trackers.

是否需要在每个任务上安装python(或jython)模块跟踪节点?

Do the python (or jython) modules need to be installed on each task tracker node?

如果您使用的是 3rd 方模块,它也应该安装在任务跟踪器中(如 geoip 等).

If you are using a 3rd party module, it should be installed in task trackers as well (like geoip, etc).

任务跟踪器节点是否需要知道如何找到模块?如果是这样,您如何指定路径(通过环境变量 - 如何这是为任务跟踪器完成的吗?

Do the task tracker nodes need to know how to find the modules? If so, how do you specify the path (via an environment variable - how is that done for the task tracker)?

作为编程猪"一书中的答案:

register 还用于为您使用的 Python UDF 定位资源在你的 Pig Latin 脚本中.在这种情况下,您不注册 jar,而是而是包含您的 UDF 的 Python 脚本.Python 脚本必须位于您当前的目录中.

register is also used to locate resources for Python UDFs that you use in your Pig Latin scripts. In this case you do not register a jar, but rather a Python script that contains your UDF. The Python script must be in your current directory.

而且这个也很重要:

需要注意的是,Pig 不会跟踪 Python 脚本中的依赖项并将所需的 Python 模块发送到您的 Hadoop 集群.你是需要确保您需要的模块驻留在任务节点上您的集群并且 PYTHONPATH 环境变量设置在那些节点,以便您的 UDF 能够找到它们进行导入.此问题已在 0.9 后修复,但截至撰写本文时尚未发布.

A caveat, Pig does not trace dependencies inside your Python scripts and send the needed Python modules to your Hadoop cluster. You are required to make sure the modules you need reside on the task nodes in your cluster and that the PYTHONPATH environment variable is set on those nodes such that your UDFs will be able to find them for import. This issue has been fixed after 0.9, but as of this writing not yet released.

如果你使用 jython :

And if you are using jython :

Pig 不知道 Jython 解释器在您系统上的哪个位置,所以调用 Pig 时,您必须在类路径中包含 jython.jar.这个可以通过设置 PIG_CLASSPATH 环境变量来完成.

Pig does not know where on your system the Jython interpreter is, so you must include jython.jar in your classpath when invoking Pig. This can be done by setting the PIG_CLASSPATH environment variable.

总而言之,如果您使用流式传输,那么您可以在 pig 中使用SHIP"命令,它将您的可执行文件发送到集群.如果您使用的是 UDF,只要它可以编译(查看有关 jython 的注释)并且其中没有 3rd 方依赖项(您尚未将其放入 PYTHONPATH/或安装在集群中),UDF 就会执行时发送到集群.(提示一下,如果您在注册时将简单的 UDF 依赖项与 pig 脚本放在同一文件夹中,将会使您的生活更加轻松)

As a summary, if you are using streaming then you can use "SHIP" command in pig which would send your executable files to cluster. if you are using UDF, as long as it can be compiled(check the note about jython) and doesn't have 3rd party dependency in it (which you didn't already put in PYTHONPATH / or installed in cluster), the UDF would be shipped to cluster when executed. (As a tip, it would make your life much more easier if you put your simple UDF dependencies in the same folder with pig script when registering)

希望这些能解决问题.

这篇关于在 Pig 中使用 Python UDF 时,如何让 Hadoop 找到导入的 Python 模块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆