数据流中的nltk依赖关系 [英] nltk dependencies in dataflow

查看:165
本文介绍了数据流中的nltk依赖关系的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道外部python依赖可以通过requirements.txt文件输入Dataflow。我可以在我的Dataflow脚本中成功加载nltk。但是,nltk通常需要下载更多文件(例如停用词或punkt)。通常在脚本的本地运行中,我可以运行

I know that external python dependencies can by fed into Dataflow via the requirements.txt file. I can successfully load nltk in my Dataflow script. However, nltk often needs further files to be downloaded (e.g. stopwords or punkt). Usually on a local run of the script, I can just run

nltk.download('stopwords')
nltk.download('punkt')

这些文件将可用剧本。我该如何做到这一点,以便工作人员脚本也可以使用这些文件。如果将这些命令放入doFn / CombineFn中,如果只需要为每个工作人员发生一次,似乎效率极低。脚本的哪个部分保证在每个工作人员上运行一次?这可能是放置下载命令的地方。

and these files will be available to the script. How do I do this so the files are also available to the worker scripts. It seems like it would be extremely inefficient to place those commands into a doFn/CombineFn if they only have to happen once per worker. What part of the script is guaranteed to run once on every worker? That would probably be the place to put the download commands.

根据这个,Java允许通过classpath对资源进行分级。这不是我在Python中寻找的。我也没有寻找一种方法来加载额外的Python资源。我只需要nltk来查找它的文件。

According to this, Java allows the staging of resources via classpath. That's not quite what I'm looking for in Python. I'm also not looking for a way to load additional python resources. I just need nltk to find its files.

推荐答案

您可以使用'--setup_file setup.py'来运行这些自定义命令。 https://cloud.google。 com / dataflow / pipeline / dependencies-python#pypi-dependencies-with-non-python-dependencies 。这是否适用于您的情况?

You can probably use '--setup_file setup.py' to run these custom commands. https://cloud.google.com/dataflow/pipelines/dependencies-python#pypi-dependencies-with-non-python-dependencies . Does this work in your case?

这篇关于数据流中的nltk依赖关系的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆