在 Luigi 的任务之间传递 Python 对象? [英] Passing Python objects between Tasks in Luigi?

查看:62
本文介绍了在 Luigi 的任务之间传递 Python 对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Spotify 的 Luigi 在 Python 3.6 中编写我的第一个项目,以安排一些自然语言处理任务一个管道.

我注意到 Task 类的 output() 函数总是返回某种 Target 对象,它只是某个地方的某个文件,无论是本地的还是远程的.因为我的任务会产生更复杂的数据结构,比如解析树,所以我将它们作为字符串写入文件并在之后再次读取它们非常尴尬.

所以我想问一下是否有可能在管道内的任务之间传递Python对象?

解决方案

简短回答:否.

Luigi 参数仅限于日期/日期时间对象、字符串、整数和浮点数.请参阅文档以供参考.

这意味着您需要将复杂的数据结构序列化为字符串(使用 json、msgpack、任何您喜欢的序列化程序,甚至对其进行压缩)并将其作为字符串参数传递.

当然,您可以编写自定义 Parameter 子类,但您需要实现 基本上是序列化和解析方法.

但请注意:如果您使用参数而不是将计算数据保存到目标,您将失去使用 Luigi 的一项关键优势:如果树中的父任务失败的次数超过您指定的重试次数,那么您将需要再次运行计算该复杂数据结构的任务.如果您的任务计算复杂数据或花费大量时间或消耗大量资源,那么您应该将输出保存为目标,以免再次进行所有昂贵的计算.

展望未来:另一个任务可能也需要这些数据,那么为什么不保存它呢?

另外,请注意目标不仅仅是文件:您可以将数据保存到数据库表、Redis、Hadoop、弹性搜索索引等等:http://luigi.readthedocs.io/en/stable/api/luigi.contrib.html#submodules>

I was coding my first project in Python 3.6 using Spotify's Luigi to arrange some Natural Language Processing Tasks in a pipeline.

I noticed that the output() function of a Task class always returns some kind of Target object, which is just some file somewhere, be it local or remote. Because my Tasks produce more complex data structures like parse trees, it's pretty awkward for me to write them into files as strings and read them again after.

Therefore I would like to ask if there is any possibility to pass Python objects between the tasks within a pipeline?

解决方案

Short answer: No.

Luigi parameters are limited to date/datetime objects, string, int and float. See docs for reference.

That means that you need to serialize your complex data structure as a string (using json, msgpack, whatever serializer you like, and even compress it) and pass it as a string parameter.

Of course, you may write a custom Parameter subclass, but you'll need to implement the serialize and parse methods basically.

But take into account: if you use parameters instead of saving your calculated data to a target, you will be loosing one key advantage of using Luigi: if the parent task in the tree fails more than the count of retries you specify, then you´ll need to run the task that calculates that complex data structure again. If your tasks calculates complex data or takes a considerable amount of time or consumes a lot of resources, then you should save the output as a target in order to not having to do all that expensive computation again.

And looking beyond: another task may need that data too, so why not save it?

Also, notice that targets are not only files: you may save your data to a database table, Redis, Hadoop, an Elastic Search index, and many more: http://luigi.readthedocs.io/en/stable/api/luigi.contrib.html#submodules

这篇关于在 Luigi 的任务之间传递 Python 对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆