如何将 tf.data.Dataset 与 kedro 一起使用? [英] How to use tf.data.Dataset with kedro?

查看:65
本文介绍了如何将 tf.data.Dataset 与 kedro 一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 tf.data.Dataset 准备用于训练 tf.kears 模型的流数据集.使用 kedro,有没有办法创建节点并返回创建的 tf.data.Dataset 在下一个训练节点中使用它?

I am using tf.data.Dataset to prepare a streaming dataset which is used to train a tf.kears model. With kedro, is there a way to create a node and return the created tf.data.Dataset to use it in the next training node?

MemoryDataset 可能不会工作,因为 tf.data.Dataset 不能被腌制(deepcopy 是不可能的),另见 这个问题.根据 issue #91MemoryDataset 中的深层副本是这样做是为了避免其他节点修改数据.有人可以详细说明为什么/如何进行这种并发修改吗?

The MemoryDataset will probably not work because tf.data.Dataset cannot be pickled (deepcopy isn't possible), see also this SO question. According to issue #91 the deep copy in MemoryDataset is done to avoid modifying the data by some other node. Can someone please elaborate a bit more on why/how this concurrent modification could happen?

来自文档,似乎有一个 copy_mode = "assign".如果数据不可pickle,是否可以使用此选项?

From the docs, there seems to be a copy_mode = "assign". Would it be possible to use this option in case the data is not picklable?

另一种解决方案(也在第 91 期中提到)是仅使用一个函数在训练节点内生成流式 tf.data.Dataset,而无需前面的数据集生成节点.但是,我不确定这种方法的缺点是什么(如果有的话).如果有人能举一些例子就好了.

Another solution (also mentioned in issue 91) is to use just a function to generate the streaming tf.data.Dataset inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.

另外,我想避免存储流数据集的完整输出,例如使用 tfrecordstf.data.experimental.save 因为这些选项会使用大量磁盘存储.

Also, I would like to avoid storing the complete output of the streaming dataset, for example using tfrecords or tf.data.experimental.save as these options would use a lot of disk storage.

有没有办法只传递创建的 tf.data.Dataset 对象以将其用于训练节点?

Is there a way to pass just the created tf.data.Dataset object to use it for the training node?

推荐答案

在此处提供解决方案以造福社区,尽管它在 kedro.community @DataEngineerOne.

Providing workaround here for the benefit of community, though it is presented in kedro.community by @DataEngineerOne.

根据@DataEngineerOne.

According to @DataEngineerOne.

使用kedro,有没有办法创建节点并返回创建的节点tf.data.Dataset 用于下一个训练节点?

With kedro, is there a way to create a node and return the created tf.data.Dataset to use it in the next training node?

是的,绝对!

有人可以详细说明为什么/如何并发可能会发生修改吗?

Can someone please elaborate a bit more on why/how this concurrent modification could happen?

从文档来看,似乎有一个 copy_mode = "assign";.可不可能是如果数据不可选择,可以使用此选项吗?

From the docs, there seems to be a copy_mode = "assign" . Would it be possible to use this option in case the data is not picklable?

我还没有尝试过这个选项,但理论上应该可行.您需要做的就是在包含 copy_mode 选项的 catalog.yml 文件中创建一个新的数据集条目.

I have yet to try this option, but it should theoretically work. All you would need to do is create a new dataset entry in the catalog.yml file that includes the copy_mode option.

例如:

# catalog.yml
tf_data:
  type: MemoryDataSet
  copy_mode: assign

# pipeline.py
node(
  tf_generator,
  inputs=...,
  outputs="tf_data",
)

我无法保证此解决方案,但请试一试,让我知道它是否适合您.

I can not vouch for this solution, but give it a go and let me know if it works for you.

另一个解决方案(也在 issue 91 中提到)是只使用一个在训练中生成流式 tf.data.Dataset 的函数节点,没有前面的数据集生成节点.但是,我我不确定这种方法的缺点是什么(如果有的话).如果有人能举一些例子就好了.

Another solution (also mentioned in issue 91) is to use just a function to generate the streaming tf.data.Dataset inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.

这也是一个很好的替代解决方案,我认为(猜测)在这种情况下 MemoryDataSet 将自动使用 assign,而不是其正常的 deepcopy,所以你应该没事.

This is also a great alternative solution, and I think (guess) that the MemoryDataSet will automatically use assign in this case, rather than its normal deepcopy, so you should be alright.

# node.py

def generate_tf_data(...):
  tensor_slices = [1, 2, 3]
  def _tf_data():
    dataset = tf.data.Dataset.from_tensor_slices(tensor_slices)
    return dataset
  return _tf_data

def use_tf_data(tf_data_func):
  dataset = tf_data_func()

# pipeline.py
Pipeline([
node(
  generate_tf_data,
  inputs=...,
  outputs='tf_data_func',
),
node(
  use_tf_data,
  inputs='tf_data_func',
  outputs=...
),
])

这里唯一的缺点是额外的复杂性.有关更多详细信息,您可以参考 这里.

The only drawback here is the additional complexity. For more details you can refer here.

这篇关于如何将 tf.data.Dataset 与 kedro 一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆