在联合培训中实施数据生成器 [英] Implement data generator in federated training

查看:119
本文介绍了在联合培训中实施数据生成器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(我已将问题发布在 https://github.com/tensorflow/federated/Issues/793 ,也许也在这里!)

我已经为联合界面定制了自己的数据和模型,并进行了融合培训.但是我对一个问题感到困惑,在图像分类任务中,整个数据集非常大,无法将其存储在单个federated_train_data中,也无法一次导入到内存中.因此,我需要将数据集从硬盘上实时地批量加载到内存中,并在训练过程中使用Keras model.fit_generator而不是model.fit,这是人们用来处理大数据的方法.

我想在图像分类教程中显示的iterative_process中,模型适合一组固定的数据.有没有办法调整代码使其适合数据生成器?我已经研究了源代码,但仍然很困惑.非常感谢任何提示.

解决方案

通常,TFF认为数据馈送是"Python驱动程序循环"的一部分,这是在编写TFF代码时做出的有益区分.

实际上,在编写TFF时,通常可以分为三个级别:

  1. TensorFlow定义本地处理(即,将在客户端,服务器上,聚合器中或可能需要的其他任何位置发生的处理,但仅是单个位置).
  2. 原生TFF,用于定义跨展示位置传递数据的方式.例如,在tff.federated_computation装饰器中写入tff.federated_sum;编写此行将声明此数据已从客户端移至服务器,并通过sum运算符进行汇总".
  3. Python驱动" TFF循环,例如进行一轮比赛.完成真正的"联合学习运行时将要完成的工作是最后阶段的工作.一个例子就是选择给定回合的客户.

如果要牢记这种故障,那么使用生成器或其他一些惰性评估风格的构造将数据输入联合计算中将变得相对简单;它只是在Python级别完成的.

可以通过 对象上的方法;随着循环的进行,您的Python代码可以从client_ids列表中进行选择,然后可以实例化一个新的tf.data.Datasets列表并将其作为新的客户端数据集传入.这种相对简单的用法的示例是此处,以及更高级的用法(涉及定义以client_id为参数的自定义client_datasets_fn并将其传递到单独定义的训练循环中,将是此处 href ="https://arxiv.org/abs/1911.06679" rel ="nofollow noreferrer">本文.

最后一个注意事项:实例化tf.data.Dataset并不会真正将数据集加载到内存中;仅在迭代数据集时才加载数据集.我从tf.data.Dataset的主要作者那里收到的一个有用的技巧是,将tf.data.Dataset看作是数据集配方",而不是数据集本身的字面实例化.有人建议,对于这个结构,也许更好的名字是DataSource.希望这可以帮助您对实际发生的事情进行心理模型研究.同样,使用tff.simulation.ClientData对象通常不应真正将任何内容加载到内存中,直到在客户端的培训中对其进行迭代为止.这应该使在管理数据集内存方面的细微差别变得更简单.

(I have posted the question on https://github.com/tensorflow/federated/issues/793 and maybe also here!)

I have customized my own data and model to federated interfaces and the training converged. But I am confused about an issue that in an images classification task, the whole dataset is extreme large and it can't be stored in a single federated_train_data nor be imported to memory for one time. So I need to load the dataset from the hard disk in batches to memory real-timely and use Keras model.fit_generator instead of model.fit during training, the approach people use to deal with large data.

I suppose in iterative_process shown in image classification tutorial, the model is fitted on a fixed set of data. Is there any way to adjust the code to let it fit to a data generator?I have looked into the source codes but still quite confused. Would be incredibly grateful for any hints.

解决方案

Generally, TFF considers the feeding of data to be part of the "Python driver loop", which is a helpful distinction to make when writing TFF code.

In fact, when writing TFF, there are generally three levels at which one may be writing:

  1. TensorFlow defining local processing (IE, processing that will happen on the clients, or on the server, or in the aggregators, or at any other placement one may want, but only a single placement.
  2. Native TFF defining the way data is communicated across placements. For example, writing tff.federated_sum inside of a tff.federated_computation decorator; writing this line declares "this data is moved from clients to server, and aggregated via the sum operator".
  3. Python "driving" the TFF loop, e.g. running a single round. It is the job of this final level to do what a "real" federated learning runtime would do; one example here would be selecting the clients for a given round.

If this breakdown is kept in mind, using a generator or some other lazy-evaluation-style construct to feed data in to a federated computation becomes relatively simple; it is just done at the Python level.

One way this could be done is via the create_tf_dataset_for_client method on the ClientData object; as you loop over rounds, your Python code can select from the list of client_ids, then you can instantiate a new list of tf.data.Datasetsand pass them in as your new set of client data. An example of this relatively simple usage would be here, and a more advanced usage (involving defining a custom client_datasets_fn which takes client_id as a parameter, and passing it to a separately-defined training loop would be here, in the code associated to this paper.

One final note: instantiating a tf.data.Dataset does not actually load the dataset into memory; the dataset is only loaded in when it is iterated over. One helpful tip I have received from the lead author of tf.data.Dataset is to think of tf.data.Dataset more as a "dataset recipe" than a literal instantiation of the dataset itself. It has been suggested that perhaps a better name would have been DataSource for this construct; hopefully that may help the mental model on what is actually happening. Similarly, using the tff.simulation.ClientData object generally shouldn't really load anything into memory until it is iterated over in training on the clients; this should make some nuances around managing dataset memory simpler.

这篇关于在联合培训中实施数据生成器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆