如何在Azure Machine Learning Studio中处理数据集中的频繁更改? [英] How to handle the frequent changes in dataset in azure Machine Learning studio?

查看:103
本文介绍了如何在Azure Machine Learning Studio中处理数据集中的频繁更改?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何处理Azure Machine Learning Studio中数据集中的频繁更改。我的数据集可能会随着时间而变化,我需要向数据集添加更多行。如何使用最新更新的数据集 刷新当前用于训练模型的数据集。我需要以编程方式(用C#或python)完成这项工作,而不是在工作室中手动完成。

How to handle the frequent changes in the dataset in Azure Machine Learning Studio. My dataset may change over time, I need to add more rows to dataset. How will I refresh the dataset which I currently use to train the model by using the newly updated dataset. I need this work to be done programmatically(in c# or python) instead of doing it manually in the studio.

推荐答案

注册时一个AzureML数据集,不移动任何数据,仅存储一些信息,例如数据的位置以及应如何加载。目的是使访问数据变得简单,就像调用 dataset = Dataset.get(name = mydataset)

When registering an AzureML Dataset, no data is moved, just some information like where the data is and how it should be loaded are stored. The purpose is to make accessing the data as simple as calling dataset = Dataset.get(name="my dataset")

在下面的代码段中(完整示例),如果我注册数据集,我可以在注册后从技术上用新版本覆盖 weather / 2018 / 11.csv ,我的数据集定义将保持不变,但是如果您在覆盖后进行训练,则可以使用新数据。

In the snippet below (full example), if I register the dataset, I could technically overwrite weather/2018/11.csv with a new version after registering, and my Dataset definition would stay the same, but the new data would be available if you use in it training after overwriting.

# create a TabularDataset from 3 paths in datastore
datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')]
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

但是是另外两种推荐的方法(我的团队同时使用)

However, there are two more recommended approaches (my team does both)


  1. 隔离数据并注册新版本的数据集,以便您始终可以回滚到数据集版本的先前版本。 数据集版本控制最佳实践

  2. 使用通配符/ glob数据路径来引用定期加载了新数据的文件夹。这样,您就可以拥有 Dataset 随时间增长的大小,而无需重新注册。

  1. Isolate your data and register a new version of the Dataset, so that you can always roll-back to a previous version of a Dataset version . Dataset Versioning Best Practice
  2. Use a wildcard/glob datapath to refer to a folder that has new data loaded into it on a regular basis. In this way you can have a Dataset that is growing in size over time without having to re-register.

这篇关于如何在Azure Machine Learning Studio中处理数据集中的频繁更改?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆