我大约可以多少钱.通过使用DVC减少磁盘容量? [英] By how much can i approx. reduce disk volume by using dvc?

查看:97
本文介绍了我大约可以多少钱.通过使用DVC减少磁盘容量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对〜1m +个文档进行分类,并具有用于相应模型的输入和输出的版本控制系统.

I want to classify ~1m+ documents and have a Version Control System for in- and Output of the corresponding model.

数据随时间变化:

  • 样本量随时间增加
  • 可能会出现新功能
  • 匿名程序可能会随着时间而改变

因此,基本上一切"都可能发生变化:观测值,特征和值. 我们有兴趣使ml模型建筑在不使用10/100 + GB的情况下可重现 磁盘卷,因为我们保存了输入数据的所有更新版本.目前,数据量约为700mb.

So basically "everything" might change: amount of observations, Features and the values. We are interested in making the ml model Building reproducible without using 10/100+ GB of disk volume, because we save all updated versions of Input data. Currently the volume size of the data is ~700mb.

我发现最有希望的工具是: https://github.com/iterative/dvc .当前数据 存储在数据库中,然后从那里加载到R/Python中.

The most promising tool i found is: https://github.com/iterative/dvc. Currently the data is stored in a database in loaded in R/Python from there.

问题:

使用dvc可以节省多少磁盘空间(非常大)?

How much disk volume can be (very approx.) saved by using dvc?

如果可以粗略估计这一点.我试图找出是否仅保存数据的差异".通过阅读,我没有找到太多信息: https://github.com/iterative/dvc# how-dvc-works 或其他文档.

If one can roughly estimate that. I tried to find out if only the "diffs" of the data are saved. I didnt find much info by reading through: https://github.com/iterative/dvc#how-dvc-works or other documentation.

我知道这是一个非常模糊的问题.而且它将高度依赖于数据集.但是,我仍然会对获得一个非常近似的想法感兴趣.

推荐答案

让我尝试总结一下DVC如何存储数据,希望您能够从中得出节省/消耗多少空间的信息.具体情况.

Let me try to summarize how does DVC store data and I hope you'll be able to figure our from this how much space will be saved/consumed in your specific scenario.

DVC在单个文件级别上存储和重复数据删除.因此,从实用角度来看,这通常意味着什么.

DVC is storing and deduplicating data on the individual file level. So, what does it usually mean from a practical perspective.

我将使用dvc add作为示例,但是相同的逻辑适用于将数据文件或目录保存到DVC缓存的所有命令-dvc adddvc run等.

I will use dvc add as an example, but the same logic applies to all commands that save data files or directories into DVC cache - dvc add, dvc run, etc.

让我们想象一下,我有一个1GB的XML文件.我开始使用DVC对其进行跟踪:

Let's imagine I have a single 1GB XML file. I start tracking it with DVC:

$ dvc add data.xml

在现代文件系统上(或如果启用了hardlinkssymlinks,请参见

On the modern file system (or if hardlinks, symlinks are enabled, see this for more details) after this command we still consume 1GB (even though file is moved into DVC cache and is still present in the workspace).

现在,让我们对其进行一些更改并再次保存:

Now, let's change it a bit and save it again:

$ echo "<test/>" >> data.xml
$ dvc add data.xml

在这种情况下,我们将消耗2GB. DVC不会在同一文件的两个版本之间进行区分,也不会将文件拆分为大块或块以了解只有一小部分数据已更改.

In this case we will have 2GB consumed. DVC does not do diff between two versions of the same file, neither it splits files into chunks or blocks to understand that only small portion of data has changed.

确切地说,它计算每个文件的md5并将其保存在内容可寻址键值存储中.文件的md5用作键(高速缓存中文件的路径),值是文件本身:

To be precise, it calculates md5 of each file and save it in the content addressable key-value storage. md5 of the files serves as a key (path of the file in cache) and value is the file itself:

(.env) [ivan@ivan ~/Projects/test]$ md5 data.xml
0c12dce03223117e423606e92650192c

(.env) [ivan@ivan ~/Projects/test]$ tree .dvc/cache
.dvc/cache
└── 0c
   └── 12dce03223117e423606e92650192c

1 directory, 1 file

(.env) [ivan@ivan ~/Projects/test]$ ls -lh data.xml
data.xml ----> .dvc/cache/0c/12dce03223117e423606e92650192c (some type of link)

方案2:修改目录

现在让我们想象一下,我们有一个1GB的大型目录images,其中包含许多文件:

Scenario 2: Modifying directory

Let's now imagine we have a single large 1GB directory images with a lot of files:

$ du -hs images
1GB

$ ls -l images | wc -l
1001

$ dvc add images

在这一点上,我们仍然消耗1GB.什么也没有变.但是,如果我们通过添加更多文件(或删除其中一些文件)来修改目录:

At this point we still consume 1GB. Nothing has changed. But if we modify the directory by adding more files (or removing some of them):

$ cp /tmp/new-image.png images

$ ls -l images | wc -l
1002

$ dvc add images

在这种情况下,保存新版本后,我们消耗的电量仍接近1GB . DVC在目录级别计算差异.它不会保存目录中以前存在的所有文件.

In this case, after saving the new version we still close to 1GB consumption. DVC calculates diff on the directory level. It won't be saving all the files that were existing before in the directory.

相同的逻辑适用于将数据文件或目录保存到DVC缓存的所有命令-dvc adddvc run等.

The same logic applies to all commands that save data files or directories into DVC cache - dvc add, dvc run, etc.

请让我知道是否清楚,或者我们需要添加更多详细信息和说明.

Please, let me know if it's clear or we need to add more details, clarifications.

这篇关于我大约可以多少钱.通过使用DVC减少磁盘容量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆