具有版本控制的HDF5文件(h5py)-每次保存时哈希更改 [英] HDF5 file (h5py) with version control - hash changes on every save

查看:71
本文介绍了具有版本控制的HDF5文件(h5py)-每次保存时哈希更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用h5py将来自数字工作的中间数据存储在HDF5文件中.我的项目受版本控制,但是对于HDF5文件来说效果不佳,因为每次重新运行生成HDF5文件的脚本时,即使其中的数据不正确,二进制文件也会更改.

这里有一个小例子来说明这一点:

 在[1]中:导入h5py,numpy作为np在[2]中:A = np.arange(5)在[3]中:f = h5py.File('test.h5','w');f ['A'] = A;f.close()在[4]中:!md5sum test.h57d27c258d94ed5d06736f6d2ba7c9433 test.h5在[5]中:f = h5py.File('test.h5','w');f ['A'] = A;f.close()在[6]中:!md5sum test.h5c1db5806f1393f2095c88dbb7efeb7d3 test.h5在[7]中:#文件已更改,但仍包含相同的数据! 

我查看了HDF5文件格式的文档和h5py文档,但没有找到任何有助于我的内容.我的问题是:

  1. 即使我保存相同的数据,为什么文件也会更改?

  2. 如何停止更改,因此版本控制仅在实际数字内容更改后才能看到文件的新版本?

谢谢

解决方案

HDF5文件同时使用抽象数据模型和抽象存储模型.这意味着文件在磁盘上的存储方式可能(通常)与程序中的表示方式完全不同.可以用一种以上的方式存储完全相同的数据,而这对于您的程序而言并不明显.

HDF5文件格式存储规范允许多个时间戳记在数据对象标题中.这些没有存储为属性,因此高级API通常无法访问它们.可以关闭使用低级HDF5 API编写这些时间戳的方法,但是尚不清楚h5py中是否包含相关功能.此github问题似乎正是您想要的,但不幸的是它仍然是公开的./p>

I am using h5py to store intermediate data from numerical work in an HDF5 file. I have the project under version control, but this doesn't work well with the HDF5 files because every time a script is re-run which generates a HDF5 file, the binary file changes even if the data within does not.

Here is a small example to illustrate this:

In [1]: import h5py, numpy as np

In [2]: A = np.arange(5)

In [3]: f = h5py.File('test.h5', 'w'); f['A'] = A; f.close()

In [4]: !md5sum test.h5
7d27c258d94ed5d06736f6d2ba7c9433  test.h5

In [5]: f = h5py.File('test.h5', 'w'); f['A'] = A; f.close()

In [6]: !md5sum test.h5
c1db5806f1393f2095c88dbb7efeb7d3  test.h5

In [7]: # the file has changed but still contains the same data!

I have looked in the HDF5 file format documents and at the h5py documentation but haven't found anything which helps me with this. My questions are:

  1. Why does the file change even though I'm saving the same data?

  2. How can I stop it changing, so version control only sees a new version of the file when the actual numerical content has changed?

Thanks

解决方案

The HDF5 file uses both an abstract data model as well as an abstract storage model. What this means is that how a file is stored on disk may be (and usually is) completely different to how it is represented in your program. It's possible to store exactly the same data in more than one way, and for this not to be apparent to your program.

The HDF5 file format storage specification allows for several timestamps in the data object headers. These are not stored as attributes, and so aren't usually accessible by the high level APIs. It's possible to turn off writing these timestamps using the low level HDF5 APIs, but it's not clear if the relevant features are in h5py. This github issue appears to be exactly what you want, but unfortunately it is still open.

这篇关于具有版本控制的HDF5文件(h5py)-每次保存时哈希更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆