在提交到存储库之前解压缩压缩数据文件 [英] uncompressing zipped data files before committing to repository

查看:108
本文介绍了在提交到存储库之前解压缩压缩数据文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在存储库中以某种方式存储正常压缩文件的未压缩版本是否有意义?



如果是这样,是否有一种标准方法这个?
(也许一个标准的预提交钩子将每个这样的文件解压到一个特殊命名的文件夹;
和一个后检出钩子压缩这些特别命名的文件夹到压缩文件LibreOffice知道如何读取和写入?类似于我应该解压缩zip文件我可以存档吗??)
(也许黑客版本控制软件的代码自动解压缩旧版本和新版本,并存储解压缩文件之间的差异,如果失败, t提供了显着的改进,回到原始文件之间存储直接diff的原始系统,或只是直接存储文件?)



我有一个集合经常编辑的OpenOffice / LibreOffice文件。
我将它们存储在版本控制存储库中 -
按照
应该将图像存储在git存储库中吗?
虽然我碰巧使用TortoiseHg或SourceTree访问我的存储库,而不是git。



我恰好知道Open Office文件实际上是压缩的容器里面有几个XML文件。
(我听说许多其他流行的应用程序二进制文件格式也是某种形式的zip压缩文件)。



我的理解是,即使是最小的更改为这样的二进制文件导致整个新文件存储在存储库中。
与文本文件中的小变化相反,这导致只存储和传输变化。



在理论上, :




  • 如果更改只有几个字,我可以在更改日志的diff视图中看到更改的确切单词。

  • 当几个不同的人独立地编辑文件的第14版时,将所有的改进合并到版本16中更容易

  • 与远程存储库的同步更快 - 只需要传输简短的更改,而不需要传输整个(压缩)文件。

  • 可能较小的存储库,在磁盘空间方面 - 经过几百次更改,我预计一个相对较小的存储库,只包含几百个小的更改,而不是一个相对较大的存储库,包含几百个完整副本的这些文件。 (我最后列出了这个优势,因为在这些日子里,它几乎不相关)。


解决方案


是否在存储库中以某种方式存储正常压缩文件的未压缩 b $ b

这很有意义,特别是如果你需要分支和diff'ing。



旧主题<​​/a>总结了情况。



  1. 对于大小受嵌入图像和其他大对象支配的Openoffice文档,git delta机制已经表现得相当不错,因为OO文件是Zip档案

    如果您不更改图像,那么该图像仍以相同的方式存储,并且
    delta可以完成。

  2. 对于大小受普通内容支配的OO文档,git delta机制不能工作,因为zip压缩引入了混合,文档中的一个小变化被转换为zip文件中的一个非常大的变化。

可以在提交之前编写 clean 过滤器来解压缩。
然而,有一个技巧与互补 smudge 过滤器在结帐时使用。如果你没有正确涂抹,git总是显示文件更改wrt索引。

正确的涂抹将意味着使用与OO使用非常相同的压缩比和压缩方法,这可能有点棘手。我已经尝试使用zip二进制在干净 smudge 阶段,它不能很好地工作。已污染的文件总是与原始文件不同。

一个应该在较低级别工作,以更好地控制正在发生的事情(libzip),并在未压缩文件前面添加要恢复的压缩参数



更大的问题是,当处理大型OO文件时,clean / smudge的东西可能真的很慢。


< blockquote>

Does it make any sense to somehow store an "uncompressed" version of normally-compressed files in the repository?

If so, is there a standard way to implement this? (Perhaps a standard pre-commit hook that uncompresses each such file into a specially-named folder; and a post-checkout hook that compresses such specially-named folders into the compressed files that LibreOffice knows how to read and write? Something like the process described by "Should I decompress zips before I archive?" ?) (Perhaps hacking the code of the version control software to automagically decompress the old version and the new version and storing the diff between the decompressed files, and if that fails or doesn't offer a significant improvement, fall back on the original system of storing the direct diff between the original files, or simply storing the file directly?)

I have a collection of OpenOffice / LibreOffice files that are frequently edited. I am storing them in a version-control repository -- as recommended by "Should images be stored in a git repository?". Although I happen to be using TortoiseHg or SourceTree to access my repositories, rather than git.

I happen to know that Open Office files are actually zip-compressed container with a few XML files inside. (I hear that many other popular application "binary file formats" are also some form of zip-compressed file).

My understanding is that even the smallest change to such "binary" files leads to the entire new file stored in the repository. As opposed to small changes in "text" files, which leads to only the changes being stored and transmitted.

In theory, that would have the advantages of:

  • Where the change is only a few words, I could see the exact words that changed in the "diff" view in the change log. (Rather than the non-informative "binary file changed" message).
  • When several different people independently edit version 14 of a file, it's much easier to merge all of their improvements into version 16 of the file without regression.
  • faster synchronization to the remote repository -- only short "changes" need to be transmitted, rather than the entire (compressed) file.
  • possibly smaller repository, in terms of disk space -- after a few hundred changes, I expect a relatively small repository that only contains a few hundred small changes, rather than a relatively large repository that contains a few hundred complete copies of these files. (I list this advantage last, because it is nearly irrelevant in these days of cheap disk space).

解决方案

Does it make any sense to somehow store an "uncompressed" version of normally-compressed files in the repository?

It makes sense especially if you need branching and diff'ing.

This old thread summarizes the situation.

  1. For Openoffice documents whose size is dominated by embed images and other large objects, the git delta mechanism already performs reasonably well, since OO files are Zip archives where each file is compressed separately.
    If you do not change an image, then that image remains stored in the same way and the delta can be done.
  2. For OO documents whose size is dominated by plain content, the git delta mechanism cannot work, since the zip compression introduces "mixing" and a small change in the document is converted into a very large change in the zip file.

It could be possible to write a clean filter to uncompress before commit.
However there is a trick with the complementary smudge filter to be used at checkout. If you do not smudge properly, git always shows the file as changed wrt the index.
Smudging correctly would mean using the very same compression ratio and compress method that OO uses, which can be a little tricky. I have tried using the zip binary both in the clean and the smudge phases and it does not work nicely. The smudged file is always different from the original one.
One should probably work at a lower level to have a finer control on what is happening (libzip) and prepend to the uncompressed file the compression parameters to be restored on smudging.

The bigger issue is however that the clean/smudge thing can be really slow when dealing with large OO files.

这篇关于在提交到存储库之前解压缩压缩数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆