有效区分大型文件的算法 [英] Algorithm for efficient diffing of huge files

查看:77
本文介绍了有效区分大型文件的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须存储两个文件A和B,它们都非常大(例如100GB)。但是B在很大程度上可能与A相似,因此我可以存储A和diff(A,B)。这个问题有两个有趣的方面:​​

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two interesting aspects to this problem:


  1. 文件太大,无法被我知道的任何差异库进行分析,因为它们位于-memory

  2. 我实际上不需要diff-diff通常具有插入,编辑和删除的功能,因为它是供人类阅读的。我得到的信息更少了:我只需要新字节范围和从任意偏移量的旧文件中复制字节。

在这些条件下,我目前对如何计算从A到B的增量感到困惑。有人知道这个算法吗?

I am currently at a loss at how to compute the delta from A to B under these conditions. Does anyone know of an algorithm for this?

同样,问题很简单:编写一个算法,该算法可以在给定文件A和B的情况下以尽可能少的字节存储文件A和B。

Again, the problem is simple: Write an algorithm that can store the files A and B with as few bytes as possible given the fact that both are quite similar.

其他信息:尽管大零件可能是相同的,但它们的偏移量可能会有所不同,并且顺序混乱。最后一个事实是为什么常规差异可能不会节省太多的费用。

Additional info: Although big parts might be identical they are likely to have different offsets and be out of order. The last fact is why a conventional diff might not save much.

推荐答案

看看RSYNCs算法,因为它的设计非常合理为此,它可以有效地复制增量。正如我记得的那样,该算法已被很好地证明。

Take a look at RSYNCs algorithm, as it's designed pretty much to do exactly this so it can efficiently copy deltas. And the algorithm is pretty well documented, as I recall.

这篇关于有效区分大型文件的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆