是否有校验和算法也支持“减量”来自它的数据? [英] Is there a checksum algorithm that also supports "substracting" data from it?

查看:153
本文介绍了是否有校验和算法也支持“减量”来自它的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大约有1亿份文档的系统,我想跟踪它们在镜像之间的修改。为了有效地交换有关修改的信息,我希望每天发送有关修改文档的信息,而不是每个单独的文档。像这样:

  [2012/03/26,cs26],
[2012/03/25,cs25 ],
[2012/03/24,cs24],
...

其中每个 cs 是在特定日期创建的所有文档的

现在,我遇到的问题是,我不知道当文档被删除时可以从校验和中减去数据的算法。出于显而易见的原因,没有任何密码哈希符合需要,并且我找不到任何可以做到这一点的CRC算法。



我考虑的一个选择是让删除向散列添加额外信息,但这会导致更多问题,因为节点可以按不同顺序接收删除请求,并且当节点重新启动时,它将重新读取文档中的所有时间戳,删除将会丢失。



我也不喜欢在内存中使用哈希树和所有文档哈希,因为这会使用大约8个内存,并且我认为对于这种需求来说这有点矫枉过正。

现在最好的选择似乎是在后台完全重新生成这些哈希值,但这也是一个很多不必要的开销,并且不会提供有关更改的即时信息。

所以,你们知道一个校验和算法,它可以让我删除一些数据该校验?我需要算法有点快,并且校验和能够强烈地表明最小的变化(这就是为什么我不能真正使用纯异或)

或者,也许你对于整个设计有更好的想法吗?

解决方案

$ p $ hash = X(documents,0,function(document){...})

其中X是一个XOR总和(javascript-y伪代码如下):

  function X(documents,x ,f)
{
for each(var document in documents)
{
x ^ = f(document);
}
return x;
}

和f()是单个文档信息的散列吗? (无论是时间戳还是文件名或ID或其他)

使用XOR可以让您减掉文档,但是在每个文档的基础上使用散列你可以保留一个类似散列的质量来检测小的变化。

I have a system with roughly a 100 million documents, and I'd like to keep track of their modifications between mirrors. In order to exchange information about modifications effectively, I want to send information about modified documents by days, not by each separate document. Something like this:

[ 2012/03/26, cs26],
[ 2012/03/25, cs25],
[ 2012/03/24, cs24],
...

where each cs is the checksum of timestamps of all documents created on a particular day.

Now, the problem I'm running into is that I don't know of an algorithm that could "subtract" data from the checksum when a document is being deleted. None of the cryptographic hashes fit the need, for obvious reasons, and I couldn't find any algorithms for CRC that would do this.

One option I considered was to have deletes add extra information to the hash, but this would lead to even more problems, as nodes can receive delete requests in different order, and when a node would restart it would re-read all the timestamps from the documents, and thus the information about the deletes would be lost.

I also wouldn't like using a hash tree with all document hashes in-memory, as that would use roughly 8 gigs of memory, and I think it's a bit of overkill for just this need.

For now the best option seems to regenerate these hashes completely from time to time in background, but that is also a lot of needless overhead, and wouldn't provide immediate information on changes.

So, do you guys know of a checksum algorithm that would let me "remove" some data from the checksum? I need the algorithm to be somewhat fast and the checksum that would strongly indicate the smallest of changes (that's why I can't really use plain XOR).

Or maybe you have better ideas about the whole design?

解决方案

How about

hash = X(documents, 0, function(document) { ... })

where X is an aggregate XOR (javascript-y pseudocode follows):

function X(documents, x, f)
{
   for each (var document in documents)
   {
      x ^= f(document);
   }
   return x;
}

and f() is a hash of individual document information? (whether timestamp or filename or ID or whatever)

The use of XOR would allow you to "subtract" out documents, but using a hash on a per-document basis allows you to preserve a hash-like quality of detecting small changes.

这篇关于是否有校验和算法也支持“减量”来自它的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆