Bloomfilter和Cassandra =为什么使用和为什么散列几次? [英] Bloomfilter and Cassandra = Why used and why hashed several times?

查看:211
本文介绍了Bloomfilter和Cassandra =为什么使用和为什么散列几次?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读:
http:/ /spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html



我的问题:



1。)是正确的,Cassandra只使用bloom过滤器,找出最可能包含key的SST(Sorted String Table)因为可能有几个SST,Cassandra不知道在哪个SST可能是一个密钥?所以要加快这一点,在所有的SST使用bloomfilters。它是否正确? (我试图理解cassandra的工作原理...)



2。)为什么(如上面的链接中所解释的)键散列几次?是正确的,密钥需要用不同的哈希函数哈希多次,以获得更好的随机分布的位?如果这是错误的,为什么一个密钥需要散列几次?这将花费CPU周期?如果我有几个哈希函数的输出,对结果做的是,它们被AND或XOR。这是否有什么区别?



3.)使用MD5,与SHA1相比,Fales positives by using Bloomfilter的区别有多大随机分布)?为什么MD5不是随机分布的?



非常感谢!
Jens

解决方案

1)是的,请参阅 在cassandra维基中,


<每个SSTable都有一个与Cassandra在进行任何磁盘寻道之前检查的Bloom过滤器相关联,对几乎不存在的密钥进行查询。


键的列可以分布在几个sstables中。如果不是用于bloom过滤器,每次读取一个键将不得不读取每个sstable,这是非常昂贵的。通过使用bloom过滤器,cassandra几乎总是只需要查找包含该键的数据的sstables。



2)可能会让您更好地了解bloom过滤器。你哈希k次给一个大小为m的数组中的独立位置。例如,如果A和B是集合中的元素,并且你有k = 2,你的散列函数是md5和sha1,m = 16,你可以做

  md5(A)%m = 7 
sha1(A)%m = 12

md5(B)%m = 15 $ b $这个给你m [7],m [12],m [12]



要查看C是否在集合中,请执行

  md5(C)%m = 8 
sha1(C)%m = 12

由于m [8]为假,你知道C不在集合中,但是对于D

 code> md5(D)%m = 7 
sha1(D)%m = 15


$ b b

m [7]和m [15]都为真,但是D不在集合中,所以D是假阳性。



cpu周期,但你正在交易cpu周期减少io,这是有意义的cassandra。



3)文章没有提到md5。 md5是随机分布的,我猜测md5和sha-1之间的差异对于布隆过滤器不大。


I Read this: http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html

My Questions:

1.) Is it correct, that Cassandra only uses the bloom filter, to find out the SST (Sorted String Table) which most likely contains the key? As there might be several SSTs and Cassandra does not know in Which SST a key might be? So to speed this up looking in all SSTs bloomfilters are used. Is this correct? (I am trying to understand how cassandra works...)

2.) Why are (as explained in the link above) keys hashed several times? Is it correct that the keys need to be hashed with different Hash functions several times, to get a better "random distribution of the" Bits? If this is wrong, why does a key need to be hashed several times? This will cost CPU cycles? If I have the output of several Hash functions, what is done with the results, are they ANDed or XORded. Does this make any difference?

3.)Using MD5 how big is the difference of "Fales positives by using the Bloomfilter" compared to SHA1 (which according to the articles is random distributed)? Why is MD5 not random distributed?

Thanks very much!! Jens

解决方案

1) Yes, see this in the cassandra wiki,

Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free

The columns of a key may be spread out in several sstables. If it wasn't for bloom filters, every read of a key would have to read every sstable, which is prohibitively expensive. By using bloom filters, cassandra almost always only has to look in the sstables which contain data for that key.

2) This might give you a better understanding of bloom filters. You hash k times to give independent positions in an array of size m. For example, if A and B are the elements in the set, and you have k = 2, your hash functions are md5 and sha1, and m = 16, you can do

md5(A) % m = 7
sha1(A) % m = 12

md5(B)  % m = 15
sha1(B)  % m = 12

This gives you m[7], m[12] and m[15] are true for the filter.

To see if C is in the set, you do

md5(C)  % m = 8
sha1(C) % m = 12

Since m[8] is false, you know C is not in the set, however, for D

md5(D)  % m = 7
sha1(D)  % m = 15

Both m[7] and m[15] is true, but D is not in the set, so D is a false positive.

This does cost cpu cycles, but you are trading cpu cycles for reduced io, which makes sense for cassandra.

3) The article doesn't mention md5. md5 is randomly distributed, and I would guess the difference between md5 and sha-1 for bloom filters is not large.

这篇关于Bloomfilter和Cassandra =为什么使用和为什么散列几次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆