hash()为不同的平台产生不同的结果 [英] hash() yields different results for different platforms

查看:69
本文介绍了hash()为不同的平台产生不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一只蜘蛛。我在一个表(mysql)中有数百万个URL来检查是否已经获取了一个url。为了快速检查,考虑添加一个哈希,我是

。表中的列,使其成为唯一键,

并使用以下sql语句:

将ignore插入url(url,hash)值(newurl,hash_of_newurl)

添加新网址。


我相信这会比制作网址更快列唯一键

并进行字符串比较。对吗?


然而,当我来到Python的内置hash()函数时,我发现它

在我的两台计算机中产生不同的值!在pentium4中,

哈希('''') - 468864544;在amd64中,哈希('''') - 12416037344。是否
哈希函数取决于机器的字长?


如果确实如此,我必须考虑另一种哈希算法因为蜘蛛

将同时在多台计算机上运行,​​有些是32位,有些是64位的b $ b。 md5是个不错的选择吗?与使用url相比,我的性能增益是否太慢?列直接作为唯一的

键?


我会做一些基准测试来找出它。但是,虽然我的手脏了,但我想听听专家的一些建议。 :)

解决方案

" Qiangning Hong" < ho **** @ gmail.comwrites:


但是,当我来到Python的内置hash()函数时,我发现它

在我的两台电脑中产生不同的值!在pentium4中,

哈希('''') - 468864544;在amd64中,哈希('''') - 12416037344。

哈希函数是否取决于机器的字长?



散列函数未指定,可以依赖于实际的实际情况。它可能(?)甚至被允许从一个解释器到另一个解释器(我没有检查

这个)的规格不同

。不要指望它在机器之间保持一致。


如果确实如此,我必须考虑另一种哈希算法因为蜘蛛

将在多台计算机中同时运行,有些是32位,有些是64位的b $ b。 md5是个不错的选择吗?与使用url相比,我的性能增益是否太慢?列直接作为唯一键?



如果你要接受SQL数据库的开销,你可能会喜欢使用它给出的抽象你,而不是尝试
来实现相当于你自己的索引形式,而不是让b / b
让db来处理它。但是md5(url)肯定非常快

与处理传出的http连接相比,你可能计划为每个网址打开


我会做一些基准测试来找出它。



这是回答这类问题的正确方法。


2006-07-11 ,Qiangning Hong< ho **** @ gmail.comwrote:


我正在写一只蜘蛛。我在一个表(mysql)中有数百万个URL来检查是否已经获取了一个url。为了快速检查,考虑添加一个哈希,我是

。表中的列,使其成为唯一键,

并使用以下sql语句:

将ignore插入url(url,hash)值(newurl,hash_of_newurl)

添加新网址。


我相信这会比制作网址更快列唯一键

并进行字符串比较。对?



我怀疑它会明显加快。比较两个字符串

和散列字符串都是O(N)。


然而,当我来到Python的内置哈希()函数,我

发现它在我的两台计算机中产生不同的值!在一个

pentium4,hash(''a'') - 468864544;在amd64中,哈希('''') - >

12416037344.哈希函数是否依赖于机器的单词

长度?



显然。 :)


低32位匹配,所以也许你应该只使用返回哈希的那个

部分?


>> hex(12416037344)



''0x2E40DB1E0L''


>> hex(-468864544 & 0xffffffffffffffff)



''0xFFFFFFFFE40DB1E0L''


>> hex(12416037344& 0xffffffff)



''0xE40DB1E0L''


>> hex(-468864544& 0xffffffff)



''0xE40DB1E0L''


-

Grant Edwards grante哇!嗯,哦!我忘了

来提交给COMPULSORY

visi.com URINALYSIS!


Grant Edwards写道:


2006-07-11,Qiangning Hong< ho **** @ gmail.comwrote:


我正在写一只蜘蛛。我在一个表(mysql)中有数百万个URL来检查是否已经获取了一个url。为了快速检查,考虑添加一个哈希,我是

。表中的列,使其成为唯一键,

并使用以下sql语句:

将ignore插入url(url,hash)值(newurl,hash_of_newurl)

添加新网址。


我相信这会比制作网址更快列唯一键

并进行字符串比较。对?



我怀疑它会明显加快。比较两个字符串

和散列字符串都是O(N)。



播放Devil's Advocate:在
数据库插入期间哈希将是一次性操作,而字符串比较将每次都发生

搜索。可以想象,与比较常规字符串相比,比较哈希字符串(即O(1))可能会节省很多钱。但我希望
期望大多数体面的sql实现已经在内部散列数据,所以

滚动你自己的哈希最多也没用。


如果缺少OP的数据库,md5可能没什么问题。也许使用md5的一个b $ b子集(比如低32位)可以加快比较更多碰撞的风险。可能是一个很好的交易,除非数据库是

humungous。

Carl Banks


I''m writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?

However, when I come to Python''s builtin hash() function, I found it
produces different values in my two computers! In a pentium4,
hash(''a'') --468864544; in a amd64, hash(''a'') -12416037344. Does
hash function depend on machine''s word length?

If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique
key?

I will do some benchmarking to find it out. But while making my hands
dirty, I would like to hear some advice from experts here. :)

解决方案

"Qiangning Hong" <ho****@gmail.comwrites:

However, when I come to Python''s builtin hash() function, I found it
produces different values in my two computers! In a pentium4,
hash(''a'') --468864544; in a amd64, hash(''a'') -12416037344. Does
hash function depend on machine''s word length?

The hash function is unspecified and can depend on anything the
implementers feel like. It may(?) even be permitted to differ from
one run of the interpreter to another (I haven''t checked the spec for
this). Don''t count on it being consistent from machine to machine.

If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique key?

If you''re going to accept the overhead of an SQL database you might as
well enjoy the use of the abstraction it gives you, instead of trying
to implement what amounts to your own form of indexing instead of
letting the db take care of it. But md5(url) is certainly very fast
compared with processing the outgoing http connection that you
presumably plan to open for each url.

I will do some benchmarking to find it out.

That''s the right way to answer questions like this.


On 2006-07-11, Qiangning Hong <ho****@gmail.comwrote:

I''m writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?

I doubt it will be significantly faster. Comparing two strings
and hashing a string are both O(N).

However, when I come to Python''s builtin hash() function, I
found it produces different values in my two computers! In a
pentium4, hash(''a'') --468864544; in a amd64, hash(''a'') ->
12416037344. Does hash function depend on machine''s word
length?

Apparently. :)

The low 32 bits match, so perhaps you should just use that
portion of the returned hash?

>>hex(12416037344)

''0x2E40DB1E0L''

>>hex(-468864544 & 0xffffffffffffffff)

''0xFFFFFFFFE40DB1E0L''

>>hex(12416037344 & 0xffffffff)

''0xE40DB1E0L''

>>hex(-468864544 & 0xffffffff)

''0xE40DB1E0L''

--
Grant Edwards grante Yow! Uh-oh!! I forgot
at to submit to COMPULSORY
visi.com URINALYSIS!


Grant Edwards wrote:

On 2006-07-11, Qiangning Hong <ho****@gmail.comwrote:

I''m writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?


I doubt it will be significantly faster. Comparing two strings
and hashing a string are both O(N).

Playing Devil''s Advocate: The hash would be a one-time operation during
database insertion, whereas string comparison would happen every
search. Conceivably, comparing hash strings (which is O(1)) could
result in a big savings compared to comparing regular strings; but I
expect most decent sql implementations already hash data internally, so
rolling your own hash would be useless at best.

If the OP''s database is lacking, md5 is probably fine. Perhaps using a
subset of the md5 (the low 32 bits, say) could speed up comparisons at
risk of more collisions. Probably a good trade off unless the DB is
humungous.
Carl Banks


这篇关于hash()为不同的平台产生不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆