为两个实际上相同的字符串URL获取唯一的散列值 [英] Getting unique hash for two different string URLs that are actually the same

查看:137
本文介绍了为两个实际上相同的字符串URL获取唯一的散列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在根据它们的哈希代码对一些URL进行索引,并使用这个哈希来检索它们。我在这件事上有两个问题:


  1. 你认为这是一个好方法吗?我的意思是有时候两个不同的URL可以产生相同的散列,但我似乎没有其他选择,因为URL可能很长,我需要为它们生成一个文件名。

  2. [更重要的]有时候两个不同的网址实际上会重复到同一页面(例如 http://www.stackoverflow.com http://stackoverflow.com 和有时与%字符的网址),但我需要为这些网址产生相同的哈希码。你有什么建议?

谢谢。 经过大量讨论和思考,由于没有答案能够完全回答我的问题,所以我会回答我自己的问题。重要的一点是,Morten Mertner发布的评论与我的回答最接近,但我不能选择它作为答案。


  1. 除了使用散列算法,除了我以外没有其他的方法。但为了减少重复的风险,我应该使用更好的算法,如SHA-2。
  2. 正如Morten Mertner所说,在某些情况下,所提到的URL实际上并不相同,我不能假设该网站配置正确。我能做的唯一事情就是删除书签,并使用网址的编码/解码版本。 (带有/不带%字符的版本)。

感谢所有帮助人员。


I'm indexing some URLs based on their hash code and use this hash to retrieve them. I have 2 questions in this matter:

  1. Do you think this is a good approach? I mean sometimes two different URLs can produce the same hash but I don't seem to have any other choice since URLs can be very long and I need to produce a file name for them.
  2. [More important] Sometimes two different URLs are actually reffering to the same page (e.g. http://www.stackoverflow.com and http://stackoverflow.com and sometimes URLs with % characters) but I need to produce the same hash code for these URLs. What do you suggest?

Thanks.

解决方案

After lots of discussion and thinking, since there is no answer that completely answers my questions, I'm going to answer my own question. The one thing important is that the comment posted by Morten Mertner is the closest thing to my answer but I cannot select it as an answser.

  1. There is no other way for me except using a hash algorithm. But to reduce the risk of duplicate, I should use better algorithms like SHA-2 ones.
  2. As Morten Mertner said, in some cases the mentioned URLs are NOT actually the same and I cannot assume that the website is configured correctly. The only thing I can do is to remove the bookmarks and either use ecoded/decoded version of the URL. (The versions with/without % characters).

Thanks for all of the help guys.

这篇关于为两个实际上相同的字符串URL获取唯一的散列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆