考虑RAM的网址或哈希索引 [英] index on url or hashing considering RAM

查看:109
本文介绍了考虑RAM的网址或哈希索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个项目,每天需要添加/更新大约一百万个URL.有些日子大部分是更新,有些日子大部分是添加,有些日子是混合的.

I am working on a project which needs to add/update around 1 million urls daily. Some days are mostly updates and some days are mostly add and some days are mix.

因此,在每个查询上都需要在url表中查找url的唯一性.

So, on every query there is need to look up uniqueness of url in url table.

如何真正快速地查找url,因为目前在url列上设置了索引,并且效果很好,但是在未来几周内,如果将索引保留在同一列上并在其中添加新记录,则RAM不够百万.

How look up for url can be made really fast because at the moment index is set at url column and it works good but in coming weeks RAM would not be enough if index are kept on same column and new records will be added in millions.

这就是为什么我要寻找一种解决方案,以便当总共有150+百万个url时,它的查找应该很快.我正在考虑在md5上创建索引,但随后会担心发生碰撞的可能性.一位朋友告诉我也要计算crc32哈希值,并与md5串联以使冲突可能性为零,并将其存储在binary(20)中,这样仅将20个字节作为索引,而不是将255个当前的varchar(255)设置为url列数据类型.

That's why I am looking for a solution so that when there will be 150+ million urls in total then its look up should be fast. I am thinking of creating indexing on md5 but then worries about collision chances. A friend tipped me to calculate crc32 hash also and concatenate with md5 to make collision possibility to zero and store it in binary(20) that way only 20 bytes would be taken as index instead of 255 currently varchar(255) set as url column data type.

目前总共有5,000万个网址,并且8GB的ram可以正常工作.

Currently there are total around 50 million urls and with 8GB ram its working fine.

昨天,我问了一个问题 url文本压缩(不是缩短)并存储在与同一项目相关的mysql 中.

Yesterday, I asked a question url text compression (not shortening) and storing in mysql related to the same project.

我想到了另一种解决方案,将crc32哈希仅以十进制形式进行,以加快查找速度.在应用程序级别移植检查返回的记录数.如果返回多个记录,则确切的url也应该匹配. 这样,通过在每行中存储4个字节而不是20个字节(md5 + crc32),可以在保持较低的RAM和磁盘空间负载的同时避免冲突.你说什么?

I have thought of another solution of putting crc32 hash only in decimal form to speed up look up. And at application level porting a check on how many records are returned. If more than 1 record is returned then exact url should also be matched. That way collision would also be avoided while keep low load on RAM and disk space by storing 4 bytes for each row instead of 20 bytes (md5+crc32). What you say?

推荐答案

阅读完所有问题后(唯一约束会使哈希变得无用吗? 512位哈希与4 128位哈希 URL文本压缩(不缩短)和存储在mysql 中),我了解到您的问题或多或少是以下情况:

After reading all your questions ( unique constraint makes hashes useless? , 512 bit hash vs 4 128bit hash and url text compression (not shortening) and storing in mysql), I understood that your problem is more or less the following:

我需要使用8GB RAM在mySQL中存储+1.5亿个URL,并且在全部编写和检索它们方面仍具有良好的性能,因为每天我都会对其进行更新,因此,我将检索很多URL ,请对照数据库检查它们.实际上它有5000万个URL,并且在接下来的3个月中每天将增长约100万."

"I need to store +150M URLs in mySQL, using 8GB of RAM, and still have a good performance on writing them all and retrieving them, because daily I'll update them, so I'll retrive a lot of URLs, check them against the database. Actually it has 50M URLs, and will grow about 1M each day in the following 3 monts."

是吗?

以下几点很重要: 您将保存的URL格式如何?您是否需要读回该URL,或者只是更新有关该URL的信息,而从不搜索基于部分URL的内容,等等?

The following points are important: How is the format of the URL that you'll save? Will you need to read the URL back, or just update informations about it, but never search based in partial URLs, etc?

假定URL =" http://www.somesite.com.tv/images /picture01.jpg ",并且您想存储所有内容,包括文件名. 如果不同,请提供更多详细信息或更正我的回答假设.

Assuming URL = "http://www.somesite.com.tv/images/picture01.jpg" and that you want to store everything, inclusing the filename. If it's different, please provide more details or correct my answer assumptions.

  1. 如果可以通过替换URL中的某些字符组来节省空间.并非所有ASCII字符在URL中都是有效的,如此处所示:RFC1738 ,因此您可以使用它们来表示(和压缩)URL.例如:使用字符0x81表示"http://"可以保存6个字符,使用0x82表示".jpg"可以节省另外3个字节,等等.

  1. If can save space by replacing some group of characters in the URL. Not all ASCII characters are valid in an URL, as you can see here: RFC1738, so you can use those to represent (and compress) the URL. For example: using character 0x81 to represent "http://" can make you save 6 characters, 0x82 to represent ".jpg" can save you another 3 bytes, etc.

某些单词可能很常见(例如图像",图片",视频",用户").如果您选择使用0x90至0x9f的字符+任何其他字符(例如0x90 0x01、0x90 0x02、0x90 0xfa)来编码此类单词,则可以使用16 * 256 = 4,096个字典条目"来编码最常用的单词.您将使用2个字节来表示4-8个字符.

Some words might be very common (like "image", "picture", "video", "user"). If you choose to user characters 0x90 up to 0x9f + any other character (so, 0x90 0x01, 0x90 0x02, 0x90 0xfa) to encode such words, you can have 16 * 256 = 4,096 "dictionary entries" to encode the most used words. You'll use 2 bytes to represent 4 - 8 characters.

编辑:您可以在上面提到的RFC中阅读,在URL中,您只能包含可打印的ASCII字符.这意味着仅应使用字符0x20至0x7F,并在RFC中进行一些观察.因此,不应使用0x80之后的任何字符(十六进制表示法,在ASCII表中为128个十进制字符).因此,如果可以选择一个字符(例如0x90)作为一个标志来指示后面的字节是字典中的指示,我将使用的索引".一个字符(0x90)* 256个字符(0x00至0xFF)=词典中的256个条目.但是,您也可以选择使用字符0x90至0x9f(或十进制144至159)来表示它们是字典的标志,从而为您提供16 * 256种可能性...

as you can read in the mentioned RFC, above, in the URL you can only have the printable ASCII characters. This means that only characters 0x20 to 0x7F should be used, with some observations made in the RFC. So, any character after 0x80 (hexadecimal notation, would be character 128 decimal in the ASCII table) shouldn't be used. So, if can choose one character (let's say the 0x90) to be one flag to indicate "the following byte is a indication in the dictionary, the index that I'll use". One character (0x90) * 256 characters (0x00 up to 0xFF) = 256 entries in the dictionary. But you can also choose to use characters 0x90 to 0x9f (or 144 to 159 in decimal) to indicate that they are a flag to the dictionary, thus giving you 16 *256 possibilities...

这2种方法可以为您节省数据库空间,并且是可逆的,而无需担心冲突等.您可以在应用程序中简单地创建一个字典,并使用它来编码/解码URL.快速,使您的数据库更轻松.

These 2 methods can save you a lot of space in your database and are reversible, without the need to worry about collisions, etc. You'll simple create a dictionary in your application and go encode/decode URLs using it, very fast, making your database much lighter.

由于您已经拥有+ 50M个URL,因此可以基于它们生成统计信息,以生成更好的字典.

Since you already have +50M URLs, you can generate statistics based on them, to generate a better dictionary.

使用哈希:在这种情况下,哈希是大小和安全性之间的权衡.如果发生碰撞,情况有多严重? 在这种情况下,您可以使用生日悖论来帮助您.

Using hashes : Hashes, in this case, are a tradeoff between size and security. How bad will it be if you get a collision? And in this case you can use the birthday paradox to help you.

请阅读文章以了解问题所在:如果所有输入(URL中的可能字符)都相等,则可以提高发生碰撞的可能性.并可以得出相反的结论:给定可接受的碰撞概率和文件数量,您的范围应该有多大?而且由于您的范围与哈希函数生成的位数完全相关...

Read the article to understand the problem: if all inputs (possible characters in the URL) were equivalent, you could stimate the probability of a collision. And could calculate the opposite: given your acceptable collision probability, and your number of files, how broad should your range be? And since your range is exactlly related to the number of bits generated by the hash function...

编辑:如果您具有可提供128位数据的哈希函数,则可能会有2 ^ 128个结果.因此,您在生日悖论中的范围"是2 ^ 128:就像您的年份有2 ^ 128天而不是365天.因此,您计算了碰撞的概率(两个文件 born 在同一天,其 year 具有2 ^ 128 days 而不是365天). 512位,范围从0到2 ^ 512 ...

if you have a hash function that gives you 128 bits, you'll have 2^128 possible outcomes. So, your "range" in the birthday paradox is 2^128: it's like your year have 2^128 days, instead of 365. So, you calculate the probabilities of collision ("two files being born in the same day, with a year that have 2^128 days instead of 365 days). If you choose to use a hash that gives you 512 bits, your range would go from 0 to 2^512...

再次,请记住RFC:并非所有字节(256个字符)在Internet/URL世界中都是有效的.因此,碰撞的可能性降低.更适合您:).

And, again, have the RFC in mind: not all bytes (256 characters) are valid in the internet / URL world. So, the probabillity of collisions decrease. Better for you :).

这篇关于考虑RAM的网址或哈希索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆