如何在MySQL中存储URL [英] How to store URLs in MySQL

查看:887
本文介绍了如何在MySQL中存储URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在数据库中存储数以亿计的URL.每个URL应该是唯一的,因此我将使用ON DUPLICATE KEY UPDATE并计算重复的URL.

但是,由于我的varchar字段为400个字符,因此无法在URL字段上创建索引. MySQL在抱怨和说; #1071-指定的密钥太长;最大密钥长度为767字节". (Varchar 400将占用1200个字节)

如果您每天需要在一台服务器上处理至少500000个URL,最好的方法是什么?

我们已经在考虑将MongoDB用于同一应用程序,因此我们可以简单地查询MongoDB并找到重复的URL,然后更新该行.但是,我不赞成使用MongoDB解决此问题,并且在此阶段我只想使用MySQL,因为我希望在开始时尽可能精简,并更快地完成项目的这一部分. (我们还没有玩过MongoDB,也不想在这个阶段花时间)

是否还有其他可能使用更少的资源和时间来执行此操作.我当时在想获取URL的MD5哈希值并存储它.我可以改为使该字段唯一.我知道,会有冲突,但是如果唯一的问题,可以在1亿个URL中有5-10-20个重复项.

您有什么建议吗?我也不想花10秒的时间仅插入一个URL,因为它每天会处理50万个URL.

您有什么建议?

根据请求,这是表定义. (我目前不在使用MD5,而是用于测试)

mysql> DESC url;
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
| Field       | Type                  | Null | Key | Default           | Extra                       |
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
| url_id      | int(11) unsigned      | NO   | PRI | NULL              | auto_increment              |
| url_text    | varchar(400)          | NO   |     |                   |                             |
| md5         | varchar(32)           | NO   | UNI |                   |                             |
| insert_date | timestamp             | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| count       | mediumint(9) unsigned | NO   |     | 0                 |                             |
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
5 rows in set (0.00 sec)

解决方案

根据 DNS规范域名的最大长度为:

DNS本身对特定标签仅设置了一个限制
可用于识别资源记录.那一个限制
与标签的长度和全名有关.长度
任何一个标签都限于1到63个八位位组之间.完整域
名称限制为255个八位字节(包括分隔符).

255 * 3 = 765< 767(勉强:-))

但是请注意,每个组件只能有63个字符长.

所以我建议将url切成各个组成部分.

使用 http://foo.example.com/a/really/long/path?with=lots&of=query&parameters=that&goes=on&ever&and=ever

可能就足够了:

  • 协议标志["http"-> 0](将"http"存储为0,将"https"存储为1,依此类推)
  • subdomain ["foo"](255-63 = 192个字符:我可以再减去2个,因为min tld是2个字符)
  • domain ["example"],(63个字符)
  • tld ["com"](用于处理"info" tld的4个字符)
  • path ["a/really/long/path"](只要您愿意- 存储在单独的表中 )
  • queryparameters ["with = lots& of = query& parameters = that& goes = on& and& and = ever"]( 存储在单独的键/值表中 )
  • 如果确实需要,
  • 很少使用的端口号/身份验证内容可以放在单独的键控表中.

这为您提供了一些不错的优势:

  • 索引仅位于您需要搜索的网址部分(较小的索引!)
  • 查询可以限于各种网址部分(例如,在facebook域中找到每个网址)
  • 任何子网域/域太长的网址都是虚假的
  • 易于丢弃查询参数.
  • 易于执行的不区分大小写的域名/tld搜索
  • 丢弃语法糖(协议后为://",子域/域,域/tld之间为.",tld和路径之间为"/",查询前为?",在查询中为&"查询)
  • 避免了主要的稀疏表问题.大多数网址将没有查询参数,也没有长路径.如果这些字段在单独的表中,则您的主表将不会受到大小影响.查询时,更多的记录将适合内存,因此查询性能更快.
  • (这里有更多优势).

I need to store potentially 100s of millions URLs in a database. Every URL should be unique, hence I will use ON DUPLICATE KEY UPDATE and count the duplicate URLs.

However, I am not able to create an index on the URL field as my varchar field is 400 characters. MySQL is complaining and saying; "#1071 - Specified key was too long; max key length is 767 bytes". (Varchar 400 will take 1200 bytes)

What is the best way to do this, if you need to process minimum 500000 URLs per day in a single server?

We are already thinking using MongoDB for the same application, so we can simply query MongoDB and find the duplicate URL, and update the row. However, I am not in favor of solving this problem using MongoDB , and I'd like to use just MySQL at this stage as I'd like to be as lean as possible in the beginning and finish this section of the project much faster. (We haven't played with MongoDB yet and don't want to spend time at this stage)

Is there any other possibility doing this using less resources and time. I was thinking to get MD5 hash of the URL and store it as well. And I can make that field UNIQUE instead. I know, there will be collision but it is ok to have 5-10-20 duplicates in the 100 million URLs, if that's the only problem.

Do you have any suggestions? I also don't want to spend 10 seconds to insert just one URL, as it will process 500k URLs per day.

What would you suggest?

Edit: As per the request this is the table definition. (I am not using MD5 at the moment, it is for testing)

mysql> DESC url;
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
| Field       | Type                  | Null | Key | Default           | Extra                       |
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
| url_id      | int(11) unsigned      | NO   | PRI | NULL              | auto_increment              |
| url_text    | varchar(400)          | NO   |     |                   |                             |
| md5         | varchar(32)           | NO   | UNI |                   |                             |
| insert_date | timestamp             | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| count       | mediumint(9) unsigned | NO   |     | 0                 |                             |
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
5 rows in set (0.00 sec)

解决方案

According to the DNS spec the maximum length of the domain name is :

The DNS itself places only one restriction on the particular labels
that can be used to identify resource records. That one restriction
relates to the length of the label and the full name. The length of
any one label is limited to between 1 and 63 octets. A full domain
name is limited to 255 octets (including the separators).

255 * 3 = 765 < 767 (Just barely :-) )

However notice that each component can only be 63 characters long.

So I would suggest chopping the url into the component bits.

Using http://foo.example.com/a/really/long/path?with=lots&of=query&parameters=that&goes=on&forever&and=ever

Probably this would be adequate:

  • protocol flag ["http" -> 0 ] ( store "http" as 0, "https" as 1, etc. )
  • subdomain ["foo" ] ( 255 - 63 = 192 characters : I could subtract 2 more because min tld is 2 characters )
  • domain ["example"], ( 63 characters )
  • tld ["com"] ( 4 characters to handle "info" tld )
  • path [ "a/really/long/path" ] ( as long as you want -store in a separate table)
  • queryparameters ["with=lots&of=query&parameters=that&goes=on&forever&and=ever" ] ( store in a separate key/value table )
  • portnumber / authentication stuff that is rarely used can be in a separate keyed table if actually needed.

This gives you some nice advantages:

  • The index is only on the parts of the url that you need to search on (smaller index! )
  • queries can be limited to the various url parts ( find every url in the facebook domain for example )
  • anything url that has too long a subdomain/domain is bogus
  • easy to discard query parameters.
  • easy to do case insensitive domain name/tld searching
  • discard the syntax sugar ( "://" after protocol, "." between subdomain/domain, domain/tld, "/" between tld and path, "?" before query, "&" "=" in the query)
  • Avoids the major sparse table problem. Most urls will not have query parameters, nor long paths. If these fields are in a separate table then your main table will not take the size hit. When doing queries more records will fit into memory, therefore faster query performance.
  • (more advantages here).

这篇关于如何在MySQL中存储URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆