最好的方式来存储url在mysql的一个读&写密集型应用程序 [英] best way to store url in mysql for a read&write-intensive application

查看:876
本文介绍了最好的方式来存储url在mysql的一个读&写密集型应用程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在mysql中存储url的最佳方法是有效地用于读和写密集型应用程序吗?



我将存储超过500,000个网址http://或https://。没有其他协议),并将整个URL(http://example.com/path/?variable=a)保存到一个列中似乎很大程度上是多余的,因为相同的域名和路径将被多次保存到mysql。



因此,最初,我想到将它们分解(即域,路径和变量等) 。但我看到一些帖子说,这不是推荐。对此有任何想法吗?



此外,应用程序通常必须检索没有主键的url,这意味着它必须搜索文本才能检索url。 URL可以索引,但我想知道存储整个url和分解url之间的性能差异,如果他们都在innodb下索引(没有全文索引)。



细分网址必须经过额外的组合步骤。此外,这意味着我必须从不同的表(协议,域,路径,变量)检索数据4次,但它也使每行中存储的数据更短,每个表中将有更少的行。这可能会加快这个过程吗?

解决方案

我已经处理了这个广泛,我的一般哲学是使用频率使用方法。这很麻烦,但它可以让你对数据进行一些很好的分析:

  CREATE TABLE URL(
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
DomainPath整数无符号NOT NULL,
QueryString文本
)Engine = MyISAM;

CREATE TABLE DomainPath(
ID整数无符号NOT NULL PRIMARY KEY AUTO_INCREMENT,
域整数无符号NOT NULL,
路径文本
UNIQUE Path)
)Engine = MyISAM;

CREATE TABLE域(
ID整数无符号NOT NULL PRIMARY KEY AUTO_INCREMENT,
协议tinyint NOT NULL,
域varchar(64)
端口smallint NULL ,
UNIQUE(Protocol,Domain,Port)
)Engine = MyISAM;

作为一般规则,您在单个域上具有类似的路径,路径。



我最初设计的这一点是将所有部分索引到单个表中(协议,域,路径,查询字符串),但是认为上面的空间密集型



文本往往会很慢,所以您可以更改路径到一个varchar后一些使用。大多数服务器在URL大约1K之后就会消失,但是我已经看到了一些大的服务器,并且不会丢失数据。



您的检索查询很繁琐,但是如果你在你的代码中抽象出来,没有问题:

  SELECT CONCAT(
IF 0,'http://','https://'),
D.Domain,
IF(D.Port IS NULL,'',CONCAT(':',D.Port) ),
'/',DP.Path,
IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))

FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath = DP.ID
INNER JOIN域D在DP.Domain = D.ID
WHERE U.ID = $ DesiredID;

存储端口号,如果它是非标准的(非80为http,非443为https) ,否则将其存储为NULL以表示不应包括它。 (你可以添加逻辑到MySQL,但它变得更丑陋。)



我永远(或永远)剥离/从路径以及?从QueryString中节省空间。只有损失才能区分

  http://www.example.com/ 
http:// www.example.com/?

哪一个,如果重要,那么我会改变你的粘性,从来不剥离,只是包括它。技术上,

  http://www.example.com 
http://www.example.com/

是一样的,所以剥离Path斜线总是OK。



因此,要解析:

  http://www.example.com/my/path/to /my/file.php?id=412&crsource=google+adwords 

我们会使用 parse_url 在PHP中生成:

  scheme] =>'http',
[host] =>'www.example.com',
[path] =>'/my/path/to/my/file.php ',
[query] =>'id = 412& crsource = google + adwords',

$ b b

然后,您将检查/插入(使用适当的锁定,未显示):

  SELECT D. ID FROM domain D 
WHERE
D.Protocol = 0
AND D.Domain ='www.example.com'
AND D.Port IS NULL

(如果不存在)

  INSERT INTO域(
协议,域,端口
)VALUES(
0,'www.example.com',NULL
);

然后我们的 $ DomainID ...



然后插入到DomainPath:

  ID FORM DomainPath DP WHERE 
DP.Domain = $ DomainID AND Path ='/ my / path / to / my / file.php';

(如果不存在,请类似插入)



然后我们的 $ DomainPathID 下一步...

  SELECT U.ID FROM URL 
WHERE
DomainPath = $ DomainPathID
AND QueryString ='id = 412& crsource = google + adwords'


现在,让我注意到重要的是

/ em>,上述方案对于高性能网站将是缓慢的。你应该修改一切,使用某种哈希加快 SELECT 。简而言之,技术如下:

  CREATE TABLE Foo(
ID整数无符号PRIMARY键非空AUTO_INCREMENT,
Hash varbinary(16)NOT NULL,
内容文本
)Type = MyISAM;

SELECT ID FROM Foo WHERE Hash = UNHEX(MD5('id = 412& crsource = google + adwords'));

我故意从上面删除它,以保持简单,但将TEXT与另一个TEXT进行比较是慢的,并为真正长的查询字符串打断。不要使用固定长度的索引,因为它也会中断。对于任何长度的字符串,其中精度很重要,哈希失败率是可以接受的。



最后,如果可以,做MD5哈希客户端保存发送大的blob到服务器做MD5操作。大多数现代语言支持MD5内置:

  SELECT ID FROM Foo WHERE Hash = UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb'); 

但我离题。


What is the best way to store url in mysql effectively for a read&write-intensive application?

I will be storing over 500,000 web addresses (all starting with either http:// or https://. no other protocols) and saving the whole url (http://example.com/path/?variable=a) into one column seems to be largely redundant because the same domain name and path will be saved to mysql multiple times.

So, initially, I thought of breaking them down (i.e. domain, path, and variables, etc) to get rid of redundancy. But I saw some posts saying that it's not recommended. Any idea on this?

Also, the application often has to retrieve urls without primary keys, meaning it has to search text to retrieve url. URL can be indexed, but I'm wondering how much performance difference there would be between storing the whole url and broken-down-url if they are all indexed under innodb(no full text indexing).

Broken-down-url will have to go through extra steps of combining them. Also, it would mean that I have to retrieve data 4 times from different tables(protocol, domain, path, variable), but it also makes the stored data in each row shorter and there would be less rows in each table. Would this possibly speed up the process?

解决方案

I have dealt with this extensively, and my general philosophy is to use the frequency of use method. It's cumbersome, but it lets you run some great analytics on the data:

CREATE TABLE URL (
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   DomainPath    integer unsigned NOT NULL,
   QueryString   text
) Engine=MyISAM;

CREATE TABLE DomainPath (   
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   Domain        integer unsigned NOT NULL,
   Path          text,
   UNIQUE (Domain,Path)
) Engine=MyISAM;

CREATE TABLE Domain (   
   ID            integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
   Protocol      tinyint NOT NULL,
   Domain        varchar(64)
   Port          smallint NULL,
   UNIQUE (Protocol,Domain,Port)
) Engine=MyISAM;

As a general rule, you'll have similar Paths on a single Domain, but different QueryStrings for each path.

I originally designed this to have all parts indexed in a single table (Protocol, Domain, Path, Query String) but think the above is less space-intensive and lends better to getting better data out of it.

text tends to be slow, so you can change "Path" to a varchar after some use. Most servers die after about 1K for a URL, but I've seen some large ones and would err on the side of not losing data.

Your retrieval query is cumbersome, but if you abstract it away in your code, no issue:

SELECT CONCAT(
    IF(D.Protocol=0,'http://','https://'),
    D.Domain,
    IF(D.Port IS NULL,'',CONCAT(':',D.Port)), 
    '/', DP.Path, 
    IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))
)
FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath=DP.ID
INNER JOIN Domain D on DP.Domain=D.ID
WHERE U.ID=$DesiredID;

Store a port number if it's non standard (non-80 for http, non-443 for https), otherwise store it as NULL to signify it shouldn't be included. (You can add the logic to the MySQL but it gets much uglier.)

I would always (or never) strip the "/" from the Path as well as the "?" from the QueryString for space savings. Only loss would being able to distinguish between

http://www.example.com/
http://www.example.com/?

Which, if important, then I would change your tack to never strip it and just include it. Technically,

http://www.example.com 
http://www.example.com/

Are the same, so stripping the Path slash is OK always.

So, to parse:

http://www.example.com/my/path/to/my/file.php?id=412&crsource=google+adwords

We would use something like parse_url in PHP to produce:

array(
    [scheme] => 'http',
    [host] => 'www.example.com',
    [path] => '/my/path/to/my/file.php',
    [query] => 'id=412&crsource=google+adwords',
)

You would then check/insert (with appropriate locks, not shown):

SELECT D.ID FROM Domain D 
WHERE 
    D.Protocol=0 
    AND D.Domain='www.example.com' 
    AND D.Port IS NULL

(if doesn't exist)

INSERT INTO Domain ( 
    Protocol, Domain, Port 
) VALUES ( 
    0, 'www.example.com', NULL 
);

We then have our $DomainID going forward...

Then insert into DomainPath:

SELECT DP.ID FORM DomainPath DP WHERE 
DP.Domain=$DomainID AND Path='/my/path/to/my/file.php';

(if it doesn't exist, insert it similarly)

We then have our $DomainPathID going forward...

SELECT U.ID FROM URL 
WHERE 
    DomainPath=$DomainPathID 
    AND QueryString='id=412&crsource=google+adwords'

and insert if necessary.

Now, let me note importantly, that the above scheme will be slow for high-performance sites. You should modify everything to use a hash of some sort to speed up SELECTs. In short, the technique is like:

CREATE TABLE Foo (
     ID integer unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
     Hash varbinary(16) NOT NULL,
     Content text
) Type=MyISAM;

SELECT ID FROM Foo WHERE Hash=UNHEX(MD5('id=412&crsource=google+adwords'));

I deliberately eliminated it from the above to keep it simple, but comparing a TEXT to another TEXT for selects is slow, and breaks for really long query strings. Don't use a fixed-length index either because that will also break. For arbitrary length strings where accuracy matters, a hash failure rate is acceptable.

Finally, if you can, do the MD5 hash client side to save sending large blobs to the server to do the MD5 operation. Most modern languages supports MD5 built-in:

SELECT ID FROM Foo WHERE Hash=UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb');

But I digress.

这篇关于最好的方式来存储url在mysql的一个读&写密集型应用程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆