最好的方式来存储url在mysql的一个读&写密集型应用程序 [英] best way to store url in mysql for a read&write-intensive application
问题描述
在mysql中存储url的最佳方法是有效地用于读和写密集型应用程序吗?
我将存储超过500,000个网址http://或https://。没有其他协议),并将整个URL(http://example.com/path/?variable=a)保存到一个列中似乎很大程度上是多余的,因为相同的域名和路径将被多次保存到mysql。
因此,最初,我想到将它们分解(即域,路径和变量等) 。但我看到一些帖子说,这不是推荐。对此有任何想法吗?
此外,应用程序通常必须检索没有主键的url,这意味着它必须搜索文本才能检索url。 URL可以索引,但我想知道存储整个url和分解url之间的性能差异,如果他们都在innodb下索引(没有全文索引)。
细分网址必须经过额外的组合步骤。此外,这意味着我必须从不同的表(协议,域,路径,变量)检索数据4次,但它也使每行中存储的数据更短,每个表中将有更少的行。这可能会加快这个过程吗?
我已经处理了这个广泛,我的一般哲学是使用频率使用方法。这很麻烦,但它可以让你对数据进行一些很好的分析:
CREATE TABLE URL(
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
DomainPath整数无符号NOT NULL,
QueryString文本
)Engine = MyISAM;
CREATE TABLE DomainPath(
ID整数无符号NOT NULL PRIMARY KEY AUTO_INCREMENT,
域整数无符号NOT NULL,
路径文本
UNIQUE Path)
)Engine = MyISAM;
CREATE TABLE域(
ID整数无符号NOT NULL PRIMARY KEY AUTO_INCREMENT,
协议tinyint NOT NULL,
域varchar(64)
端口smallint NULL ,
UNIQUE(Protocol,Domain,Port)
)Engine = MyISAM;
作为一般规则,您在单个域上具有类似的路径,路径。
我最初设计的这一点是将所有部分索引到单个表中(协议,域,路径,查询字符串),但是认为上面的空间密集型
文本
往往会很慢,所以您可以更改路径到一个varchar后一些使用。大多数服务器在URL大约1K之后就会消失,但是我已经看到了一些大的服务器,并且不会丢失数据。
您的检索查询很繁琐,但是如果你在你的代码中抽象出来,没有问题:
SELECT CONCAT(
IF 0,'http://','https://'),
D.Domain,
IF(D.Port IS NULL,'',CONCAT(':',D.Port) ),
'/',DP.Path,
IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))
)
FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath = DP.ID
INNER JOIN域D在DP.Domain = D.ID
WHERE U.ID = $ DesiredID;
存储端口号,如果它是非标准的(非80为http,非443为https) ,否则将其存储为NULL以表示不应包括它。 (你可以添加逻辑到MySQL,但它变得更丑陋。)
我永远(或永远)剥离/从路径以及?从QueryString中节省空间。只有损失才能区分
http://www.example.com/
http:// www.example.com/?
哪一个,如果重要,那么我会改变你的粘性,从来不剥离,只是包括它。技术上,
http://www.example.com
http://www.example.com/
是一样的,所以剥离Path斜线总是OK。
因此,要解析:
http://www.example.com/my/path/to /my/file.php?id=412&crsource=google+adwords
我们会使用 parse_url
在PHP中生成:
scheme] =>'http',
[host] =>'www.example.com',
[path] =>'/my/path/to/my/file.php ',
[query] =>'id = 412& crsource = google + adwords',
)
$ b b
然后,您将检查/插入(使用适当的锁定,未显示):
SELECT D. ID FROM domain D
WHERE
D.Protocol = 0
AND D.Domain ='www.example.com'
AND D.Port IS NULL
(如果不存在)
INSERT INTO域(
协议,域,端口
)VALUES(
0,'www.example.com',NULL
);
然后我们的 $ DomainID
...
然后插入到DomainPath:
ID FORM DomainPath DP WHERE
DP.Domain = $ DomainID AND Path ='/ my / path / to / my / file.php';
(如果不存在,请类似插入)
然后我们的 $ DomainPathID
下一步...
SELECT U.ID FROM URL
WHERE
DomainPath = $ DomainPathID
AND QueryString ='id = 412& crsource = google + adwords'
$ c $
现在,让我注意到重要的是
/ em>,上述方案对于高性能网站将是缓慢的。你应该修改一切,使用某种哈希加快 SELECT
。简而言之,技术如下: CREATE TABLE Foo(
ID整数无符号PRIMARY键非空AUTO_INCREMENT,
Hash varbinary(16)NOT NULL,
内容文本
)Type = MyISAM;
SELECT ID FROM Foo WHERE Hash = UNHEX(MD5('id = 412& crsource = google + adwords'));
我故意从上面删除它,以保持简单,但将TEXT与另一个TEXT进行比较是慢的,并为真正长的查询字符串打断。不要使用固定长度的索引,因为它也会中断。对于任何长度的字符串,其中精度很重要,哈希失败率是可以接受的。
最后,如果可以,做MD5哈希客户端保存发送大的blob到服务器做MD5操作。大多数现代语言支持MD5内置:
SELECT ID FROM Foo WHERE Hash = UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb');
但我离题。
What is the best way to store url in mysql effectively for a read&write-intensive application?
I will be storing over 500,000 web addresses (all starting with either http:// or https://. no other protocols) and saving the whole url (http://example.com/path/?variable=a) into one column seems to be largely redundant because the same domain name and path will be saved to mysql multiple times.
So, initially, I thought of breaking them down (i.e. domain, path, and variables, etc) to get rid of redundancy. But I saw some posts saying that it's not recommended. Any idea on this?
Also, the application often has to retrieve urls without primary keys, meaning it has to search text to retrieve url. URL can be indexed, but I'm wondering how much performance difference there would be between storing the whole url and broken-down-url if they are all indexed under innodb(no full text indexing).
Broken-down-url will have to go through extra steps of combining them. Also, it would mean that I have to retrieve data 4 times from different tables(protocol, domain, path, variable), but it also makes the stored data in each row shorter and there would be less rows in each table. Would this possibly speed up the process?
解决方案 I have dealt with this extensively, and my general philosophy is to use the frequency of use method. It's cumbersome, but it lets you run some great analytics on the data:
CREATE TABLE URL (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
DomainPath integer unsigned NOT NULL,
QueryString text
) Engine=MyISAM;
CREATE TABLE DomainPath (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
Domain integer unsigned NOT NULL,
Path text,
UNIQUE (Domain,Path)
) Engine=MyISAM;
CREATE TABLE Domain (
ID integer unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
Protocol tinyint NOT NULL,
Domain varchar(64)
Port smallint NULL,
UNIQUE (Protocol,Domain,Port)
) Engine=MyISAM;
As a general rule, you'll have similar Paths on a single Domain, but different QueryStrings for each path.
I originally designed this to have all parts indexed in a single table (Protocol, Domain, Path, Query String) but think the above is less space-intensive and lends better to getting better data out of it.
text
tends to be slow, so you can change "Path" to a varchar after some use. Most servers die after about 1K for a URL, but I've seen some large ones and would err on the side of not losing data.
Your retrieval query is cumbersome, but if you abstract it away in your code, no issue:
SELECT CONCAT(
IF(D.Protocol=0,'http://','https://'),
D.Domain,
IF(D.Port IS NULL,'',CONCAT(':',D.Port)),
'/', DP.Path,
IF(U.QueryString IS NULL,'',CONCAT('?',U.QueryString))
)
FROM URL U
INNER JOIN DomainPath DP ON U.DomainPath=DP.ID
INNER JOIN Domain D on DP.Domain=D.ID
WHERE U.ID=$DesiredID;
Store a port number if it's non standard (non-80 for http, non-443 for https), otherwise store it as NULL to signify it shouldn't be included. (You can add the logic to the MySQL but it gets much uglier.)
I would always (or never) strip the "/" from the Path as well as the "?" from the QueryString for space savings. Only loss would being able to distinguish between
http://www.example.com/
http://www.example.com/?
Which, if important, then I would change your tack to never strip it and just include it. Technically,
http://www.example.com
http://www.example.com/
Are the same, so stripping the Path slash is OK always.
So, to parse:
http://www.example.com/my/path/to/my/file.php?id=412&crsource=google+adwords
We would use something like parse_url
in PHP to produce:
array(
[scheme] => 'http',
[host] => 'www.example.com',
[path] => '/my/path/to/my/file.php',
[query] => 'id=412&crsource=google+adwords',
)
You would then check/insert (with appropriate locks, not shown):
SELECT D.ID FROM Domain D
WHERE
D.Protocol=0
AND D.Domain='www.example.com'
AND D.Port IS NULL
(if doesn't exist)
INSERT INTO Domain (
Protocol, Domain, Port
) VALUES (
0, 'www.example.com', NULL
);
We then have our $DomainID
going forward...
Then insert into DomainPath:
SELECT DP.ID FORM DomainPath DP WHERE
DP.Domain=$DomainID AND Path='/my/path/to/my/file.php';
(if it doesn't exist, insert it similarly)
We then have our $DomainPathID
going forward...
SELECT U.ID FROM URL
WHERE
DomainPath=$DomainPathID
AND QueryString='id=412&crsource=google+adwords'
and insert if necessary.
Now, let me note importantly, that the above scheme will be slow for high-performance sites. You should modify everything to use a hash of some sort to speed up SELECT
s. In short, the technique is like:
CREATE TABLE Foo (
ID integer unsigned PRIMARY KEY NOT NULL AUTO_INCREMENT,
Hash varbinary(16) NOT NULL,
Content text
) Type=MyISAM;
SELECT ID FROM Foo WHERE Hash=UNHEX(MD5('id=412&crsource=google+adwords'));
I deliberately eliminated it from the above to keep it simple, but comparing a TEXT to another TEXT for selects is slow, and breaks for really long query strings. Don't use a fixed-length index either because that will also break. For arbitrary length strings where accuracy matters, a hash failure rate is acceptable.
Finally, if you can, do the MD5 hash client side to save sending large blobs to the server to do the MD5 operation. Most modern languages supports MD5 built-in:
SELECT ID FROM Foo WHERE Hash=UNHEX('82fd4bcf8b686cffe81e937c43b5bfeb');
But I digress.
这篇关于最好的方式来存储url在mysql的一个读&写密集型应用程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!