存储“标签"的最佳方式在巨大的桌子上速度 [英] Best way to store "tags" for speed in enormous table

查看:46
本文介绍了存储“标签"的最佳方式在巨大的桌子上速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个大型内容网站,有一个内容"表,有超过 5000 万条记录.这是表结构:

包含id(INT11 INDEX),名称(varchar150 全文),描述(文本全文),日期(INT11 INDEX)

我想给这个内容添加一个标签".

我认为有两种方法:

  1. 在表格内容中创建一个 varchar(255 FULLTEXT)标签"列.存储由逗号分隔的所有标签,并使用 MATCH & 逐行搜索(我认为这会很慢)再次.

  2. 制作两张桌子.第一个表名tags",列 id, tag(varchar(30 INDEX or FULLTEXT?)), contents_tags" with id, tag_id (int11 INDEX) 和 content_id (int11 INDEX) 并通过 3 个表的 JOINS (contents- contents_tags - 标签)以检索带有标签的所有内容.

<块引用>

我认为这很慢而且很浪费内存,因为 50M 的巨大连接表 * contents_tags * 标签.

存储标签以使其尽可能高效的最佳方法是什么?按文本搜索(例如movie 3d 2011"和简单标签video")和定位内容的最快方法是什么?

表的大小(现在大约 5Gb,没有标签).该表是一个MYISAM,因为我需要将表内容的名称和描述存储在FULLTEXT中以进行字符串搜索(用户现在可以通过此字段进行搜索),并且需要以最佳速度通过标签进行搜索.

有这方面的经验吗?

谢谢!

解决方案

FULLTEXT 索引确实没有您想象的那么快.

使用单独的表格来存储您的标签:

表格标签----------id 整数 PK标签 varchar(20)表 tag_link--------------tag_id 整数外键引用 tag(id)content_id 整数外键引用 content(id)/* 这个表有一个由 tag_id + content_id 组成的 PK */表格内容--------------id 整数 PK......

您可以使用以下方法选择带有标签 x 的所有内容:

SELECT c.* FROM 标签 t内连接 tag_link tl ON (t.id = tl.tag_id)INNER JOIN 内容 c ON (c.id = tl.content_id)WHERE 标签 = '测试'ORDER BY tl.content_id DESC/*最新内容优先*/限制 10;

因为外键的原因,tag_links 中的所有字段都是单独索引的.
`WHERE tags = 'test' 选择 1 (!) 条记录.
Equi-join 这个与 10,000 个标签链接.
并且 Equi-joins that 每个有 1 个内容记录(每个 tag_link 只指向 1 个内容).
由于限制为 10,MySQL 一有 10 项就会停止查找,因此它实际上只查找 10 个 tag_links 记录.
content.id 是自动递增的,因此较高的数字是较新文章的快速代理.

在这种情况下,您从不需要寻找除相等之外的任何内容,并且您从 1 个标记开始,您使用整数键(最快的连接可能)对它进行相等连接.

没有关于它的if-thens-or-buts,这是最快的方法.

请注意,由于最多只有 1000 个标签,因此任何搜索都比钻研完整内容表要快得多.

终于
CSV 字段是一个非常糟糕的主意,永远不要在数据库中使用.

I'm developing a big content site, with a table "contents", with more than 50 Million of records. Here's the table structure:

contain id(INT11 INDEX), 
name(varchar150 FULLTEXT), 
description (text FULLTEXT), 
date(INT11 INDEX)

I wan to add a "tags" to this contents.

I'm think 2 methods:

  1. Make a varchar(255 FULLTEXT) "tags" column in table contents. Store all tags separated by comas, and search row by row (Which I think this will be slow) using MATCH & AGAINS.

  2. Make 2 tables. First table name "tags" with columns id, tag(varchar(30 INDEX or FULLTEXT?)), "contents_tags" with id, tag_id (int11 INDEX) and content_id (int11 INDEX) and search contents by a JOINS of 3 tables (contents - contents_tags - tags) to retrieve all contents with the tag(s).

I think this is slow and memory killer because a ENORMOUS JOIN of 50M table * contents_tags * tags.

What is the best method to store tags to make it as efficient as possible? What is the fastest way to search by a text (for example "movie 3d 2011" and simple tag "video") and to locate contents.?

The size of the table (approx. 5Gb now without tags). The table is a MYISAM because I need to store name and description of the table contents in FULLTEXT to string search (users ca search now by this fields), and need the best speed to search by tags.

Any with experience in this?

Thanks!

解决方案

FULLTEXT indexes are really not as fast as you may think they are.

Use a separate table to store your tags:

Table tags
----------
id integer PK
tag varchar(20)

Table tag_link
--------------
tag_id integer foreign key references tag(id)
content_id integer foreign key references content(id)
/* this table has a PK consisting of tag_id + content_id */

Table content
--------------
id integer PK
......

You SELECT all content with tag x by using:

SELECT c.* FROM tags t
INNER JOIN tag_link tl ON (t.id = tl.tag_id)
INNER JOIN content c ON (c.id = tl.content_id)
WHERE tag = 'test'
ORDER BY tl.content_id DESC /*latest content first*/
LIMIT 10;

Because of the foreign key, all fields in tag_links are individually indexed.
The `WHERE tags = 'test' selects 1 (!) record.
Equi-joins this with 10,000 taglinks.
And Equi-joins that with 1 content record each (each tag_link only ever points to 1 content).
Because of the limit 10, MySQL will stop looking as soon as it has 10 items, so it really only looks at 10 tag_links records.
The content.id is autoincrementing, so higher numbers are very fast proxy for newer articles.

In this case you never need to look for anything other than equality and you start out with 1 tag that you equi-join using integer keys (the fastest join possible).

There are no if-thens-or-buts about it, this is the fastest way.

Note that because there are at most a few 1000 tags, any search will be much faster than delving in the full contents table.

Finally
CSV fields are a very bad idea, never use then in a database.

这篇关于存储“标签"的最佳方式在巨大的桌子上速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆