如何使用mysql找到最相关的行? [英] How to find the most relevant rows using mysql?
问题描述
我有一个包含数千行的文章的 MySQL 表.每个文章行最多可以有 3 个 VARCHAR 标签字段:tag1
、tag2
、tag3
,每个字段都可以为空(用 NA
在数据库中)
I have a MySQL table of articles with several thousands of rows. Each article row can have up to 3 VARCHAR tag fields: tag1
, tag2
, tag3
, each of which can be empty (filled with NA
in the database)
article
表架构是这样的:
+------------+--------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(100) | NO | | NULL | |
| body | longtext | NO | | NULL | |
| created_at | datetime | NO | MUL | CURRENT_TIMESTAMP | |
| visits | int(11) | NO | | 1 | |
| slug | varchar(100) | NO | | NULL | |
| lasthit | datetime | YES | | CURRENT_TIMESTAMP | |
| tag1 | varchar(100) | NO | | NA | |
| tag2 | varchar(100) | NO | | NA | |
| tag3 | varchar(100) | NO | | NA | |
+------------+--------------+------+-----+-------------------+----------------+
我想为每篇文章找到一个包含 10 篇相关文章
的列表,因此与该文章具有更多相似标签的行应该排在第一位.例如,如果 ArticleA 有:
I'd like to find a list of 10 Relevant Articles
for each article, so the rows which have higher number similar tags with that article should come first. So for example, if ArticleA has:
tag1 = "cat"
tag2 = "dog"
tag3 = "fish"
查询应该返回10篇具有相同标签的文章,如果只有4行有这3个标签,剩下的结果必须来自有2个公共标签的行,如果没有足够的行有2个匹配标签,应返回 1 个匹配的标签.
The query should return 10 articles which have the very same tags, and if there are only 4 row having these 3 tags,the remaining results must be from rows have 2 common tags, and if there are not enough rows with 2 matching tags, 1 matching tag should be returned.
我想知道实现这一目标的最佳方法是什么?
I'm wondering what is the best way to achieve this?
推荐答案
所以 newA.* 是包含一些匹配的 10 篇文章的列表.取决于为文章 A 输入的前一篇文章 ID.
So newA.* is a list of 10 articles with some match. The depends on the input of a previous articleID for articleA.
SELECT newA.*
FROM articles articleA
JOIN articles newA
ON newA.tag1 IN (articleA.tag1, articleA.tag2, articleA.tag3) OR
newA.tag2 IN (articleA.tag1, articleA.tag2, articleA.tag3) OR
newA.tag3 IN (articleA.tag1, articleA.tag2, articleA.tag3)
WHERE articleA.id = {some explict number}
AND newA.id != articleA.id
ORDER BY
(newA.tag1 IN (articleA.tag1, articleA.tag2, articleA.tag3) IS NOT NULL) +
(newA.tag2 IN (articleA.tag1, articleA.tag2, articleA.tag3) IS NOT NULL) +
(newA.tag3 IN (articleA.tag1, articleA.tag2, articleA.tag3) IS NOT NULL)
DESC
LIMIT 10
ORDER BY
接受每个标签,X IN (...)
是:
The ORDER BY
takes each tag, and X IN (...)
is either:
- 1 - 真,值存在
- 0 - 错误,值不存在
- NULL - newA.tagX 为空.
'IS NOT NULL' 将 NULL 变为 0,因此可以添加.
The 'IS NOT NULL' turns NULL into 0, so it can be added.
这个查询效率很低,因为它需要扫描整篇文章来确定匹配.
This query is rather inefficient as it requires to scan the entire articles to determine a match.
为了更高效,article_tags
表将 id
映射到具有 id,tag
的 tag
> 主键将导致此查询:
To be more efficient a article_tags
table that maps the id
to a tag
that has a id,tag
primary key would result in this query:
SELECT newA.*
FROM articles articleA
JOIN article_tags tagsA ON articleA.id=tagsA.id
JOIN article_tags newAtags ON tagsA.tag=newA.tag
JOIN articles newA ON newAtags.id=newA.id
WHERE articleA.id = {some explicit number}
AND newA.id != articleA.id
GROUP BY newA.id
ORDER BY COUNT(*) DESC
LIMIT 10
这篇关于如何使用mysql找到最相关的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!