MySQL查找相关文章 [英] MySQL Find Related Articles

查看:56
本文介绍了MySQL查找相关文章的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试选择最多10条相关文章,其中相关文章是与其他文章具有3个或更多相同关键字的文章.

I'm trying to select a maximum of 10 related articles, where a related article is an article that has 3 or more of the same keywords as the other article.

我的表结构如下:

articles[id, title, content, time]
tags[id, tag]
articles_tags[article_id, tag_id]

我可以在一个查询中选择所有相关的文章ID和标题吗?

Can I select the related articles id and title all in one query?

非常感谢您的帮助.

推荐答案

假定标题也是唯一的

SELECT fA.ID, fA.Title
from
   Articles bA,
   articles_tags bAT,
   articles_tags fAT,
   Articles fA
where 
   bA.title = 'some name'   AND
   bA.id = bAT.Article_Id   AND
   bAT.Tag_ID = fAT.Tag_ID    AND
   fAT.Article_ID = fA.ID  AND
   fA.title != 'some name'
GROUP BY 
    fA.ID, fA.Title
HAVING
   count(*) >= 3

在何处排除种子"文章

因为我并不在乎我匹配哪个标签,而只是在3个标签上匹配,所以我只需要tag_id并完全避免连接到标签表.因此,现在我将多对多表与其自身相连,以查找重叠的文章.

Where to exclude the 'seed' article

Because I don't care exactly WHICH tags I match on, just THAT I match on 3 tags, I only need tag_id and avoid the join to the tags table completely. So now I join the many-to-many table to itself to find the articles which have an overlap.

问题在于该文章将100%匹配到自己,因此我们需要从结果中消除它.

The problem is that the article will match itself 100% so we need to eliminate that from the results.

您可以通过3种方式排除该记录.您可以将其从表中筛选到联接之前,可以使其脱离联接,也可以在完成后对其进行筛选.

You can exclude that record in 3 ways. You can filter it from the table to before joining, you can have it fall out of the join, or you can filter it when you're finished.

如果在开始加入之前将其消除,则不会获得太多优势.您有成千上万的文章,而您只消除了1.我还认为,基于article_tag映射表的最佳索引,这将无用.

If you eliminate it before you begin the join, you're not gaining much of an advantage. You've got thousands or millions of articles and you're only eliminating 1. I also believe this will not be useful based on the best index for the article_tag mapping table.

如果将其作为连接的一部分进行,则不等式将阻止该子句成为索引扫描的一部分,并在索引扫描之后用作过滤器.

If you do it as part of the join, the inequality will prevent that clause from being part of the index scan and be applied as a filter after the index scan.

将article_tags上的索引视为(Tag_ID,Article_ID).如果我在tag_id = tag_id上将索引与其自身相连,那么我将通过将索引移至我的种子"文章所具有的每个tag_id来立即定义要处理的索引切片.如果添加子句article_id!= article_id,则无法使用索引来定义要处理的切片.这意味着它将被用作过滤器.例如假设我的第一个标签是蓝色".我浏览索引以获取所有具有蓝色"的文章. (当然是通过ID).假设有50行.我们知道1是我的种子文章,49是匹配项.如果我不包括不等式,那么我将包括所有50条记录并继续前进.如果确实包含不等式,则必须检查50条记录中的每条记录,以查看哪些是我的种子,哪些不是我的种子.下一个标签是"Jupiter",它匹配20,000个文章.再次,我必须检查索引的该切片中的每一行以排除我的种子文章.在经历了2,5,20次(取决于该种子文章的标签)之后,我现在有了一组完全干净的文章来进行COUNT(*)和HAVING处理.如果我不将不等式作为连接的一部分,而只是将SEED ID过滤掉,然后在分组之后将其过滤,那么我只会在很短的列表中执行一次该过滤器.

Consider the index on article_tags as (Tag_ID, Article_ID). If I join the index to itself on tag_id = tag_id then I'll immediately define the slice of the index to process by walking the index to each tag_id my 'seed' article has. If I add the clause article_id != article_id, that can't use the index to define the slice to be processed. That means it will be applied as a filter. e.g. Say my first tag is "BLUE". I walk the index to get all the articles which have "BLUE". (by ID of course). Say there are 50 rows. We know that 1 is my seed article and 49 are matches. If I don't include the inequality, I include all 50 records and move on. If I do include the inequality, I then have to check each of the 50 records to see which is my seed and which isn't. The next tag is "Jupiter" and it matches 20,000 articles. Again I have to check each row in that slice of the index to exclude my seed article. After I go through this 2,5,20 times (depends on tags for that seed article), I now have a completely clean set of articles to do the COUNT(*) and HAVING on. If I don't include the inequality as part of my join but instead just filter the SEED ID out after the group by and having then I only do that filer once on a very short list.

这篇关于MySQL查找相关文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆