优化两个大表上的简单查询 [英] Optimizing a simple query on two large tables

查看:135
本文介绍了优化两个大表上的简单查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提供一项功能,以显示好友最多查看的页面.我的好友表有570万行,而视图表有530万行.此刻,我只想在这两个表上运行查询,并找到一个人的朋友查看的20个浏览量最高的页面ID.

I'm trying to offer a feature where I can show pages most viewed by friends. My friends table has 5.7M rows and the views table has 5.3M rows. At the moment I just want to run a query on these two tables and find the 20 most viewed page id's by a person's friend.

这是我现在拥有的查询:

Here's the query as I have it now:

SELECT page_id 
FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id 
WHERE (`friendships`.`creator_id` = 143416) 
GROUP BY page_id 
ORDER BY count(views.user_id) desc 
LIMIT 20

这是解释的样子:

+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| id | select_type | table       | type | possible_keys                           | key                             | key_len | ref                                     | rows | Extra                                        |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | friendships | ref  | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4       | const                                   |  271 | Using index; Using temporary; Using filesort | 
|  1 | SIMPLE      | views       | ref  | PRIMARY                                 | PRIMARY                         | 4       | friendships.receiver_id                 |   11 | Using index                                  | 
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+

views表的主键为(user_id,page_id),您可以看到正在使用它.友谊表的主键为(receiver_id,creator_id),辅助索引为(creator_id).

The views table has a primary key of (user_id, page_id), and you can see this is being used. The friendships table has a primary key of (receiver_id, creator_id), and a secondary index of (creator_id).

如果我在不使用group by和limit的情况下运行此查询,则该特定用户大约有25,000行-通常.

If I run this query without the group by and limit, there's about 25,000 rows for this particular user - which is typical.

在最近一次的实际运行中,该查询执行还花费了7秒钟的时间,对于在网络应用中获得不错的响应而言,这太长了.

On the most recent real run, this query took 7 seconds too execute, which is way too long for a decent response in a web app.

我想知道的一件事是我是否应该将二级索引调整为(creator_id,receiver_id).我不确定这是否会带来很大的性能提升.今天,我可能会根据对这个问题的答案进行尝试.

One thing I'm wondering is if I should adjust the secondary index to be (creator_id, receiver_id). I'm not sure that will give much of a performance gain though. I'll likely try it today depending on answers to this question.

您能看到以什么方式重写查询以使其变得轻快吗?

Can you see any way the query can be rewritten to make it lightening fast?

更新:我需要对其进行更多测试,但是如果我不对数据库进行分组和排序,但后来在ruby中进行排序,则看来我的讨厌查询会更好.总体时间要短得多-似乎减少了80%.也许我的早期测试存在缺陷-但这绝对值得进行更多调查.如果是真的-那么wtf是Mysql在做什么?

Update: I need to do more testing on it, but it appears my nasty query works out better if I don't do the grouping and sorting in the db, but do it in ruby afterwards. The overall time is much shorter - by about 80% it seems. Perhaps my early testing was flawed - but this definitely warrants more investigation. If it's true - then wtf is Mysql doing?

推荐答案

据我所知,进行闪电般的"查询的最佳方法是创建一个汇总表,该表可跟踪每页的好友页面浏览量每个创作者.

As far as I know, the best way to make a query like that "lightning fast", is to create a summary table that tracks friend page views per page per creator.

您可能希望使用触发器将其保持最新.然后,您的汇总已经为您完成,这是一个获得最多查看页面的简单查询.您可以确保摘要表上有正确的索引,这样就不必对数据库进行排序即可获得最多的浏览量.

You would probably want to keep it up-to-date with triggers. Then your aggregation is already done for you, and it is a simple query to get the most viewed pages. You can make sure you have proper indexes on the summary table, so that the database doesn't even have to sort to get the most viewed.

汇总表是在只读环境中保持聚合类型查询良好性能的关键.当更新发生(不频繁)并且查询(频繁)时,您无需进行任何工作.

Summary tables are the key to maintaining good performance for aggregation-type queries in read-mostly environments. You do the work up-front, when the updates occur (infrequent) and then the queries (frequent) don't have to do any work.

如果您的统计数据不一定是完美的,并且您的写入实际上相当频繁(页面浏览量之类的情况很可能如此),则可以在内存中批量添加视图并在后台对其进行处理,以便朋友在查看页面时不必承受使摘要表保持最新状态的麻烦.该解决方案还减少了对数据库的争用(更新摘要表的过程更少了).

If your stats don't have to be perfect, and your writes are actually fairly frequent (which is probably the case for something like page views), you can batch up views in memory and process them in the background, so that the friends don't have to take the hit of keeping the summary table up-to-date, as they view pages. That solution also reduces contention on the database (fewer processes updating the summary table).

这篇关于优化两个大表上的简单查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆