mysql - 在连接表列上优化ORDER BY COALESCE [英] mysql - Optimizing ORDER BY COALESCE on joined table column

查看:164
本文介绍了mysql - 在连接表列上优化ORDER BY COALESCE的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

EDITED:根据请求添加完整查询。



本质上,我有一张与多个链接的表格,转发到Twitter表格。我想加载在转发时间(如果存在)或原始帖子的时间下订单的帖子。然而,使用单个查询的排序过程非常慢(可能是COALESCE(x,y)没有充分利用MySQL索引的事实)。



我的查询看起来像这样。

  SELECT * FROM Post p LEFT JOIN p.reposts ON ... WHERE ... 
ORDER BY COALESCE(r.time,p.time)LIMIT 0,10

由于我使用DAL,更精确地(伪-hh):

  SELECT * FROM Post p LEFT JOIN p.reposts repost ON(p.id = repost.post_id AND 
repost.time =(
SELECT MIN .time)FROM Repost r WHERE p.id = r.post_id
AND r.user_id IN(1,2,3 ...)AND r.user_id NOT IN(4,5,6 ...))
))
WHERE(转发不为空或p.author_id IN(1,2,3 ...))
AND p.author_id NOT IN(4,5,6 .. 。)
ORDER BY COALESCE(repost.time,p.time)LIMIT 0,10

在上面,ON子句确保最多只有一个repost(我想要的)被加入。 COALESCE是必需的,因为如果一篇文章还没有转贴,r可能是NULL。该查询的行为与预期的一样 - 当ORDER BY子句被省略时使用,或仅在索引列(如p.time)上使用。这是可以预期的,因为Post表格大于100k +行。



查询说明



编辑:更好地解释什么查询应该做的。值得注意的是,这里的逻辑有效 - 我得到我想要的数据。问题是,使用ORDER BY子句会导致查询运行速度降低约50倍,因为MySQL不能在连接的表上使用COALESCE的索引。




  • 加载由一组用户撰写的10个帖子的列表(之后),或由最新的订单由相同的集合(后跟)重新发布。

  • 帖子应该

  • 忽略不同组中的用户的帖子和转发(被阻止)


  • 获取帖子:从帖子中选择


  • 由用户在随后的集合中获取最早的转贴:LEFT JOIN ON ... r。 time =(SELECT MIN(r.time)...)

  • 过滤掉不被后面集合中用户创作或重新发布的帖子:WHERE(repost IS NOT NULL ...)

  • 订单成为第一个转贴(如果存在)或发布时间:ORDER BY COALESCE(转贴时间,p.time)

  • 加载最多10个帖子:LIMIT 0,10



UPD ATE



我发现:

  .ORDER BY repost.time DESC 

产生缓慢的结果,除非我还添加:

  ... WHERE repost.id IS NOT NULL ... 

在这种情况下,查询速度很快。这使我相信真正的问题是对可空列索引进行排序。我也试过:

  ... ORDER BY CASE WHEN repost.id IS NULL p.time ELSE repost.time END DESC 

哪些没有帮助。



strong>更新2



由于MySQL使用b-tree作为索引,所以似乎无法利用我想要的方式因此,我目前最好的想法是将每个原始帖子视为其作者的转贴,然后在转贴表上执行我的选择和订单,例如

  SELECT * FROM Repost r LEFT JOIN r.post ON ... WHERE ... ORDER BY r.time DESC 


解决方案

这里的问题是我在问题的更新2中所描述的。 MySQL使用索引来快速执行ORDER BY操作。更具体地说,MySQL使用 B树来索引列(例如时间戳 - p.time /r.time),它占用了更多的空间,但允许更快的排序。



我的查询的问题是它是按照时间列排序在两个表,使用repost表中的时间戳(如果可用)和其他表格。由于MySQL不能组合来自两个表的B树,因此它不能对两个不同的表的列执行快速的索引排序。



我修改了我的查询和表结构以两种方式解决这个问题。



1)首先执行基于被阻止的用户进行过滤,因此只能对当前用户可访问的帖子进行排序。这不是问题的根源,而是实际的优化。例如

  SELECT * FROM(SELECT * FROM Post p WHERE p.author_id NOT IN(4,5,6 ...) )... 

2)将每个帖子视为作者的转贴,所以每个帖子都被保证有一个可以转载的转载和转载时间,索引和排序。例如

  SELECT * FROM(...)LEFT JOIN p.reposts repost ON(p.id = repost.post_id AND 
repost.time =(
SELECT MIN(r.time)FROM Repost r WHERE p.id = r.post_id
AND r.user_id IN(1,2,3 ...)AND r.user_id NOT IN(4,5,6 ...))
WHERE(repost.id IS NOT NULL)ORDER BY repost.time DESC LIMIT 0,10

在一天结束时,问题归结为ORDER BY - 这种方法将查询时间从约8秒缩短到20 ms。


EDITED: added full query by request.

In essence I have a table of posts linked one to many to a table of reposts, akin to Twitter. I want to load the posts ordered by the time of the repost (if present) or the time of the original post. However, the ordering process is very slow using a single query (probably do the the fact that COALESCE(x, y) doesn't make full use of MySQL indexes). The time column on both relevant tables is indexed.

My query looks something like this.

SELECT * FROM Post p LEFT JOIN p.reposts ON ... WHERE ... 
ORDER BY COALESCE(r.time, p.time) LIMIT 0, 10

More precisely (pseudo-ish) since I'm using a DAL:

SELECT * FROM Post p LEFT JOIN p.reposts repost ON (p.id = repost.post_id AND    
repost.time = (
  SELECT MIN(r.time) FROM Repost r WHERE p.id = r.post_id
  AND r.user_id IN (1, 2, 3...) AND r.user_id NOT IN (4, 5, 6...))
))
WHERE (repost IS NOT NULL OR p.author_id IN (1, 2, 3...)) 
AND p.author_id NOT IN (4, 5, 6...)
ORDER BY COALESCE(repost.time, p.time) LIMIT 0, 10

In the above, the ON clause ensures at most one repost (the one I want) is joined. COALESCE is necessary because r may be NULL if a post has not been reposted. The query behaves as expected - fast when ORDER BY clause is omitted, or used only on an indexed column like p.time. This is to be expected since the Post table is large 100k+ rows.

Query Explanation

EDIT: better explanation of what query should do. It's worth noting the logic here works - I get the data I want. The problem is that applying the ORDER BY clause causes the query to run about 50x slower because MySQL can't use the indexes with COALESCE on a joined table.

  • Load a list of 10 posts that are either authored by a set of users (followed) or reposted by the same set (followed), ordered by most recent.
  • Posts should be ordered by either the time of the post or the first repost.
  • Ignore posts and reposts by users in a different set (blocked)

  • Get posts: SELECT from posts

  • Get the earliest repost by a user in the followed set: LEFT JOIN ON... r.time = (SELECT MIN(r.time)...)
  • Filter out posts not authored or reposted by users in the followed set: WHERE (repost IS NOT NULL...)
  • Order be the first repost (if it exists) or the publication time: ORDER BY COALESCE(repost.time, p.time)
  • Load at most 10 posts: LIMIT 0, 10

UPDATE

I found that:

...ORDER BY repost.time DESC

Produces slow results as well unless I also add:

...WHERE repost.id IS NOT NULL...

In which case the query is fast. This leads me to believe that the real problem is sorting on nullable column indexes. I also tried:

... ORDER BY CASE WHEN repost.id IS NULL p.time ELSE repost.time END DESC

Which didn't help.

UPDATE 2

Due to the fact that MySQL uses b-trees for its indexes, it seems it'll be impossible to leverage the indexes in the way I want. Thus my current best idea is to treat each original post as a "repost" by its author, then perform my select and order on the repost table, e.g.

SELECT * FROM Repost r LEFT JOIN r.post ON ... WHERE ... ORDER BY r.time DESC

解决方案

The problem here was as I described in update 2 of my question. MySQL uses indexes to perform ORDER BY operations quickly. More specifically, MySQL uses B-trees to index columns (such as timestamps - p.time/r.time), which use up a bit more space but allow for faster sorting.

The issue with my query was that it was sorting by the time column in two tables, using the timestamp from the repost table if available, and the post table otherwise. Since MySQL can't combine the B-trees from both tables, it can't perform fast index sorting on columns from two different tables.

I modified my query and table structure in two ways to solve this.

1) Perform filtering based on blocked users first, so ordering only has to be done on posts that are accessible by the current user. This was not the root of the problem, but is practical optimization. e.g.

SELECT * FROM (SELECT * FROM Post p WHERE p.author_id NOT IN (4, 5, 6...))...

2) Treat every post as a repost by its author, so every post is guaranteed to have a joinable repost and repost.time on which to index and sort. e.g.

SELECT * FROM (...) LEFT JOIN p.reposts repost ON (p.id = repost.post_id AND 
repost.time = (
  SELECT MIN(r.time) FROM Repost r WHERE p.id = r.post_id
  AND r.user_id IN (1, 2, 3...) AND r.user_id NOT IN (4, 5, 6...))
))
WHERE (repost.id IS NOT NULL) ORDER BY repost.time DESC LIMIT 0, 10

At the end of the day the issue came down to ORDER BY - this approach reduced the query time from about 8 seconds to 20 ms.

这篇关于mysql - 在连接表列上优化ORDER BY COALESCE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆