无法优化MySQL查询 [英] Trouble optimizing MySQL query
问题描述
我正在处理以下查询,但不确定如何继续进一步优化:
I'm working on the following query but not sure how to proceed with further optimizations:
SELECT u.id AS userId, firstName, profilePhotoId, preferredActivityId, preferredSubActivityId, availabilityType,
3959 * ACOS(COS(radians(requestingUserLat)) * COS(radians(u.latitude)) * COS(radians(u.longitude) - radians(requestingUserLon)) + SIN(radians(requestingUserLat)) * SIN(radians(u.latitude))) AS distanceInMiles
FROM users u
WHERE u.id IN (
SELECT uu.id
FROM users uu
WHERE uu.latitude between lat1 and lat2 -- MySQL 5.7 supports Point data type, but it is not indexed in innoDB. We store latitude and longitude as DOUBLE for now
AND uu.longitude between lon1 and lon2
AND uu.dateOfBirth between maxAge and minAge -- dates are in millis, therefore maxAge will have a smaller value than minAge and so it needs to go first
)
AND IF(gender is null, TRUE, u.gender = gender)
AND IF(activityType is null, TRUE, u.preferredActivityType = activityType)
AND u.accountState = 'A'
AND u.id != userId
HAVING distanceInMiles < searchRadius ORDER BY distanceInMiles LIMIT pagingStart, pagingLength;
CREATE INDEX `findMatches` ON `users` (`latitude` ASC, `longitude` ASC, `dateOfBirth` ASC) USING BTREE;
这里的想法是使用上面指定的覆盖索引,使用内部查询根据用户位置和年龄标识符合条件的行。在具有几百万行的表中,将它们缩小到几千,而不需要全表扫描。然后针对更细粒度的条件(例如性别,可用性等)测试结果行。 - 这次对缩减结果集的完全扫描是不可避免的。
The idea here is to have an inner query that identifies qualifying rows based on user location and age, using the covering index specified above. In a table with a few million rows, that narrows them down to a few thousand without requiring a full table scan. Then the resulting rows are tested against more fine-grain conditions, such as gender, availability, etc. - this time a full scan of the reduced resultset is inevitable.
这按预期差不多运行, EXPLAIN
显示内部查询确实利用覆盖索引的完整密钥长度(3列),然后外部查询看起来通过PK返回的行。
This runs "almost" as expected, EXPLAIN
reveals that the inner query does indeed make use of the full key length (3 columns) of the covering index, and the outer query then looks up the rows returned, by PK.
< br>
问题:
当搜索范围在几百英里之内时性能令人满意,但当我到达千里之后,它开始降级因为指定边界内的用户数量增加。如果搜索范围保持不变,但用户数量增加了几个订单,问题也会变得明显。以下是我到目前为止所发现的问题:
The problem:
The performance is satisfactory while the search range is within a few hundred miles, but when I get to a thousand miles, it starts degrading because the number of users within the specified boundary increases. The problem would also become apparent if the search range remained the same, but the number of users increased by a few orders. Here are the problems that I have identified so far:
- MySQL目前不支持
LIMIT
在内部查询中,因此内部查询将返回所有符合条件的userIDs
(即数千),即使外部查询将它们限制为只有十几个左右。 - 启用
optimizer_trace
并查看幕后工作表明只有列纬度$ c $我的覆盖索引的c>用作
范围
。我不确定为什么会这样,特别是因为EXPLAIN
表明使用了完整的索引键长度。
- MySQL does not currently support
LIMIT
in inner queries, therefore the inner query will return ALL qualifyinguserIDs
(i.e., thousands), even if the outer query will then limit them to just a dozen or so. - Enabling
optimizer_trace
and looking at the work behind the scenes suggests that only columnlatitude
of my covering index is used as arange
. I'm not sure why that is, especially sinceEXPLAIN
suggests that the full index key length is used.
问题:
我如何解决上述(1)和(2)?在有人建议使用lat和long的空间数据类型之前,请注意最新的InnoDB引擎(MySQL v5.7)不支持空间索引,只支持空间数据类型。
推荐答案
我认为其他答案已经涵盖了这一点。在索引中使用查询中的数据与使用索引中的数据来寻找正确的解决方案之间存在差异。后者是指数的最有效使用。前者很有帮助,但效率只是不读数据页。
I think this has already been covered in other answers. There is a difference between using the data in the index for a query, and using the data in the index to seek to the right solution. The latter is the most efficient use of an index. The former is helpful, but the efficiency is simply in not reading in the data pages.
我认为你可以通过使用 exists <来改善你的查询/ code>而不是
中的。这应该允许在外层进行过滤以提高查询的性能:
I think you can improve your query by using exists
instead of in
. This should allow filtering at the outer level to improve performance of the query:
SELECT u.id AS userId, firstName, profilePhotoId, preferredActivityId, preferredSubActivityId, availabilityType,
3959 * ACOS(COS(radians(requestingUserLat)) * COS(radians(u.latitude)) * COS(radians(u.longitude) - radians(requestingUserLon)) + SIN(radians(requestingUserLat)) * SIN(radians(u.latitude))) AS distanceInMiles
FROM users u
WHERE EXISTS (SELECT 1
FROM users uu
WHERE uu.latitude between lat1 and lat2 AND
uu.longitude between lon1 and lon2 AND
uu.dateOfBirth between maxAge and minAge AND
uu.id = u.id
) AND
IF(gender is null, TRUE, u.gender = gender) AND
IF(activityType is null, TRUE, u.preferredActivityType = activityType) AND
u.accountState = 'A' AND
u.id <> userId
HAVING distanceInMiles < searchRadius
ORDER BY distanceInMiles LIMIT pagingStart, pagingLength;
注意,表达式 IF(性别为空,TRUE,u .gender =性别)
是非常不敏感的,因为它总是评估为真。如果您有一个名为 gender
的变量,则不会在此表达式中使用它。 gender
将根据MySQL范围规则进行解释,并且是表中的一列。您应该始终使用类似 var _
或 p _
之类的前缀来区分参数与表中的列。
As a note, the expression IF(gender is null, TRUE, u.gender = gender)
is rather non-sensical because it always evaluates to true. If you have a variable called gender
, it will not be used in this expression. The gender
will be interpreted according to MySQL scoping rules and be a column in a table. You should always use a prefix like var_
or p_
or something to distinguish parameters from columns in the table.
编辑:
我应该提到索引需要包含 id
作为使用的第一列
。
I should have mentioned that the index needs to include id
as the first column to use with exists
.
这篇关于无法优化MySQL查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!