是否应将SQL排名功能视为“谨慎使用"? [英] Should SQL ranking functionality be considered as "use with caution"
问题描述
This question originates from a discussion on whether to use SQL ranking functionality or not in a particular case.
任何常见的RDBMS都包含一些排名功能,即其查询语言具有类似TOP n ... ORDER BY key
,ROW_NUMBER() OVER (ORDER BY key)
或ORDER BY key LIMIT n
的元素(
Any common RDBMS includes some ranking functionality, i.e. its query language has elements like TOP n ... ORDER BY key
, ROW_NUMBER() OVER (ORDER BY key)
, or ORDER BY key LIMIT n
(overview).
如果您只想显示大量记录中的一小部分,它们在提高性能方面做得很好.但是,它们也带来了一个重大陷阱:如果key
不是唯一的,则结果是不确定的.考虑以下示例:
They do a great job in increasing performance if you want to present only a small chunk out of a huge number of records. But they also introduce a major pitfall: If key
is not unique results are non-deterministic. Consider the following example:
users
user_id name
1 John
2 Paul
3 George
4 Ringo
logins
login_id user_id login_date
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
查询应该返回上次登录的人:
A query is supposed to return the person who logged in last:
SELECT TOP 1 users.*
FROM
logins JOIN
users ON logins.user_id = users.user_id
ORDER BY logins.login_date DESC
正如预期的那样返回George
,一切看起来都很好.但随后将一条新记录插入到logins
表中:
Just as expected George
is returned and everything looks fine. But then a new record is inserted into logins
table:
1 4 2009-08-17
2 1 2009-08-18
3 2 2009-08-19
4 3 2009-08-20
5 4 2009-08-20
上面的查询现在返回什么? Ringo
? George
?你不知道据我记得例如MySQL 4.1返回第一个实际创建的,符合条件的记录,即结果为George
.但这可能因版本而异,并且随DBMS的不同而不同.应该退还什么?可能有人说Ringo
,因为他显然是最后登录了,但这纯粹是解释.我认为两者都应该返回,因为您无法根据可用数据做出明确的决定.
What does the query above return now? Ringo
? George
? You can't tell. As far as I remember e.g. MySQL 4.1 returns the first record physically created that matches the criteria, i.e. the result would be George
. But this may vary from version to version and from DBMS to DBMS. What should have been returned? One might say Ringo
since he apparently logged in last but this is pure interpretation. In my opinion both should have been returned, because you can't decide unambiguously from the data available.
因此该查询符合要求:
SELECT users.*
FROM
logins JOIN
users ON
logins.user_id = users.user_id AND
logins.login_date = (
SELECT max(logins.login_date)
FROM
logins JOIN
users ON logins.user_id = users.user_id)
作为替代方案,某些DBMS提供特殊功能(例如Microsoft SQL Server 2005引入了TOP n WITH TIES ... ORDER BY key
(由 gbn 建议) ,RANK
和DENSE_RANK
为此目的.)
As an alternative some DBMSs provide special functions (e.g. Microsoft SQL Server 2005 introduces TOP n WITH TIES ... ORDER BY key
(suggested by gbn), RANK
, and DENSE_RANK
for this very purpose).
例如,如果您搜索SO. ROW_NUMBER
您将找到许多建议使用排名功能的解决方案,而错过指出可能存在的问题.
If you search SO for e.g. ROW_NUMBER
you'll find numerous solutions which suggest using ranking functionality and miss to point out the possible problems.
问题:如果提出了包含排名功能的解决方案,应该给出什么建议?
推荐答案
摘要如下:
- 先用头.应该很明显,但这始终是一个很好的起点.您是否完全希望
n
行,或者期望满足约束条件的行数可能有所不同?重新考虑您的设计.如果您希望精确地找到n
行,那么在无法明确识别行的情况下,模型的设计可能会很糟糕.如果预计行数可能会有所不同,则可能需要调整UI才能显示查询结果. - 在
key
中添加使其唯一的列(例如PK).您至少要获得对返回结果的控制权.几乎总是有一种方法可以做到
- Use your head first. Should be obvious, but it is always a good point to start. Do you expect
n
rows exactly or do you expect a possibly varying number of rows that fulfill a constraint? Reconsider your design. If you're expectingn
rows exactly, your model might be designed poorly if it's impossible to identify a row unambiguously. If you expect a possibly varying number of rows, you might need to adjust your UI in order to present your query results. - Add columns to
key
that make it unique (e.g. PK). You at least gain back control on the returned result. There is almost always a way to do this as Quassnoi pointed out. - Consider using possibly more suitable functions like
RANK
,DENSE_RANK
andTOP n WITH TIES
. They are available in Microsoft SQL Server by 2005 version and in PosgreSQL from 8.4 onwards. If these functions are not available, consider using nested queries with aggregation instead of ranking functions.
这篇关于是否应将SQL排名功能视为“谨慎使用"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!