Postgres DISTINCT 与 DISTINCT ON 之间有什么区别? [英] What is the difference between Postgres DISTINCT vs DISTINCT ON?

查看:56
本文介绍了Postgres DISTINCT 与 DISTINCT ON 之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用以下语句创建的 Postgres 表.此表由其他服务的数据转储填充.

I have a Postgres table created with the following statement. This table is filled by as dump of data from another service.

CREATE TABLE data_table (
    date date DEFAULT NULL,
    dimension1 varchar(64) DEFAULT NULL,
    dimension2 varchar(128) DEFAULT NULL
) TABLESPACE pg_default;

我正在构建的 ETL 中的一个步骤是提取 dimension1 的唯一值并将它们插入另一个中间表中.但是,在一些测试中,我发现下面的 2 个命令不会返回相同的结果.我希望两者都返回相同的金额.与第二个命令相比,第一个命令返回的结果更多(1466 行 vs. 1504.

One of the steps in a ETL I'm building is extracting the unique values of dimension1 and inserting them in another intermediary table. However, during some tests I found out that the 2 commands below do not return the same results. I would expect for both to return the same sum. The first command returns more results compared with the second (1466 rows vs. 1504.

-- command 1
SELECT DISTINCT count(dimension1)
FROM data_table;

-- command 2    
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;

对此有任何明显的解释吗?除了解释之外,是否有任何关于我应该对数据进行检查的建议?

Any obvious explanations for this? Alternatively to an explanation, is there any suggestion of any check on the data I should do?

以下查询都返回 1504(与简单"DISTINCT 相同)

The following queries both return 1504 (same as the "simple" DISTINCT)

SELECT count(*)
FROM data_table WHERE dimension1 IS NOT NULL;

SELECT count(dimension1)
FROM data_table;

谢谢!

推荐答案

DISTINCT 和 DISTINCT ON 的语义完全不同.

先说理论

DISTINCT 适用于整个元组.计算出查询结果后,DISTINCT 会从结果中删除任何重复的元组.

DISTINCT applies to an entire tuple. Once the result of the query is computed, DISTINCT removes any duplicate tuples from the result.

例如,假设表 R 具有以下内容:

For example, assume a table R with the following contents:

#table r;
a | b 
---+---
1 | a
2 | b
3 | c
3 | d
2 | e
1 | a

(6 行)

SELECT distinct * from R 将导致:

SELECT distinct * from R will result:

# select distinct * from r;
 a | b 
---+---
 1 | a
 3 | d
 2 | e
 2 | b
 3 | c
(5 rows)

请注意,distinct 适用于整个投影属性列表:因此

Note that distinct applies to the entire list of projected attributes: thus

select distinct * from R

在语义上等同于

select distinct a,b from R

你不能发出

select a, distinct b From R

DISTINCT 必须跟在 SELECT 之后.它适用于整个元组,而不适用于结果的一个属性.

DISTINCT must follow SELECT. It applies to the entire tuple, not to an attribute of the result.

DISTINCT ON 是该语言的 postgresql 补充.与 group by 类似,但不完全相同.

DISTINCT ON is a postgresql addition to the language. It is similar, but not identical, to group by.

它的语法是:

 SELECT DISTINCT ON (attributeList) <rest as any query>

例如:

 SELECT DISTINCT ON (a) * from R

它的语义可以描述如下.像往常一样计算查询,但在投影结果之前,对当前结果进行排序并根据 DISTINCT ON 中的属性列表对其进行分组(类似于 group by).现在,使用每组中的第一个元组进行投影并忽略其他元组.

It semantics can be described as follows. Compute the query as usual, but before the projection of the result, sort the current result and group it according to the attribute list in DISTINCT ON (similar to group by). Now, do the projection using the first tuple in each group and ignore the other tuples.

示例:

select distinct * from r order by a;
     a | b 
    ---+---
     1 | a
     2 | e
     2 | b
     3 | c
     3 | d
    (5 rows)

然后对于 a 的每个不同值,取第一个元组.与以下相同:

Then for every different value of a, take the first tuple. Which is the same as:

 SELECT DISTINCT on (a) * from r;
  a | b 
 ---+---
 1 | a
 2 | b
 3 | c
 (3 rows)

某些 DBMS(尤其是 mySQL)将允许您运行此查询:

Some DBMS (most notably mySQL) will allow you to run this query:

 SELECT a,b from R group by a;

这会给你类似的结果.

Postgresql 将允许这个查询,当且仅当存在从 a 到 b 的函数依赖.换句话说,如果对于关系 R 的任何实例,每个值或 a 只有一个元组,则此查询将有效(因此选择第一个元组是确定性的:只有一个元组).

Postgresql will allow this query, if and only if there is a functional dependency from a to b. In other words, this query will be valid if for any instance of the relation R, there is only one tuple for every value or a (thus selecting the first tuple is deterministic: there is only one tuple).

例如,如果 R 的主键是 a,则 a->b 和:

For instance, if the primary key of R is a, then a->b and:

SELECT a,b FROM R group by a

等同于:

  SELECT DISTINCT on (a) a, b from r;

现在,回到你的问题:

第一次查询:

SELECT DISTINCT count(dimension1)
FROM data_table;

计算维度 1 的计数(data_table 中维度 1 不为空的元组数).这个查询返回一个元组,它总是唯一的(因此 DISTINCT是多余的).

computes the count of dimension1 (number of tuples in data_table that where dimension1 is not null). This query returns one tuple, which is always unique (hence DISTINCT is redundant).

查询 2:

SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;

这是查询中的查询.为了清楚起见,让我重写一下:

This is query in a query. Let me rewrite it for clarity:

WITH tmp_table AS (
   SELECT DISTINCT ON (dimension1) 
     dimension1 FROM data_table
     GROUP by dimension1) 
SELECT count(*) from tmp_table

让我们先计算 tmp_table.正如我上面提到的,让我们首先忽略 DISTINCT ON 并执行其余的操作询问.这是一个按维度 1 分组.因此这部分查询将导致每个维度 1 的不同值产生一个元组.

Let us compute first tmp_table. As I mentioned above, let us first ignore the DISTINCT ON and do the rest of the query. This is a group by by dimension1. Hence this part of the query will result in one tuple per different value of dimension1.

现在,DISTINCT ON.它再次使用维度 1.但是维度 1 已经是唯一的(由于 group by).因此这使得 DISTINCT ON superflouos(它什么都不做).最终计数只是对 group by 中所有元组的计数.

Now, the DISTINCT ON. It uses dimension1 again. But dimension1 is unique already (due to the group by). Hence this makes the DISTINCT ON superflouos (it does nothing). The final count is simply a count of all the tuples in the group by.

如您所见,以下查询中存在等价关系(适用于具有属性 a 的任何关系):

As you can see, there is an equivalence in the following query (it applies to any relation with an attribute a):

SELECT (DISTINCT ON a) a
FROM R

SELECT a FROM R group by a

SELECT DISTINCT a FROM R

警告

在查询中使用 DISTINCT ON 结果对于数据库的任何给定实例可能是不确定的.换句话说,对于相同的表,查询可能会返回不同的结果.

Using DISTINCT ON results in a query might be non-deterministic for the any given instance of the database. In other words, the query might return different results for the same tables.

一个有趣的方面

Distinct ON 以更简洁的方式模拟 sqlite 和 mysql 的不良行为.假设R有两个属性a和b:

Distinct ON emulates a bad behaviour of sqlite and mysql in a much cleaner way. Assume that R has two attributes a and b:

SELECT a, b FROM R group by a

是 SQL 中的非法语句.然而,它运行在 mysql 和 sqlite 上.它只是从 a 的相同值组中的任何元组中获取 b 的随机值.在 Postgresql 中,此语句是非法的.相反,您必须使用 DISTINCT ON 并写入:

is an illegal statement in SQL. Yet, it runs on mysql and sqlite. It simply takes a random value of b from any of the tuples in the group of same values of a. In Postgresql this statement is illegal. Instead, you must use DISTINCT ON and write:

SELECT DISTINCT ON (a) a,b from R

推论

DISTINCT ON 在 group by 中很有用,当您要访问在功能上依赖于 group by 属性的值时.换句话说,如果您知道对于每组属性,它们的第三个属性始终具有相同的值,则在该组属性上使用 DISTINCT.否则,您将不得不进行 JOIN 以检索该第三个属性.

DISTINCT ON is useful in a group by when you want to access a value that is functionally dependent on the group by attributes. In other words, if you know that for every group of attributes they always have the same value of the third attribute, then use DISTINCT ON that group of attributes. Otherwise you would have to make a JOIN to retrieve that third attribute.

这篇关于Postgres DISTINCT 与 DISTINCT ON 之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆