JOIN 后的 GROUP 或 DISTINCT 返回重复项 [英] GROUP or DISTINCT after JOIN returns duplicates
问题描述
我有两个表,products
和 meta
.它们是 1:N 关系,其中每个产品行通过外键至少有一个元行.
I have two tables, products
and meta
. They are in relation 1:N where each product row has at least one meta row via foreign key.
(即 SQLfiddle:http://sqlfiddle.com/#!15/c8f34/1)
(viz. SQLfiddle: http://sqlfiddle.com/#!15/c8f34/1)
我需要加入这两个表,但我只需要过滤独特的产品.当我尝试此查询时,一切正常(返回 4 行):
I need to join these two tables but i need to filter only unique products. When I try this query, everything is ok (4 rows returned):
SELECT DISTINCT(product_id)
FROM meta JOIN products ON products.id = meta.product_id
但是当我尝试选择所有列时,DISTINCT 规则不再适用于结果,因为返回的是 8 行而不是 4 行.
but when I try to select all columns the DISTINCT rule no longer applies to results, as 8 rows instead of 4 is returned.
SELECT DISTINCT(product_id), *
FROM meta JOIN products ON products.id = meta.product_id
我尝试了很多方法,比如尝试在子查询上使用 DISTINCT
或 GROUP BY
但总是得到相同的结果.
I have tried many approaches like trying to DISTINCT
or GROUP BY
on sub-query but always with same result.
推荐答案
虽然从表中检索所有或大部分行,但此类查询的最快方法通常是首先聚合/消除歧义稍后加入:
While retrieving all or most rows from a table, the fastest way for this type of query typically is to aggregate / disambiguate first and join later:
SELECT *
FROM products p
JOIN (
SELECT DISTINCT ON (product_id) *
FROM meta
ORDER BY product_id, id DESC
) m ON m.product_id = p.id;
products
中每行 meta
中的行数越多,对性能的影响就越大.
The more rows in meta
per row in products
, the bigger the impact on performance.
当然,您需要在子查询中添加一个 ORDER BY
子句来定义 哪 行以从子查询中的每个集合中选择.@Craig 和 @Clodoaldo 已经告诉过你了.我正在返回具有最高 id
的 meta
行.
Of course, you'll want to add an ORDER BY
clause in the subquery do define which row to pick form each set in the subquery. @Craig and @Clodoaldo already told you about that. I am returning the meta
row with the highest id
.
DISTINCT ON
的详细信息:
不过,这并不总是最快的解决方案.根据数据分布,有各种其他查询样式.对于这个涉及另一个连接的简单案例,这个在大表测试中的运行速度要快得多:
Still, this is not always the fastest solution. Depending on data distribution there are various other query styles. For this simple case involving another join, this one ran considerably faster in a test with big tables:
SELECT p.*, sub.meta_id, m.product_id, m.price, m.flag
FROM (
SELECT product_id, max(id) AS meta_id
FROM meta
GROUP BY 1
) sub
JOIN meta m ON m.id = sub.meta_id
JOIN products p ON p.id = sub.product_id;
如果您不使用非描述性的 id
作为列名,我们就不会遇到命名冲突,只需编写 SELECT p.*, m.*
.(我从不使用 id
作为列名.)
If you wouldn't use the non-descriptive id
as column names, we would not run into naming collisions and could simply write SELECT p.*, m.*
. (I never use id
as column name.)
如果性能是您的首要要求,请考虑更多选择:
If performance is your paramount requirement, consider more options:
- a
实体化视图
使用来自meta
的预聚合数据,如果您的数据没有变化(很多). - 一个递归 CTE 模拟 松散索引扫描big
meta
表,每个产品有 许多 行(相对较少的不同product_id
).
这是我所知道的对整个表的 DISTINCT 查询使用索引的唯一方法.
- a
MATERIALIZED VIEW
with pre-aggregated data frommeta
, if your data does not change (much). - a recursive CTE emulating a loose index scan for a big
meta
table with many rows per product (relatively few distinctproduct_id
).
This is the only way I know to use an index for a DISTINCT query over the whole table.
这篇关于JOIN 后的 GROUP 或 DISTINCT 返回重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!