JOIN 后的 GROUP 或 DISTINCT 返回重复项 [英] GROUP or DISTINCT after JOIN returns duplicates

查看:21
本文介绍了JOIN 后的 GROUP 或 DISTINCT 返回重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个表,productsmeta.它们是 1:N 关系,其中每个产品行通过外键至少有一个元行.

I have two tables, products and meta. They are in relation 1:N where each product row has at least one meta row via foreign key.

(即 SQLfiddle:http://sqlfiddle.com/#!15/c8f34/1)

(viz. SQLfiddle: http://sqlfiddle.com/#!15/c8f34/1)

我需要加入这两个表,但我只需要过滤独特的产品.当我尝试此查询时,一切正常(返回 4 行):

I need to join these two tables but i need to filter only unique products. When I try this query, everything is ok (4 rows returned):

SELECT DISTINCT(product_id)
FROM meta JOIN products ON products.id = meta.product_id

但是当我尝试选择所有列时,DISTINCT 规则不再适用于结果,因为返回的是 8 行而不是 4 行.

but when I try to select all columns the DISTINCT rule no longer applies to results, as 8 rows instead of 4 is returned.

SELECT DISTINCT(product_id), *
FROM meta JOIN products ON products.id = meta.product_id

我尝试了很多方法,比如尝试在子查询上使用 DISTINCTGROUP BY 但总是得到相同的结果.

I have tried many approaches like trying to DISTINCT or GROUP BY on sub-query but always with same result.

推荐答案

虽然从表中检索所有或大部分行,但此类查询的最快方法通常是首先聚合/消除歧义稍后加入:

While retrieving all or most rows from a table, the fastest way for this type of query typically is to aggregate / disambiguate first and join later:

SELECT *
FROM   products p
JOIN  (
   SELECT DISTINCT ON (product_id) *
   FROM   meta
   ORDER  BY product_id, id DESC
   ) m ON m.product_id = p.id;

products 中每行 meta 中的行数越多,对性能的影响就越大.

The more rows in meta per row in products, the bigger the impact on performance.

当然,您需要在子查询中添加一个 ORDER BY 子句来定义 行以从子查询中的每个集合中选择.@Craig 和 @Clodoaldo 已经告诉过你了.我正在返回具有最高 idmeta 行.

Of course, you'll want to add an ORDER BY clause in the subquery do define which row to pick form each set in the subquery. @Craig and @Clodoaldo already told you about that. I am returning the meta row with the highest id.

SQL 小提琴.

DISTINCT ON 的详细信息:

不过,这并不总是最快的解决方案.根据数据分布,有各种其他查询样式.对于这个涉及另一个连接的简单案例,这个在大表测试中的运行速度要快得多:

Still, this is not always the fastest solution. Depending on data distribution there are various other query styles. For this simple case involving another join, this one ran considerably faster in a test with big tables:

SELECT p.*, sub.meta_id, m.product_id, m.price, m.flag
FROM  (
   SELECT product_id, max(id) AS meta_id
   FROM   meta
   GROUP  BY 1
   ) sub
JOIN meta     m ON m.id = sub.meta_id
JOIN products p ON p.id = sub.product_id;

如果您不使用非描述性的 id 作为列名,我们就不会遇到命名冲突,只需编写 SELECT p.*, m.*.(我从不使用 id 作为列名.)

If you wouldn't use the non-descriptive id as column names, we would not run into naming collisions and could simply write SELECT p.*, m.*. (I never use id as column name.)

如果性能是您的首要要求,请考虑更多选择:

If performance is your paramount requirement, consider more options:

  • a 实体化视图 使用来自 meta 的预聚合数据,如果您的数据没有变化(很多).
  • 一个递归 CTE 模拟 松散索引扫描big meta 表,每个产品有 许多 行(相对较少的不同 product_id).
    这是我所知道的对整个表的 DISTINCT 查询使用索引的唯一方法.
  • a MATERIALIZED VIEW with pre-aggregated data from meta, if your data does not change (much).
  • a recursive CTE emulating a loose index scan for a big meta table with many rows per product (relatively few distinct product_id).
    This is the only way I know to use an index for a DISTINCT query over the whole table.

这篇关于JOIN 后的 GROUP 或 DISTINCT 返回重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆