redshift选择非重复返回重复值 [英] redshift select distinct returns repeated values

查看:96
本文介绍了redshift选择非重复返回重复值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库,其中每个对象属性存储在单独的行中.附加查询在redshift数据库中不会返回不同的值,但是在任何与mysql兼容的数据库中进行测试时都可以按预期工作.

I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.

SELECT DISTINCT distinct_value 
FROM
( 
  SELECT
    uri,
    ( SELECT DISTINCT value_string 
      FROM `test_organization__app__testsegment` AS X 
      WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value 
  FROM `test_organization__app__testsegment` AS parent 
  WHERE     
    uri IN ( SELECT uri 
             FROM `test_organization__app__testsegment` 
             WHERE name = 'types' AND value_uri_multivalue = 'Document'
           )
) AS T 
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0

推荐答案

这不是bug,行为虽然不是直截了当的,但是故意的.

This is not a bug and behavior is intentional, though not straightforward.

在Redshift中,您可以在表上声明约束,但Redshift不会强制执行约束,即,如果插入约束,则允许重复的值.唯一的区别是,当对没有声明主键的列运行SELECT DISTINCT查询时,它将扫描整个列并获取唯一值,并且如果在具有主键约束的列上运行相同的查询它将仅返回输出,而不执行唯一列表过滤.如果您插入重复的条目,这就是这种方式.

In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.

为什么要这样做? Redshift针对大型数据集进行了优化,如果您不需要检查复制或插入的每一行的约束有效性,则复制数据的速度要快得多.如果需要,可以将主键约束声明为数据模型的一部分,但是您需要通过删除重复项或以没有这种方式的方式设计ETL来显式地支持它.

Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.

此堆博客文章 Redshift陷阱以及如何避免它们的更多信息和特定示例a>

More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them

这篇关于redshift选择非重复返回重复值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆