按多列分布 [英] Distributed by multiple columns
问题描述
在创建表时在分布上使用多列是否有好处?例如:
Is there a benefit of using multiple columns on distribution when creating a table? For instance:
CREATE TABLE data_facts (
data_id int primary key,
channel_id smallint,
chart_id smallint,
demo_id smallint,
value numeric)
DISTRIBUTED BY (
channel_id,
chart_id,
demo_id)
因为我需要使用三个不同的表 channel
、chart
和 demo
加入 data_facts
分别为 channel_id
、chart_id
和 demo_id
.
as there will be chance I need join data_facts
with three different tables channel
, chart
and demo
using channel_id
, chart_id
and demo_id
respectively.
具体来说,
- 我是否应该总是添加
distribution
并包含我用于加入的所有id(s)
以提高效率? - 如果是,这些
id(s)
的顺序重要吗? - 这在架构级别上是如何工作的?(可选)
- Should I always add
distribution
and include allid(s)
that I'm using for joining in terms of efficiency? - If so, does the order of these
id(s)
matter? - How does this work on an architecture level? (optional)
谢谢!
推荐答案
这取决于您希望对数据库进行分片 的数量,以及您希望在每个分区中分配的记录减少多少,我意味着如果您在分布中添加多于一列,您会将更多数据分割成更多分区.
It depends on how much you want to shard the database, and on how less records you want to distribute in each partition, I mean if you add more than one column in the distribution you will fragment a lot more the data into more partitions.
这还取决于您是通过模数还是散列进行分片...
It also depends if you shard by modulo or hash ...
但是,在我看来,如果你有一个多列主键,并且你想通过这个主键进行分片,可能会有一种分配多列(主键中的所有列)的感觉,否则你应该通过一个分片单列在大多数情况下就足够了.
However, in my opinion, if you have a multiple columns primary key and you want to shard by this primary key could have a sense distributing by multiple columns(with all the columns in the primary key) otherwise you should shard by a single column that in most cases is enough .
这篇关于按多列分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!