按多列分布 [英] Distributed by multiple columns

查看:44
本文介绍了按多列分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在创建表时在分布上使用多列是否有好处?例如:

Is there a benefit of using multiple columns on distribution when creating a table? For instance:

CREATE TABLE data_facts (
    data_id int primary key,
    channel_id smallint,
    chart_id smallint,
    demo_id smallint,
    value numeric)
DISTRIBUTED BY (
    channel_id,
    chart_id,
    demo_id)

因为我需要使用三个不同的表 channelchartdemo 加入 data_facts分别为 channel_idchart_iddemo_id.

as there will be chance I need join data_facts with three different tables channel, chart and demo using channel_id, chart_id and demo_id respectively.

具体来说,

  1. 我是否应该总是添加 distribution 并包含我用于加入的所有 id(s) 以提高效率?
  2. 如果是,这些 id(s) 的顺序重要吗?
  3. 这在架构级别上是如何工作的?(可选)
  1. Should I always add distribution and include all id(s) that I'm using for joining in terms of efficiency?
  2. If so, does the order of these id(s) matter?
  3. How does this work on an architecture level? (optional)

谢谢!

推荐答案

这取决于您希望对数据库进行分片 的数量,以及您希望在每个分区中分配的记录减少多少,我意味着如果您在分布中添加多于一列,您会将更多数据分割成更多分区.

It depends on how much you want to shard the database, and on how less records you want to distribute in each partition, I mean if you add more than one column in the distribution you will fragment a lot more the data into more partitions.

这还取决于您是通过模数还是散列进行分片...

It also depends if you shard by modulo or hash ...

但是,在我看来,如果你有一个多列主键,并且你想通过这个主键进行分片,可能会有一种分配多列(主键中的所有列)的感觉,否则你应该通过一个分片单列在大多数情况下就足够了.

However, in my opinion, if you have a multiple columns primary key and you want to shard by this primary key could have a sense distributing by multiple columns(with all the columns in the primary key) otherwise you should shard by a single column that in most cases is enough .

这篇关于按多列分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆