Azure SQL数据仓库中的事实表设计 [英] Fact table design in Azure SQL Data Warehouse

查看:81
本文介绍了Azure SQL数据仓库中的事实表设计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

哪个是相对较小的事实表的最佳索引和分布设计(平均每个表3000万行)。每个表的结构类似于以下内容:



Which is the best index and distribution design for relatively small fact tables (on average 30 million rows per table). The structure of each table is similar to the following:

CREATE TABLE FactTable (    TimeDimensionID INT NOT NULL,    DimensionID1 VARCHAR (10) NOT NULL,    DimensionID2 VARCHAR (10) NOT NULL,    DimensionID3 VARCHAR (10) NOT NULL,    DimensionID4 VARCHAR (10) NOT NULL,    Measure1 INT,    Measure2 FLOAT,    Measure3 DECIMAL (10.2),    Measure4 DECIMAL (10,2) )

TimeDimensionID,DimensionID1,DimensionID2,DimensionID3和DimensionID4的并集在事实表中是唯一的。目前,我们在5个字段中拥有一个集群且唯一的主键。

The union of TimeDimensionID, DimensionID1, DimensionID2, DimensionID3 and DimensionID4 is unique in the fact table. Currently we have a clustered and unique primary key in the 5 fields.


  • 将这些表迁移到SQL Azure数据仓库的最佳索引和分发是什么?我们正在考虑使用CLUSTERED INDEX(DimensionID1,DimensionID2,DimensionID3和DimensionID4)来使用TimeDimensionID
    字段进行索引和散列分配。
  • 即使散列,CLUSTERED INDEX也必须包含TimeDimensionID字段该字段的分配是什么?

  • 这个设计是正确的还是我们应该使用COLUMN STORE INDEX,即使这些表实际上有少于1亿行?
  • 我们应该考虑为事实表使用复制表吗?

推荐答案

您好Diego,

Hi Diego,

以下文档( 将您的解决方案迁移到Azure SQL数据仓库
引导您完成迁移过程但是,迁移模式部分特定于您当前所处的
指定分发选项

The following document (Migrate your solution to Azure SQL Data Warehouse) walks you through the migration process but, the migrate schema portion is specific to where you are currently at with regard to Specify the distribution option.

在选择分配时,其中说明了以下内容:

Which states the following with regard to selecting a distribution:


  • 循环法是默认值。它是最简单的使用,并尽可能快地加载数据,但连接将需要数据移动,这会降低查询性能。
  • 复制存储副本每个Compute节点上的表。复制表具有高性能,因为它们不需要连接和聚合的数据移动。它们确实需要额外的存储空间,因此最适合较小的表。
  • Hash distributed通过散列函数在所有节点上分配行。散列分布式表是SQL数据仓库的核心,因为它们旨在为大型表提供高查询性能。此选项需要一些
    计划选择分发数据的最佳列。但是,如果您第一次没有选择最佳列,则可以轻松地在不同列上重新分发数据。


  • Round-robin is the default. It is the simplest to use, and loads the data as fast as possible, but joins will require data movement which slows query performance.
  • Replicated stores a copy of the table on each Compute node. Replicated tables are performant because they do not require data movement for joins and aggregations. They do require extra storage, and therefore work best for smaller tables.
  • Hash distributed distributes the rows across all the nodes via a hash function. Hash distributed tables are the heart of SQL Data Warehouse since they are designed to provide high query performance on large tables. This option requires some planning to select the best column on which to distribute the data. However, if you don't choose the best column the first time, you can easily re-distribute the data on a different column.

以下内容将引导您完成有关上述数据点的其他内容。 

The following walks you through additional content with regard to the above data points. 

要为每个表选择最佳分发选项,请参阅< a href ="https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute"target ="_ blank">
分布式表格。

我希望这些信息有用。如果您有任何其他问题,请告知我们。 

I hope this information is useful. Please let us know if you have any additional questions. 


这篇关于Azure SQL数据仓库中的事实表设计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆