在配置单元中生成星型模式 [英] Generating star schema in hive

查看:53
本文介绍了在配置单元中生成星型模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我来自SQL Datawarehouse,在这里,通过统一供稿生成维度和事实表.在一般的数据仓库项目中,我们将提要分为事实和维度.例如:

I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex:

我对Hadoop完全陌生,后来我知道可以在蜂巢中构建数据仓库.现在,我熟悉使用guid了,我认为它适用于配置单元中的主键.那么,下面的策略是在蜂巢中加载事实和维度的正确方法吗?

I am completely new to Hadoop and I came to know that I can build data warehouse in hive. Now, I am familiar with using guid which I think is applicable as a primary key in hive. So, the below strategy is the right way to load fact and dimension in hive?

  1. 将源数据加载到配置单元表中;假设Sales_Data_Warehouse
  2. 从sales_data_warehouse生成维度;例如:

  1. Load source data into a hive table; let say Sales_Data_Warehouse
  2. Generate Dimension from sales_data_warehouse; ex:

从Sales_Data_Warehouse中选择New_Guid(),Customer_Name,Customer_Address

SELECT New_Guid(), Customer_Name, Customer_Address From Sales_Data_Warehouse

完成所有尺寸后,请像这样加载事实表

When all dimensions are done then load the fact table like

选择New_Guid()AS'Fact_Key',Customer.Customer_Key,Store.Store_Key ...从Sales_Data_Warehouse AS'源'JOIN Customer_Dimension source.Customer_Name =上的客户Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address加入Store_Dimension AS'商店'到Store.Store_Name = Source.Store_NameJOIN Product_Dimension AS'Product'ON .....

SELECT New_Guid() AS 'Fact_Key', Customer.Customer_Key, Store.Store_Key... FROM Sales_Data_Warehouse AS 'source' JOIN Customer_Dimension Customer on source.Customer_Name = Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address JOIN Store_Dimension AS 'Store' ON Store.Store_Name = Source.Store_Name JOIN Product_Dimension AS 'Product' ON .....

这是我应该在蜂巢中加载事实和维度表的方式吗?

Is this the way I should load my fact and dimension table in hive?

此外,在一般仓库项目中,我们需要更新维度属性(例如,Customer_Address更改为其他内容),或者必须更新事实表外键(很少,但是确实会发生).因此,如何在配置单元中加载INSERT-UPDATE.(就像我们在TSIS中的SSIS或MERGE语句中进行查找一样?)

Also, in general warehouse projects we need to update dimensions attributes (ex: Customer_Address is changed to something else) or have to update fact table foreign key (rarely, but it does happen). So, how can I have a INSERT-UPDATE load in hive. (Like we do Lookup in SSIS or MERGE Statement in TSQL)?

推荐答案

我们仍然可以在Hadoop和Hive上获得维度模型的优势.但是,Hadoop的某些功能要求我们稍微采用标准方法进行维建模.

We still get the benefits of dimensional models on Hadoop and Hive. However, some features of Hadoop require us to slightly adopt the standard approach to dimensional modelling.

Hadoop文件系统是不可变的.我们只能添加但不能更新数据.结果,我们只能将记录追加到维表中(而Hive已添加了更新功能和事务,这似乎是有问题的).在Hadoop上缓慢更改维度成为默认行为.为了获得维表中的最新记录和最新记录,我们提供了三个选项.首先,我们可以创建一个使用窗口功能检索最新记录的视图.其次,我们可以在后台运行一个压缩服务,以重新创建最新状态.第三,我们可以将维度表存储在可变存储中,例如跨两种存储类型的HBase和联合查询.

The Hadoop File System is immutable. We can only add but not update data. As a result we can only append records to dimension tables (While Hive has added an update feature and transactions this seems to be rather buggy). Slowly Changing Dimensions on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a View that retrieves the latest record using windowing functions. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.

跨HDFS分配数据的方式使联接数据变得昂贵.在分布式关系数据库(MPP)中,我们可以将具有相同主键和外键的记录共放置在群集中的同一节点上.这使得联接非常大的桌子相对便宜.无需数据就可以跨网络传输来执行连接.在Hadoop和HDFS上,这是非常不同的.在HDFS上,表分为大块,并分布在集群中的各个节点上.我们无法控制单个记录及其键在整个集群中的分布方式.结果,在Hadoop上为两个非常大的表联接非常昂贵,因为数据必须通过网络传输.我们应尽可能避免加入.对于较大的事实和维度表,我们可以将维度表直接归一化为事实表.对于两个非常大的事务表,我们可以将子表的记录嵌套在父表中,并在运行时将数据展平.我们可以使用BigQuery/Postgres等中的SQL扩展(例如array_agg等)来处理事实表中的多个颗粒

The way how data is distributed across HDFS makes it expensive to join data. In a distributed relational database (MPP) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time. We can use SQL extensions such as array_agg in BigQuery/Postgres etc. to handle multiple grains in a fact table

我还会质疑代理密钥的用处.为什么不使用自然键?也许复杂的复合键的性能可能是一个问题,但是代理键实际上并没有什么用,我从不使用它们.

I would also question the usefulness of surrogate keys. Why not use the natural key? Maybe performance for complex compound keys may be an issue but otherwise surrogate keys are not really useful and I never use them.

这篇关于在配置单元中生成星型模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆