在Datawarehouse中处理null [英] Handling nulls in Datawarehouse

查看:113
本文介绍了在Datawarehouse中处理null的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想问您有关与数据仓库和SSIS / SSAS有关的处理空或空数据值的最佳做法是什么。

I'd like to ask your input on what the best practice is for handling null or empty data values when it pertains to data warehousing and SSIS/SSAS.

我有几个事实和维度表,这些表在不同的行中包含空值。

I have several fact and dimension tables that contain null values in different rows.

特殊情况:

1)什么是处理空日期/时间值的最佳方法?我应该在时间或日期维度中打上默认行,并在找到空值时将SSIS指向默认行吗?

1) What is the best way to handle null date/times values? Should I make a 'default' row in my time or date dimensions and point SSIS to the default row when there is a null found?

2)处理维数据中的空值/空值的最佳方法是什么。例如:帐户维度中有一些行,在帐户名称列中具有空值(非NULL)。我应该将列内的这些空值或空值转换为特定的默认值吗?

2) What is the best way to handle nulls/empty values inside of dimension data. Ex: I have some rows in an 'Accounts' dimensions that have empty (not NULL) values in the Account Name column. Should I convert these empty or null values inside the column to a specific default value?

3)与上面的第1点类似,我应该做什么?如果我最终出现在维度列之一中没有记录的Facttable行中,该怎么办?如果发生这种情况,是否需要每个维度的默认维度记录?

3) Similar to point 1 above - What should I do if I end up with a Facttable row that has no record in one of the dimension columns? Do I need default dimension records for each dimension in case this happens?

4)关于如何处理这些操作的任何建议或提示在Sql服务器集成服务(SSIS)中?最好使用最佳数据流配置或最佳转换对象。

4) Any suggestion or tips in regards to how to handle these operation in Sql server integration services (SSIS)? Best data flow configurations or best transformation objects to use would be helpful.

谢谢:-)

推荐答案

如先前的回答所述,维度的Null值可以附加许多不同的含义,未知,不适用,未知等。如果可以在应用程序中区分它们,这很有用。

As the previous answer states there can be many different meanings attached to Null values for a dimension, unknown, not applicable, unknown etc. If it is useful to be able to distinguish between them in your application adding "pseudo" dimension entries can help.

无论如何,我都避免使用Null事实外键或维度字段,即使只有一个未知维度值也将有所帮助您的用户定义的查询包含数据质量不是100%(而且永远不是)的包罗万象的分组。

In any case I would avoid having either Null fact foreign keys or dimension fields, having even a single 'unknown' dimension value will help your users define queries that include a catch-all grouping where the data quality isn't 100% (and it never is).

我曾经做过的一个非常简单的技巧为此使用并且尚未咬我的方法是使用T-sql中的int IDENTITY(1,1)定义我的尺寸替代键(从1开始,每行递增1)。伪密钥(不可用,未分配,不适用)定义为负整数,并由ETL流程开始时运行的存储过程填充。

One very simple trick I've been using for this and hasn't bitten me yet is to define my dimensions surrogate keys using int IDENTITY(1,1) in T-sql (start at 1 and increment by 1 per row). Pseudo keys ("Unavailable", "Unassigned", "Not applicable") are defined as negative ints and populated by a stored procedure ran at the beginning of the ETL process.

例如,以


    CREATE TABLE [dbo].[Location]
    (
        [LocationSK] [int] IDENTITY(1,1) NOT NULL,
        [Name] [varchar](50) NOT NULL,
        [Abbreviation] [varchar](4) NOT NULL,
        [LocationBK] [int] NOT NULL,
        [EffectiveFromDate] [datetime] NOT NULL,
        [EffectiveToDate] [datetime] NULL,
        [Type1Checksum] [int] NOT NULL,
        [Type2Checksum] [int] NOT NULL,
    ) ON [PRIMARY]

然后用表b

And a stored procedure populating the table with


Insert Into dbo.Location (LocationSK, Name, Abbreviation, LocationBK, 
                      EffectiveFromDate,  Type1Checksum, Type2Checksum)
            Values (-1, 'Unknown location', 'Unk', -1, '1900-01-01', 0,0)



<我已经制定了一个规则,即每个维度至少要有一个这样的伪行,以用于维度查找失败的情况,并建立异常报告以跟踪分配给此类行的事实数量。

I have made it a rule to have at least one such pseudo row per dimension which is used in cases where the dimension lookup fails and to build exception reports to track the number of facts which are assigned to such rows.

这篇关于在Datawarehouse中处理null的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆