制作具有固定列的表与元数据对密钥值对? [英] Making a table with fixed columns versus key-valued pairs of metadata?

查看:122
本文介绍了制作具有固定列的表与元数据对密钥值对?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被要求创建一个表格,用于存储来自多个子公司的多个地理位置的多个考勤系统的付费时间数据。该表将用于高级别报告,所以基本上它正在跳过为每个系统(可能存在的)创建表的步骤,并直接移动到最终产品。

I was asked to create a table to store paid-hours data from multiple attendance systems from multiple geographies from multiple sub-companies. This table would be used for high level reporting so basically it is skipping the steps of creating tables for each system (which might exist) and moving directly to what the final product would be.

请求是针对每种类型的小时维度,或像这样支付:

The request was to have a dimension for each type of hours or pay like this:

date       | employee_id   | type          | hours  | amount
2016-04-22      abc123      regular           80       3500
2016-04-22      abc123      overtime          6        200
2016-04-22      abc123      adjustment        1        13
2016-04-22      abc123      paid time off     24       100
2016-04-22      abc123      commission                 600
2016-04-22      abc123      gross total                4413

每个员工有多个行,但是虽然过程是允许我们在添加新维度时捕获新维度。

There are multiple rows per employee but the though process is that this will allow us to capture new dimensions if they are added.

数据来自几个来源,我被告知不要担心ETL,而只是设计终极表,使其适用于任何系统。我们会向其他人提供此格式供他们填写。

The data is coming from several sources and I was told not to worry about the ETL, but just design the ultimate table and make it work for any system. We would provide this format to other people for them to fill in.

我只看过一个系统的原始数据,如下所示:

I have only seen the raw data from one system and it like this:

date | employee_id | gross_total_amount | regular_hours | regular_amount | OT_hours  | OT_amount | classification | amount | hours 

这是相当凌乱。员工的多行和总值等值重复每一行。有一个分类列,其中包含PTO(付费时间关闭),调整,空值,佣金等项目。由于重复的值,不可能简单地将数据总和使其等于gross_total_amount。

It is pretty messy. Multiple rows for employees and values like gross_total repeat each row. There is a classification column which has items like PTO (paid time off), adjustments, empty values, commission, etc. Because of repeating values, it is impossible to just simply sum the data up to make it equal the gross_total_amount.

无论如何,我宁愿做一个基于列的方法,其中每行都描述了员工的截止时间。一个问题是,我不会知道所有可能的时间类型,所以我不一定可以像下列那样:

Anyways, I kind of would prefer to do a column based approach where each row describes the employees paid hours for a cut off. One problem is that I won't know all of the possible types of hours which are possible so I can't necessarily make a table like:

date | employee_id | gross_total_amount | commission_amount | regular_hours | regular_amount | overtime_hours | overtime_amount | paid_time_off_hours | paid_time_off_amount | holiday_hours | holiday_amount

我更习惯了以这种方式格式化的数据。关心的是,您可能无法捕获所有必需的列,或者添加了新的列。 (例如,我知道有产假,陪产假,丧亲假,在其他地区有劳动法关于夜间工作等)

I am more used to data formatted that way though. The concern is that you might not capture all of the necessary columns or if something new is added. (For example, I know there is maternity leave, paternity leave, bereavement leave, in other geographies there are labor laws about working at night, etc)

任何建议?是从我的上级提出的表是可行的解决方案吗?

Any advice? Is the table which was suggested to me from my superior a viable solution?

推荐答案

让我概括一下我理解为基本任务。

Let me recapitulate what I understand to be the basic task.

您从不同的来源获取数据,具有不同的结构。您的任务是将它们整合到单个数据库中,以便能够回答有关所有这些数据的问题。我理解关于不用担心ETL,但只是设计终极表的提示,因为您的统一数据库不需要包含原始数据中可能存在的所有详细信息,而只是足够的信息满足统一数据库的具体要求。

You get data from different sources, having different structures. Your task is to consolidate them in a single database to be able to answer questions about all these data. I understand the hint about "not to worry about the ETL, but just design the ultimate table" in that way that your consolidated database doesn't need to contain all detail information that might be present in the original data, but just enough information to fulfill the specific requirements to the consolidated database.

只要您的上级人员对这些要求足够确定,这听起来很明智。在这种情况下,您将减少从每个来源到综合结构的信息。

This sounds sensible as long as your superior is certain enough about these requirements. In that case, you will reduce the information coming from each source to the consolidated structure.

无论如何,您必须捕获数据来源的域语义从每个来源。无法访问您的域语义,我无法澄清重复值等的混乱。例如,如果有详细记录和总总记录,如您所示,添加所有记录的时间是错误的,因为这总是会产生实际工作时间的两倍。所以有人不得不担心ETL,即解释每组记录,可能包括一个员工的所有条目和一个工作日,找出它们的含义,并将其转换为合并结构。

In any way, you'll have to capture the domain semantics of the data coming in from each source. Lacking access to your domain semantics, I can't clarify the mess of repeating values etc. for you. E.g., if there are detail records and gross total records, as in your example, it would be wrong to add the hours of all records, as this would always yield twice the hours actually worked. So someone will have to worry about ETL, namely interpreting each set of records, probably consisting of all entries for an employee and one working day, find out what they mean, and transform them to the consolidated structure.

我理解关于元数据使用的另一部分问题。您可以为休假和产假等概念提供不同的列,或者您有一个包含这些概念作为键值对的元数据表,并参考主表中的键。元数据方式有时被称为更灵活,因为您可以在不重新设计数据库的情况下引入新类型(如陪产假)。但是,您将需要重新设计软件的填充,也可能会查询您的表以使用新的类型。因此,您必须开发和部署新的软件版本,并在表中添加几列将仅仅是开发工作的一部分。

I understand another part of the question to be about the usage of metadata. You can have different columns for notions like holiday leave and maternity leave, or you have a metadata table containing these notions as a key-value pair, and refer to the key from your main table. The metadata way is sometimes praised as being more flexible, as you can introduce a new type (like paternity leave) without redesigning your database. However, you will need to redesign the software filling and probably also querying your tables to make use of the new type. So you'll have to develop and deploy a new software release anyway, and adding a few columns to a table will just be part of that development effort.

有一个包含所有概念作为属性的宽表与元数据方法之间的主要区别。如果你想确保,在一段时间内,全部或者全部值都不存在,那么很容易就可以使用宽表:只要使所有属性不为空,就完成了。确保元数据解决方案意味着一些相当复杂的约束,根据您使用的数据库系统可能或可能不可用。

There is one major difference between a broad table containing all notions as attributes and the metadata approach. If you want to make sure that, for a time period, either all or none of the values are present, that's easy with the broad table: Just make all attributes `not null´, and you're done. Ensuring this for the metadata solution would mean some rather complicated constraint that may or may not be available depending on the database system you use.

如果这不是主要要求,我将采取务实的方式,并使用不同的列,如果我只希望只有少数几种类型,另外一个单独的键值表。

If that's not a main requirement, I would go a pragmatic way and use different columns if I expect only a handful of those types, and a separate key-value table otherwise.

所有这些考虑都依赖于你的上级断言(据我所知),您的综合表只需要满足今天所知的要求,所以如果由于这些要求而不需要,可以自由抛出原始的详细信息。我很担心这种断言。让我们假设您的一些信息来源提供其他信息。那么很可能有人有人要求一份报告,也包含这些信息,现在在哪里。如果您的数据结构仅包含当前需要的话,这是不可能的。

All these considerations relied on your superior's assertion (as I understand it) that your consolidated table will only need to fulfill the requirements known today, so you are free to throw original detail information away if it's not needed due to these requirements. I'm wary of that kind of assertion. Let's assume some of your information sources deliver additional information. Then it's quite probable that someday someone asks for a report also containing this information, where present. This won't be possible if your data structure only contains what's needed today.

有两种方法来处理这个问题,即提供未来的需求。您可以在了解每个附加来源的数据之后,将您的统一数据库扩展到涵盖所有数据结构。这需要一些努力,因为不同的来源可能使用不同的数据表达相同的概念,您将不得不合并这些数据,使数据可比。另外,有一些可能性并不是你所有的努力都值得一试,因为并不是所有的细节信息都将被统一数据库所需。因此,另一种更优雅的方式是保留您为每个来源导入的原始数据,只有在具体的新要求的情况下,才能扩展数据库并从源重新导入数据以覆盖附加的细节。存储价格低,这可能会产生最佳的成本效益比。

There are two ways to handle this, i.e. to provide for future needs. You can, after knowing the data coming from each additional source, extend your consolidated database to cover all data structures coming from there. This requires some effort, as different sources might express the same concept using different data, and you would have to consolidate those to make the data comparable. Also, there is some probability that not all of your effort will be worth the trouble, as not all of the detail information you get will actually be needed for your consolidated database. Another more elegant way would therefore be to keep the original data that you import for each source, and only in case of a concrete new requirement, extend your database and reimport the data from the sources to cover the additional details. Prices of storage being low as they are, this might yield an optimal cost-benefit ratio.

这篇关于制作具有固定列的表与元数据对密钥值对?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆