使用Pentaho Kettle,如何在保持参照完整性的同时从单个表中加载多个表? [英] Using Pentaho Kettle, how do I load multiple tables from a single table while keeping referential integrity?

查看:133
本文介绍了使用Pentaho Kettle,如何在保持参照完整性的同时从单个表中加载多个表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

需要将具有100,000条以上记录的单个文件中的数据加载到MySQL上的多个表中,以保持文件/表中定义的关系;表示关系已经匹配.该解决方案应在最新版本的MySQL上运行,并需要使用InnoDB引擎; MyISAM不支持外键.

Need to load data from a single file with a 100,000+ records into multiple tables on MySQL maintaining the relationships defined in the file/tables; meaning the relationships already match. The solution should work on the latest version of MySQL, and needs to use the InnoDB engine; MyISAM does not support foreign keys.

对于使用Pentaho数据集成(又名Kettle),我是一个全新的人,任何指针都将不胜感激.

I am a completely new to using Pentaho Data Integration (aka Kettle) and any pointers would be appreciated.

我可能还要补充一点,即必须禁用外键约束.据我了解,如果数据库的参照完整性有问题,那么当重新打开外键约束时,MySQL将不会检查参照完整性. 来源: 5.1.4.服务器系统变量- foreign_key_checks

I might add that it is a requirement that the foreign key constraints are NOT disabled. Since it's my understanding that if there is something wrong with the database's referential integrity, MySQL will not check for referential integrity when the foreign key constraints are turned back on. SOURCE: 5.1.4. Server System Variables -- foreign_key_checks

所有方法都应包括一些验证和回滚策略,以防插入失败或无法保持参照完整性.

All approaches should include some from of validation and a rollback strategy should an insert fail, or fail to maintain referential integrity.

再次对此完全陌生,如果您有任何疑问或要求澄清,我们会尽力提供尽可能多的信息-请让我知道.

Again, completely new to this, and doing my best to provide as much information as possible, if you have any questions, or request for clarification -- just let me know.

如果您能够从将为超级"的kjb和ktr文件(作业/转换)中发布XML.甚至可能追捕您在任何地方发表的每条评论/答案,并对其进行投票... :-) ...真的,对于我来说,找到答案的确很重要.

If you are able to post the XML from the kjb and ktr files (jobs/transformations) that would be SUPER. Might even hunt down every comment/answer you've every made anywhere and up vote them... :-) ...really, it's really important to me to find an answer for this.

谢谢!

样本数据:为了更好地说明一个示例,假设我正在尝试加载一个包含员工姓名,他们过去所居住的办公室以及其职位名称历史(由制表符分隔)的文件

SAMPLE DATA: To better elaborate with an example, lets assume I am trying to load a file containing employee name, the offices they have occupied in the past and their Job title history separated by a tab.

文件:

EmployeeName<tab>OfficeHistory<tab>JobLevelHistory
John Smith<tab>501<tab>Engineer
John Smith<tab>601<tab>Senior Engineer
John Smith<tab>701<tab>Manager
Alex Button<tab>601<tab>Senior Assistant
Alex Button<tab>454<tab>Manager

注意::单个表数据库已完全规范化(可能与单个表一样多),例如,对于"John Smith",只有一个John Smith;意味着没有重复会导致参照完整性冲突.

NOTE: The single table database is completely normalized (as much as a single table may be) -- and for example, in the case of "John Smith" there is only one John Smith; meaning there are no duplicates that would lead to conflicts in referential integrity.

MyOffice数据库架构具有以下表:

The MyOffice database schema has the following tables:

Employee (nId, name)
Office (nId, number)
JobTitle (nId, titleName)
Employee2Office (nEmpID, nOfficeId)
Employee2JobTitle (nEmpId, nJobTitleID)

所以在这种情况下.表格应如下所示:

So in this case. the tables should look like:

Employee
1 John Smith
2 Alex Button

Office
1 501
2 601
3 701
4 454

JobTitle
1 Engineer
2 Senior Engineer
3 Manager
4 Senior Assistant

Employee2Office
1 1
1 2
1 3
2 2
2 4

Employee2JobTitle
1 1
1 2
1 3
2 4
2 3

以下是用于创建数据库和表的MySQL DDL:

create database MyOffice2;

use MyOffice2;

CREATE TABLE Employee (
      id MEDIUMINT NOT NULL AUTO_INCREMENT,
      name CHAR(50) NOT NULL,
      PRIMARY KEY (id)
    ) ENGINE=InnoDB;

CREATE TABLE Office (
  id MEDIUMINT NOT NULL AUTO_INCREMENT,
  office_number INT NOT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB;

CREATE TABLE JobTitle (
  id MEDIUMINT NOT NULL AUTO_INCREMENT,
  title CHAR(30) NOT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB;

CREATE TABLE Employee2JobTitle (
  employee_id MEDIUMINT NOT NULL,
  job_title_id MEDIUMINT NOT NULL,
  FOREIGN KEY (employee_id) REFERENCES Employee(id),
  FOREIGN KEY (job_title_id) REFERENCES JobTitle(id),
  PRIMARY KEY (employee_id, job_title_id)
) ENGINE=InnoDB;

CREATE TABLE Employee2Office (
  employee_id MEDIUMINT NOT NULL,
  office_id MEDIUMINT NOT NULL,
  FOREIGN KEY (employee_id) REFERENCES Employee(id),
  FOREIGN KEY (office_id) REFERENCES Office(id),
  PRIMARY KEY (employee_id, office_id)
) ENGINE=InnoDB;




我对选定答案的回答:

PREP:




My Notes in Response to Selected Answer:

PREP:

  1. (a)使用示例数据,通过将<TAB>更改为逗号分隔来创建CSV.
  2. (b)安装MySQL并使用MySQL DDL示例创建示例数据库
  3. (c)安装Kettle(它是基于Java的,并且可以在运行Java的任何程序上运行)
  4. (d)下载KTR文件
  1. (a) Use the sample data, create a CSV by changing <TAB> to comma delimited.
  2. (b) Install MySQL and create sample database using the MySQL DDL sample
  3. (c) Install Kettle (it's Java based and will run on anything that runs Java)
  4. (d) Download KTR file

分步数据流:(我的笔记)

  1. 在Kettle中打开KTR文件,然后双击"CSV文件输入"并浏览到您创建的CSV文件.分隔符应该已经设置为逗号.然后单击确定".
  2. 双击插入员工",然后选择数据库连接器,然后按照中的说明进行操作创建一个新的数据库连接

推荐答案

我整理了一个示例根据您提供的内容进行转换(右键单击并选择保存链接).我觉得不确定的唯一步骤是最后一个表的输入.我基本上是将联接数据写到表中,如果已经存在特定的关系,则让它失败.

I put together a sample transformation(right click and choose save link) based on what you provided. The only step I feel a bit uncertain on is the last table inputs. I'm basically writing the join data to the table and letting it fail if a specific relationship already exists.

此解决方案并不能真正满足所有方法都应包括验证中的某些方法,如果插入失败或无法保持参照完整性,则应采用回滚策略."标准,尽管它可能不会失败.如果您真的想设置一些复杂的功能,我们可以,但是绝对可以帮助您进行这些转换.

This solution doesn't really meet the "All approaches should include some from of validation and a rollback strategy should an insert fail, or fail to maintain referential integrity." criteria, though it probably won't fail. If you really want to setup something complex we can but this should definitely get you going with these transformations.

1..我们首先读取您的文件.就我而言,我将其转换为CSV,但制表符也很好.

1. We start with reading in your file. In my case I converted it to CSV but tab is fine too.

2..现在,我们将使用combination lookup/update将员工姓名插入Employee表. 插入后,我们将employee_id作为id附加到我们的数据流中,并从数据流中删除EmployeeName.

2. Now we're going to insert the employee names into the Employee table using a combination lookup/update. After the insert we append the employee_id to our datastream as id and remove the EmployeeName from the data stream.

3..在这里,我们只是使用选择值"步骤将id字段重命名为employee_id

3. Here we're just using a Select Values step to rename the id field to employee_id

4..就像我们对员工所做的那样插入职务,并将职务ID附加到我们的数据流中,同时还会从数据流中删除JobLevelHistory.

4. Insert Job Titles just like we did employees and append the title id to our datastream also deleting the JobLevelHistory from the datastream.

5..简单地将标题ID重命名为title_id(请参阅第3步)

5. Simple rename of the title id to title_id(see step 3)

6..插入办公室,获取ID,从流中删除OfficeHistory.

6. Insert offices, get id's, remove OfficeHistory from the stream.

7..将办公室ID简单重命名为office_id(请参阅第3步)

7. Simple rename of the office id to office_id(see step 3)

8..将最后一步的数据复制到两个流中,分别使用值employee_id,office_idemployee_id,title_id.

8. Copy Data from the last step into two streams with the values employee_id,office_id and employee_id,title_id respectively.

9..使用表插入来插入联接数据.我选择它来忽略插入错误,因为可能存在重复,并且PK约束将使某些行失败.

9. Use a table insert to insert the join data. I've got it selected to ignore insert errors as there could be duplicates and the PK constraints will make some rows fail.

这篇关于使用Pentaho Kettle,如何在保持参照完整性的同时从单个表中加载多个表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆