Sqoop 增量导入 [英] Sqoop Incremental Import

查看:32
本文介绍了Sqoop 增量导入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

需要有关 Sqoop 增量导入的建议.假设我有一个客户在第 1 天使用策略 1,我在第 1 天将这些记录导入 HDFS,我在部件文件中看到它们.
在第 2 天,同一个客户添加了策略 2,并且在增量导入 sqoop 运行后,我们是否只会在零件文件中获得新记录?在这种情况下,如何使用 Sqoop 获取旧的和增量的附加/最后修改的记录?

Need advice on Sqoop Incremental Imports. Say I have a Customer with Policy 1 on Day 1 and I imported those records in HDFS on Day 1 and I see them in Part Files.
On Day 2, the same customer adds Policy 2 and after the incremental import sqoop run, will we get only new records in the part files? In that case, How do I get the Old and Incremental appended/last modified records using Sqoop?

推荐答案

考虑一个你已经使用 sqoop 导入到 hdfs 的 3 条记录的表

Consider a table with 3 records which you already imported to hdfs using sqoop

+------+------------+----------+------+------------+
| sid  | city       | state    | rank | rDate      |
+------+------------+----------+------+------------+
|  101 | Chicago    | Illinois |    1 | 2014-01-25 |
|  101 | Schaumburg | Illinois |    3 | 2014-01-25 |
|  101 | Columbus   | Ohio     |    7 | 2014-01-25 |
+------+------------+----------+------+------------+

sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P

现在表中有其他记录,但现有记录没有更新

Now you have additional records in the table but no updates on existing records

+------+------------+----------+------+------------+
| sid  | city       | state    | rank | rDate    |
+------+------------+----------+------+------------+
|  101 | Chicago    | Illinois |    1 | 2014-01-25 |
|  101 | Schaumburg | Illinois |    3 | 2014-01-25 |
|  101 | Columbus   | Ohio     |    7 | 2014-01-25 |
|  103 | Charlotte  | NC       |    9 | 2013-04-22 |
|  103 | Greenville | SC       |    9 | 2013-05-12 |
|  103 | Atlanta    | GA       |   11 | 2013-08-21 |
+------+------------+----------+------+------------+

在这里,您应该使用 --incremental append--check-column 指定在确定要导入的行时要检查的列.

Here you should use an --incremental append with --check-column which specifies the column to be examined when determining which rows to import.

sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rank --incremental append --last-value 7

上面的代码将根据最后一个值插入所有新行.

The above code will insert all the new rows based on the last value.

现在我们可以考虑行中有更新的第二种情况

Now we can think of second case where there are updates in rows

+------+------------+----------+------+------------+
| sid  | city       | state    | rank | rDate      |
+------+------------+----------+------+------------+
|  101 | Chicago    | Illinois |    1 | 2015-01-01 |
|  101 | Schaumburg | Illinois |    3 | 2014-01-25 |
|  101 | Columbus   | Ohio     |    7 | 2014-01-25 |
|  103 | Charlotte  | NC       |    9 | 2013-04-22 |
|  103 | Greenville | SC       |    9 | 2013-05-12 |
|  103 | Atlanta    | GA       |   11 | 2013-08-21 |
|  104 | Dallas     | Texas    |    4 | 2015-02-02 |
|  105 | Phoenix    | Arzona   |   17 | 2015-02-24 |
+------+------------+----------+------+------------+

这里我们使用增量 lastmodified,我们将根据日期获取所有更新的行.

Here we use incremental lastmodified where we will fetch all the updated rows based on date.

sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P   --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc

这篇关于Sqoop 增量导入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆