Sqoop 增量导入 [英] Sqoop Incremental Import
问题描述
需要有关 Sqoop 增量导入的建议.假设我有一个客户在第 1 天使用策略 1,我在第 1 天将这些记录导入 HDFS,我在部件文件中看到它们.
在第 2 天,同一个客户添加了策略 2,并且在增量导入 sqoop 运行后,我们是否只会在零件文件中获得新记录?在这种情况下,如何使用 Sqoop 获取旧的和增量的附加/最后修改的记录?
Need advice on Sqoop Incremental Imports.
Say I have a Customer with Policy 1 on Day 1 and I imported those records in HDFS on Day 1 and I see them in Part Files.
On Day 2, the same customer adds Policy 2 and after the incremental import sqoop run, will we get only new records in the part files?
In that case, How do I get the Old and Incremental appended/last modified records using Sqoop?
推荐答案
考虑一个你已经使用 sqoop 导入到 hdfs 的 3 条记录的表
Consider a table with 3 records which you already imported to hdfs using sqoop
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2014-01-25 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
+------+------------+----------+------+------------+
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P
现在表中有其他记录,但现有记录没有更新
Now you have additional records in the table but no updates on existing records
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2014-01-25 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
| 103 | Charlotte | NC | 9 | 2013-04-22 |
| 103 | Greenville | SC | 9 | 2013-05-12 |
| 103 | Atlanta | GA | 11 | 2013-08-21 |
+------+------------+----------+------+------------+
在这里,您应该使用 --incremental append
和 --check-column
指定在确定要导入的行时要检查的列.
Here you should use an --incremental append
with --check-column
which specifies the column to be examined when determining which rows to import.
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rank --incremental append --last-value 7
上面的代码将根据最后一个值插入所有新行.
The above code will insert all the new rows based on the last value.
现在我们可以考虑行中有更新的第二种情况
Now we can think of second case where there are updates in rows
+------+------------+----------+------+------------+
| sid | city | state | rank | rDate |
+------+------------+----------+------+------------+
| 101 | Chicago | Illinois | 1 | 2015-01-01 |
| 101 | Schaumburg | Illinois | 3 | 2014-01-25 |
| 101 | Columbus | Ohio | 7 | 2014-01-25 |
| 103 | Charlotte | NC | 9 | 2013-04-22 |
| 103 | Greenville | SC | 9 | 2013-05-12 |
| 103 | Atlanta | GA | 11 | 2013-08-21 |
| 104 | Dallas | Texas | 4 | 2015-02-02 |
| 105 | Phoenix | Arzona | 17 | 2015-02-24 |
+------+------------+----------+------+------------+
这里我们使用增量 lastmodified,我们将根据日期获取所有更新的行.
Here we use incremental lastmodified where we will fetch all the updated rows based on date.
sqoop import --connect jdbc:mysql://localhost:3306/ydb --table yloc --username root -P --check-column rDate --incremental lastmodified --last-value 2014-01-25 --target-dir yloc/loc
这篇关于Sqoop 增量导入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!