在导入之前是否可以在新文件上使用筛选器编写Sqoop增量导入? [英] Is it possible to write a Sqoop incremental import with filters on the new file before importing?
问题描述
现在,我想为要添加到hdfs上的第二块数据运行增量导入,但是,我不希望导入完整的3000条记录。我只需要根据我的需要导入一些数据,例如,具有一定条件的1000条记录将作为增量导入的一部分导入。
有没有办法使用sqoop增量导入命令做到这一点?
请帮助,谢谢。
您需要一个唯一的密钥或Timestamp字段来标识您的案例中新增1000条记录的增量。使用该字段,您必须选择将数据引入Hadoop。
选项1
是使用sqoop增量追加,下面是它的例子
sqoop import \
--connect jdbc:oracle:thin:@ enkx3-scan:1521:dbm2 \
--username wzhou \
--password wzhou \
--table STUDENT \
- 增量追加\
- 校验列student_id \
-m 4 \
- 拆分 - 主要
参数:
- check-column(col)#指定要确定要导入哪些行时要检查的列。
--incremental(mode)#指定Sqoop如何确定哪些行是新的。模式的合法值包括append和lastmodified。
--last-value(value)指定上一次导入中检查列的最大值。
选项2
在sqoop中使用 - query
参数,您可以在其中使用本机sql for mysql /您连接的任何数据库。
示例:
sqoop import \
- (a.id == b.id)WHERE $ CONDITIONS'\
--split-by a.id --target-dir / user / foo / joinresults
sqoop import \
--query'SELECT a。*,b。* FROM JOIN b on(a.id == b.id)WHERE $ CONDITIONS'\
-m 1 --target-dir / user / foo / joinresults
My doubt is, Say, I have a file A1.csv with 2000 records on sql-server table, I import this data into hdfs, later that day I have added 3000 records to the same file on sql-server table. Now, I want to run incremental import for the second chunk of data to be added on hdfs, but, I do not want complete 3000 records to be imported. I need only some data according to my necessity to be imported, like, 1000 records with certain condition to be imported as part of the increment import.
Is there a way to do that using sqoop incremental import command?
Please Help, Thank you.
You need a unique key or a Timestamp field to identify the deltas which is the new 1000 records in your case. using that field you have to options to bring in the data to Hadoop.
Option 1
is to use the sqoop incremental append, below is the example of it
sqoop import \
--connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
--username wzhou \
--password wzhou \
--table STUDENT \
--incremental append \
--check-column student_id \
-m 4 \
--split-by major
Arguments :
--check-column (col) #Specifies the column to be examined when determining which rows to import.
--incremental (mode) #Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value) Specifies the maximum value of the check column from the previous import.
Option 2
Using the --query
argument in sqoop where you can use the native sql for mysql/any database you connect to.
Example :
sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults
这篇关于在导入之前是否可以在新文件上使用筛选器编写Sqoop增量导入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!