用Oozie使用sqoop观察重复项 [英] Observing duplicates using sqoop with Oozie

查看:133
本文介绍了用Oozie使用sqoop观察重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经构建了一个sqoop程序,以便使用预先构建的sqoop作业将数据从MySQL导入到HDFS中:

  sqoop job -fs $ driver_path -D mapreduce.map.java.opts =-Duser.timezone = Europe / Paris\ 
--create job_parquet_table - import -m $ nodes_number\
--connect jdbc:mysql:// $ server:$ port / $ database --username $ username --password-file $ pass_file \
--target-dir $ destination_dir --table $ table --as -parquetfile --append \
--incremental append --check-column $ id_column_names --last-value 1 \
- 字段终止的字符, - 转义字符\\\ \\ \\'

我也制作了一个spark程序,允许以此来测量max_id(正在导入的表的SELECT Max(ID),sqoop incremental.last.value,记录数(SELECT count()) ,不同记录的数量(SELECT计数(不同的ID))和重复的百分比(定义为SELECT计数() - 计数(不同的ID)/计数(*)* 100)。

b
$ b

手动导入数据时,通过执行sqoop作业:

  sqoop-job --exec job_parquet_ $ table  -  --username edastatnew --password edastatnew 

一切正常,我但是,当我使用Oozie安排sqoop作业时,我发现sqoopincremental.last.value被错误地更新了,什么导致我的数据重复。这是我的监测:





你可以请参阅max_ID与实验不同时间的sqoop_last_value不匹配。我得出结论,在使用Oozie时,执行sqoop作业后sqoop incremenatal.last.value被错误更新。
有人遇到同样的问题?如果是的话,你有没有找到解决办法?



谢谢,

解决方案

问自己一个问题: Sqoop在哪里存储了最后的价值信息? 答案是:对于Sqoop1,默认情况下, 本地文件系统上的文件。但是Oozie在随机机器上运行你的Sqoop作业,因此执行不协调。

和Sqoop2(它有一个合适的Metastore数据库)或多或少处于空闲状态;至少它不被Oozie支持。



解决方案是启动共享HSQLDB数据库服务以存储最后一个值信息所有Sqoop1工作,无论他们正在运行的机器。



请阅读Sqoop1有关其跛脚的文档 Metastore 以及如何使用它,从 那里
为了更专业地处理那个过时的HSQLDB数据库,请看


I've built a sqoop pogram in order to import data from MySQL to HDFS using a pre-built sqoop job:

                sqoop job -fs $driver_path -D mapreduce.map.java.opts=" -Duser.timezone=Europe/Paris"\
            --create job_parquet_table -- import -m $nodes_number\
            --connect jdbc:mysql://$server:$port/$database --username $username --password-file $pass_file\
            --target-dir $destination_dir --table $table --as-parquetfile --append\
            --incremental append --check-column $id_column_names --last-value 1\
            --fields-terminated-by , --escaped-by \\ --enclosed-by '\"'

I've also builted a spark progam which permits to monnitore the sqoop importation. In this, I have measured the max_id (SELECT Max(ID) of the table which is being imported, the sqoop incremental.last.value, the number of records ( SELECT count()), the number of distinct records ( SELECT count(distinct ID)) and the percent of duplicates (defined as SELECT count() - count(distinct ID) / count(*) * 100).

When I import the data manually, by executing the sqoop job:

sqoop-job --exec job_parquet_$table -- --username edastatnew --password edastatnew

Everything is right, I've got no duplicates and data is perfectly imported

However, when I use Oozie to schedule the sqoop job, I have observed that sqoopincremental.last.value is wrongly updated, what causes duplicates in my data. Here is my monitoring:

As you can see, the max_ID doesn't match the sqoop_last_value at different time of the experiment. I have concluded that when using Oozie, the sqoop incremenatal.last.value is wrongly updated after executing a sqoop job. Have someone faced this same issue? if yes, have you find any solution?

Thanks,

解决方案

Ask yourself a question: where does Sqoop store that "last value" information?

The answer is: for Sqoop1, by default, in a file on the local filesystem. But Oozie runs your Sqoop job on random machines therefore the executions are not coordinated.
And Sqoop2 (which has a proper Metastore database) is more or less in limbo; at least it is not supported by Oozie.

The solution is to start a shared HSQLDB database service to store the "last value" information for all Sqoop1 jobs, whatever machine they are running on.

Please read the Sqoop1 documentation about its lame Metastore and about how to use it, from there to there.
And for a more professional handling of that obsolete HSQLDB database, look at that post of mine.

这篇关于用Oozie使用sqoop观察重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆