Sqoop - 数据拆分 [英] Sqoop - Data splitting

查看:38
本文介绍了Sqoop - 数据拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Sqoop 能够使用 --query 子句从多个表中导入数据,但不清楚是否能够导入下面的查询.

Sqoop able to import data from multiple tables using --query clause but not clear whether it is able to import below query.

选择deptid, avg(salary) from emp group by deptid

Select deptid, avg(salary) from emp group by deptid

另一个问题是

sqoop import --connect jdbc:mysql://myserver:1202/--username=u1--password=p1 --query 'Select * from emp where empid<1123 和 $CONDITIONS' --split-by empid --target-dir/uname/emp/salary

sqoop import --connect jdbc:mysql://myserver:1202/ --username=u1 --password=p1 --query 'Select * from emp where empid< 1123 and $CONDITIONS' --split-by empid --target-dir /uname/emp/salary

$CONDITIONS 和 split-by 用于执行并行处理,或者我们可以说有效地导入数据.前者根据条件拆分行,然后在主键上使用 min 和 max 逻辑.这两个($CONDITIONS, split-by) 有什么区别.如果我们在同一个 sqoop 语句中使用两者,哪个子句会获得优先级?

$CONDITIONS and split-by are used to perform parallel processing or we can say efficiently importing data. Former split the rows based on condition and later use min and max logic on primary key. What is the difference between these two ($CONDITIONS, split-by). If we use both in same sqoop statement, which clause would got the priority?

谢谢....

推荐答案

你的理解有一些差距.

首先,并行度由-m --num-mappers 控制.--num-mappers 的默认值为 4.

First of all, the degree of parallelism is controlled by -m <n> or --num-mappers <n>. By default value of --num-mappers is 4.

其次,--split-by ,将根据 column-name 拆分您的任务.

Second, --split-by <column-name>, will split your task on the basis of column-name.

第三,$CONDITIONS,sqoop内部使用它来实现这个拆分任务.

Third, $CONDITIONS, it is used internally by sqoop to achieve this splitting task.

示例,您触发了查询:

sqoop import --connect jdbc:mysql://myserver:1202/--username u1 --password p1 --query 'select * from emp where $CONDITIONS' --split-by empId --target-dir/temp/emp -m 4

比如说,我的 empId 从 1 到 100 均匀分布.

现在,sqoop 将使用 --split-by 列并使用查询找到它的 ma​​xmin 值:

Now, sqoop will take --split-by column and find its max and min value using query:

SELECT MIN(empId), MAX(empId) FROM (Select * From emp WHERE (1 = 1) ) t1

看到它用 (1 = 1) 替换了 $CONDITIONS.

See it replaced $CONDITIONS with (1 = 1).

在我们的例子中,最小值、最大值是 1 和 100.

In our case, min, max values are 1 and 100.

由于映射器的数量为 4,sqoop 会将我的查询分为 4 部分.

As number of mappers are 4, sqoop will divide my query in 4 parts.

创建输入拆分,下限为 'empId >= 1',上限为 'empId <25'

Creating input split with lower bound 'empId >= 1' and upper bound 'empId < 25'

创建输入拆分下限'empId >= 25'和上限'empId <50'

Creating input split with lower bound 'empId >= 25' and upper bound 'empId < 50'

创建输入分割,下限'empId >= 50'和上限'empId <75'

Creating input split with lower bound 'empId >= 50' and upper bound 'empId < 75'

使用下限 'empId >= 75' 和上限 'empId <= 100' 创建输入拆分

Creating input split with lower bound 'empId >= 75' and upper bound 'empId <= 100'

现在 $CONDITIONS 将再次出现.它被上述范围的查询所取代.

Now $CONDITIONS will again come into the picture. It is replaced by above range queries.

第一个映射器将触发这样的查询:

First mapper will fire query like this:

Select * From emp WHERE empId >= 25' AND 'empId <;50

对其他 3 个映射器依此类推.

and so on for other 3 mappers.

所有映射器的结果汇总并写入最终的 HDFS 目录.

Results from all the mappers is aggregated and written to a final HDFS directory.

关于您的查询:

选择deptid, avg(salary) from emp group by deptid

您将指定

--query 'select deptid, avg(salary) from emp group by deptid where $CONDITIONS'

会先转换为

选择deptid, avg(salary) from emp group by deptid where (1 = 0)

获取列元数据.

我相信这个查询不会在 RDBMS 中运行.直接在 Mysql 中尝试上述查询(具有 Where (1 = 0)).

I believe this query won't run in RDBMS. Try above query(having Where (1 = 0)) directly in Mysql.

因此您将无法使用此查询通过 Sqoop 获取数据.

So you will not be able to use this query to fetch data using Sqoop.

Sqoop 用于更简单的 SQL 查询.

这篇关于Sqoop - 数据拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆