在 SQOOP 中使用 Where 子句的问题 [英] Issue in using Where Clause in SQOOP

查看:31
本文介绍了在 SQOOP 中使用 Where 子句的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 --where 选项通过使用以下命令将订单表与 order_items 表连接来获取条件数据:

I am trying to use --where option to get conditional data by joining orders table with order_items table using below command :

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id where  \$CONDITIONS " \
--where "order_id between 10840 and 10850" \
--target-dir /user/cloudera/order_join_conditional \
--split-by order_id

<小时>

现在我不知道这有什么问题,因为当我在 MySQL 中运行相同的查询时,我得到 41 条记录,这是正确的但是当我在 sqoop 中运行此命令时,它将转储所有 172198 条记录.我不明白发生了什么,出了什么问题.


Now i don't know whats wrong with this because when i Run same Query in MySQL i get 41 records which is correct But when i run this command in sqoop it will Dump all the 172198 records. I don't understand whats happening and whats going wrong.

推荐答案

运行并行导入时,Sqoop 将使用 --split-by 中指定的参数值替换 $CONDITIONS 参数并生成不同的查询(将由不同的映射器执行).例如,Sqoop 将首先尝试找到 order_id 的最小值和最大值,并根据映射器的数量,尝试针对 order_id 的整个可能值范围的不同子集执行查询>order_id.

When you run a parallel import, Sqoop will use the value of the parameter specified in --split-by to substitute the $CONDITIONS parameter and generate different queries (which will be executed by different mappers). For instance, Sqoop will first try to find the minimum and maximum value of order_id and depending on the number of mappers, will try to execute your query against different subsets of the whole range of possible values of order_id.

这样,您的查询将在内部转换为不同的并行查询,例如:

That way, your query would be translated internally to different parallel queries like these ones:

SELECT * FROM orders o join order_items oi on o.order_id = oi.order_item_order_id
 WHERE (order_id >=0 AND order_id < 10000)

SELECT * FROM orders o join order_items oi on o.order_id = oi.order_item_order_id
 WHERE (order_id >=1000 AND order_id < 20000)

...

因此,在这种情况下,您单独指定的 --where 子句将不会被使用,您最终将拥有所有记录.但在您的特定情况下,您并不真正需要 --split-by 标志,因为您只对特定(且非常有限)的值范围感兴趣.所以你可以改用这个:

So in this case, the --where clause you specified separately will not be used and you'll end up having all the records. But in your particular case, you don't really need the --split-by flag, because you are only interested in a particular (and very limited) range of values. So you could use this instead:

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id BETWEEN 10840 AND 10850)" \
--target-dir /user/cloudera/order_join_conditional \
-m 1

还要注意末尾的 -m 1(正如 dev ツ 指出的那样)代表 --num-mappers 并允许您告诉 Sqoop只想为您的导入过程使用一个映射器(因此,没有并行性).

Note also the -m 1 at the end which (as pointed out by dev ツ) stands for --num-mappers and allows you to tell Sqoop that you want to use just one mapper for your import process (therefore, no parallelism).

如果值的范围更大,您可以使用 --split-by 并在您的自由格式查询中使用您的 where 条件,利用并行性:

If the range of values was bigger, you could use the --split-by and use your where condition in your free-form query, making use of the parallelism:

sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id BETWEEN 10840 AND 10850) AND \$CONDITIONS" \
--target-dir /user/cloudera/order_join_conditional \
--split-by order_id

这篇关于在 SQOOP 中使用 Where 子句的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆