用于 Sqoop 拆分列的 Oracle ROWID [英] Oracle ROWID for Sqoop Split-By Column

查看:39
本文介绍了用于 Sqoop 拆分列的 Oracle ROWID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的 oracle 表 (Transaction),我的 oracle 表中的数据在Customer id"列上的数据有偏差,因此少数映射器需要花费数小时的时间才能完成工作,而其他映射器则在分钟.我看不到任何其他选项来避免倾斜数据,因为这是唯一可以拆分的列.我们可以结合其他列,如客户 ID、批次 ID、SEQ NUM 来进行多列拆分,但我知道 sqoop 不支持多列拆分.

I have a huge oracle table (Transaction), the data in my oracle table has skew data on the column "Customer id" due to which the few mappers take time in hours to finish the job while other mappers finish the job in minutes. I couldn't see any other option to avoid the skewing data as this is the only column can be splited by. We can combine other columns like Customer ID, Batch ID, SEQ NUM to come with multi column split but I understood that sqoop doesn't support multi column in split by.

我的目标是提取特定时期的交易数据(即一个月数据的唯一批次日期).

My objective is to pull the transaction data for a specific period (i.e. batch date unique for a month of data).

我在带有 10 个映射器的 sqoop 中尝试了以下选项.

I tried the below options in sqoop with 10 mappers.

--split-by "my column name"           //for example customer id
--where "my query condition"         //for example batch date

现在我正在考虑使用 ROWID,它可能会在映射器之间均匀地分割行.我想到了使用边界查询来获取 MIN &最大行 ID.下面是我要使用的 Sqoop 命令.

Now I am thinking of using the ROWID which might split the rows evenly between the mappers. I thought of using the boundary query to get the MIN & MAX ROW ID. Below is Sqoop command I want to use.

sqoop import \
--table Transaction \
--split-by ROWID \
--where "BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY')" \
--boundary-query "SELECT MIN(ROWID) AS MIN, MAX(ROWID) AS MAXL FROM Transaction WHERE BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY') GROUP BY CUSTOMERID, BATCHNO,BATCHSEQNO " \
--num-mappers 10 \
--target-dir /user/trans

需要建议这是否是正确的选择或是否有其他方法.

Need advise if this would be right option or is there any other way.

我也想知道我们是否可以使用多列名称拆分.

Also I would like to know if we can use multi split-by column name by any chance.

推荐答案

提供 --boundary-query 只会节省您评估最小值和最大值的时间.所有映射器将具有相同的范围查询.

Providing --boundary-query will only save your time in evaluating minimun and maximun value. All mappers will have the same range query.

在您的情况下,sqoop 将生成边界查询,如 -

In your case, sqoop will generate boundary query like -

SELECT MIN(ROWID), MAX(ROWID) FROM (Select * From Transaction WHERE BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY') ) t1

您可以在 JDBC 客户端上尝试此查询和自定义边界查询,以检查哪个更快并使用那个.

You can try this query and your custom boundary query on your JDBC client to check which one is faster and use that one.

现在映射器负载不均匀.

Now coming to uneven mappers load.

是的,你说得对.目前,sqoop 不支持多列拆分.你必须选择一列.如果 ROWID 是均匀分布的(我假设是),你应该使用它.

Yes, you are right. Currently, sqoop doesn't support multi-column in split by. you have to choose one column. If ROWID is evenly distributed (I am assuming yes), you should use it.

所以,您的查询看起来不错.只需检查 compare--boundary-query.

So, you query looks good. Just check compare--boundary-query.

ROWID 类型的 Oracle 没有正确的 java 类型问题.

There is no proper java type issue with ROWID type of Oracle.

在导入命令中添加 --map-column-java ROWID=String 以将其映射到 Java 的字符串.

Add --map-column-java ROWID=String in your import command to map this to Java's String.

这篇关于用于 Sqoop 拆分列的 Oracle ROWID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆