sqoop split by 选项在 split by 选项中使用派生列时出错 [英] sqoop split by option is giving error while using a derived column in the split by option

查看:71
本文介绍了sqoop split by 选项在 split by 选项中使用派生列时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Oracle 查询,它正在获取 2500 万条记录,没有 pk 或没有正确分布的列以按列拆分.所以我想到了使用 ROW_number() over () as RANGEGROUP 来制作一个序列号.但是当我使用这个伪列时,它给了我一个错误说

I have an Oracle query which is fetching 25 million records, there is no pk or no columns which is distributed properly to make as a split by column. So I have thought of making a sequence number using ROW_number() over () as RANGEGROUP. But when I use this pseudo column its giving me an error saying

在 org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)引起:java.sql.SQLSyntaxErrorException:ORA-00904:P".RANGEGROUP":oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:91)处的标识符无效.

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: "P"."RANGEGROUP": invalid identifier at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:91).

我正确地给出了别名,即使我尝试不使用伪列的别名,它仍然给出相同的错误.我们可以在 Sqoop split by 中使用派生列,还是该列应该物理存在于表中?

I am properly giving the alias, even I tried with out alias to the pseudo column, its still giving the same error. Can we use derived columns in Sqoop split by, or the column should be physically present in table?

推荐答案

使用子查询包装 row_number 计算,然后在 split-by 中使用派生列.

Use subquery to wrap row_number calculation, then use derived column in the split-by.

   --query "select col1, ... colN, RANGEGROUP 
               from (select t.*, row_number() OVER (order by t.item_id ) AS RANGEGROUP
                      from table t ) s 
              where 1=1 and \$CONDITIONS"

row_number 应该是确定性的,这意味着当多次执行时,它应该为所有行分配完全相同的数字.如果 OVER 中的 ORDER BY 不包含唯一的列或组合,会发生什么情况:row_number 可以为相同的行返回不同的数字.如果你在 split-by 中使用它,你会得到重复,因为同一行可以在拆分范围 1 中,比如 1-100,在 mapper2 中 sqoop 将使用范围 2 的过滤器执行相同的查询,比如 (101-200) 同一行也可以出现在该范围内.Sqoop 在具有不同条件的不同容器(映射器)中运行相同的查询以并行获取分割范围.

row_number should be deterministic, it means when executed multiple times, it should assign exactly the same number to all rows. What can happen if ORDER BY in the OVER contains not unique column or combination: row_number can return different numbers for the same rows. And if you are using it in the split-by, you will get duplication because the same row can be in split range 1, say 1-100, in mapper2 sqoop will execute same query with filter for range 2, say (101-200) the same row can appear also in that range. Sqoop runs the same query in different containers(mappers) with different condition to get split ranges in parallel.

如果 Id 是 int(如果均匀分布会更好),使用该 ID.为什么您可能需要 row_number 是因为它是 STRING 列.阅读:https://stackoverflow.com/a/37389134/2700344,分列不一定是PK

If Id is int (and much better if it is evenly distributed), use that ID. Why you may need row_number is when it is STRING column. read this: https://stackoverflow.com/a/37389134/2700344, split-column is not necessarily a PK

这篇关于sqoop split by 选项在 split by 选项中使用派生列时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆