Spark JDBC伪列不起作用 [英] Spark JDBC pseudocolumn isn't working

查看:140
本文介绍了Spark JDBC伪列不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的用例,我正在尝试使用spark JDBC读取一个大的oracle表.由于我的表中没有整数类型的列,因此我将rownum用作paritionColumn.

For my use case, I am trying to read one big oracle table using spark JDBC. Since, I do not have an integer type column in my table, I am using rownum as paritionColumn.

这是我的spark查询的样子:(为了测试,我使用的表只有22000行.)

Here is what my spark query looks like: (For testing I am using a table with only 22000 rows.)

val df = spark.read.jdbc(jdbcUrl = url, table = select * from table1, 
                         columnName= "rownum", lowerBound = 0, upperBound = 22000, 
                         numPartitions = 3, connectionProperties = oracleProperties)

理想情况下,它应该返回3个分区,每个分区有7000行.但是,当我对数据帧的每个分区进行计数时,我可以看到只有一个分区具有行,而其他分区为0.

Ideally, it should return me 3 partitions with almost 7000 rows in each. But when I ran the count on each partitions of dataframe I can see that only one partition has rows while others are 0.

df.rdd.mapPartitionsWithIndex{case(i, rows) => Iterator((i, rows.size))}.toDF().show()

输出:

+---+----+
| _1| _2 |    
+---+----+    
| 0 |7332|    
| 1 | 0  |    
| 2 | 0  |    
+---+----+

能否请您说明为什么它仅在一个分区中返回行?

Can you please suggest why its only returning rows in one partition?

我的来源是一个Oracle数据库.使用Oracle JDBC驱动程序 oracle.jdbc.driver.OracleDriver jar-> ojdbc7.jar

My source is a Oracle Database. Using oracle jdbc driver oracle.jdbc.driver.OracleDriver jar --> ojdbc7.jar

参考线程:推荐答案

经过一番谷歌搜索后,我得以使它工作.我做了一些

After some googling around I was able to make it work. I made some

我试图使用Spark jdbc从Oracle数据库读取.我能够使用oracle的Pseudocolumn ROWNUM将来自Spark的读取与一些技巧并行化. 窍门是您必须为ROWNUM列添加别名,然后使用该别名列.

I was trying to read from an Oracle Database using Spark jdbc. I was able to use Pseudocolumn ROWNUM of oracle to parallelize the read from spark with some hacks. Trick is that you have to alias the ROWNUM column and then use that alias column.

我想查询整个"table1"并在该表的spark中创建多个分区.

I wanted to query entire "table1" and create multiple partitions in spark for that table.

val df = spark.read.jdbc(jdbcUrl= url, table = "select * from table1", 
                         columnName="ROWNUM", lowerBound = 0, upperBound = 22000,
                          numPartitions = 3, connectionProperties = oracleProperties)

要在Pseudocolumn上进行分区,请修改查询,如下所示:

In order to partition on Pseudocolumn modify the query like following:

val df = spark.read.jdbc(jdbcUrl= url, table = "(select t1.*, ROWNUM as num_rows from (select * from table1) t1) oracle_table1", 
                         columnName="num_rows", lowerBound = 0, upperBound = 22000,
                          numPartitions = 3, connectionProperties = oracleProperties)

通过这种方式,我们实际上将psuedocolumn用作实际的列,并将其用于分区.

This way we are actually making the psuedocolumn as an actual column and use it for partitioning.

这篇关于Spark JDBC伪列不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆