在我的Spark应用程序中对未完全指定的错误进行分区 [英] Partitioning incompletely specified error in my spark application

查看:87
本文介绍了在我的Spark应用程序中对未完全指定的错误进行分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请在下面查看此代码.当我传递分区数量的值时,以下代码出现错误.

Please take a look at this code below . I am getting error for the below code when I pass value for the number of partitions.

      def loadDataFromPostgress(sqlContext: SQLContext, tableName: String, 
         columnName: String, dbURL: String, userName: String, pwd: String, 
         partitions: String): DataFrame = {
        println("the no of partitions are : "+partitions)
        var dataDF = sqlContext.read.format("jdbc").options(
        scala.collection.Map("url" -> dbURL,
                          "dbtable" -> tableName,
                      "driver" -> "org.postgresql.Driver",
                   "user" -> userName,
                 "password" -> pwd,
                   "partitionColumn" -> columnName,
               "numPartitions" -> "1000")).load()
                return dataDF
                        }

错误:

                java.lang.RuntimeException: Partitioning incompletely specified
                App > at scala.sys.package$.error(package.scala:27)
                App > at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:38)
                App > at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315)
                App > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
    App > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
                App > at Test$.loadDataFromGreenPlum(script.scala:28)
                App > at Test$.loadDataFrame(script.scala:15)
                App > at Test$.main(script.scala:59)
                App > at Test.main(script.scala)
                App > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
                  Method)
                App > at 

推荐答案

您可以在下面查看代码的使用方式.

you can check code below how exactly you can use.

def loadDataFromPostgress(sqlContext: SQLContext, tableName: String,
                            columnName: String, dbURL: String, userName: String,
                            pwd: String, partitions: String): DataFrame = {
    println("the no of partitions are : " + partitions)
    var dataDF = sqlContext.read.format("jdbc").options(
      scala.collection.Map("url" -> dbURL,
        "dbtable" -> "(select mod(tmp.empid,10)  as hash_code,tmp.* from employee as tmp) as t",
        "driver" -> "org.postgresql.Driver",
        "user" -> userName,
        "password" -> pwd,
        "partitionColumn" -> hash_code,
        "lowerBound" -> 0,
        "upperBound" -> 10
    "numPartitions" -> "10"
    ) ).load()
    return dataDF
  }

以上代码将创建带有10个查询的10个任务,如下所示.在完成这项工作之前,您将找到

Above code will create 10 tasks with 10 queries as shown below. Before that job will find out

offset =(upperBound-lowerBound)/numPartitions

offset = (upperBound-lowerBound)/numPartitions

此处 offset =(10-0)/10 = 1

select mod(tmp.empid,10)  as hash_code,tmp.* from employee as tmp where hash_code between 0 between 1
select mod(tmp.empid,10)  as hash_code,tmp.* from employee as tmp where hash_code between 1 between 2
.
.
select mod(tmp.empid,10)  as hash_code,tmp.* from employee as tmp where hash_code between 9 between 10

这将创建10个分区和

以0结尾的empid将进入一个分区,因为mod(empid,10)始终等于0

empid ends with 0 will be going one partition as mod(empid,10) always equals 0

以1结尾的empid将进入一个分区,因为mod(empid,10)始终等于1

empid ends with 1 will be going one partition as mod(empid,10) always equals 1

像这样,所有员工行将被拆分为10个分区.

like this all employee rows will be spitted into 10 partitions.

您必须根据需要更改partitionColumn,upperBound,lowerBound,numPartitions值.

you have to change partitionColumn,upperBound,lowerBound,numPartitions values according to your requirements.

希望我的回答对您有所帮助.

Hope my answer helps you.

这篇关于在我的Spark应用程序中对未完全指定的错误进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆