Spark SQL RDD加载到pyspark中,但不加载到spark-submit中:"JDBCRDD:已关闭连接" [英] Spark SQL RDD loads in pyspark but not in spark-submit: "JDBCRDD: closed connection"

查看:387
本文介绍了Spark SQL RDD加载到pyspark中,但不加载到spark-submit中:"JDBCRDD:已关闭连接"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下简单代码,用于将表从Postgres数据库加载到RDD中.

I have the following simple code for loading a table from my Postgres database into an RDD.

# this setup is just for spark-submit, will be ignored in pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("GA")#.setMaster("localhost")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

# func for loading table
def get_db_rdd(table):
    url = "jdbc:postgresql://localhost:5432/harvest?user=postgres"
    print(url)
    lower = 0
    upper = 1000
    ret = sqlContext \
      .read \
      .format("jdbc") \
      .option("url", url) \
      .option("dbtable", table) \
      .option("partitionColumn", "id") \
      .option("numPartitions", 1024) \
      .option("lowerBound", lower) \
      .option("upperBound", upper) \
      .option("password", "password") \
      .load()
    ret = ret.rdd
    return ret

# load table, and print results
print(get_db_rdd("mytable").collect())

我运行./bin/pyspark,然后将其粘贴到解释器中,它会按预期从我的表中打印出数据.

I run ./bin/pyspark then paste that into the interpreter, and it prints out the data from my table as expected.

现在,如果我将代码保存到名为test.py的文件中,然后执行./bin/spark-submit test.py,它就会开始运行,但随后我会看到这些消息永远在我的控制台上发垃圾邮件:

Now, if I save that code to a file named test.py then do ./bin/spark-submit test.py, it starts to run, but then I see these messages spam my console forever:

17/02/16 02:24:21 INFO Executor: Running task 45.0 in stage 0.0 (TID 45)
17/02/16 02:24:21 INFO JDBCRDD: closed connection
17/02/16 02:24:21 INFO Executor: Finished task 45.0 in stage 0.0 (TID 45). 1673 bytes result sent to driver

这是在一台计算机上.我还没有开始任何主人或奴隶; spark-submit是我在系统启动后运行的唯一命令.我尝试了具有相同结果的主/从设置. 我的spark-env.sh文件如下所示:

This is on a single machine. I haven't started any masters or slaves; spark-submit is the only command I run after system start. I tried with the master/slave setup with the same results. My spark-env.sh file looks like this:

export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=800m
export SPARK_EXECUTOR_MEMORY=800m
export SPARK_EXECUTOR_CORES=2
export SPARK_CLASSPATH=/home/ubuntu/spark/pg_driver.jar # Postgres driver I need for SQLContext
export PYTHONHASHSEED=1337 # have to make workers use same seed in Python3

如果我火花提交一个仅根据列表或其他内容创建RDD的Python文件,则此方法有效.我只有在尝试使用JDBC RDD时才遇到问题.我想念什么?

It works if I spark-submit a Python file that just creates an RDD from a list or something. I only have problems when I try to use a JDBC RDD. What piece am I missing?

推荐答案

使用spark-submit时,应将jar提供给执行者.

When using spark-submit you should supply the jar to the executors.

中所述spark 2.1 JDBC文档:

要开始使用,您将需要包括JDBC驱动程序 spark类路径上的特定数据库.例如,连接到 在Spark Shell中使用postgres,您可以运行以下命令:

To get started you will need to include the JDBC driver for you particular database on the spark classpath. For example, to connect to postgres from the Spark Shell you would run the following command:

bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar

注意:spark-submit命令应该与此相同

Note: The same should be for spark-submit command

问题排查

JDBC驱动程序类对于原始类加载器必须是可见的 在客户端会话和所有执行者上.这是因为Java DriverManager类执行安全检查,导致其忽略 原始类加载器不可见的所有驱动程序 打开连接. 一种方便的方法是修改 在所有工作节点上 compute_classpath.sh包括驱动程序JAR.

The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.

这篇关于Spark SQL RDD加载到pyspark中,但不加载到spark-submit中:"JDBCRDD:已关闭连接"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆