在 pyspark 中使用 jdbc jar [英] Working with jdbc jar in pyspark

查看:28
本文介绍了在 pyspark 中使用 jdbc jar的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从 pyspark 中的 postgres sql 数据库中读取数据.我知道之前有人问过这个问题,例如 here这里和许多其他地方,但是,那里的解决方案要么使用本地运行目录中的jar,要么手动将其复制到所有worker.

I need to read from a postgres sql database in pyspark. I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually.

我下载了 postgresql-9.4.1208 jar 并将其放在/tmp/jars 中.然后我继续使用 --jars 和 --driver-class-path 开关调用 pyspark:

I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches:

pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar

在 pyspark 中我做了:

Inside pyspark I did:

df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name").load()
df.count()

然而,虽然使用 --jars 和 --driver-class-path 对我创建的 jars 工作正常,但对于 jdbc 却失败了,我从工人那里得到了一个例外:

However, while using --jars and --driver-class-path worked fine for jars I created, it failed for jdbc and I got an exception from the workers:

 java.lang.IllegalStateException: Did not find registered driver with class org.postgresql.Driver

如果我手动将 jar 复制到所有工作人员并添加 --conf spark.executor.extraClassPath 和 --conf spark.driver.extraClassPath,它确实可以工作(使用相同的 jar).文档 btw 建议使用已弃用的 SPARK_CLASSPATH 实际上添加了这两个开关(但具有防止使用 --jars 选项添加其他罐子的副作用,我需要这样做)

If I copy the jar manually to all workers and add --conf spark.executor.extraClassPath and --conf spark.driver.extraClassPath, it does work (with the same jar). The documentation btw suggests using SPARK_CLASSPATH which is deprecated actually adds these two switches (but has the side effect of preventing adding OTHER jars with the --jars option which I need to do)

所以我的问题是:jdbc 驱动程序有什么特别之处使其无法工作,以及如何添加它而不必手动将其复制到所有工作人员.

So my question is: what is special about the jdbc driver which makes it not work and how can I add it without having to manually copy it to all workers.

更新:

我做了更多的查找并在文档中找到了这个:JDBC 驱动程序类必须对客户端会话和所有执行程序上的原始类加载器可见.这是因为 Java 的 DriverManager 类进行了一项安全检查,导致它在访问时忽略原始类加载器不可见的所有驱动程序打开一个连接.一种方便的方法是修改所有工作节点上的 compute_classpath.sh 以包含您的驱动程序 JAR.".

I did some more looking and found this in the documentation: "The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.".

问题是我似乎找不到 computer_classpath.sh 也不明白原始类加载器的含义.

The problem is I can't seem to find computer_classpath.sh nor do I understand what the primordial class loader means.

我确实找到了 this 基本上解释了这需要在本地完成.我还发现 this 基本上说有一个修复程序,但它尚未在版本 1.6.1.

I did find this which basically explains that this needs to be done locally. I also found this which basically says there is a fix but it is not yet available in version 1.6.1.

推荐答案

我找到了一个有效的解决方案(不知道它是否是最好的,请随时继续评论).显然,如果我添加选项:driver="org.postgresql.Driver",这可以正常工作.即我的全线(在 pyspark 内)是:

I found a solution which works (Don't know if it is the best one so feel free to continue commenting). Apparently, If I add the option: driver="org.postgresql.Driver", this works properly. i.e. My full line (inside pyspark) is:

df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name",driver="org.postgresql.Driver").load()
df.count()

另一件事:如果您已经在使用自己的胖 jar(我在我的完整应用程序中),那么您需要做的就是将 jdbc 驱动程序添加到您的 pom 文件中:

Another thing: If you are already using a fat jar of your own (I am in my full application) then all you need to do is add the jdbc driver to your pom file as such:

    <dependency>
      <groupId>org.postgresql</groupId>
      <artifactId>postgresql</artifactId>
      <version>9.4.1208</version>
    </dependency>

然后就不必将驱动程序添加为单独的 jar,只需使用带有依赖项的 jar.

and then you don't have to add the driver as a separate jar, just use the jar with dependencies.

这篇关于在 pyspark 中使用 jdbc jar的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆