Apache Spark:JDBC 连接不起作用 [英] Apache Spark : JDBC connection not working
问题描述
我之前也问过这个问题,但没有得到任何答案(无法在 pyspark shell 中使用 jdbc 连接到 postgres).
我已在本地 Windows 上成功安装 Spark 1.3.0 并运行示例程序以使用 pyspark shell 进行测试.
现在,我想对存储在 Postgresql 中的数据运行来自 Mllib 的关联,但我无法连接到 postgresql.
我已经通过运行
在类路径中成功添加了所需的 jar(测试了这个 jar)pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"
我可以看到在环境 UI 中成功添加了 jar.
当我在 pyspark shell 中运行以下命令时-
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
我收到此错误 -
<预><代码>>>>df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py",第482行,加载中df = self._ssql_ctx.load(source, joptions)文件C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",第 538 行,在__称呼__文件C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",第300行,在获取返回值py4j.protocol.Py4JJavaError:调用 o20.load 时发生错误.: java.sql.SQLException: 找不到适合 jdbc:postgresql://[host]/[dbname] 的驱动程序在 java.sql.DriverManager.getConnection(DriverManager.java:602)在 java.sql.DriverManager.getConnection(DriverManager.java:207)在 org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94)在 org.apache.spark.sql.jdbc.JDBCRelation.我在使用 mysql/mariadb 时遇到了这个问题,并从 这个问题
所以你的 pyspark 命令应该是:
pyspark --conf spark.executor.extraClassPath=--driver-class-path --jars --master
还要注意 pyspark 启动时的错误,例如警告:本地 jar ... 不存在,正在跳过."和ERROR SparkContext: Jar not found at ...",这可能意味着你拼错了路径.
I have asked this question previously also but did not got any answer (Not able to connect to postgres using jdbc in pyspark shell).
I have successfully installed Spark 1.3.0 on my local windows and ran sample programs to test using pyspark shell.
Now, I want to run Correlations from Mllib on the data that is stored in Postgresql, but I am not able to connect to postgresql.
I have successfully added the required jar (tested this jar) in the classpath by running
pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"
I can see that jar is successfully added in environment UI.
When I run the following in pyspark shell-
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
I get this ERROR -
>>> df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py", line 482, in load
df = self._ssql_ctx.load(source, joptions)
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.sql.SQLException: No suitable driver found for jdbc:postgresql://[host]/[dbname]
at java.sql.DriverManager.getConnection(DriverManager.java:602)
at java.sql.DriverManager.getConnection(DriverManager.java:207)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94)
at org.apache.spark.sql.jdbc.JDBCRelation.<init> (JDBCRelation.scala:125)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:619)
I had this exact problem with mysql/mariadb, and got BIG clue from this question
So your pyspark command should be:
pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>
Also watch for errors when pyspark start like "Warning: Local jar ... does not exist, skipping." and "ERROR SparkContext: Jar not found at ...", these probably mean you spelled the path wrong.
这篇关于Apache Spark:JDBC 连接不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!