如何在 PySpark 中连接到 Presto JDBC? [英] How to connect to Presto JDBC in PySpark?

查看:144
本文介绍了如何在 PySpark 中连接到 Presto JDBC?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 PySpark 中使用 JDBC 连接到 Presto 服务器.我遵循了用 Java 编写的 教程.我试图在我的 Python3 代码中做同样的事情,但出现错误:

I want to connect to Presto server using JDBC in PySpark. I followed a tutorial which is written in Java. I am trying to do the same in my Python3 code but getting an error:

: java.sql.SQLException: 没有合适的驱动程序

我尝试执行以下代码:

jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:presto://my_machine_ip:8080/hive/default") \
    .option("user", "airflow") \
    .option("dbtable", "may30_1") \
    .load()

应该注意的是,我在 EMR 上使用 Spark,因此,spark 已经提供给我.

It should be noted that I am using Spark on EMR and so, spark is already provided to me.

上述教程中的代码是:

final String JDBC_DRIVER = "com.facebook.presto.jdbc.PrestoDriver";
    final String DB_URL = "jdbc:presto://localhost:9000/catalogName/schemaName";
    //  Database credentials
    final String USER = "username";
    final String PASS = "password";
    Connection conn = null;
    Statement stmt = null;
    try {
      //Register JDBC driver
      Class.forName(JDBC_DRIVER);

请注意上面代码中的JDBC_DRIVER,我无法在Python3中推导出相应的赋值在PySpark中.

Kindly notice the JDBC_DRIVER in the above code, I have not been able to deduce the corresponding assignment in Python3 i.e. in PySpark.

我也没有在任何配置中添加任何依赖项.

Nor have I added any dependency in any configuration whatsoever.

我希望能成功连接到我的 presto.现在,我收到以下错误,完整的堆栈跟踪:

I expect to successfully connect to my presto. Right now, I am getting the following error, complete stacktrace:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 172, in load
    return self._df(self._jreader.load())
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o90.load.
: java.sql.SQLException: No suitable driver
    at java.sql.DriverManager.getDriver(DriverManager.java:315)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:104)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

推荐答案

您需要执行以下步骤将 presto 连接到 pyspark.

You need to perform the following steps for connecting presto to pyspark.

  1. 下载Presto JDBC driver jar 来自官网并把它进入火花主节点.

  1. Download Presto JDBC driver jar from official website and put it into spark master node.

使用以下参数启动 pyspark shell:

Start pyspark shell with following params:

bin/pyspark --driver-class-path com.facebook.presto.jdbc.PrestoDriver --jars /path/to/presto/jdbc/driver/jar/file

  1. 尝试使用 spark 连接到 presto

jdbcDF = spark.read \
    .format("jdbc") \
    .option("driver", "com.facebook.presto.jdbc.PrestoDriver") \  <-- presto driver class
    .option("url", "jdbc:presto://<machine_ip>:8080/hive/default") \
    .option("user", "airflow") \
    .option("dbtable", "may30_1") \
    .load()

这篇关于如何在 PySpark 中连接到 Presto JDBC?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆