如何使用PySpark JDBC连接器在Postgres上远程执行Postgres SQL函数? [英] How to remotely execute a Postgres SQL function on Postgres using PySpark JDBC connector?
问题描述
我想使用 JDBC连接器:
SELECT id, postgres_function(some_column) FROM my_database GROUP BY id
问题是我可以t使用 spark.sql(QUERY)
在Pyspark上执行这种查询,显然是因为 postgres_function
不是ANSI自Spark 2.0.0起支持SQL功能 。
The problem is I can't execute this kind of query on Pyspark using spark.sql(QUERY)
, obviously because the postgres_function
is not an ANSI SQL function supported since Spark 2.0.0.
我正在使用Spark 2.0.1和Postgres 9.4。
I'm using Spark 2.0.1 and Postgres 9.4.
推荐答案
您唯一的选择是使用子查询:
The only option you have is to use subquery:
table = """
(SELECT id, postgres_function(some_column) FROM my_database GROUP BY id) AS t
"""
sqlContext.read.jdbc(url=url, table=table)
,但这将在数据库端执行整个查询,包括聚合并获取结果。
but this will execute a whole query, including aggregation, on the database side and fetch the result.
通常,这并不重要如果函数是ANSI SQL函数,或者它在源数据库中具有等效函数,并且在获取数据后在Spark中执行在 spark.sql
中调用的ll函数。
In general it doesn't matter if function is an ANSI SQL function or if it has an equivalent in the source database and ll functions called in spark.sql
are executed in Spark after data is fetched.
这篇关于如何使用PySpark JDBC连接器在Postgres上远程执行Postgres SQL函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!