在(Py)Spark中读取JDBC源时出现不受支持的数组错误? [英] Unsupported Array error when reading JDBC source in (Py)Spark?
问题描述
尝试将postgreSQL DB转换为Dataframe.以下是我的代码:
Trying to convert postgreSQL DB to Dataframe . Following is my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Connect to DB") \
.getOrCreate()
jdbcUrl = "jdbc:postgresql://XXXXXX"
connectionProperties = {
"user" : " ",
"password" : " ",
"driver" : "org.postgresql.Driver"
}
query = "(SELECT table_name FROM information_schema.tables) XXX"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
table_name_list = df.select("table_name").rdd.flatMap(lambda x: x).collect()
for table_name in table_name_list:
df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
我遇到的错误:
java.sql.SQLException:为表名生成df2时,不支持的类型ARRAY
java.sql.SQLException: Unsupported type ARRAY on generating df2 for table name
如果我对表名进行硬编码,则不会得到相同的错误
If I hard code table name value , I do not get the same error
df2 = spark.read.jdbc(jdbcUrl,"conditions",properties=connectionProperties)
我检查了table_name类型,它是String,这是正确的方法吗?
I checked table_name type and it is String , is this the correct approach ?
推荐答案
我猜您不希望属于postgres内部工作的表名,例如 pg_type
, pg_policies
等,其架构的类型为 pg_catalog
的类型会导致
I guess you don't want the table names that belong to internal working of postgres such as pg_type
, pg_policies
etc whose schema are of type pg_catalog
that causes the error
py4j.protocol.Py4JJavaError:调用o34.jdbc时发生错误.:java.sql.SQLException:不支持的ARRAY类型
py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : java.sql.SQLException: Unsupported type ARRAY
当您尝试将它们阅读为
spark.read.jdbc(url=jdbcUrl, table='pg_type', properties=connectionProperties)
并且有诸如 applicable_roles
, view_table_usage
之类的表,这些表的模式为导致
and there are tables such as applicable_roles
, view_table_usage
etc whose schema are of type information_schema
that causes
py4j.protocol.Py4JJavaError:调用o34.jdbc时发生错误.:org.postgresql.util.PSQLException:错误:关系"view_table_usage"不存在
py4j.protocol.Py4JJavaError: An error occurred while calling o34.jdbc. : org.postgresql.util.PSQLException: ERROR: relation "view_table_usage" does not exist
当您尝试将它们阅读为
spark.read.jdbc(url=jdbcUrl, table='view_table_usage', properties=connectionProperties)
可以使用上述jdbc命令将模式类型为公共的表读入表中.
The tables whose schema types are public can be read into tables using above jdbc commands.
我检查了table_name类型,它是String,这是正确的方法吗?
I checked table_name type and it is String , is this the correct approach ?
因此,您需要过滤掉这些表名,并将逻辑应用为
So you need to filter out those table names and apply your logic as
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Connect to DB") \
.getOrCreate()
jdbcUrl = "jdbc:postgresql://hostname:post/"
connectionProperties = {
"user" : " ",
"password" : " ",
"driver" : "org.postgresql.Driver"
}
query = "information_schema.tables"
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
table_name_list = df.filter((df["table_schema"] != 'pg_catalog') & (df["table_schema"] != 'information_schema')).select("table_name").rdd.flatMap(lambda x: x).collect()
for table_name in table_name_list:
df2 = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)
应该可以
这篇关于在(Py)Spark中读取JDBC源时出现不受支持的数组错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!