将SQL模式分配给Spark DataFrame [英] Assign SQL schema to Spark DataFrame
问题描述
我正在将团队的旧版Redshift SQL代码转换为Spark SQL代码.我见过的所有Spark示例都使用StructType
和StructField
以非SQL方式定义架构,而我更喜欢在SQL中定义架构,因为我的大多数用户都知道SQL,但不了解Spark.>
这是我现在正在做的丑陋的解决方法.有没有一种更优雅的方法,它不需要定义空表就可以拉出SQL模式?
create_table_sql = '''
CREATE TABLE public.example (
id LONG,
example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
path="s3a://"+s3_bucket_path,
schema=schema
)\
.saveAsTable('public.example')
是的,虽然我不确定它是否真的像SQL,但是有一种方法可以从字符串创建模式!因此,您可以使用:
from pyspark.sql.types import _parse_datatype_string
_parse_datatype_string("id: long, example: string")
这将创建下一个模式:
StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))
或者您也可能具有复杂的架构:
schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")
StructType(
List(StructField(
customers,ArrayType(
StructType(
List(
StructField(id,LongType,true),
StructField(name,StringType,true),
StructField(address,StringType,true)
)
),true),true)
)
)
您可以在此处查看更多示例
I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType
and StructField
and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.
This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?
create_table_sql = '''
CREATE TABLE public.example (
id LONG,
example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
path="s3a://"+s3_bucket_path,
schema=schema
)\
.saveAsTable('public.example')
Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:
from pyspark.sql.types import _parse_datatype_string
_parse_datatype_string("id: long, example: string")
This will create the next schema:
StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))
Or you may have a complex schema as well:
schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")
StructType(
List(StructField(
customers,ArrayType(
StructType(
List(
StructField(id,LongType,true),
StructField(name,StringType,true),
StructField(address,StringType,true)
)
),true),true)
)
)
You can check for more examples here
这篇关于将SQL模式分配给Spark DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!