将SQL模式分配给Spark DataFrame [英] Assign SQL schema to Spark DataFrame

查看:75
本文介绍了将SQL模式分配给Spark DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将团队的旧版Redshift SQL代码转换为Spark SQL代码.我见过的所有Spark示例都使用StructTypeStructField以非SQL方式定义架构,而我更喜欢在SQL中定义架构,因为我的大多数用户都知道SQL,但不了解Spark.

这是我现在正在做的丑陋的解决方法.有没有一种更优雅的方法,它不需要定义空表就可以拉出SQL模式?

create_table_sql = '''
CREATE TABLE public.example (
  id LONG,
  example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
    path="s3a://"+s3_bucket_path,
    schema=schema
)\
.saveAsTable('public.example')

解决方案

是的,虽然我不确定它是否真的像SQL,但是有一种方法可以从字符串创建模式!因此,您可以使用:

from pyspark.sql.types import _parse_datatype_string

_parse_datatype_string("id: long, example: string")

这将创建下一个模式:

  StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))

或者您也可能具有复杂的架构:

schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")

StructType(
  List(StructField(
    customers,ArrayType(
      StructType(
        List(
          StructField(id,LongType,true),
          StructField(name,StringType,true),
          StructField(address,StringType,true)
        )
      ),true),true)
  )
)

您可以在此处查看更多示例

I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.

This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?

create_table_sql = '''
CREATE TABLE public.example (
  id LONG,
  example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
    path="s3a://"+s3_bucket_path,
    schema=schema
)\
.saveAsTable('public.example')

解决方案

Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:

from pyspark.sql.types import _parse_datatype_string

_parse_datatype_string("id: long, example: string")

This will create the next schema:

  StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))

Or you may have a complex schema as well:

schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")

StructType(
  List(StructField(
    customers,ArrayType(
      StructType(
        List(
          StructField(id,LongType,true),
          StructField(name,StringType,true),
          StructField(address,StringType,true)
        )
      ),true),true)
  )
)

You can check for more examples here

这篇关于将SQL模式分配给Spark DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆