如何在pyspark中从SQL中提取列名和列类型 [英] How to extract column name and column type from SQL in pyspark

查看:231
本文介绍了如何在pyspark中从SQL中提取列名和列类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用于创建查询的 Spark SQL 类似于 这个 -

The Spark SQL for Create query is like this -

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING datasource
  [OPTIONS (key1=val1, key2=val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1=val1, key2=val2, ...)]
  [AS select_statement]

其中 [x] 表示 x 是可选的.如果传递了 CREATE sql 查询,我希望将输出作为以下顺序的元组 -

where [x] means x is optional. I want the output as a tuple of following order if a CREATE sql query is passed -

(db_name, table_name, [(col1 name, col1 type), (col2 name, col2 type), ...])

那么有什么办法可以用 pyspark sql 函数做到这一点,或者需要正则表达式的帮助?

So is there any way to do that with pyspark sql functions or need help from regex?

如果正则表达式有人可以帮忙处理正则表达式吗?

If regex could anyone please help with the regular expression?

推荐答案

可以通过java_gateway访问非官方API来完成:

It can be done by accessing the unofficial API through java_gateway:

plan = spark_session._jsparkSession.sessionState().sqlParser().parsePlan("CREATE TABLE foobar.test (foo INT, bar STRING) USING json")
print(f"database: {plan.tableDesc().identifier().database().get()}")
print(f"table: {plan.tableDesc().identifier().table()}")
# perhaps there is a better way to convert the schemas, using JSON string hack here
print(f"schema: {StructType.fromJson(json.loads(plan.tableDesc().schema().json()))}")

输出:

database: foobar
table: test
schema: StructType(List(StructField(foo,IntegerType,true),StructField(bar,StringType,true)))

请注意,如果未定义数据库并且应正确处理 Scala 选项,则 database().get() 将失败.此外,如果您使用 CREATE TEMPORARY VIEW 访问器的命名不同.命令可以在这里找到https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L38https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L58

Note that database().get() would fail if the database is not defined and the Scala option should be handled properly. Also, if you use CREATE TEMPORARY VIEW the accessors are named differently. The commands can be found here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L38 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L58

这篇关于如何在pyspark中从SQL中提取列名和列类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆