如何从Pyspark中的SQL中提取列名和列类型 [英] How to extract column name and column type from SQL in pyspark

查看:75
本文介绍了如何从Pyspark中的SQL中提取列名和列类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用于创建的Spark SQL查询类似于-

The Spark SQL for Create query is like this -

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING datasource
  [OPTIONS (key1=val1, key2=val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1=val1, key2=val2, ...)]
  [AS select_statement]

其中 [x] 表示 x 是可选的.如果要传递 CREATE sql查询,我希望输出为以下顺序的元组-

where [x] means x is optional. I want the output as a tuple of following order if a CREATE sql query is passed -

(db_name, table_name, [(col1 name, col1 type), (col2 name, col2 type), ...])

那么有什么方法可以使用pyspark sql函数或者需要正则表达式的帮助吗?

So is there any way to do that with pyspark sql functions or need help from regex?

如果正则表达式可以帮助正则表达式吗?

If regex could anyone please help with the regular expression?

推荐答案

可以通过 java_gateway 访问非官方API来完成:

It can be done by accessing the unofficial API through java_gateway:

plan = spark_session._jsparkSession.sessionState().sqlParser().parsePlan("CREATE TABLE foobar.test (foo INT, bar STRING) USING json")
print(f"database: {plan.tableDesc().identifier().database().get()}")
print(f"table: {plan.tableDesc().identifier().table()}")
# perhaps there is a better way to convert the schemas, using JSON string hack here
print(f"schema: {StructType.fromJson(json.loads(plan.tableDesc().schema().json()))}")

输出:

database: foobar
table: test
schema: StructType(List(StructField(foo,IntegerType,true),StructField(bar,StringType,true)))

请注意,如果未定义数据库并且应该正确处理Scala选项,则 database().get()将会失败.另外,如果您使用 CREATE TEMPORARY VIEW ,则访问器的名称也不同.命令可以在这里找到 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L38 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L58

Note that database().get() would fail if the database is not defined and the Scala option should be handled properly. Also, if you use CREATE TEMPORARY VIEW the accessors are named differently. The commands can be found here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L38 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala#L58

这篇关于如何从Pyspark中的SQL中提取列名和列类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆