使用 PySpark 从表中识别分区键列 [英] Identify Partition Key Column from a table using PySpark

查看:23
本文介绍了使用 PySpark 从表中识别分区键列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要帮助来使用 PySpark 查找 Hive 表的唯一分区列名称.该表可能有多个分区列,最好输出应返回 Hive 表的分区列列表.

I need help to find the unique partitions column names for a Hive table using PySpark. The table might have multiple partition columns and preferable the output should return a list of the partition columns for the Hive Table.

如果结果还包括分区列的数据类型,那就太好了.

It would be great if the result would also include the datatype of the partitioned columns.

任何建议都会有所帮助.

Any suggestions will be helpful.

推荐答案

可以使用desc完成,如下图:

It can be done using desc as shown below:

df=spark.sql("""desc test_dev_db.partition_date_table""")
>>> df.show(truncate=False)
+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|emp_id                 |int      |null   |
|emp_name               |string   |null   |
|emp_salary             |int      |null   |
|emp_date               |date     |null   |
|year                   |string   |null   |
|month                  |string   |null   |
|day                    |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|year                   |string   |null   |
|month                  |string   |null   |
|day                    |string   |null   |
+-----------------------+---------+-------+

由于这个表是分区的,所以在这里你可以看到分区列的信息和它们的数据类型.

Since this table was partitioned, So here you can see the partition column information along with their datatypes.

您似乎只对分区列名称及其各自的数据类型感兴趣.因此,我正在创建一个元组列表.

It seems your are interested in just partition column name and their respective data types. Hence I am creating a list of tuples.

partition_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()

>>> print partition_list
[(u'emp_id', u'int'), (u'emp_name', u'string'), (u'emp_salary', u'int'), (u'emp_date', u'date'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string'), (u'# Partition Information', u''), (u'# col_name', u'data_type'), (u'year', u'string'), (u'month', u'string'), (u'day', u'string')]

partition_details = [partition_list[index+1:] for index,item in enumerate(partition_list) if item[0]=='# col_name']

>>> print partition_details
[[(u'year', u'string'), (u'month', u'string'), (u'day', u'string')]]

如果表未分区,它将返回空列表.希望这会有所帮助.

It will return empty list in case table is not partitioned. Hope this helps.

这篇关于使用 PySpark 从表中识别分区键列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆