在不将文件加载到python的spark数据框中的情况下获取Parquet文件的架构? [英] Get Schema of Parquet file without loading file into spark data frame in python?
问题描述
是否有任何可用于获取实木复合地板文件架构的python库.
Is there any python library that can be used to just get the schema of parquet file.
当前,我们正在将镶木地板文件加载到Spark中的数据框中,并从该数据框中获取架构以显示在应用程序的某些UI中.但是初始化spark-context并加载数据框架并从数据框架获取架构是耗时的活动.因此,正在寻找一种仅获取模式的替代方法.
Currently we are loading the parquet file into dataframe in Spark and getting schema from the dataframe to display in some UI of the application. But initializing spark-context and loading data frame and getting the schema from dataframe is time consuming activity. So looking for an alternative way to just get the schema.
推荐答案
除了@mehdio的回答外,如果您的实木复合地板是目录(例如,spark生成的实木复合地板),还可以读取模式/列名称:
In addition to the answer by @mehdio, in case your parquet is a directory (e.g. a parquet generated by spark), to read the schema / column names:
import pyarrow.parquet as pq
pfile = pq.read_table("file.parquet")
print("Column names: {}".format(pfile.column_names))
print("Schema: {}".format(pfile.schema))
这篇关于在不将文件加载到python的spark数据框中的情况下获取Parquet文件的架构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!