如何将Parquet文件读入Pandas DataFrame? [英] How to read a Parquet file into Pandas DataFrame?
问题描述
如何在不设置集群计算基础架构(例如Hadoop或Spark)的情况下,将大小适中的Parquet数据集读取到内存中的Pandas DataFrame中?我只想在笔记本电脑上使用简单的Python脚本在内存中读取这些数据,这只是一个中等数量的数据.数据不驻留在HDFS上.它位于本地文件系统上,也可能位于S3中.我不想启动并配置其他服务,例如Hadoop,Hive或Spark.
How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
我认为Blaze/Odo可以做到这一点:Odo文档中提到了Parquet,但是这些示例似乎都是通过外部Hive运行时进行的.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
推荐答案
pandas 0.21 introduces new functions for Parquet:
pd.read_parquet('example_pa.parquet', engine='pyarrow')
或
pd.read_parquet('example_fp.parquet', engine='fastparquet')
上面的链接说明:
这些引擎非常相似,应该读取/写入几乎相同的镶木地板格式文件.这些库的不同之处在于它们具有不同的基础依赖性(使用numba进行快速拼花,而pyarrow使用c库).
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
这篇关于如何将Parquet文件读入Pandas DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!