如何将Parquet文件读入Pandas DataFrame? [英] How to read a Parquet file into Pandas DataFrame?

查看:477
本文介绍了如何将Parquet文件读入Pandas DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在不设置集群计算基础架构(例如Hadoop或Spark)的情况下,将大小适中的Parquet数据集读取到内存中的Pandas DataFrame中?我只想在笔记本电脑上使用简单的Python脚本在内存中读取这些数据,这只是一个中等数量的数据.数据不驻留在HDFS上.它位于本地文件系统上,也可能位于S3中.我不想启动并配置其他服务,例如Hadoop,Hive或Spark.

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

我认为Blaze/Odo可以做到这一点:Odo文档中提到了Parquet,但是这些示例似乎都是通过外部Hive运行时进行的.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

推荐答案

pandas 0.21引入了

pandas 0.21 introduces new functions for Parquet:

pd.read_parquet('example_pa.parquet', engine='pyarrow')

pd.read_parquet('example_fp.parquet', engine='fastparquet')

上面的链接说明:

这些引擎非常相似,应该读取/写入几乎相同的镶木地板格式文件.这些库的不同之处在于它们具有不同的基础依赖性(使用numba进行快速拼花,而pyarrow使用c库).

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

这篇关于如何将Parquet文件读入Pandas DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆