如何将Parquet文件读入Pandas DataFrame? [英] How to read a Parquet file into Pandas DataFrame?

查看：477 发布时间：2020/5/23 22:13:35 python pandas parquet blaze

本文介绍了如何将Parquet文件读入Pandas DataFrame?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何在不设置集群计算基础架构(例如Hadoop或Spark)的情况下，将大小适中的Parquet数据集读取到内存中的Pandas DataFrame中?我只想在笔记本电脑上使用简单的Python脚本在内存中读取这些数据，这只是一个中等数量的数据.数据不驻留在HDFS上.它位于本地文件系统上，也可能位于S3中.我不想启动并配置其他服务，例如Hadoop，Hive或Spark.

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

我认为Blaze/Odo可以做到这一点:Odo文档中提到了Parquet，但是这些示例似乎都是通过外部Hive运行时进行的.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

推荐答案

pandas 0.21引入了

pandas 0.21 introduces new functions for Parquet:

pd.read_parquet('example_pa.parquet', engine='pyarrow')

或

pd.read_parquet('example_fp.parquet', engine='fastparquet')

上面的链接说明:

这些引擎非常相似，应该读取/写入几乎相同的镶木地板格式文件.这些库的不同之处在于它们具有不同的基础依赖性(使用numba进行快速拼花，而pyarrow使用c库).

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

这篇关于如何将Parquet文件读入Pandas DataFrame?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将Parquet文件读入Pandas DataFrame? [英] How to read a Parquet file into Pandas DataFrame?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何将Parquet文件读入Pandas DataFrame? [英] How to read a Parquet file into Pandas DataFrame?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭