流实木复合地板文件python和仅向下采样 [英] Streaming parquet file python and only downsampling

查看:131
本文介绍了流实木复合地板文件python和仅向下采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有实木复合地板格式的数据,太大而无法放入内存(6 GB).我正在寻找一种使用Python 3.6读取和处理文件的方法.有没有一种方法可以流式传输文件,缩减采样并保存到dataframe?最终,我希望使用dataframe格式的数据.

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe? Ultimately, I would like to have the data in dataframe format to work with.

我在不使用Spark框架的情况下尝试这样做是错误的吗?

Am I wrong to attempt to do this without using a spark framework?

我尝试使用pyarrowfastparquet,但是在尝试读取整个文件时遇到内存错误. 任何提示或建议,将不胜感激!

I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated!

推荐答案

火花无疑是完成此任务的可行选择.

Spark is certainly a viable choice for this task.

我们计划今年在pyarrow中添加流式读取逻辑(2019年,请参见

We're planning to add streaming read logic in pyarrow this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile and its read_row_group method

这篇关于流实木复合地板文件python和仅向下采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆