流实木复合地板文件python和仅向下采样 [英] Streaming parquet file python and only downsampling
问题描述
我有实木复合地板格式的数据,太大而无法放入内存(6 GB).我正在寻找一种使用Python 3.6读取和处理文件的方法.有没有一种方法可以流式传输文件,缩减采样并保存到dataframe
?最终,我希望使用dataframe
格式的数据.
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe
? Ultimately, I would like to have the data in dataframe
format to work with.
我在不使用Spark框架的情况下尝试这样做是错误的吗?
Am I wrong to attempt to do this without using a spark framework?
我尝试使用pyarrow
和fastparquet
,但是在尝试读取整个文件时遇到内存错误.
任何提示或建议,将不胜感激!
I have tried using pyarrow
and fastparquet
but I get memory errors on trying to read the entire file in.
Any tips or suggestions would be greatly appreciated!
推荐答案
火花无疑是完成此任务的可行选择.
Spark is certainly a viable choice for this task.
我们计划今年在pyarrow
中添加流式读取逻辑(2019年,请参见
We're planning to add streaming read logic in pyarrow
this year (2019, see https://issues.apache.org/jira/browse/ARROW-3771 and related issues). In the meantime, I would recommend reading one row group at a time to mitigate the memory use issues. You can do this with pyarrow.parquet.ParquetFile
and its read_row_group
method
这篇关于流实木复合地板文件python和仅向下采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!