在 Python 中读取镶木地板时内存使用率过高 [英] Over-high memory usage during reading parquet in Python

查看:84
本文介绍了在 Python 中读取镶木地板时内存使用率过高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大约 10+GB 的镶木地板文件,列主要是字符串.加载到内存时,内存使用量最高可达110G,加载完成后,内存使用量减少到40G左右.

I have a parquet file at around 10+GB, with columns are mainly strings. When loading it into the memory, the memory usage can peak to 110G, while after it's finished the memory usage is reduced back to around 40G.

我正在使用分配了内存的高性能计算机,因此我可以访问大内存.但是,我觉得为了加载数据还得申请128G的内存,然后64G对我来说就足够了,这对我来说似乎是一种浪费.另外,128G内存更经常出现故障.

I'm working on a high-performance computer with allocated memory so I do have access to large memory. However, it seems a waste to me that I have to apply for a 128G memory just for loading data, after that 64G is sufficient for me. Also, 128G memory is more often to be out of order.

我天真的猜想是,Python 解释器将 HPC 上的 512G 物理内存误认为是总可用内存,因此它不会像实际需要的那样频繁地进行垃圾收集.比如我加载64G内存的数据时,它从来没有给我抛出MemoryError,而是直接杀死内核并重新启动.

My naive conjecture is that the Python interpreter mistreated the 512G physical memory on the HPC as the total available memory, so it does not do garbage collection as often as actually needed. For example, when I load the data with 64G memory, it never threw me a MemoryError but the kernel is directly killed and restarted.

我想知道加载时内存使用率过高是否是pyarrow的常规行为,还是由于我的环境的特殊设置.如果是后者,那么是否有可能在加载过程中以某种方式限制可用内存?

I was wondering whether the over-high usage of memory when loading is a regular behavior of pyarrow, or it is due to the special setting of my environment. If the latter, then is it possible to somehow limit the available memory during loading?

推荐答案

我们修复了 0.14.0/0.14.1(这可能是您现在正在使用的)中存在的内存使用错误.

We fixed a memory use bug that's present in 0.14.0/0.14.1 (which is probably what you're using right now).

https://issues.apache.org/jira/browse/ARROW-6060

我们还引入了一个选项来读取字符串列作为分类(在 Arrow 的说法中又名 DictionaryArray),这也将减少内存使用.请参阅 https://issues.apache.org/jira/browse/ARROW-3325

We also are introducing an option to read string columns as categorical (aka DictionaryArray in Arrow parlance) which also will reduce memory usage. See https://issues.apache.org/jira/browse/ARROW-3325 and discussion in

https://ursalabs.org/blog/2019-06-07月报/

这篇关于在 Python 中读取镶木地板时内存使用率过高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆