如何在python中遍历大型数据集而又没有出现MemoryError? [英] How do I loop through a large dataset in python without getting a MemoryError?

查看:103
本文介绍了如何在python中遍历大型数据集而又没有出现MemoryError?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列栅格数据集,它们代表了几十年来的月降雨量.我已经用Python编写了一个脚本,该脚本遍历每个栅格并执行以下操作:

I have a large series of raster datasets representing monthly rainfall over several decades. I've written a script in Python that loops over each raster and does the following:

  1. 将栅格转换为numpy蒙版数组,
  2. 执行许多数组代数以计算新的水位,
  3. 将结果写入输出栅格.
  4. 重复

该脚本只是由循环语句括起来的一长列数组代数方程式.

The script is just a long list of array algebra equations enclosed by a loop statement.

如果我只对数据的一小部分(例如20年的价值)运行脚本,那么一切都会很好,但是如果我尝试处理全部数据,则会得到MemoryError.该错误只提供了更多信息(除了突出显示了Python放弃的代码行).

Everything works well if I just run the script on a small part of my data (say 20 years' worth), but if I try to process the whole lot I get a MemoryError. The error doesn't give any more information than that (except it highlights the line in the code at which Python gave up).

不幸的是,我无法轻松地分块处理我的数据-我真的需要能够一次完成全部工作.这是因为,在每次迭代结束时,输出(水位)都将作为下一个起点反馈到下一次迭代中.

Unfortunately, I can't easily process my data in chunks - I really need to be able to do the whole lot at once. This is because, at the end of each iteration, the output (water level) is fed back into the next iteration as the start point.

目前我对编程的理解非常基础,但是我认为我的所有对象只会在每个循环中被覆盖.我(很愚蠢?)假设,如果代码成功地成功循环了一次,那么它应该可以无限循环,而不会占用越来越多的内存.

My understanding of programming is very basic at present, but I thought that all of my objects would just be overwritten on each loop. I (stupidly?) assumed that if the code managed to loop successfully once then it should be able to loop indefinitely without using up more and more memory.

我尝试阅读各种文档,并发现了一种叫做垃圾收集器"的东西,但我感觉自己已经不懂我的意思了,我的大脑正在融化!在我的代码循环时,谁能提供一些基本的见识来了解内存中的对象实际发生了什么?在每个循环的末尾有没有释放内存的方法,还是有一些"Pythonic"的编码方式可以完全避免这个问题?

I've tried reading various bits of documentation and have discovered something called the "Garbage Collector", but I feel like I'm getting out of my depth and my brain's melting! Can anyone offer some basic insight into what actually happens to objects in memory when my code loops? Is there a way of freeing-up memory at the end of each loop, or is there some more "Pythonic" way of coding which avoids this problem altogether?

推荐答案

您不需要担心内存管理,尤其是不需要垃圾回收器,因为垃圾回收器具有非常具体的任务,您甚至可能根本不使用它. Python将始终收集其可以使用的内存并重复使用.

You don't need to concern youself with memory management, especially not with the garbage collector that has a very specific task that you most likely don't even use. Python will always collect the memory it can and reuse it.

出现问题的原因只有两个:要么您试图加载的数据太多而无法放入内存,要么您的计算将数据存储在某个地方(列表,字典,迭代之间存在持久性),并且存储在不断增长. 内存分析器可以帮助找到它.

There are just two reasons for your problem: Either the data you try to load is too much to fit into memory or your calculations store data somewhere (a list, dict, something persistent between iterations) and that storage grows and grows. Memory profilers can help finding that.

这篇关于如何在python中遍历大型数据集而又没有出现MemoryError?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆