如何解决 pandas 的内存分配问题? [英] How do I fix memory allocation problem with pandas?

查看：536 发布时间：2020/5/24 3:43:11 python pandas

本文介绍了如何解决 pandas 的内存分配问题?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

背景不多，我已经继承了公司用Python编写的一段代码，我真的不知道，该代码使用熊猫将很少的预下载Excel报表合并为一个.我一直遇到内存分配错误:

little background, I've inherited a piece of code in company that is written in Python, which I don't really know, the code is combining few pre-downloaded Excel reports into one using pandas. I keep running into Memory Allocation Error:

内存错误:无法为形状为(17，5668350)和数据类型为对象的数组分配368. MiB.

MemoryError: Unable to allocate 368. MiB for an array with shape (17, 5668350) and data type object

这是给我错误的代码:

dfCC = dfVendNew.merge(dfVendOld[['SAP ID', 'Cost ctr']], on='SAP ID', how='left')

我被困在这一点上，无法进一步发展.我尝试更改Windows上的页面大小，但没有帮助.我怀疑这与我的计算机设置有关，因为此脚本可以在其他计算机上顺利运行.

I am stuck on this point unable to progress further. I've tried changing the paging size on Windows but it did not help. I suspect that it is related to my computer settings since this script is running on the other machines without a hitch.

非常感谢您的帮助.

推荐答案

对象将是存储事物列表的最胖方法.但是，您需要了解一些有关如何存储内容的知识，以使其变得更小，更快. 使用数据框的df.info()

Objects are going to be the fattest way to store lists of things. But you need to know a bit about how something is stored in order to make it smaller and faster. Check out the column types using dataframe's df.info()

这是一个玩具示例:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
people     3 non-null object
cost_ctr    3 non-null object
number     3 non-null int64
dtypes: int64(1), object(2)
memory usage: 152.0+ bytes

在这种情况下，people是对象类，消息也是.另一件事要看的是最后一行:内存使用情况.因此，现在您可以更改数据类型并查看内存使用情况的下降.因此，让我们看一下如何更改其中一些类型.

In this case people is an object class and so is message. The other thing to look at is on the last line: memory usage. So now you'll be able to change your data types and watch your memory usage drop. So let's go through how to change some of these types.

默认情况下，您的SAP_ID可能是int.如果不是，则可以使用所有数字数据

Your SAP_ID is probably an int by default. If it is not and it's all numeric data you can use:

df['SAP ID']=df['SAP ID'].astype(int)

或

df['SAP ID']=pd.to_numeric(df['SAP ID'])

现在您已经更改了一种列的类型，请再次使用df.info()签出内存.

so now that you have changed a type of one column check out the memory with df.info() again.

"Cost ctr"听起来像是重复很多的简短列表，但通常存储为字符串列表.您可以将此列更改为pd.categorical，然后查看使用此命令可以节省多少内存.

"Cost ctr" sounds like it's going to be a short list of things that repeats a lot, but is generally stored as a list of strings. You could change this column to a pd.categorical and see how much memory you save with this command.

df['Cost_Ctr'] = df['Cost_Ctr'].astype(pd.Categorical)

查看使用 astype的文档在这里下一步移动将首先正确地导入它.当您阅读excel文件时，请使用

Check out the documentation for using astype here a next level move would be importing it correctly in the first place. When you read the excel file use the converters argument in read_excel.

如果仍然需要降低内存使用量(这不应该只包含Excel记录)，则可以使用其他分布式技术来实现此目的，即

If getting your memory usage down still is a problem (this shouldn't be with only Excel records) there are other distributed technologies you can use for this purpose namely Dask.

希望这会有所帮助.

这篇关于如何解决 pandas 的内存分配问题?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何解决 pandas 的内存分配问题? [英] How do I fix memory allocation problem with pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何解决 pandas 的内存分配问题? [英] How do I fix memory allocation problem with pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭