get_dummies python内存错误 [英] get_dummies python memory error

查看:255
本文介绍了get_dummies python内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对一个具有40万行和300个变量的数据集有疑问.我必须为具有3,000多个不同项目的分类变量获取虚拟变量.最后,我想得出一个具有3,300个变量或特征的数据集,以便可以训练RandomForest模型.

I'm having a problem with a data set that has 400,000 rows and 300 variables. I have to get dummy variables for a categorical variable with 3,000+ different items. At the end I want to end up with a data set with 3,300 variables or features so that I can train a RandomForest model.

这是我尝试做的事情:

 df = pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_')], axis=1)

当我这样做时,我总是会遇到内存错误.我可以拥有的变量数量有限制吗?

When I do that I'll always get an memory error. Is there a limit to the number of variables i can have?

如果我仅使用前1,000行(具有374个不同的类别)来执行此操作,那么它就可以正常工作.

If I do that with only the first 1,000 rows (which got 374 different categories) it just works fine.

有人可以解决我的问题吗?我正在使用的计算机具有8 GB的内存.

Does anyone have a solution for my problem? The computer I'm using has 8 GB of memory.

推荐答案

更新:从版本0.19.0开始,get_dummies返回8位整数而不是64位浮点数,这将在许多情况下解决此问题.情况下,使下面的as_type解决方案变得不必要.请参阅: get_dummies -熊猫0.19.0

Update: Starting with version 0.19.0, get_dummies returns an 8bit integer rather than 64bit float, which will fix this problem in many cases and make the as_type solution below unnecessary. See: get_dummies -- pandas 0.19.0

但是在其他情况下,下面描述的sparse选项可能仍然有用.

But in other cases, the sparse option descibed below may still be helpful.

原始答案:这里有几种尝试的方法.两者都将大大减少数据帧的内存占用量,但是稍后您仍然可能会遇到内存问题.很难预测,您只需要尝试即可.

Original Answer: Here are a couple of possibilities to try. Both will reduce the memory footprint of the dataframe substantially but you could still run into memory issues later. It's hard to predict, you'll just have to try.

(请注意,我在下面简化了info()的输出)

(note that I am simplifying the output of info() below)

df = pd.DataFrame({ 'itemID': np.random.randint(1,4,100) })

pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_')], axis=1).info()

itemID       100 non-null int32
itemID__1    100 non-null float64
itemID__2    100 non-null float64
itemID__3    100 non-null float64

memory usage: 3.5 KB

这是我们的基准.每个虚拟列占用800个字节,因为示例数据有100行,并且get_dummies似乎默认为float64(8个字节).这似乎是一种不必要的低效方式来存储虚拟变量,因为您可以花很少的时间来完成它,但是可能有些原因令我不知道.

Here's our baseline. Each dummy column takes up 800 bytes because the sample data has 100 rows and get_dummies appears to default to float64 (8 bytes). This seems like an unnecessarily inefficient way to store dummies as you could use as little as a bit to do it, but there may be some reason for that which I'm not aware of.

因此,第一次尝试,只需将其更改为一个字节的整数(get_dummies似乎不选择此选项,因此必须将其转换为astype(np.int8).

So, first attempt, just change to a one byte integer (this doesn't seem to be an option for get_dummies so it has to be done as a conversion with astype(np.int8).

pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_').astype(np.int8)], 
                              axis=1).info()

itemID       100 non-null int32
itemID__1    100 non-null int8
itemID__2    100 non-null int8
itemID__3    100 non-null int8

memory usage: 1.5 KB

每个虚拟列现在都像以前一样占用内存的1/8.

Each dummy column now takes up 1/8 the memory as before.

或者,您可以使用get_dummiessparse选项.

Alternatively, you can use the sparse option of get_dummies.

pd.concat([df, pd.get_dummies(df['itemID'],prefix = 'itemID_',sparse=True)], 
                              axis=1).info()

itemID       100 non-null int32
itemID__1    100 non-null float64
itemID__2    100 non-null float64
itemID__3    100 non-null float64

memory usage: 2.0 KB

相当可观的节省. info()输出在某种程度上掩盖了节省的方式,但是您可以查看内存使用的价值以了解节省的总量.

Fairly comparable savings. The info() output somewhat hides the way savings are occurring, but you can look at the value of memory usage to see to total savings.

其中哪种方法在实践中会更有效,这取决于您的数据,因此您只需要尝试一下即可(或者甚至可以将它们组合在一起).

Which of these will work better in practice will depend on your data, so you'll just need to give them each a try (or you could even combine them).

这篇关于get_dummies python内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆