pandas Scikit:切片DataFrame时的内存使用情况 [英] Pandas & Scikit: memory usage when slicing DataFrame

查看:141
本文介绍了 pandas Scikit:切片DataFrame时的内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的DataFrame,是从csv文件加载的(大约300MB).

I have a largeish DataFrame, loaded from a csv file (about 300MB).

由此,我提取了几十个要在RandomForestClassifier中使用的功能:其中一些功能只是从数据列中得出的,例如:

From this, I'm extracting a few dozen features to use in a RandomForestClassifier: some of the features are simply derived from columns in the data, for example:

 feature1 = data["SomeColumn"].apply(len)
 feature2 = data["AnotherColumn"]

并使用原始数据帧上的索引从numpy数组中将其他创建为新的DataFrame

And others are created as new DataFrames from numpy arrays, using the index on the original dataframe:

feature3 = pandas.DataFrame(count_array, index=data.index)

然后将所有这些功能合并为一个DataFrame:

All these features are then joined into one DataFrame:

features = feature1.join(feature2) # etc...

我训练了一个随机森林分类器:

And I train a random forest classifier:

classifier = RandomForestClassifier(
    n_estimators=100,
    max_features=None,
    verbose=2,
    compute_importances=True,
    n_jobs=n_jobs,
    random_state=0,
)
classifier.fit(features, data["TargetColumn"])

RandomForestClassifier与这些功能配合良好,构建一棵树需要O(数百兆内存). 但是:如果在加载数据后,我会从中提取一个子集:

The RandomForestClassifier works fine with these features, building a tree takes O(hundreds of megabytes of memory). However: if after loading my data, I take a small subset of it:

data_slice = data[data['somecolumn'] > value]

然后为我的随机森林建造一棵树会突然占用许多GB 的内存-尽管功能DataFrame的大小现在是原始大小的O(10%).

Then building a tree for my random forest suddenly takes many gigabytes of memory - even though the size of the features DataFrame is now O(10%) of the original.

我可以相信这可能是因为对数据的切片视图不允许有效地进行进一步切片(尽管我看不到如何将其传播到features数组中),所以我尝试了:

I can believe that this might be because a sliced view on the data doesn't permit further slices to be done efficiently (though I don't see how I this could propagate into the features array), so I've tried:

data = pandas.DataFrame(data_slice, copy=True)

但这没有帮助.

  • 为什么要取一部分数据来大幅度增加内存使用?
  • 是否还有其他方法来压缩/重新排列DataFrame,这可能会使事情再次变得更加高效?
  • Why would taking a subset of the data massively increase memory use?
  • Is there some other way to compact / rearrange a DataFrame which might make things more efficient again?

推荐答案

RandomForestClassifier正在内存中多次复制数据集,尤其是当n_jobs很大时.我们知道这些问题,因此优先解决这些问题:

The RandomForestClassifier is copying the dataset several times in memory, especially when n_jobs is large. We are aware of those issues and it's a priority to fix them:

  • 我目前正在处理标准库的multiprocessing.Pool类的子类,当将numpy.memmap实例传递给子流程工作程序时,该子类将不进行内存复制.这样就可以在工作程序之间共享源数据集的内存+一些预先计算的数据结构.解决此问题后,我将在github跟踪器上关闭此问题.

  • I am currently working on a subclass of the multiprocessing.Pool class of the standard library that will do no memory copy when numpy.memmap instances are passed to the subprocess workers. This will make it possible to share the memory of the source dataset + some precomputed datastructures between the workers. Once this is fixed I will close this issue on the github tracker.

有一个正在进行的重构将会进一步减少RandomForestClassifier的内存使用量减少了两倍.但是,重构的当前状态是主数据库的两倍,因此仍需要进一步的工作.

There is an ongoing refactoring that will further decrease the memory usage of RandomForestClassifier by two. However the current state of the refactoring is twice as slow as the master, hence further work is still required.

但是,这些修补程序都无法使其恢复到计划于下周发布的0.12版本.很有可能它们将完成0.13(计划在3-4个月内发布),但是在主分支中将可以更快地使用该课程.

However none of those fixes will make it to 0.12 release that is scheduled for release next week. Most probably they will be done for 0.13 (planned for release in 3 to 4 months) but offcourse will be available in the master branch a lot sooner.

这篇关于 pandas Scikit:切片DataFrame时的内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆