pandas Scikit:切片DataFrame时的内存使用情况 [英] Pandas & Scikit: memory usage when slicing DataFrame

查看：141 发布时间：2020/5/24 1:54:22 python pandas scikit-learn

本文介绍了 pandas Scikit:切片DataFrame时的内存使用情况的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的DataFrame，是从csv文件加载的(大约300MB).

I have a largeish DataFrame, loaded from a csv file (about 300MB).

由此，我提取了几十个要在RandomForestClassifier中使用的功能:其中一些功能只是从数据列中得出的，例如:

From this, I'm extracting a few dozen features to use in a RandomForestClassifier: some of the features are simply derived from columns in the data, for example:

 feature1 = data["SomeColumn"].apply(len)
 feature2 = data["AnotherColumn"]

并使用原始数据帧上的索引从numpy数组中将其他创建为新的DataFrame

And others are created as new DataFrames from numpy arrays, using the index on the original dataframe:

feature3 = pandas.DataFrame(count_array, index=data.index)

然后将所有这些功能合并为一个DataFrame:

All these features are then joined into one DataFrame:

features = feature1.join(feature2) # etc...

我训练了一个随机森林分类器:

And I train a random forest classifier:

classifier = RandomForestClassifier(
    n_estimators=100,
    max_features=None,
    verbose=2,
    compute_importances=True,
    n_jobs=n_jobs,
    random_state=0,
)
classifier.fit(features, data["TargetColumn"])

RandomForestClassifier与这些功能配合良好，构建一棵树需要O(数百兆内存). 但是:如果在加载数据后，我会从中提取一个小子集:

The RandomForestClassifier works fine with these features, building a tree takes O(hundreds of megabytes of memory). However: if after loading my data, I take a small subset of it:

data_slice = data[data['somecolumn'] > value]

然后为我的随机森林建造一棵树会突然占用许多GB 的内存-尽管功能DataFrame的大小现在是原始大小的O(10％).

Then building a tree for my random forest suddenly takes many gigabytes of memory - even though the size of the features DataFrame is now O(10%) of the original.

我可以相信这可能是因为对数据的切片视图不允许有效地进行进一步切片(尽管我看不到如何将其传播到features数组中)，所以我尝试了:

I can believe that this might be because a sliced view on the data doesn't permit further slices to be done efficiently (though I don't see how I this could propagate into the features array), so I've tried:

data = pandas.DataFrame(data_slice, copy=True)

但这没有帮助.

为什么要取一部分数据来大幅度增加内存使用?
是否还有其他方法来压缩/重新排列DataFrame，这可能会使事情再次变得更加高效?

Why would taking a subset of the data massively increase memory use?
Is there some other way to compact / rearrange a DataFrame which might make things more efficient again?

pandas Scikit:切片DataFrame时的内存使用情况 [英] Pandas & Scikit: memory usage when slicing DataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas Scikit:切片DataFrame时的内存使用情况 [英] Pandas &amp; Scikit: memory usage when slicing DataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

pandas Scikit:切片DataFrame时的内存使用情况 [英] Pandas & Scikit: memory usage when slicing DataFrame

登录关闭