将大型Numpy数组拆分为训练和测试的内存有效方法 [英] Memory efficient way to split large numpy array into train and test

查看:144
本文介绍了将大型Numpy数组拆分为训练和测试的内存有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的numpy数组,当我运行scikit learning的train_test_split将数组拆分为训练和测试数据时,我总是遇到内存错误.将内存分成训练和测试的内存效率更高的方法是什么,为什么train_test_split会导致这种情况?

I have a large numpy array and when I run scikit learn's train_test_split to split the array into training and test data, I always run into memory errors. What would be a more memory efficient method of splitting into train and test, and why does the train_test_split cause this?

以下代码导致内存错误并导致崩溃

The follow code results in a memory error and causes a crash

import numpy as np
from sklearn.cross_validation import train_test_split

X = np.random.random((10000,70000))
Y = np.random.random((10000,))
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state=42)

推荐答案

我尝试过的一种有效方法是将X存储在熊猫数据框中并随机播放

One method that I've tried which works is to store X in a pandas dataframe and shuffle

X = X.reindex(np.random.permutation(X.index))

因为尝试时出现相同的内存错误

since I arrive at the same memory error when I try

np.random.shuffle(X)

然后,我将pandas数据框转换回一个numpy数组,并使用此功能,可以获得火车测试成绩

Then, I convert the pandas dataframe back to a numpy array and using this function, I can obtain a train test split

#test_proportion of 3 means 1/3 so 33% test and 67% train
def shuffle(matrix, target, test_proportion):
    ratio = int(matrix.shape[0]/test_proportion) #should be int
    X_train = matrix[ratio:,:]
    X_test =  matrix[:ratio,:]
    Y_train = target[ratio:,:]
    Y_test =  target[:ratio,:]
    return X_train, X_test, Y_train, Y_test

X_train, X_test, Y_train, Y_test = shuffle(X, Y, 3)

这暂时有效,当我想进行k倍交叉验证时,我可以迭代循环k次并重新排列pandas数据框.虽然现在就足够了,但是为什么numpy和sci-kit学习shuffle和train_test_split的实现会导致大数组的内存错误?

This works for now, and when I want to do k-fold cross-validation, I can iteratively loop k times and shuffle the pandas dataframe. While this suffices for now, why does numpy and sci-kit learn's implementations of shuffle and train_test_split result in memory errors for big arrays?

这篇关于将大型Numpy数组拆分为训练和测试的内存有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆