从不带numpy.delete的numy.narray中删除多个项目 [英] Remove multiple items from a numy.narray without numpy.delete

查看:95
本文介绍了从不带numpy.delete的numy.narray中删除多个项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用大型nump.narray(11.000x3180)开发主动学习算法(文本挖掘).在这种算法中,我必须删除数据集中的每个itarecion 16个样本(行向量),然后将它们集成到训练集中(每次迭代以16个样本增长).在执行了大约60次迭代之后,再次初始化了该算法,并从头开始进行了100次运行相同的过程

I am using a large nump.narray (11.000x3180) to develop an active learning algorithm (Text mining). In this algorithm, I have to delete each itarecion 16 samples (row vectors) in my dataset, and then integrate them into training set (it grows at 16 samples per iteration). After performing this process for 60 iterations (approximately), the algorithm is initialized again and again the same process from the beginning for 100 runs

要删除数据集中的16个元素,请使用 numpy.delete (dataset [ListifoIndex], axis = 0),其中[ListifoIndex]对应于要删除的所选项目的索引.

To delete the set of 16 elements in my data set, I use the method numpy.delete (dataset [ListifoIndex], axis = 0), where [ListifoIndex] corresponds to the indices of the selected items to remove.

此方法适用于首次运行(100个中的1个),但随后再次初始化算法,则出现以下错误:

  This method works for the first run (1 of 100), but then initialize the algorithm again, I have the following error:

new = empty(newshape, arr.dtype, arr.flags.fnc)
MemoryError

显然,numpy.delete方法为每个索引(16x1.2GB)创建了一个数据库副本,该副本超出了我计算机上的内存量.

Apparently the numpy.delete metod creates a copy of my database for each of the indices (16x1.2GB), which exceeds the amount of memory that I have on my computer.

问题是:如何从numpy.narray中删除项目,而又不占用大量内存并且没有过多的执行时间?

the question is: how I can remove items from a numpy.narray not get to use a lot of memory and without excessive execution times?

PD1:我已经完成了相反的过程,在该过程中,我添加了要删除的索引列表中未包含的元素,但是过程非常缓慢. PD2:有时错误会在初始化算法之前(迭代次数60之前)发生

PD1: I've done the reverse process, where I add the elements that are not in the index list to remove, but the process is very slow. PD2: Sometimes the error occurs before initializing the algorithm (before the iteration number 60)

推荐答案

这可能有助于准确了解np.delete的作用.就您而言

It may help to understand exactly what np.delete does. In your case

newset = np.delete(dataset, ListifoIndex, axis = 0)  # corrected

从本质上讲,它是这样做的:

in essence it does:

keep = np.ones(dataset.shape[0], dtype=bool) # array of True matching 1st dim
keep[ListifoIndex] = False
newset = dataset[keep, :]

换句话说,它为要保留的行构造一个布尔索引.

In other words, it constructs a boolean index of the rows it wants to keep.

如果我跑步

dataset = np.delete(dataset, ListifoIndex, axis = 0)

反复在交互式外壳中,没有任何中间数组的堆积.在运行delete时,临时会有这个keep数组和一个dataset的新副本.但是分配后,旧副本就消失了.

repeatedly in an interactive shell, there isn't any accumulation of intermediate arrays. Temporarily while running delete there will be this keep array, and a new copy of dataset. But with assignment, the old copy disappears.

您确定是delete增长了内存使用量,而不是增长了训练集吗?

Are you sure it's the delete that's growing memory use, as opposed to growing the training set?

关于速度,您可以通过维护所有删除"行的掩码"而不是实际删除任何东西来改善速度.但是,根据ListifoIndex与以前的删除内容重叠的方式,更新该掩码可能比它值得的麻烦更多.它也可能更容易出错.

As for speed, you might improve that by maintaining a 'mask' of all 'delete' rows, rather than actually deleting anything. But depending on how ListifoIndex overlaps with previous deletions, updating that mask might be more trouble than it's worth. It's also likely to be more error prone.

这篇关于从不带numpy.delete的numy.narray中删除多个项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆