测试 Numpy 数组是否包含给定的行 [英] testing whether a Numpy array contains a given row

查看:28
本文介绍了测试 Numpy 数组是否包含给定的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种 Pythonic 且有效的方法来检查 Numpy 数组是否包含给定行的至少一个实例?高效"是指它在找到第一个匹配行时终止,而不是遍历整个数组,即使已经找到结果.

对于 Python 数组,这可以通过 if row in array: 非常干净地完成,但这并不像我对 Numpy 数组所期望的那样工作,如下图所示.

使用 Python 数组:

<预><代码>>>>a = [[1,2],[10,20],[100,200]]>>>[1,2] 在一个真的>>>[1,20] 在一个错误的

但是 Numpy 数组给出了不同且看起来很奇怪的结果.(ndarray__contains__ 方法似乎没有记录.)

<预><代码>>>>a = np.array([[1,2],[10,20],[100,200]])>>>np.array([1,2]) 在一个真的>>>np.array([1,20]) 在一个真的>>>np.array([1,42]) 在一个真的>>>np.array([42,1]) 在一个错误的

解决方案

Numpys __contains__ 在撰写本文时,(a == b).any()可以说只有在 b 是标量时才是正确的(它有点毛茸茸,但我相信 – 仅在 1.7 或更高版本中才能像这样工作.或更高版本 – 这将是正确的通用方法 (a == b).all(np.arange(a.ndim - b.ndim, a.ndim)).any(),这对 a 的所有组合都有意义>b 维数)...

明确地说,当涉及广播时,这不一定是预期的结果.也有人可能会争辩说它应该像 np.in1d 那样单独处理 a 中的项目.我不确定它应该有一种明确的工作方式.

现在您希望 numpy 在找到第一次出现时停止.该 AFAIK 目前不存在.这很困难,因为 numpy 主要基于 ufuncs,它在整个数组上做同样的事情.Numpy 确实优化了这些类型的缩减,但只有当被缩减的数组已经是一个布尔数组时才有效(即 np.ones(10, dtype=bool).any()).>

否则它需要一个不存在的 __contains__ 的特殊函数.这可能看起来很奇怪,但您必须记住 numpy 支持许多数据类型,并且有一个更大的机制来选择正确的数据类型并选择正确的函数来处理它.因此,换句话说,ufunc 机制无法做到这一点,并且由于数据类型的原因,实现 __contains__ 之类的东西实际上并不是那么简单.

你当然可以用 python 编写它,或者因为你可能知道你的数据类型,所以自己用 Cython/C 编写它非常简单.

<小时>

说的.通常,对这些事情使用基于排序的方法要好得多.这有点乏味,而且对于 lexsort 没有 searchsorted 这样的东西,但是它有效(你也可以滥用 scipy.spatial.cKDTree 如果你喜欢).这假设您只想沿最后一个轴进行比较:

# 不幸的是你需要使用结构化数组:sorted = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[-1]).ravel()# 其实此时也可以使用np.in1d,如果你已经有很多b# 那就更好了.sorted.sort()b_comp = np.ascontiguousarray(b).view(sorted.dtype)ind = sorted.searchsorted(b_comp)结果 = 排序 [ind] == b_comp

这也适用于数组 b,并且如果您保留已排序的数组,如果您为 b 中的单个值(行)执行此操作也会好得多> 一次,当 a 保持不变时(否则我会在将其视为 recarray 后查看 np.in1d ).重要提示: 为了安全,您必须执行 np.ascontiguousarray.它通常什么都不做,但如果它做了,那将是一个很大的潜在错误.

Is there a Pythonic and efficient way to check whether a Numpy array contains at least one instance of a given row? By "efficient" I mean it terminates upon finding the first matching row rather than iterating over the entire array even if a result has already been found.

With Python arrays this can be accomplished very cleanly with if row in array:, but this does not work as I would expect for Numpy arrays, as illustrated below.

With Python arrays:

>>> a = [[1,2],[10,20],[100,200]]
>>> [1,2] in a
True
>>> [1,20] in a
False

but Numpy arrays give different and rather odd-looking results. (The __contains__ method of ndarray seems to be undocumented.)

>>> a = np.array([[1,2],[10,20],[100,200]])
>>> np.array([1,2]) in a
True
>>> np.array([1,20]) in a
True
>>> np.array([1,42]) in a
True
>>> np.array([42,1]) in a
False

解决方案

Numpys __contains__ is, at the time of writing this, (a == b).any() which is arguably only correct if b is a scalar (it is a bit hairy, but I believe – works like this only in 1.7. or later – this would be the right general method (a == b).all(np.arange(a.ndim - b.ndim, a.ndim)).any(), which makes sense for all combinations of a and b dimensionality)...

EDIT: Just to be clear, this is not necessarily the expected result when broadcasting is involved. Also someone might argue that it should handle the items in a separately as np.in1d does. I am not sure there is one clear way it should work.

Now you want numpy to stop when it finds the first occurrence. This AFAIK does not exist at this time. It is difficult because numpy is based mostly on ufuncs, which do the same thing over the whole array. Numpy does optimize these kind of reductions, but effectively that only works when the array being reduced is already a boolean array (i.e. np.ones(10, dtype=bool).any()).

Otherwise it would need a special function for __contains__ which does not exist. That may seem odd, but you have to remember that numpy supports many data types and has a bigger machinery to select the correct ones and select the correct function to work on it. So in other words, the ufunc machinery cannot do it, and implementing __contains__ or such specially is not actually that trivial because of data types.

You can of course write it in python, or since you probably know your data type, writing it yourself in Cython/C is very simple.


That said. Often it is much better anyway to use sorting based approach for these things. That is a little tedious as well as there is no such thing as searchsorted for a lexsort, but it works (you could also abuse scipy.spatial.cKDTree if you like). This assumes you want to compare along the last axis only:

# Unfortunatly you need to use structured arrays:
sorted = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[-1]).ravel()

# Actually at this point, you can also use np.in1d, if you already have many b
# then that is even better.

sorted.sort()

b_comp = np.ascontiguousarray(b).view(sorted.dtype)
ind = sorted.searchsorted(b_comp)

result = sorted[ind] == b_comp

This works also for an array b, and if you keep the sorted array around, is also much better if you do it for a single value (row) in b at a time, when a stays the same (otherwise I would just np.in1d after viewing it as a recarray). Important: you must do the np.ascontiguousarray for safety. It will typically do nothing, but if it does, it would be a big potential bug otherwise.

这篇关于测试 Numpy 数组是否包含给定的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆