有效地在numpy数组中查找具有特定条件的行 [英] Finding rows in numpy array with specific condition efficiently

查看：1205 发布时间：2020/9/25 1:52:48 python arrays performance numpy numpy-ndarray

本文介绍了有效地在numpy数组中查找具有特定条件的行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个numpy数组2D。我想做的是在 np_sentence 中找到 np_weight 的特定行。

I have two numpy array 2D. What I want to do is to find specific rows of np_weight in the np_sentence.

例如：

#rows are features, columns are clusters or whatever
np_weight = np.random.uniform(1.0,10.0,size=(7,4))
print(np_weight)

[[9.96859395 8.65543961 6.07429382 4.58735497]
 [3.21776471 8.33560037 2.11424961 8.89739975]
 [9.74560314 5.94640798 6.10318198 7.33056421]
 [6.60986206 2.36877835 3.06143215 7.82384351]
 [9.49702267 9.98664568 3.89140374 5.42108704]
 [1.93551346 8.45768507 8.60233715 8.09610975]
 [5.21892795 4.18786508 5.82665674 8.28397111]]

#rows are sentence index, columns are words on that sentence
np_sentence = np.random.randint(0.0,7.0,size=(5,3))
print(np_sentence)

[[2 5 1]
 [1 6 4]
 [0 0 0]
 [2 3 6]
 [4 2 4]]

如果我对np_weight 在每列上，然后得到top5，我将拥有这个（这里我只显示第一列）：


If I sort np_weight on each column and then get top5 of that, I will have this one
(here I just show the first column):
temp_sorted_result=
[9.96859395 ] --->index=0
[9.74560314 ] --→ index=2
[9.49702267 ] --→ index=4
[6.60986206 ] --->index=3
[5.21892795 ] --->index=6

现在，我想在第二个numpy数组 np_sentence 中搜索两个索引，以查看是否有包含两个索引的行。
Now, I want to search these indexes two by two in the second numpy array np_sentence to see is there any row on that which contains two of the indexes.
例如，基于此，它必须输出： 1,3,4 。这些是 np_sentence 的索引，其中包括 temp_sorted_result 中的两个索引的组合。
For example, based on this it has to output: 1,3,4. These are the indices of the np_sentence which includes a combination of two of the indexes in temp_sorted_result.
例如，在 temp_sorted_result 中可用的 4和6 在 row = 1 中的同一行 np_sentence 等。
for instance, both 4 and 6 which are available in temp_sorted_result  are in the same row of np_sentence in the row=1 and so on.
我需要对 np_weight 的每一列执行此操作。对于我来说，拥有非常有效的代码非常重要，因为行数非常大
I need to do this for each column of np_weight. It is very important for me to have a very efficient code as the number of the rows are very large
到目前为止，我所做的只是在第二个中搜索一项数组，这不是我最终想要的数组。
What I have done so far is only searching one item in the second array which is not what I want ultimately:

一种方法可能是我形成每一列的所有组合，例如显示在 temp_sorted_result ，我形成

One approach could be I form all the combinations for each column, for example for the first column showed above temp_sorted_result, I form

(0,2) (0,4)(0,3) (0,6)
(2,4) (2,3) (2,6)
(4,3)(4,6)
(3,6)

，然后检查 np_sentence 。根据我的 np_sentence 行索引 1,3,4 包含其中一些。


and then check which one is available in the rows of np_sentence. Base on my np_sentence rows index of 1,3,4 contains some of these.
现在我的问题是，如何才能以最有效的方式实现这一目标？
Now my question is that how can I implement this in a most efficient way?
如果不明显，请告诉我。
Please let me know if it is not obvious.
感谢您的帮助：）
推荐答案
这是一种方法：下面的函数 f 创建一个与 weight 形状相同的蒙版（加上一个的虚拟行） False  s）将每个列的前五个条目标记为 True 。
Here is one approach: The function f below creates a mask the same shape as weight (plus one dummy row of Falses) marking the top five entries in each column with True.
它然后使用 np_sentence 索引到掩码中，并为每列，行对计算 True 并与阈值2进行比较
It then uses np_sentence to index into the mask and counts the True for each column,row pair and compares with the threshold two.
仅复杂性：我们必须禁止 np_sentence 行中的重复值。为此，我们对行进行排序，然后将等于其左邻居的每个索引指向掩码中的虚拟行。
Only complication: We must suppress duplicate values in rows of np_sentence. To that end we sort the rows and then direct each index which equals its left neighbor to the dummy row in the mask.
此函数返回掩码。脚本的最后一行演示了如何将掩码转换为索引。
This function returns a mask. The last line of the script demonstrates how to convert that mask to indices.
import numpy as np

def f(a1, a2, n_top, n_hit):
    N,M = a1.shape
    mask = np.zeros((N+1,M), dtype=bool)
    np.greater_equal(
        a1,a1[a1.argpartition(N-n_top, axis=0)[N-n_top], np.arange(M)],
        out=mask[:N])
    a2 = np.sort(a2, axis=1)
    a2[:,1:][a2[:,1:]==a2[:,:-1]] = N
    return np.count_nonzero(mask[a2], axis=1) >= n_hit

a1 = np.matrix("""[[9.96859395 8.65543961 6.07429382 4.58735497]
 [3.21776471 8.33560037 2.11424961 8.89739975]
 [9.74560314 5.94640798 6.10318198 7.33056421]
 [6.60986206 2.36877835 3.06143215 7.82384351]
 [9.49702267 9.98664568 3.89140374 5.42108704]
 [1.93551346 8.45768507 8.60233715 8.09610975]
 [5.21892795 4.18786508 5.82665674 8.28397111]]"""[2:-2].replace("]\n [",";")).A

a2 = np.matrix("""[[2 5 1]
 [1 6 4]
 [0 0 0]
 [2 3 6]
 [4 2 4]]"""[2:-2].replace("]\n [",";")).A

print(f(a1,a2,5,2))

from itertools import groupby
from operator import itemgetter

print([[*map(itemgetter(1),grp)] for k,grp in groupby(np.argwhere(f(a1,a2,5,2).T),itemgetter(0))])

输出：
[[False  True  True  True]
 [ True  True  True  True]
 [False False False False]
 [ True False  True  True]
 [ True  True  True False]]
[[1, 3, 4], [0, 1, 4], [0, 1, 3, 4], [0, 1, 3]]


                        这篇关于有效地在numpy数组中查找具有特定条件的行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

有效地在numpy数组中查找具有特定条件的行 [英] Finding rows in numpy array with specific condition efficiently

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

有效地在numpy数组中查找具有特定条件的行 [英] Finding rows in numpy array with specific condition efficiently

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭