在 numpy 中获取唯一行位置的更快方法是什么 [英] What is a faster way to get the location of unique rows in numpy

查看:36
本文介绍了在 numpy 中获取唯一行位置的更快方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个唯一行列表和另一个更大的数据数组(在示例中称为 test_rows).我想知道是否有更快的方法来获取数据中每个唯一行的位置.我能想到的最快方法是...

I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...

import numpy


uniq_rows = numpy.array([[0, 1, 0],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 1]])

test_rows = numpy.array([[0, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0],
                         [0, 1, 0],
                         [0, 1, 1],
                         [0, 1, 1],
                         [1, 1, 1],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0]])

# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
    print row, numpy.where((test_rows == row).all(axis=1))[0]

这会打印...

[0, 1, 0] [ 1  4 10]
[1, 1, 0] [ 3  8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]

是否有更好或更多的 numpythonic(不确定该词是否存在)的方法来做到这一点?我正在寻找一个 numpy 组函数,但找不到它.基本上对于任何传入的数据集,我都需要以最快的方式获取该数据集中每个唯一行的位置.传入的数据集并不总是具有每个唯一的行或相同的数字.

Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.

这只是一个简单的例子.在我的应用程序中,数字不仅仅是零和一,它们可以是 0 到 32000 之间的任何地方.uniq 行的大小可以在 4 到 128 行之间,而 test_rows 的大小可以是数十万.

This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.

推荐答案

这里有很多解决方案,但我用香草 numpy 添加了一个.在大多数情况下,numpy 会比列表推导式和字典更快,尽管如果使用大型数组,数组广播可能会导致内存成为问题.

There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.

np.where((uniq_rows[:, None, :] == test_rows).all(2))

非常简单,嗯?这将返回唯一行索引和相应测试行的元组.

Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.

 (array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
  array([ 1,  4, 10,  3,  8, 12,  7,  9,  0,  5,  6]))

工作原理:

(uniq_rows[:, None, :] == test_rows)

使用数组广播将test_rows的每个元素与uniq_rows中的每一行进行比较.这会产生一个 4x13x3 的数组.all 用于确定哪些行相等(所有比较都返回 true).最后,where 返回这些行的索引.

Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.

这篇关于在 numpy 中获取唯一行位置的更快方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆