numpy:通过合并从关联中找到不同值的数量 [英] Numpy: Finding count of distinct values from associations through binning

查看:66
本文介绍了numpy:通过合并从关联中找到不同值的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

先决条件

这是此帖子的扩展.因此,对该问题的一些介绍将类似于该帖子.

This is a question is an extension of this post. So, some of the introduction of the problem will be similar to that post.

问题

假设result是2D数组,而values是1D数组. values保留一些与result中的每个元素关联的值. values中的元素到result的映射存储在x_mappingy_mapping中. result中的位置可以与不同的值关联. x_mappingy_mapping中的(x,y)对与results[-y,x]相关联.我必须找到按关联分组的值的唯一计数.

Let's say result is a 2D array and values is a 1D array. values holds some values associated with each element in result. The mapping of an element in values to result is stored in x_mapping and y_mapping. A position in result can be associated with different values. (x,y) pair from x_mapping and y_mapping is associated with results[-y,x]. I have to find the unique count of the values grouped by associations.

一个更好地说明问题的例子.

An example for better clarification.

result数组:

[[ 0.,  0.],
[ 0.,  0.],
[ 0.,  0.],
[ 0.,  0.]]

values数组:

[ 1.,  2.,  1.,  1.,  5.,  6.,  7.,  1.]

注意:这里result数组和values具有相同数量的元素.但事实并非如此.大小之间根本没有关系.

Note: Here result arrays and values have the same number of elements. But it might not be the case. There is no relation between the sizes at all.

x_mappingy_mapping具有从1D values到2D result的映射. x_mappingy_mappingvalues的大小将相同.

x_mapping and y_mapping have mappings from 1D values to 2D result. The sizes of x_mapping, y_mapping and values will be the same.

x_mapping-[0, 1, 0, 0, 0, 0, 0, 0]

y_mapping-[0, 3, 2, 2, 0, 3, 2, 0]

此处,第一个值(值[0]),第五个值(值[4])和第八个值(值[7])的x为0,y为0(x_mapping [0]和y_mappping [0])因此与result [0,0]相关联.如果我们计算该组(1,5,1)中不同值的计数,则结果将为2. @WarrenWeckesser 让我们看看来自x_mappingy_mapping[1, 3](x,y)对如何对results做出贡献.由于只有一个值(即2)与该特定组关联,因此results[-3,1]将具有一个值,因为与该单元格关联的不同值的数量为1.

Here, 1st value(values[0]), 5th value(values[4]) and 8th value(values[7]) have x as 0 and y as 0 (x_mapping[0] and y_mappping[0]) and hence associated with result[0, 0]. If we compute the count of distinct values from this group- (1,5,1), we will have 2 as result. @WarrenWeckesser Let's see how [1, 3] (x,y) pair from x_mapping and y_mapping contribute to results. Since there is only one value, ie 2, associated with this particular group, the results[-3,1] will have one as the number of distinct values associated with that cell is one.

另一个例子.让我们计算results[-1,1]的值.从映射来看,由于没有与单元格关联的值,因此results[-1,1]的值将为零.

Another example. Let's compute the value of results[-1,1]. From mappings, since there is no value associated with the cell, the value of results[-1,1] will be zero.

类似地,results中的位置[-2, 0]将具有值2.

Similarly, the position [-2, 0] in results will have value 2.

请注意,如果根本没有关联,则result的默认值为零.

Note that if there is no association at all then the default value for result will be zero.

计算后的result

[[ 2.,  0.],
[ 1.,  1.],
[ 2.,  0.],
[ 0.,  0.]]

当前有效的解决方案

使用@Divakar的 answer ,我找到了可行的解决方案.

Using the answer from @Divakar, I was able to find a working solution.

x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 0])
values = np.array([ 1.,  2.,  1.,  1.,  5.,  6.,  7.,  1.], dtype=np.float32)
result = np.zeros([4, 2], dtype=np.float32) 

m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping

sidx = lidx.argsort()
idx = lidx[sidx]
val = values[sidx]

m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
unq_ids = idx[m_idx]

r_res = np.zeros(m_idx.size, dtype=np.float32)
for i in range(0, m_idx.shape[0]):
    _next = None
    arr = None
    if i == m_idx.shape[0]-1:
        _next = val.shape[0]
    else:
        _next = m_idx[i+1]
    _start = m_idx[i]

    if _start >= _next:
        arr = val[_start]
    else:
        arr = val[_start:_next]
    r_res[i] = np.unique(arr).size
result.flat[unq_ids] = r_res

问题

现在,以上解决方案需要15ms才能对19943值进行运算. 我正在寻找一种更快地计算结果的方法.还有其他更有效的方法吗?

Now, the above solution takes 15ms for operating on 19943 values. I'm looking for a way to compute the result faster. Is there any more performant way to do this?

旁注

我正在将Numpy版本1.14.3与Python 3.5.2结合使用

I'm using Numpy version 1.14.3 with Python 3.5.2

编辑

感谢@WarrenWeckesser,指出我尚未说明results中的元素如何与映射中的(x,y)关联.为了清楚起见,我已经更新了帖子并添加了示例.

Thanks to @WarrenWeckesser, pointing out that I haven't explained how an element in results is associated with (x,y) from mappings. I have updated the post and added examples for clarity.

推荐答案

这是一种解决方案

import numpy as np

x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 0])
values = np.array([ 1.,  2.,  1.,  1.,  5.,  6.,  7.,  1.], dtype=np.float32)
result = np.zeros([4, 2], dtype=np.float32)

# Get flat indices
idx_mapping = np.ravel_multi_index((-y_mapping, x_mapping), result.shape, mode='wrap')
# Sort flat indices and reorders values accordingly
reorder = np.argsort(idx_mapping)
idx_mapping = idx_mapping[reorder]
values = values[reorder]
# Get unique values
val_uniq = np.unique(values)
# Find where each unique value appears
val_uniq_hit = values[:, np.newaxis] == val_uniq
# Find reduction indices (slices with the same flat index)
reduce_idx = np.concatenate([[0], np.nonzero(np.diff(idx_mapping))[0] + 1])
# Reduce slices
reduced = np.logical_or.reduceat(val_uniq_hit, reduce_idx)
# Count distinct values on each slice
counts = np.count_nonzero(reduced, axis=1)
# Put counts in result
result.flat[idx_mapping[reduce_idx]] = counts

print(result)
# [[2. 0.]
#  [1. 1.]
#  [2. 0.]
#  [0. 0.]]

此方法占用更多内存(O(len(values) * len(np.unique(values)))),但与原始解决方案相比,较小的基准测试显示出显着的加速效果(尽管这取决于问题的实际大小):

This method takes more memory (O(len(values) * len(np.unique(values)))), but a small benchmark comparing with your original solution shows a significant speedup (although that depends on the actual size of the problem):

import numpy as np

np.random.seed(100)
result = np.zeros([400, 200], dtype=np.float32)
values = np.random.randint(100, size=(20000,)).astype(np.float32)
x_mapping = np.random.randint(result.shape[1], size=values.shape)
y_mapping = np.random.randint(result.shape[0], size=values.shape)

res1 = solution_orig(x_mapping, y_mapping, values, result)
res2 = solution(x_mapping, y_mapping, values, result)
print(np.allclose(res1, res2))
# True

# Original solution
%timeit solution_orig(x_mapping, y_mapping, values, result)
# 76.2 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# This solution
%timeit solution(x_mapping, y_mapping, values, result)
# 13.8 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

基准功能的完整代码:

import numpy as np

def solution(x_mapping, y_mapping, values, result):
    result = np.array(result)
    idx_mapping = np.ravel_multi_index((-y_mapping, x_mapping), result.shape, mode='wrap')
    reorder = np.argsort(idx_mapping)
    idx_mapping = idx_mapping[reorder]
    values = values[reorder]
    val_uniq = np.unique(values)
    val_uniq_hit = values[:, np.newaxis] == val_uniq
    reduce_idx = np.concatenate([[0], np.nonzero(np.diff(idx_mapping))[0] + 1])
    reduced = np.logical_or.reduceat(val_uniq_hit, reduce_idx)
    counts = np.count_nonzero(reduced, axis=1)
    result.flat[idx_mapping[reduce_idx]] = counts
    return result

def solution_orig(x_mapping, y_mapping, values, result):
    result = np.array(result)
    m,n = result.shape
    out_dtype = result.dtype
    lidx = ((-y_mapping)%m)*n + x_mapping

    sidx = lidx.argsort()
    idx = lidx[sidx]
    val = values[sidx]

    m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
    unq_ids = idx[m_idx]

    r_res = np.zeros(m_idx.size, dtype=np.float32)
    for i in range(0, m_idx.shape[0]):
        _next = None
        arr = None
        if i == m_idx.shape[0]-1:
            _next = val.shape[0]
        else:
            _next = m_idx[i+1]
        _start = m_idx[i]

        if _start >= _next:
            arr = val[_start]
        else:
            arr = val[_start:_next]
        r_res[i] = np.unique(arr).size
    result.flat[unq_ids] = r_res
    return result

这篇关于numpy:通过合并从关联中找到不同值的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆