numpy:通过合并从关联中找到不同值的数量 [英] Numpy: Finding count of distinct values from associations through binning
问题描述
先决条件
这是此帖子的扩展.因此,对该问题的一些介绍将类似于该帖子.
This is a question is an extension of this post. So, some of the introduction of the problem will be similar to that post.
问题
假设result
是2D数组,而values
是1D数组. values
保留一些与result
中的每个元素关联的值. values
中的元素到result
的映射存储在x_mapping
和y_mapping
中. result
中的位置可以与不同的值关联. x_mapping
和y_mapping
中的(x,y)
对与results[-y,x]
相关联.我必须找到按关联分组的值的唯一计数.
Let's say result
is a 2D array and values
is a 1D array. values
holds some values associated with each element in result
. The mapping of an element in values
to result
is stored in x_mapping
and y_mapping
. A position in result
can be associated with different values. (x,y)
pair from x_mapping
and y_mapping
is associated with results[-y,x]
. I have to find the unique count of the values grouped by associations.
一个更好地说明问题的例子.
An example for better clarification.
result
数组:
[[ 0., 0.],
[ 0., 0.],
[ 0., 0.],
[ 0., 0.]]
values
数组:
[ 1., 2., 1., 1., 5., 6., 7., 1.]
注意:这里result
数组和values
具有相同数量的元素.但事实并非如此.大小之间根本没有关系.
Note: Here result
arrays and values
have the same number of elements. But it might not be the case. There is no relation between the sizes at all.
x_mapping
和y_mapping
具有从1D values
到2D result
的映射. x_mapping
,y_mapping
和values
的大小将相同.
x_mapping
and y_mapping
have mappings from 1D values
to 2D result
. The sizes of x_mapping
, y_mapping
and values
will be the same.
x_mapping
-[0, 1, 0, 0, 0, 0, 0, 0]
y_mapping
-[0, 3, 2, 2, 0, 3, 2, 0]
此处,第一个值(值[0]),第五个值(值[4])和第八个值(值[7])的x为0,y为0(x_mapping [0]和y_mappping [0])因此与result [0,0]相关联.如果我们计算该组(1,5,1)中不同值的计数,则结果将为2.
@WarrenWeckesser
让我们看看来自x_mapping
和y_mapping
的[1, 3]
(x,y)对如何对results
做出贡献.由于只有一个值(即2)与该特定组关联,因此results[-3,1]
将具有一个值,因为与该单元格关联的不同值的数量为1.
Here, 1st value(values[0]), 5th value(values[4]) and 8th value(values[7]) have x as 0 and y as 0 (x_mapping[0] and y_mappping[0]) and hence associated with result[0, 0]. If we compute the count of distinct values from this group- (1,5,1), we will have 2 as result.
@WarrenWeckesser
Let's see how [1, 3]
(x,y) pair from x_mapping
and y_mapping
contribute to results
. Since there is only one value, ie 2, associated with this particular group, the results[-3,1]
will have one as the number of distinct values associated with that cell is one.
另一个例子.让我们计算results[-1,1]
的值.从映射来看,由于没有与单元格关联的值,因此results[-1,1]
的值将为零.
Another example. Let's compute the value of results[-1,1]
. From mappings, since there is no value associated with the cell, the value of results[-1,1]
will be zero.
类似地,results
中的位置[-2, 0]
将具有值2.
Similarly, the position [-2, 0]
in results
will have value 2.
请注意,如果根本没有关联,则result
的默认值为零.
Note that if there is no association at all then the default value for result
will be zero.
计算后的result
[[ 2., 0.],
[ 1., 1.],
[ 2., 0.],
[ 0., 0.]]
当前有效的解决方案
使用@Divakar的 answer ,我找到了可行的解决方案.
Using the answer from @Divakar, I was able to find a working solution.
x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 0])
values = np.array([ 1., 2., 1., 1., 5., 6., 7., 1.], dtype=np.float32)
result = np.zeros([4, 2], dtype=np.float32)
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
sidx = lidx.argsort()
idx = lidx[sidx]
val = values[sidx]
m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
unq_ids = idx[m_idx]
r_res = np.zeros(m_idx.size, dtype=np.float32)
for i in range(0, m_idx.shape[0]):
_next = None
arr = None
if i == m_idx.shape[0]-1:
_next = val.shape[0]
else:
_next = m_idx[i+1]
_start = m_idx[i]
if _start >= _next:
arr = val[_start]
else:
arr = val[_start:_next]
r_res[i] = np.unique(arr).size
result.flat[unq_ids] = r_res
问题
现在,以上解决方案需要15ms才能对19943值进行运算. 我正在寻找一种更快地计算结果的方法.还有其他更有效的方法吗?
Now, the above solution takes 15ms for operating on 19943 values. I'm looking for a way to compute the result faster. Is there any more performant way to do this?
旁注
我正在将Numpy版本1.14.3与Python 3.5.2结合使用
I'm using Numpy version 1.14.3 with Python 3.5.2
编辑
感谢@WarrenWeckesser,指出我尚未说明results
中的元素如何与映射中的(x,y)
关联.为了清楚起见,我已经更新了帖子并添加了示例.
Thanks to @WarrenWeckesser, pointing out that I haven't explained how an element in results
is associated with (x,y)
from mappings. I have updated the post and added examples for clarity.
推荐答案
这是一种解决方案
import numpy as np
x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 0])
values = np.array([ 1., 2., 1., 1., 5., 6., 7., 1.], dtype=np.float32)
result = np.zeros([4, 2], dtype=np.float32)
# Get flat indices
idx_mapping = np.ravel_multi_index((-y_mapping, x_mapping), result.shape, mode='wrap')
# Sort flat indices and reorders values accordingly
reorder = np.argsort(idx_mapping)
idx_mapping = idx_mapping[reorder]
values = values[reorder]
# Get unique values
val_uniq = np.unique(values)
# Find where each unique value appears
val_uniq_hit = values[:, np.newaxis] == val_uniq
# Find reduction indices (slices with the same flat index)
reduce_idx = np.concatenate([[0], np.nonzero(np.diff(idx_mapping))[0] + 1])
# Reduce slices
reduced = np.logical_or.reduceat(val_uniq_hit, reduce_idx)
# Count distinct values on each slice
counts = np.count_nonzero(reduced, axis=1)
# Put counts in result
result.flat[idx_mapping[reduce_idx]] = counts
print(result)
# [[2. 0.]
# [1. 1.]
# [2. 0.]
# [0. 0.]]
此方法占用更多内存(O(len(values) * len(np.unique(values)))
),但与原始解决方案相比,较小的基准测试显示出显着的加速效果(尽管这取决于问题的实际大小):
This method takes more memory (O(len(values) * len(np.unique(values)))
), but a small benchmark comparing with your original solution shows a significant speedup (although that depends on the actual size of the problem):
import numpy as np
np.random.seed(100)
result = np.zeros([400, 200], dtype=np.float32)
values = np.random.randint(100, size=(20000,)).astype(np.float32)
x_mapping = np.random.randint(result.shape[1], size=values.shape)
y_mapping = np.random.randint(result.shape[0], size=values.shape)
res1 = solution_orig(x_mapping, y_mapping, values, result)
res2 = solution(x_mapping, y_mapping, values, result)
print(np.allclose(res1, res2))
# True
# Original solution
%timeit solution_orig(x_mapping, y_mapping, values, result)
# 76.2 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# This solution
%timeit solution(x_mapping, y_mapping, values, result)
# 13.8 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
基准功能的完整代码:
import numpy as np
def solution(x_mapping, y_mapping, values, result):
result = np.array(result)
idx_mapping = np.ravel_multi_index((-y_mapping, x_mapping), result.shape, mode='wrap')
reorder = np.argsort(idx_mapping)
idx_mapping = idx_mapping[reorder]
values = values[reorder]
val_uniq = np.unique(values)
val_uniq_hit = values[:, np.newaxis] == val_uniq
reduce_idx = np.concatenate([[0], np.nonzero(np.diff(idx_mapping))[0] + 1])
reduced = np.logical_or.reduceat(val_uniq_hit, reduce_idx)
counts = np.count_nonzero(reduced, axis=1)
result.flat[idx_mapping[reduce_idx]] = counts
return result
def solution_orig(x_mapping, y_mapping, values, result):
result = np.array(result)
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
sidx = lidx.argsort()
idx = lidx[sidx]
val = values[sidx]
m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
unq_ids = idx[m_idx]
r_res = np.zeros(m_idx.size, dtype=np.float32)
for i in range(0, m_idx.shape[0]):
_next = None
arr = None
if i == m_idx.shape[0]-1:
_next = val.shape[0]
else:
_next = m_idx[i+1]
_start = m_idx[i]
if _start >= _next:
arr = val[_start]
else:
arr = val[_start:_next]
r_res[i] = np.unique(arr).size
result.flat[unq_ids] = r_res
return result
这篇关于numpy:通过合并从关联中找到不同值的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!