基于索引获取阵列的特定行的中值 [英] getting median of particular rows of array based on index
问题描述
我具有相同的长度,一个两个阵列包含含有其相应的值的索引和其它即,一个索引可以有一个以上的值:
I have two arrays of the same length, one containing an index and the other containing its corresponding value i.e. one index can have more than one value:
idx = [0,0,0,1,1,1,2,2,2,3,3,3,4,4,5,5...]
values = [1.2,3.1,3.1,3.1,3.3,1.2,3.3,4.1,5.4...]
欲返回其保持唯一的索引以及为具有相同的idx的值对象的中间值的数组。
I want to return an array which holds the unique index as well as the median value for objects with the same idx value.
例如
result =
[0, np.median([1.2,3.1,3.1])
1, np.median([3.1,3.3,1.2])
2, etc. ]
我的蛮力的方法是只是去:
My brute force approach is to just go:
for idxi in np.arange(np.max(idx)):
mask = (idxi == idx)
medians = np.median(values[mask])
result.append([idxi,medians])
这是迄今为止在任何情况下,遗憾的是,相当难看减缓我的需求。
This is far to slow for my needs unfortunately and quite ugly in any case.
推荐答案
如果你不介意的 SciPy的
的依赖,函数<一个href=\"http://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.measurements.labeled_com$p$phension.html\"相对=nofollow> scipy.ndimage.labeled_com prehension
可以做到这一点。下面是一个例子。
If you don't mind a dependency on scipy
, the function scipy.ndimage.labeled_comprehension
can do this. Here's an example.
首先建立样本数据:
In [570]: import numpy as np
In [571]: idx = np.array([0,0,0,1,1,1,2,2,2,3,3,3,4,4,5,5])
In [572]: values = np.array([1.2,3.1,3.1,3.1,3.3,1.2,3.3,4.1,5.4,6,6,6.2,6,7,7.2,7.2])
获取 IDX
独特的标签。 (如果你已经知道的最大的,比方说, N
,你知道,所有从0整数 N
使用时,你可以使用 uniq的范围=(N + 1)
来代替。)
Get the unique "labels" in idx
. (If you already know the maximum is, say, N
, and you know that all the integers from 0 to N
are used, you could use uniq = range(N+1)
instead.)
In [573]: uniq = np.unique(idx) # Or range(idx.max()+1)
In [574]: uniq
Out[574]: array([0, 1, 2, 3, 4, 5])
使用 labeled_com prehension
来计算每个标记组的中位数:
Use labeled_comprehension
to compute the median of each labeled group:
In [575]: from scipy.ndimage import labeled_comprehension
In [576]: medians = labeled_comprehension(values, idx, uniq, np.median, np.float64, None)
In [577]: medians
Out[577]: array([ 3.1, 3.1, 4.1, 6. , 6.5, 7.2])
另一种选择,如果你不介意的依赖 熊猫
,是使用 pandas.DataFrame
类的<code> GROUPBY 功能。
设置数据框:
In [609]: import pandas as pd
In [610]: df = pd.DataFrame(dict(labels=idx, values=values))
In [611]: df
Out[611]:
labels values
0 0 1.2
1 0 3.1
2 0 3.1
3 1 3.1
4 1 3.3
5 1 1.2
6 2 3.3
7 2 4.1
8 2 5.4
9 3 6.0
10 3 6.0
11 3 6.2
12 4 6.0
13 4 7.0
14 5 7.2
15 5 7.2
使用 GROUPBY
对数据进行分组使用标签
列,然后计算各组的中位数:
Use groupby
to group the data uses the labels
column, and then compute the medians of the groups:
In [612]: result = df.groupby('labels').median()
In [613]: result
Out[613]:
values
labels
0 3.1
1 3.1
2 4.1
3 6.0
4 6.5
5 7.2
免责声明:我没有尝试过任何的大型阵列的这些建议,所以我不知道他们的表现会如何与你的蛮力解决方案或@阿什维尼的回答比较
Disclaimer: I haven't tried either of those suggestions on large arrays, so I don't know how their performance will compare with your brute force solution or with @Ashwini's answer.
这篇关于基于索引获取阵列的特定行的中值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!