NumPy 或 SciPy 计算加权中位数 [英] NumPy or SciPy to calculate weighted median

查看:33
本文介绍了NumPy 或 SciPy 计算加权中位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试自动化 JMP 执行的流程(分析-> 分布,输入 A 列作为Y 值",使用后续列​​作为权重"值).在 JMP 中,您必须一次处理一列 - 我想使用 Python 遍历所有列并创建一个数组,例如显示每列的中位数.

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.

例如,如果质量数组为 [0, 10, 20, 30],并且第 1 列的权重数组为 [30, 191, 9, 0],则质量数组的加权中位数应为 10.但是,我不确定如何得出这个答案.

For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.

到目前为止我已经

  1. 导入将权重显示为数组的 csv,屏蔽值为 0 和
  2. 创建了一个Y 值"数组,其形状和大小与权重数组 (113x32) 相同.我不完全确定我需要这样做,但我认为为了加权目的,这比 for 循环更容易.

我不确定从这里开始到底要去哪里.基本上,Y 值"是一个质量范围,数组中的所有列表示为每个质量找到的数据点数.我需要根据报告的频率找到质量中位数.

I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.

我不是 Python 或统计学专家,所以如果我遗漏了任何有用的细节,请告诉我!

I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!

更新:这是我目前所做的一些代码:

Update: here's some code for what I've done so far:

#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt

inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)

#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)

#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]

for i in range (rowLength):
    createArr = np.arange(0, fieldLength*10, 10)
    nCreateArr = np.array(createArr)
    massArr.append(nCreateArr)
    nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()

推荐答案

如果我正确理解了您的问题,我们可以做什么.就是将观察结果相加,除以 2 将得到对应于中位数的观察数.从那里我们需要弄清楚这个数字是什么观察结果.

What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.

这里的一个技巧是用 np.cumsum 计算观察和.这给了我们一个连续的累积总和.

One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.

示例:
np.cumsum([1,2,3,4]) ->[ 1, 3, 6, 10]
每个元素是所有先前元素及其自身的总和.我们在这里有 10 个观察结果.所以平均值将是第 5 个观察值.(我们通过将最后一个元素除以 2 得到 5).
现在查看 cumsum 结果,我们可以很容易地看出,这一定是第二个和第三个元素之间的观察(观察 3 和 6).

Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).

所以我们需要做的就是找出中位数 (5) 适合的位置的索引.
np.searchsorted 正是我们所需要的.它将找到将元素插入数组的索引,以便它保持排序.

So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.

代码如下:

import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])

c = np.cumsum(freq_count, axis=1) 
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...

#This is just for explanation.
print "median masses is:",  masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))

输出将是:

median masses is: [10 20 20  0 30]  
[[ 30 191   9   0]  <- The test data
 [ 10  20 300  10]  
 [ 10  20  30  40]  
 [100  10  10  10]  
 [  1   1   1 100]]  
[[  30.   221.   230.   230.   115. ]  <- cumsum results with median added to the end.
 [  10.    30.   330.   340.   170. ]     you can see from this where they fit in.
 [  10.    30.    60.   100.    50. ]  
 [ 100.   110.   120.   130.    65. ]  
 [   1.     2.     3.   103.    51.5]]  

这篇关于NumPy 或 SciPy 计算加权中位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆