NumPy或SciPy以计算加权中位数 [英] NumPy or SciPy to calculate weighted median

查看:244
本文介绍了NumPy或SciPy以计算加权中位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试自动化JMP所做的过程(分析"->分布",将列A输入为"Y值",并使用后续列​​作为权重"值).在JMP中,您必须一次完成这一列-我想使用Python遍历所有列并创建一个数组,例如显示每列的中位数.

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.

例如,如果质量数组为[0、10、20、30],而列1的权重数组为[30、191、9、0],则质量数组的加权中位数应为10.但是,我不确定如何得出这个答案.

For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.

到目前为止,我已经

  1. 导入了将权重显示为数组的csv,其掩码值为0,并且
  2. 创建了一个形状和大小与权重数组(113x32)相同的"Y值"数组.我不确定是否需要执行此操作,但出于加权的目的,我认为它比for循环更容易.

我不确定从这里到底要去哪里.基本上,"Y值"是一个质量范围,并且数组中的所有列都代表为每个质量找到的数据点的数量.我需要根据报告的频率找到中位质量.

I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.

我不是Python或统计专家,因此,如果我省略了任何有用的细节,请告诉我!

I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!

更新:这是到目前为止我所做的一些代码:

Update: here's some code for what I've done so far:

#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt

inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)

#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)

#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]

for i in range (rowLength):
    createArr = np.arange(0, fieldLength*10, 10)
    nCreateArr = np.array(createArr)
    massArr.append(nCreateArr)
    nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()

推荐答案

如果我正确理解了您的问题,我们可以做些什么.是对观察值进行汇总,将其除以2便得到对应于中位数的观察值.从那里我们需要找出这个数字是什么观测值.

What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.

此处的一个技巧是使用np.cumsum计算观测值之和.这给了我们连续的累计和.

One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.

示例:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
每个元素都是所有先前元素及其本身的总和.我们在这里有10个观察结果.所以平均值将是第5个观察值. (我们将最后一个元素除以2得到5.)
现在查看累加结果,我们可以轻松地看到,这必须是第二个元素和第三个元素之间的观察值(观察值3和6).

Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).

因此,我们要做的就是找出中位数(5)适合的位置的索引.
np.searchsorted 正是我们所需要的.它将找到将元素插入数组的索引,以便它保持排序.

So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.

执行此操作的代码如下:

The code to do it like so:

import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])

c = np.cumsum(freq_count, axis=1) 
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...

#This is just for explanation.
print "median masses is:",  masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))

输出将是:

median masses is: [10 20 20  0 30]  
[[ 30 191   9   0]  <- The test data
 [ 10  20 300  10]  
 [ 10  20  30  40]  
 [100  10  10  10]  
 [  1   1   1 100]]  
[[  30.   221.   230.   230.   115. ]  <- cumsum results with median added to the end.
 [  10.    30.   330.   340.   170. ]     you can see from this where they fit in.
 [  10.    30.    60.   100.    50. ]  
 [ 100.   110.   120.   130.    65. ]  
 [   1.     2.     3.   103.    51.5]]  

这篇关于NumPy或SciPy以计算加权中位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆