如何计算分组数据集的中位数? [英] how to calculate the median on grouped dataset?

查看:404
本文介绍了如何计算分组数据集的中位数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据集如下:

salary  number
1500-1600   110
1600-1700   180
1700-1800   320
1800-1900   460
1900-2000   850
2000-2100   250
2100-2200   130
2200-2300   70
2300-2400   20
2400-2500   10

如何计算此数据集的中位数?这是我尝试过的:

How can I calculate the median of this dataset? Here's what I have tried:

x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]", 
              "(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
              "(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))

            numbers cumsum
[1500-1600]     110    110
(1600-1700]     180    290
(1700-1800]     320    610
(1800-1900]     460   1070
(1900-2000]     850   1920
(2000,2100]     250   2170
(2100-2200]     130   2300
(2200-2300]      70   2370
(2300-2400]      20   2390
(2400-2500]      10   2400

在这里,您可以看到中途频率为2400/2 = 1200.它在10701920之间.因此,中位类别(1900-2000]组.您可以使用以下公式获得此结果:

Here, you can see the half-way frequency is 2400/2=1200. It is between 1070 and 1920. Thus the median class is the (1900-2000] group. You can use the formula below to get this result:

中位数= L + h/f(n/2-c)

Median = L + h/f (n/2 - c)

其中:

L 是中产阶级的下阶级边界
h 是中位类别的大小,即中位类别的上层和下层界限之间的差异
f 是中位课程的发生频率
c 是中位类别的先前累积频率
n/2 是总数.观测值除以2(即 f /2)

L is the lower class boundary of median class
h is the size of the median class i.e. difference between upper and lower class boundaries of median class
f is the frequency of median class
c is previous cumulative frequency of the median class
n/2 is total no. of observations divided by 2 (i.e. sum f / 2)

或者,中位类别是通过以下方法定义的:

Alternatively, median class is defined by the following method:

在累积频率列中找到n/2.

Locate n/2 in the column of cumulative frequency.

获取其中所在的类.

在代码中:

> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)    
[1] 1915.294

现在我要做的是使上面的表达更优雅-即1900+(1200-1070)/(1920-1070)*(2000-1900).我该如何实现?

Now what I want to do is to make the above expression more elegant - i.e. 1900+(1200-1070)/(1920-1070)*(2000-1900). How can I achieve this?

推荐答案

由于您已经知道公式,因此创建一个函数来为您进行计算应该很容易.

Since you already know the formula, it should be easy enough to create a function to do the calculation for you.

在这里,我创建了一个基本功能来帮助您入门.该函数带有四个参数:

Here, I've created a basic function to get you started. The function takes four arguments:

  • frequencies:频率的向量(第一个示例中为数字")
  • intervals:2行matrix,其列数与频率的长度相同,第一行是下层边界,第二行是上层边界.另外,"intervals"可能是data.frame中的一列,并且您可以指定sep(可能还有trim)以使函数自动为您创建所需的矩阵.
  • sep:data.frame中"intervals"列中的分隔符.
  • trim:字符的正则表达式,在尝试强制转换为数字矩阵之前需要将其删除.函数中内置了一种模式:trim = "cut".设置正则表达式模式以从输入中删除(,),[和].
  • frequencies: A vector of frequencies ("number" in your first example)
  • intervals: A 2-row matrix with the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals" may be a column in your data.frame, and you may specify sep (and possibly, trim) to have the function automatically create the required matrix for you.
  • sep: The separator character in your "intervals" column in your data.frame.
  • trim: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function: trim = "cut". This sets the regular expression pattern to remove (, ), [, and ] from the input.

这是功能(带有注释,显示了我如何使用您的说明将其组合在一起):

Here's the function (with comments showing how I used your instructions to put it together):

GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
  # If "sep" is specified, the function will try to create the 
  #   required "intervals" matrix. "trim" removes any unwanted 
  #   characters before attempting to convert the ranges to numeric.
  if (!is.null(sep)) {
    if (is.null(trim)) pattern <- ""
    else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
    else pattern <- trim
    intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
  }

  Midpoints <- rowMeans(intervals)
  cf <- cumsum(frequencies)
  Midrow <- findInterval(max(cf)/2, cf) + 1
  L <- intervals[1, Midrow]      # lower class boundary of median class
  h <- diff(intervals[, Midrow]) # size of median class
  f <- frequencies[Midrow]       # frequency of median class
  cf2 <- cf[Midrow - 1]          # cumulative frequency class before median class
  n_2 <- max(cf)/2               # total observations divided by 2

  unname(L + (n_2 - cf2)/f * h)
}


以下是与data.frame一起使用的示例:


Here's a sample data.frame to work with:

mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800", 
    "1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300", 
    "2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L, 
    850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"), 
    class = "data.frame", row.names = c(NA, -10L))
mydf
#       salary number
# 1  1500-1600    110
# 2  1600-1700    180
# 3  1700-1800    320
# 4  1800-1900    460
# 5  1900-2000    850
# 6  2000-2100    250
# 7  2100-2200    130
# 8  2200-2300     70
# 9  2300-2400     20
# 10 2400-2500     10


现在,我们可以简单地做到:


Now, we can simply do:

GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294


以下是该函数对某些组合数据起作用的示例:


Here's an example of the function in action on some made up data:

set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
#           Var1 Freq
# 1   (1.9,11.7]    8
# 2  (11.7,21.5]    8
# 3  (21.5,31.4]    8
# 4  (31.4,41.2]   15
# 5    (41.2,51]   13
# 6    (51,60.8]    5
# 7  (60.8,70.6]   11
# 8  (70.6,80.5]   15
# 9  (80.5,90.3]   11
# 10  (90.3,100]    6

### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231

### ... and the output of median on the original vector
median(x)
# [1] 49.5


顺便说一句,在您提供的示例数据中,我认为您的一个范围内有一个错误(除了破折号以外,其他所有均用破折号隔开,其中一个用逗号隔开),因为strsplit使用了正则表达式默认会拆分,您可以使用如下函数:


By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit uses a regular expression by default to split on, you can use the function like this:

x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
            "(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
            "(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294

这篇关于如何计算分组数据集的中位数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆