在表格中找到第三个四分位数的频率 [英] Find frequencies over 3rd quartile in table

查看:175
本文介绍了在表格中找到第三个四分位数的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大数据框(对57个变量的观测值达到+ 239k),其中包含一些疾病描述以及针对不同年龄段人群的药物。我想在每种疾病描述的使用频率最高的四分位数中找到这些药物。

I have a big data frame (+239k observations on 57 variables) with some sickness descriptions and medicines administered to those sicknesses for people in different age ranges. I'd like to find those medicines in the top quartile of frequency use for each sickness description.

为举一个可重复的示例,我创建了1000个观察数据框: / p>

To make a reproducible example, I created a 1000 observations data frame:

set.seed(1);sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T));md<-as.factor(sample(c("med 1","med 2","med 3","med 4","med 5")));age<-as.factor(sample(c("group a","group b","group c"),1000,replace=T))
df<-data.frame(obs=1:1000,md=md,sk=sk,age=age)

我可以用

xt<-xtabs(~md+sk+age,df)

然后我可以为每个年龄段生成一个数据框

I can then produce a data frame for each age group

XTDF_a<-as.data.frame(xt[,,"group a"])

,然后找到每种疾病的发生频率的第三个四分位数:

and then find the 3rd quartile of frequencies of each sickness with:

Q3_a<-apply(XTDF_a,2,function(x) quantile(x,probs = .75))

我可以与之比较并获得每种疾病在第三个四分位数中处于以上的药物

to which I can compare and obtain which Medicines are over the 3rd quartile for each Sickness

XTDF_a>Q3_a


    sk
md      sick A sick B sick C sick D
  med 1  FALSE  FALSE   TRUE  FALSE
  med 2  FALSE  FALSE  FALSE  FALSE
  med 3   TRUE   TRUE  FALSE  FALSE
  med 4  FALSE  FALSE  FALSE   TRUE
  med 5  FALSE  FALSE  FALSE  FALSE

我可以得出结论, med 3 是疾病A的首选,依此类推(我正在循环循环以提取该信息)。然后,我回去对b,c组重复该过程……对于我拥有的数据量来说,这几乎是不可能的(疾病约为4200个水平,药物约为1150个水平)。

I can conclude that med 3 is the top selection for Sickness A, and so on (I'm acutally looping to extract that information). I then go back and repeat the process for group b, c.... which is almost impossible with the size of data I have (sicknesses are about 4200 levels and medicines are about 1150 levels).

我很确定应该有其他更简单的方法来实现这一目标。我希望能有更好的选择。

I'm pretty sure there should be a different, easier way to achieve this. I'd appreciate a hint on a better path to follow.

推荐答案

我认为您可以通过编写更多内容来加快这一步精确函数,然后使用 aggregate 获得结果。如果您想使用基于列表的方法,也可以使用 by ,这对于您的下一次使用可能会更有用。我认为它仍然会很慢,但不如循环慢。

I think you can speed this up by writing a bit more precise function and then using aggregate to get the results. You could also use by if you want a more list-based approach, which might be more useful for your next use. I think it will still be slow, but not as slow as looping.

# Here is what you gave me originally
set.seed(1)
sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T))
md<-as.factor(sample(c("med 1","med 2","med 3","med 4","med 5")))
age<-as.factor(sample(c("group a","group b","group c"),1000,replace=T))
df<-data.frame(obs=1:1000,md=md,sk=sk,age=age)

# Define a function that basically does what you did before, but uses table()
func.get_75th_meds <- function(vector_of_meds) {

    freq <- table(vector_of_meds)
    return(names(freq)[freq >= quantile(x = freq,probs = 0.75)])
}

aggregate(x = list(Meds = df$md),
          by = list(Sickness = df$sk,Group = df$age),
          FUN = func.get_75th_meds)

   Sickness   Group                       Meds
1    sick A group a               med 3, med 5
2    sick B group a               med 3, med 5
3    sick C group a med 1, med 2, med 4, med 5
4    sick D group a               med 2, med 4
5    sick A group b               med 4, med 5
6    sick B group b        med 1, med 2, med 5
7    sick C group b               med 1, med 2
8    sick D group b               med 2, med 3
9    sick A group c               med 2, med 5
10   sick B group c               med 2, med 4
11   sick C group c        med 1, med 2, med 4
12   sick D group c        med 1, med 3, med 4

编辑后添加:这是 by()使用相同的函数。

EDITED TO ADD: Here's the alternative with by() using the same function.

by(data = df$md,
   INDICES = list(Sickness = df$sk,Group = df$age),
   FUN = func.get_75th_meds)

Sickness: sick A
Group: group a
[1] "med 3" "med 5"
---------------------------------------------------------------
Sickness: sick B
Group: group a
[1] "med 3" "med 5"
---------------------------------------------------------------
... and so on

这篇关于在表格中找到第三个四分位数的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆