如何在r中给定另一列的连续字符串的情况下找到列中连续数字的均值 [英] How to find the means of consecutive numbers in a column given consecutive string of another column in r

查看:42
本文介绍了如何在r中给定另一列的连续字符串的情况下找到列中连续数字的均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个与此类似的数据集:

I have a dataset that looks similar to this:

head(df,20)
   mmpd tot
1     0   0
2    mm   0
3    mm   1
4     0   0
5     0   0
6    mm   0
7    mm   1
8    mm   3
9    mm   1
10    0   0
11    0   0
12    0   0
13    0   0
14   mm   0
15   mm   0
16    0   0
17    0   0
18   mm   4
19   mm   1
20   mm   0

当df$mmpd 中的一串mm 对应时,我想得到df$tot 的平均值.因此,对于示例数据集,我想获得以下数字字符串:.5、1.25、0、1.667.df$mmpd 将始终是一串 mm > 1 或 0,并且该列可以以 0 或一串 mm 开头.

I would like to get the average of df$tot when it corresponds to a string of mm in df$mmpd. So for the example dataset, I'd like to get the following string of numbers: .5, 1.25, 0, 1.667. df$mmpd will always either be a string of mm > 1, or 0, and the column can begin with either 0 or a string of mm.

有没有办法在没有 for 循环的情况下做到这一点?

Is there a way to do this without a for loop?

推荐答案

Using data.table

library(data.table) # v 1.9.5+
setDT(df)[,.(my=mean(tot)), by=.(indx=rleid(mmpd),mmpd)][,indx:=NULL][mmpd=='mm']
   mmpd       my
#1:   mm 0.500000
#2:   mm 1.250000
#3:   mm 0.000000
#4:   mm 1.666667

显然,有很多方法可以做到(参见r 沿向量搜索并计算平均值).data.table 方法在此处速度最快且经过调整.

Apparently, there are many ways to do it (see r search along a vector and calculate the mean). The data.table method was fastest and adapted here.

注意:rleid 可以在 data.table 语法之外使用.这将更像传统"R 语法并产生相同的结果.

Note: rleid can be use outside of the data.table syntax. This will be more like "traditional" R syntax and produce the same results.

subset(aggregate(tot ~ indx + mmpd, 
          data=cbind(df,indx=rleid(df$mmpd)),
          FUN=mean),mmpd=="mm")

rleid(myrleid)不同生成方式的速度比较来自@JasonAizkalns 的回答).

Speed comparison of different ways to generate rleid (myrleid is from @JasonAizkalns answer).

> set.seed(1); x<-sample(1:2,100000,replace=T); 
  microbenchmark(rleid(x),
                 myrleid2=cumsum(c(1,diff(x)!=0)),
                 myrleid(x))
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval cld
   rleid(x) 1.422263 1.500873 1.586482 1.571315 1.662982  1.938254   100 a  
   myrleid2 3.860290 3.908308 4.369646 3.962497 4.177673 15.674611   100  b 
 myrleid(x) 7.282868 7.386515 7.753515 7.444008 7.654126 18.864898   100   c

对于非数字 x:

>  set.seed(1); x<-sample(c('a','b'),100000,replace=T); 
>  microbenchmark(rleid(x),myrleid2=cumsum(c(1,diff(as.numeric(factor(x)))!=0)),myrleid(x))
Unit: milliseconds
       expr       min        lq      mean    median       uq       max neval cld
   rleid(x)  1.465466  1.571662  1.684568  1.606614  1.66080  2.900983   100 a
   myrleid2  8.705447  9.276787 12.393393  9.907403 10.35032 61.080374   100  b
 myrleid(x) 11.970271 13.176144 18.779256 13.790767 14.09626 69.845587   100   c

这篇关于如何在r中给定另一列的连续字符串的情况下找到列中连续数字的均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆