如何使用left_join和嵌套在R中计算不同类别的平均值? [英] How to compute the mean in different categories using left_join and nest in R?

查看:169
本文介绍了如何使用left_join和嵌套在R中计算不同类别的平均值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 left_join nest 来计算收货数据的平均值。

  bin.size = 100 

第一个数据框:

  df = data.frame(x = c(300,400),
y = c (sca1,sca2))
xy
1 300 sca1
2 400 sca2

第二个数据框:

  df2 = data.frame(snp = c(1,2,10,100, (sca2,sca2,sca2,sca2,sca2,sca2,sca2 sca2))

snp r2 sca
1 1 0.70 sca1
2 2 0.80 sca1
3 10 0.70 sca1
4 100 0.10 sca1
5 1 0.90 sca2
6 2 0.98 sca2
7 14 0.80 sca2
8 16 0.80 sca2
9 399 0.01 sca2
pre>

来自@ r2evans的代码:

  output_bin_LD = df%> ;%
left_join(nest(df2,snp,.key =snp),by = c(y=sca))%>%
mutate b $ b cut = map(x,〜seq(0,。,by = bin.size)),
tbls = pmap(
.l = list(snp,cutting),
.f = function(xx,break){
z < - table(cut(xx $ snp,breaks))
data_frame(cut = names(z),count = z)
}

)%>%
select(y,tbls)%>%
unnest()

这个代码正在这样做:

  y cut count 
1 sca1(0,100)4
2 sca1(100,200)0
3 sca1(200,300)0
4 sca2(0,100 ] 4
5 sca2(100,200] 0
6 sca2(200,300)0
7 sca2(300,400)1

最终目标是拥有

  y cut count mean 
1 sca1(0,100] 4 0.575
2 sca1(100,20 0] 0 0
3 sca1(200,300)0 0
4 sca2(0,100)4 0.87
5 sca2(100,200)0 0
6 sca2(200,300)0 0
7 sca2(300,400)1 399

到目前为止,我已经尝试过:

  df%>%
left_join(nest(df2,snp,r2,.key =snp),
by = c(y=sca))%>%
mutate(
cutting = map(x,〜seq(0,...,by = 100)),
tbls = pmap(
.l = list(snp,cutting),
.f = function(xx,break){
z < - table(cut(xx $ snp, )
a < - mean(cut(xx $ r2,break))
data_frame(cut = names(z),count = z,mean = a)
}#.f
)#关闭pmap
)%>%#mutate
select(y,tbls)%>%
unnest()
/ pre>

但它输出我 NA s和一条警告消息:

  y cut count mean 
1 sca1(0,100)4 NA
2 sca1 (100,200] 0 NA
3 sca1(200,300)0 NA
4 sca2(0,100)4 NA
5 sca2(100,200)0 NA
6 sca2(200,300)0 NA
7 sca2(300,400)1 NA
警告消息:
1:在mean.default(cut(xx $ r2,休息)):
参数不是数字或逻辑:返回NA
2:在mean.default(cut(xx $ r2,休息)):
参数不是数字或逻辑:返回NA

我该如何解决这个问题?我需要双重嵌套桌子吗?

解决方案

不确定您的方法,但这里有一个简单的方法..使用 data.table 包,如果你有兴趣。您将需要最新版本(目前为1.10.0),因为这是一个新功能。

  require( data.table)## v1.9.8 + 
和< - b [a,on =。(sca = y,snp> start,snp< = end),## 1
。 = .N,mean = mean(r2,na.rm = TRUE)),## 2
by = .EACHI] ## 3




  1. 对于 a 中的每一行,请在<$ c $中找到匹配的行索引在参数


  2. 的条件下匹配c> b > 长度(匹配行索引) == .N 给出计数 mean()给出了这些匹配索引的 r2 的平均值。


  3. (2)中的表达式运行在 a 中的每一行。


其中, a 是:

  require(data.table)## v1.9.8 + 
a < - setDT(df)[,。(start = seq 0,x-1,by = bin.size),
end = seq(bin.size,x,by = bin.size)),
by = y]

b< - fread(snp r2 sca
1 0.70 sca1
2 0.80 sca1
10 0.70 sca1
100 0.10 sca1
1 0.90 sca2
2 0.98 sca2
14 0.80 sca2
16 0.80 sca2
399 0.01 sca2)


I'm trying to compute the mean values for binned data using left_join and nest.

bin.size = 100 

First dataframe:

df = data.frame(x =c(300,400), 
                y = c("sca1","sca2"))
    x    y
1 300 sca1
2 400 sca2

Second dataframe:

df2 = data.frame(snp = c(1,2,10,100,1,2,14,16,399), 
                 sca = c("sca1","sca1","sca1","sca1","sca2","sca2","sca2","sca2","sca2"))

      snp   r2  sca
1   1 0.70 sca1
2   2 0.80 sca1
3  10 0.70 sca1
4 100 0.10 sca1
5   1 0.90 sca2
6   2 0.98 sca2
7  14 0.80 sca2
8  16 0.80 sca2
9 399 0.01 sca2

Code from @r2evans:

output_bin_LD = df %>%
  left_join(nest(df2, snp, .key = "snp"), by = c("y" = "sca")) %>%
  mutate(
    cuts = map(x, ~ seq(0, ., by = bin.size)),
    tbls = pmap(
      .l = list(snp, cuts),
      .f = function(xx, breaks) {
        z <- table(cut(xx$snp, breaks))
        data_frame(cut = names(z), count = z)
      }
    )
  ) %>%
  select(y, tbls) %>%
  unnest()

This code up is doing this:

     y       cut count
1 sca1   (0,100]     4
2 sca1 (100,200]     0
3 sca1 (200,300]     0
4 sca2   (0,100]     4
5 sca2 (100,200]     0
6 sca2 (200,300]     0
7 sca2 (300,400]     1

The end goal would be to have

     y       cut count  mean
1 sca1   (0,100]     4 0.575
2 sca1 (100,200]     0     0
3 sca1 (200,300]     0     0
4 sca2   (0,100]     4  0.87
5 sca2 (100,200]     0     0
6 sca2 (200,300]     0     0
7 sca2 (300,400]     1   399

So far I've tried this:

df %>%
  left_join(nest(df2, snp, r2, .key = "snp"), 
            by = c("y" = "sca")) %>%
  mutate(
    cuts = map(x, ~ seq(0, ., by = 100)),
    tbls = pmap(
      .l = list(snp, cuts),
      .f = function(xx, breaks) {
        z <- table(cut(xx$snp, breaks))
        a <- mean(cut(xx$r2, breaks))
        data_frame(cut = names(z), count = z, mean = a)
      } # .f 
    ) # closing pmap
  ) %>% # mutate
  select(y, tbls) %>%
  unnest()

But it outputs me NAs and a warning message:

     y       cut count mean
1 sca1   (0,100]     4   NA
2 sca1 (100,200]     0   NA
3 sca1 (200,300]     0   NA
4 sca2   (0,100]     4   NA
5 sca2 (100,200]     0   NA
6 sca2 (200,300]     0   NA
7 sca2 (300,400]     1   NA
Warning messages:
1: In mean.default(cut(xx$r2, breaks)) :
  argument is not numeric or logical: returning NA
2: In mean.default(cut(xx$r2, breaks)) :
  argument is not numeric or logical: returning NA

How should I fix this? Do I need to double nest the table?

解决方案

Not sure about your approach, but here's a slightly straightforward approach.. using data.table package, if you're interested. You will need the latest version (currently 1.10.0) for this to work (since it's a new feature).

require(data.table) ## v1.9.8+
and <- b[a, on=.(sca=y, snp>start, snp<=end),       ## 1
         .(count=.N, mean=mean(r2, na.rm=TRUE)),    ## 2
         by=.EACHI]                                 ## 3

  1. For each row in a, find matching row indices in b while matching on the condition provided to on argument.

  2. length(matching row indices) == .N gives count and mean() gives the mean of r2 for those matching indices.

  3. The expression in (2) is run for each row in a.

where, a is:

require(data.table) ## v1.9.8+
a <- setDT(df)[, .(start=seq(0, x-1, by=bin.size), 
                   end=seq(bin.size, x, by=bin.size)), 
                 by=y]

b <- fread("snp   r2  sca
      1 0.70 sca1
      2 0.80 sca1
     10 0.70 sca1
    100 0.10 sca1
      1 0.90 sca2
      2 0.98 sca2
     14 0.80 sca2
     16 0.80 sca2
    399 0.01 sca2")

这篇关于如何使用left_join和嵌套在R中计算不同类别的平均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆