如何使用left_join和嵌套在R中计算不同类别的平均值? [英] How to compute the mean in different categories using left_join and nest in R?
问题描述
我正在使用 left_join
和 nest
来计算收货数据的平均值。
bin.size = 100
第一个数据框:
df = data.frame(x = c(300,400),
y = c (sca1,sca2))
xy
1 300 sca1
2 400 sca2
第二个数据框:
df2 = data.frame(snp = c(1,2,10,100, (sca2,sca2,sca2,sca2,sca2,sca2,sca2 sca2))
pre>
snp r2 sca
1 1 0.70 sca1
2 2 0.80 sca1
3 10 0.70 sca1
4 100 0.10 sca1
5 1 0.90 sca2
6 2 0.98 sca2
7 14 0.80 sca2
8 16 0.80 sca2
9 399 0.01 sca2
来自@ r2evans的代码:
output_bin_LD = df%> ;%
left_join(nest(df2,snp,.key =snp),by = c(y=sca))%>%
mutate b $ b cut = map(x,〜seq(0,。,by = bin.size)),
tbls = pmap(
.l = list(snp,cutting),
.f = function(xx,break){
z < - table(cut(xx $ snp,breaks))
data_frame(cut = names(z),count = z)
}
)
)%>%
select(y,tbls)%>%
unnest()
这个代码正在这样做:
y cut count
1 sca1(0,100)4
2 sca1(100,200)0
3 sca1(200,300)0
4 sca2(0,100 ] 4
5 sca2(100,200] 0
6 sca2(200,300)0
7 sca2(300,400)1
最终目标是拥有
y cut count mean
1 sca1(0,100] 4 0.575
2 sca1(100,20 0] 0 0
3 sca1(200,300)0 0
4 sca2(0,100)4 0.87
5 sca2(100,200)0 0
6 sca2(200,300)0 0
7 sca2(300,400)1 399
到目前为止,我已经尝试过:
df%>%
/ pre>
left_join(nest(df2,snp,r2,.key =snp),
by = c(y=sca))%>%
mutate(
cutting = map(x,〜seq(0,...,by = 100)),
tbls = pmap(
.l = list(snp,cutting),
.f = function(xx,break){
z < - table(cut(xx $ snp, )
a < - mean(cut(xx $ r2,break))
data_frame(cut = names(z),count = z,mean = a)
}#.f
)#关闭pmap
)%>%#mutate
select(y,tbls)%>%
unnest()
但它输出我
NA
s和一条警告消息:y cut count mean
1 sca1(0,100)4 NA
2 sca1 (100,200] 0 NA
3 sca1(200,300)0 NA
4 sca2(0,100)4 NA
5 sca2(100,200)0 NA
6 sca2(200,300)0 NA
7 sca2(300,400)1 NA
警告消息:
1:在mean.default(cut(xx $ r2,休息)):
参数不是数字或逻辑:返回NA
2:在mean.default(cut(xx $ r2,休息)):
参数不是数字或逻辑:返回NA
我该如何解决这个问题?我需要双重嵌套桌子吗?
解决方案不确定您的方法,但这里有一个简单的方法..使用
data.table
包,如果你有兴趣。您将需要最新版本(目前为1.10.0),因为这是一个新功能。require( data.table)## v1.9.8 +
和< - b [a,on =。(sca = y,snp> start,snp< = end),## 1
。 = .N,mean = mean(r2,na.rm = TRUE)),## 2
by = .EACHI] ## 3
对于
a
中的每一行,请在<$ c $中找到匹配的行索引在参数
的条件下匹配c> b >
长度(匹配行索引)
==.N
给出计数
和mean()
给出了这些匹配索引的r2
的平均值。
(2)
中的表达式运行在a
中的每一行。
其中,
a
是:require(data.table)## v1.9.8 +
a < - setDT(df)[,。(start = seq 0,x-1,by = bin.size),
end = seq(bin.size,x,by = bin.size)),
by = y]
b< - fread(snp r2 sca
1 0.70 sca1
2 0.80 sca1
10 0.70 sca1
100 0.10 sca1
1 0.90 sca2
2 0.98 sca2
14 0.80 sca2
16 0.80 sca2
399 0.01 sca2)
I'm trying to compute the mean values for binned data using
left_join
andnest
.bin.size = 100
First dataframe:
df = data.frame(x =c(300,400), y = c("sca1","sca2")) x y 1 300 sca1 2 400 sca2
Second dataframe:
df2 = data.frame(snp = c(1,2,10,100,1,2,14,16,399), sca = c("sca1","sca1","sca1","sca1","sca2","sca2","sca2","sca2","sca2")) snp r2 sca 1 1 0.70 sca1 2 2 0.80 sca1 3 10 0.70 sca1 4 100 0.10 sca1 5 1 0.90 sca2 6 2 0.98 sca2 7 14 0.80 sca2 8 16 0.80 sca2 9 399 0.01 sca2
Code from @r2evans:
output_bin_LD = df %>% left_join(nest(df2, snp, .key = "snp"), by = c("y" = "sca")) %>% mutate( cuts = map(x, ~ seq(0, ., by = bin.size)), tbls = pmap( .l = list(snp, cuts), .f = function(xx, breaks) { z <- table(cut(xx$snp, breaks)) data_frame(cut = names(z), count = z) } ) ) %>% select(y, tbls) %>% unnest()
This code up is doing this:
y cut count 1 sca1 (0,100] 4 2 sca1 (100,200] 0 3 sca1 (200,300] 0 4 sca2 (0,100] 4 5 sca2 (100,200] 0 6 sca2 (200,300] 0 7 sca2 (300,400] 1
The end goal would be to have
y cut count mean 1 sca1 (0,100] 4 0.575 2 sca1 (100,200] 0 0 3 sca1 (200,300] 0 0 4 sca2 (0,100] 4 0.87 5 sca2 (100,200] 0 0 6 sca2 (200,300] 0 0 7 sca2 (300,400] 1 399
So far I've tried this:
df %>% left_join(nest(df2, snp, r2, .key = "snp"), by = c("y" = "sca")) %>% mutate( cuts = map(x, ~ seq(0, ., by = 100)), tbls = pmap( .l = list(snp, cuts), .f = function(xx, breaks) { z <- table(cut(xx$snp, breaks)) a <- mean(cut(xx$r2, breaks)) data_frame(cut = names(z), count = z, mean = a) } # .f ) # closing pmap ) %>% # mutate select(y, tbls) %>% unnest()
But it outputs me
NA
s and a warning message:y cut count mean 1 sca1 (0,100] 4 NA 2 sca1 (100,200] 0 NA 3 sca1 (200,300] 0 NA 4 sca2 (0,100] 4 NA 5 sca2 (100,200] 0 NA 6 sca2 (200,300] 0 NA 7 sca2 (300,400] 1 NA Warning messages: 1: In mean.default(cut(xx$r2, breaks)) : argument is not numeric or logical: returning NA 2: In mean.default(cut(xx$r2, breaks)) : argument is not numeric or logical: returning NA
How should I fix this? Do I need to double nest the table?
解决方案Not sure about your approach, but here's a slightly straightforward approach.. using
data.table
package, if you're interested. You will need the latest version (currently 1.10.0) for this to work (since it's a new feature).require(data.table) ## v1.9.8+ and <- b[a, on=.(sca=y, snp>start, snp<=end), ## 1 .(count=.N, mean=mean(r2, na.rm=TRUE)), ## 2 by=.EACHI] ## 3
For each row in
a
, find matching row indices inb
while matching on the condition provided toon
argument.
length(matching row indices)
==.N
givescount
andmean()
gives the mean ofr2
for those matching indices.The expression in
(2)
is run for each row ina
.where,
a
is:require(data.table) ## v1.9.8+ a <- setDT(df)[, .(start=seq(0, x-1, by=bin.size), end=seq(bin.size, x, by=bin.size)), by=y] b <- fread("snp r2 sca 1 0.70 sca1 2 0.80 sca1 10 0.70 sca1 100 0.10 sca1 1 0.90 sca2 2 0.98 sca2 14 0.80 sca2 16 0.80 sca2 399 0.01 sca2")
这篇关于如何使用left_join和嵌套在R中计算不同类别的平均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!