R,ggplot,用x值的范围分开平均值 [英] R, ggplot, separate mean by range of x value
问题描述
我有一组数据看起来像这样
CHROM POS GT DIFF
1 chr01 14653 CT 254
2 chr01 14907 AG 254
3 chr01 14930 AG 23
4 chr01 15190 GA 260
5 chr01 15211 TG 21
6 chr01 16378 TC 1167
其中POS范围从1xxxx到1xxxxxxx。
CHROM是一个包含chr01到chr22和chrX值的分类变量。
我想绘制散点图: p>
- y(DIFF)与X(POS)
- 由CHROM分隔面板
- 按GT分组(不同颜色的GT)
/ b>
我正在创建一个运行平均值的ggplot而不是时间系列数据)。
我想要的是GT每1,000,000范围内的平均值。
<例如,x在范围内(1〜1,000,000),$ D
$ (1,000,001〜2,000,000),DIFF平均值= _____
,我想绘制ggplot上的水平线(用GT着色) 。
#
到目前为止,我已经应用了你的函数:
应用您的功能后:
< img src =https://i.stack.imgur.com/4w5PU.jpgalt =我在试图应用您的解决方案时使用了
我已经有了,这里有一些问题:
- 有不同的面板,所以不同面板的平均值是不同的,但是当我应用您的代码,横向平均线与第一个面板完全相同。
- 我对x轴有不同的范围,所以当应用您的功能时,它会自动填充额外的范围与以前的水平平均线
以下是我之前的代码:
< pre $ggplot(data1,aes(x = POS,y = DIFF,color = GT))+
geom_point()+
facet_grid(〜CHROM,scales = free_x,space =free_x)+
主题(strip.text.x = element_text(size = 40),
strip.background = element_rect(color ='lightblue',fill ='lightblue '),
legend.position =top,
legend.title = element_text(size = 40,color =darkblu元素文字(大小= 40),
legend.key.size =单位(2.5,cm))+
指南(fill = guide_legend(title) .position =top,
title =Legend:GT ='REF'+'ALT'),
shape = guide_legend(override.aes = list(size = 10)))+
scale_y_log10(breaks = trans_breaks(log10,function(x)10 ^ x,n = 10))+
scale_x_continuous(breaks = pretty_breaks(n = 3))
解决方案这比我想象的要困难得多!不过这应该至少让你开始吧:
#它可以节省很多麻烦, b $ b选项(stringsAsFactors = FALSE)
图书馆(ggplot2)
图书馆(plyr)
#如果您可以发布
#您的真实数据的子集,它总是有帮助的。 dput()函数对此非常有用。
dat < - data.frame(POS = seq(1,1e7,by = 1e4))
#添加随机GT值
dat $ GT< ; - 样本(x = c(CT,AG,GA,TG,TC),
size = nrow(dat),
replace = TRUE)
#分组数百万 - 有几种方法可以做到这一点,我可以
#永远不会记住,但这里有一个简单的方法来分割数百万美元
dat $ POSgroup< - floor( dat $ POS / 1e6)
#添加一个任意的DIFF值
dat $ DIFF < - rnorm(n = nrow(dat),
mean = 200 * dat $ POSgroup,
sd = 300)
#通过GT和POS-group汇总数据
#理想情况下,这里面的情节使用stat_summary,
#但我无法让它工作。不过,在一张图
#中使用两个数据集是可以的。
datsum< - ddply(dat,.var =POSgroup,.fun = function(x){
#计算此POS组中每个GT组的平均DIFF值
meandiff< - ddply(x,.var =GT,.fun = summarize,ymean = mean(DIFF))
#添加POSgroup范围的中心作为x位置
meandiff $ center< - (x $ POSgroup [1] * 1e6)+ 0.5e6
#返回结果
meandiff
})
#在图上,这些结果将由POS和GT分组 - 但是
#ggplot只会接受一个分组向量。所以做一个组合。
datsum $ combogroup< - paste(datsum $ GT,datsum $ POSgroup)
#绘制
ggplot()+
#首先,点自己的图层
#大量的点可能会变得非常慢 - 您可能会尝试获取
#图以使用子样本(〜1000),然后添加其余的
#您的数据
geom_point(data = dat,
aes(x = POS,y = DIFF,color = as.factor(GT)))+
#然后是另一层手段。你可以在
#中使用各种各样的geoms,但是ymin和ymax设置为group的crossbar意味着
#是一个简单的
geom_crossbar(data = datsum,aes(x =中心,
y = ymean,
ymin = ..y ..,
ymax = ..y ..,
color = as.factor(GT),
group = combogroup),
size = 1)+
#一些其他细节
scale_x_continuous(breaks = seq(0,1e7,by = 1e6))+
labs(x =POS,y =DIFF,color =GT)+
theme_bw()
其结果如下:
可能有更直接的方法来做到这一点,但我不知道。希望这有助于。
I have a set of data looks like this
CHROM POS GT DIFF 1 chr01 14653 CT 254 2 chr01 14907 AG 254 3 chr01 14930 AG 23 4 chr01 15190 GA 260 5 chr01 15211 TG 21 6 chr01 16378 TC 1167
Where POS range from 1xxxx to 1xxxxxxx. And CHROM is a categorical variable that contains values of "chr01" to "chr22" and "chrX".
I want to plot a scatterplot:
- y(DIFF) vs. X(POS)
- having panels separated by CHROM
- grouped by GT (different colors by GT)
I'm creating a ggplot with running average (though not time series data).
What I want is to get average for every 1,000,000 range of POS by GT.
For example,
for x in range(1 ~ 1,000,000) , DIFF average = _____
for x in range(1,000,001 ~ 2,000,000), DIFF average = _____
and I want to plot horizontal lines on the ggplot (coloured by GT).
#What I have so far before apply your function:
After apply your function:
I tried to apply your solution to what I already have, here are some problems:
- There are different panels, so the mean values are different for different panel, but when I apply your code, the horizontal mean lines are all identical to the first panel.
- I'm having different ranges for x-axis, so when apply your function, it automatically fills out the extra range with the previous horizontal mean line
Here is my code before:
ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) + geom_point() + facet_grid(~ CHROM,scales="free_x",space="free_x") + theme(strip.text.x = element_text(size=40), strip.background = element_rect(color='lightblue',fill='lightblue'), legend.position="top", legend.title = element_text(size=40,colour="darkblue"), legend.text = element_text(size=40), legend.key.size = unit(2.5, "cm")) + guides(fill = guide_legend(title.position="top", title = "Legend:GT='REF'+'ALT'"), shape = guide_legend(override.aes=list(size=10))) + scale_y_log10(breaks=trans_breaks("log10", function(x) 10^x, n=10)) + scale_x_continuous(breaks = pretty_breaks(n=3))
解决方案This was tougher than I expected! This should at least get you started, though:
# It saves a lot of headaches to just make factors as you need them options(stringsAsFactors = FALSE) library(ggplot2) library(plyr) # Here's some made-up data - it always helps if you can post a subset of # your real data, though. The dput() function is really useful for that. dat <- data.frame(POS = seq(1, 1e7, by = 1e4)) # Add random GT value dat$GT <- sample(x = c("CT", "AG", "GA", "TG", "TC"), size = nrow(dat), replace = TRUE) # Group by millions - there are several ways to do this that I can # never remember, but here's a simple way to split by millions dat$POSgroup <- floor(dat$POS / 1e6) # Add an arbitrary DIFF value dat$DIFF <- rnorm(n = nrow(dat), mean = 200 * dat$POSgroup, sd = 300) # Aggregate the data by GT and POS-group # Ideally, you'd do this inside of the plot using stat_summary, # but I couldn't get that to work. Using two datasets in a plot # is okay, though. datsum <- ddply(dat, .var = "POSgroup", .fun = function(x) { # Calculate the mean DIFF value for each GT group in this POSgroup meandiff <- ddply(x, .var = "GT", .fun = summarise, ymean = mean(DIFF)) # Add the center of the POSgroup range as the x position meandiff$center <- (x$POSgroup[1] * 1e6) + 0.5e6 # Return the results meandiff }) # On the plot, these results will be grouped by both POS and GT - but # ggplot will only accept one vector for grouping. So make a combination. datsum$combogroup <- paste(datsum$GT, datsum$POSgroup) # Plot it ggplot() + # First, a layer for the points themselves # Large numbers of points can get pretty slow - you might try getting # the plot to work with a subsample (~1000) and then add in the rest of # your data geom_point(data = dat, aes(x = POS, y = DIFF, color = as.factor(GT))) + # Then another layer for the means. There are a variety of geoms you could # use here, but crossbar with ymin and ymax set to the group mean # is a simple one geom_crossbar(data = datsum, aes(x = center, y = ymean, ymin = ..y.., ymax = ..y.., color = as.factor(GT), group = combogroup), size = 1) + # Some other niceties scale_x_continuous(breaks = seq(0, 1e7, by = 1e6)) + labs(x = "POS", y = "DIFF", color = "GT") + theme_bw()
Which results in this:
There's probably a more straightforward way to do this, but I don't know it. Hope this helps.
这篇关于R,ggplot,用x值的范围分开平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!