R,ggplot,用x值的范围分开平均值 [英] R, ggplot, separate mean by range of x value

查看:217
本文介绍了R,ggplot,用x值的范围分开平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组数据看起来像这样

  CHROM POS GT DIFF 
1 chr01 14653 CT 254
2 chr01 14907 AG 254
3 chr01 14930 AG 23
4 chr01 15190 GA 260
5 chr01 15211 TG 21
6 chr01 16378 TC 1167

其中POS范围从1xxxx到1xxxxxxx。
CHROM是一个包含chr01到chr22和chrX值的分类变量。

我想绘制散点图: p>


  • y(DIFF)与X(POS)

  • 由CHROM分隔面板

  • 按GT分组(不同颜色的GT)
    / b>

    我正在创建一个运行平均值的ggplot而不是时间系列数据)。

    我想要的是GT每1,000,000范围内的平均值。



    <例如,x在范围内(1〜1,000,000),$ D

    ,DIFF平均= _____
    $ (1,000,001〜2,000,000),DIFF平均值= _____

    ,我想绘制ggplot上的水平线(用GT着色) 。



    到目前为止,我已经应用了你的函数:

    应用您的功能后:



    < img src =https://i.stack.imgur.com/4w5PU.jpgalt =我在试图应用您的解决方案时使用了

    我已经有了,这里有一些问题:


    • 有不同的面板,所以不同面板的平均值是不同的,但是当我应用您的代码,横向平均线与第一个面板完全相同。

    • 我对x轴有不同的范围,所以当应用您的功能时,它会自动填充额外的范围与以前的水平平均线



    以下是我之前的代码:



    < pre $ ggplot(data1,aes(x = POS,y = DIFF,color = GT))+
    geom_point()+
    facet_grid(〜CHROM,scales = free_x,space =free_x)+
    主题(strip.text.x = element_text(size = 40),
    strip.background = element_rect(color ='lightblue',fill ='lightblue '),
    legend.position =top,
    legend.title = element_text(size = 40,color =darkblu元素文字(大小= 40),
    legend.key.size =单位(2.5,cm))+
    指南(fill = guide_legend(title) .position =top,
    title =Legend:GT ='REF'+'ALT'),
    shape = guide_legend(override.aes = list(size = 10)))+
    scale_y_log10(breaks = trans_breaks(log10,function(x)10 ^ x,n = 10))+
    scale_x_continuous(breaks = pretty_breaks(n = 3))


    解决方案

    这比我想象的要困难得多!不过这应该至少让你开始吧:

     #它可以节省很多麻烦, b $ b选项(stringsAsFactors = FALSE)



    图书馆(ggplot2)
    图书馆(plyr)

    #如果您可以发布
    #您的真实数据的子集,它总是有帮助的。 dput()函数对此非常有用。
    dat < - data.frame(POS = seq(1,1e7,by = 1e4))


    #添加随机GT值
    dat $ GT< ; - 样本(x = c(CT,AG,GA,TG,TC),
    size = nrow(dat),
    replace = TRUE)

    #分组数百万 - 有几种方法可以做到这一点,我可以
    #永远不会记住,但这里有一个简单的方法来分割数百万美元
    dat $ POSgroup< - floor( dat $ POS / 1e6)


    #添加一个任意的DIFF值
    dat $ DIFF < - rnorm(n = nrow(dat),
    mean = 200 * dat $ POSgroup,
    sd = 300)



    #通过GT和POS-group汇总数据
    #理想情况下,这里面的情节使用stat_summary,
    #但我无法让它工作。不过,在一张图
    #中使用两个数据集是可以的。
    datsum< - ddply(dat,.var =POSgroup,.fun = function(x){

    #计算此POS组中每个GT组的平均DIFF值
    meandiff< - ddply(x,.var =GT,.fun = summarize,ymean = mean(DIFF))

    #添加POSgroup范围的中心作为x位置
    meandiff $ center< - (x $ POSgroup [1] * 1e6)+ 0.5e6

    #返回结果
    meandiff

    })


    #在图上,这些结果将由POS和GT分组 - 但是
    #ggplot只会接受一个分组向量。所以做一个组合。
    datsum $ combogroup< - paste(datsum $ GT,datsum $ POSgroup)


    #绘制
    ggplot()+

    #首先,点自己的图层
    #大量的点可能会变得非常慢 - 您可能会尝试获取
    #图以使用子样本(〜1000),然后添加其余的
    #您的数据
    geom_point(data = dat,
    aes(x = POS,y = DIFF,color = as.factor(GT)))+

    #然后是另一层手段。你可以在
    #中使用各种各样的geoms,但是ymin和ymax设置为group的crossbar意味着
    #是一个简单的
    geom_crossbar(data = datsum,aes(x =中心,
    y = ymean,
    ymin = ..y ..,
    ymax = ..y ..,
    color = as.factor(GT),
    group = combogroup),
    size = 1)+


    #一些其他细节
    scale_x_continuous(breaks = seq(0,1e7,by = 1e6))+
    labs(x =POS,y =DIFF,color =GT)+
    theme_bw()

    其结果如下:





    可能有更直接的方法来做到这一点,但我不知道。希望这有助于。


    I have a set of data looks like this

      CHROM   POS GT DIFF
    1 chr01 14653 CT 254
    2 chr01 14907 AG 254
    3 chr01 14930 AG 23
    4 chr01 15190 GA 260
    5 chr01 15211 TG 21
    6 chr01 16378 TC 1167
    

    Where POS range from 1xxxx to 1xxxxxxx. And CHROM is a categorical variable that contains values of "chr01" to "chr22" and "chrX".

    I want to plot a scatterplot:

    • y(DIFF) vs. X(POS)
    • having panels separated by CHROM
    • grouped by GT (different colors by GT)

    I'm creating a ggplot with running average (though not time series data).

    What I want is to get average for every 1,000,000 range of POS by GT.

    For example,

    for x in range(1 ~ 1,000,000) , DIFF average = _____

    for x in range(1,000,001 ~ 2,000,000), DIFF average = _____

    and I want to plot horizontal lines on the ggplot (coloured by GT).

    #

    What I have so far before apply your function:

    After apply your function:

    I tried to apply your solution to what I already have, here are some problems:

    • There are different panels, so the mean values are different for different panel, but when I apply your code, the horizontal mean lines are all identical to the first panel.
    • I'm having different ranges for x-axis, so when apply your function, it automatically fills out the extra range with the previous horizontal mean line

    Here is my code before:

    ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) +
      geom_point() +
      facet_grid(~ CHROM,scales="free_x",space="free_x") + 
      theme(strip.text.x = element_text(size=40),
            strip.background = element_rect(color='lightblue',fill='lightblue'),
            legend.position="top",
            legend.title = element_text(size=40,colour="darkblue"),
            legend.text = element_text(size=40),
            legend.key.size = unit(2.5, "cm")) +
      guides(fill = guide_legend(title.position="top",
                                 title = "Legend:GT='REF'+'ALT'"),
             shape = guide_legend(override.aes=list(size=10))) +
      scale_y_log10(breaks=trans_breaks("log10", function(x) 10^x, n=10)) + 
      scale_x_continuous(breaks = pretty_breaks(n=3))
    

    解决方案

    This was tougher than I expected! This should at least get you started, though:

    # It saves a lot of headaches to just make factors as you need them
    options(stringsAsFactors = FALSE)
    
    
    
    library(ggplot2)
    library(plyr)
    
    # Here's some made-up data - it always helps if you can post a subset of
    # your real data, though. The dput() function is really useful for that.
    dat <- data.frame(POS = seq(1, 1e7, by = 1e4))
    
    
    # Add random GT value
    dat$GT <- sample(x = c("CT", "AG", "GA", "TG", "TC"),
                     size = nrow(dat),
                     replace = TRUE)
    
    # Group by millions - there are several ways to do this that I can 
    # never remember, but here's a simple way to split by millions
    dat$POSgroup <- floor(dat$POS / 1e6)
    
    
    # Add an arbitrary DIFF value
    dat$DIFF <- rnorm(n = nrow(dat),
                      mean = 200 * dat$POSgroup,
                      sd = 300)
    
    
    
    # Aggregate the data by GT and POS-group
    # Ideally, you'd do this inside of the plot using stat_summary,
    # but I couldn't get that to work. Using two datasets in a plot 
    # is okay, though.
    datsum <- ddply(dat, .var = "POSgroup", .fun = function(x) {
    
        # Calculate the mean DIFF value for each GT group in this POSgroup
        meandiff <- ddply(x, .var = "GT", .fun = summarise, ymean = mean(DIFF))
    
        # Add the center of the POSgroup range as the x position
        meandiff$center <- (x$POSgroup[1] * 1e6) + 0.5e6
    
        # Return the results
        meandiff
    
    })
    
    
    # On the plot, these results will be grouped by both POS and GT - but
    # ggplot will only accept one vector for grouping. So make a combination.
    datsum$combogroup <- paste(datsum$GT, datsum$POSgroup)
    
    
    # Plot it
    ggplot() +
    
        # First, a layer for the points themselves
        # Large numbers of points can get pretty slow - you might try getting
        # the plot to work with a subsample (~1000) and then add in the rest of
        # your data
        geom_point(data = dat, 
                   aes(x = POS, y = DIFF, color = as.factor(GT))) +
    
        # Then another layer for the means. There are a variety of geoms you could
        # use here, but crossbar with ymin and ymax set to the group mean
        # is a simple one
        geom_crossbar(data = datsum, aes(x = center, 
                                         y = ymean, 
                                         ymin = ..y.., 
                                         ymax = ..y.., 
                                         color = as.factor(GT),
                                         group = combogroup),
                      size = 1) +
    
    
        # Some other niceties
        scale_x_continuous(breaks = seq(0, 1e7, by = 1e6)) +
        labs(x = "POS", y = "DIFF", color = "GT") +
        theme_bw()
    

    Which results in this:

    There's probably a more straightforward way to do this, but I don't know it. Hope this helps.

    这篇关于R,ggplot,用x值的范围分开平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆