SuperImpose直方图适合一个图ggplot [英] SuperImpose Histogram fits in one plot ggplot

查看:117
本文介绍了SuperImpose直方图适合一个图ggplot的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有5个非常大的矢量(〜108 MM条目),所以我在R中做的任何情节/东西都需要很长时间。



我正在尝试以直观地显示它们的分布(直方图),并想知道在R中叠加它们的直方图分布而不需要太长时间的最佳方式。我正在考虑首先将分布拟合到直方图上,然后将所有分布线拟合在一起。

您是否对如何做到这一点有一些建议?

让我们说我的向量是:

  x1,x2 ,x3,x4,x5。 

我正在尝试使用此代码:在R中用ggplot2覆盖直方图



我使用的代码示例3个向量(R未能做出图):

pre code $ n = length(x1)
dat < - data .frame(xx = c(x1,x2,x3),yy = rep(letters [1:3],each = n))
ggplot(dat,aes(x = xx))+
geom_histogram(data = subset(dat,yy =='a'),fill =red,alpha = 0.2)+
geom_histogram(data = subset(dat,yy =='b'),fill =蓝色,alpha = 0.2)+
geom_histogram(data = subset(dat,yy =='c'),fill =green,alpha = 0.2)

但生成剧情需要花费很长时间,最终它会将我从R中踢出去。关于如何有效地使用ggplot2来处理大型矢量的想法?在我看来,我必须创建一个5 * 108MM条目的数据框,然后在我的情况下非常低效。



谢谢!

$ b $ b

  library(Rcpp)
cppFunction('
std :: vector< int> bin3(NumericVector x,double width,double origin = 0){
int bin,nmissing = 0;
std :: vector< int> out;

NumericVector :: iterator x_it = x.begin(),x_end;
for(; x_it!= x.end(); ++ x_it){
double val = * x_it;
if(ISNAN(val)){
++ nmissing;
} else {
bin =(val - origin)/ width;
if(bin <0)continue;

//确保有' (bin> = out.size()){
out.resize(bin + 1);
}
足够的空间
++ out [bin];
}
}

//将缺失值放入最后位置
out.push_back(nmissing);
退出;

$)b
$ b x8 < - runif(1e8)
system.time(bin3(x8,1 / 100))
#user系统流逝
#1.373 0.000 1.373

也就是说, hist 在这里也相当快:

  system.time(hist(x8,breaks = 100,plot = F))
#用户系统经过
#7.281 1.362 8.669

这很简单使用 bin3 制作直方图或频率多边形:

 #First我们创建了一些示例数据,并将每列分别存储为

library(reshape2)
library(ggplot2)

df< - as.data.frame(replicate( 5,runif(1e6)))
bins< - vapply(df,bin3,1/100,FUN.VALUE = integer(100 + 1))

#接下来我们匹配
binsdf< - data.frame(
breaks = c(seq(0,1,length = 100),NA),
bins)
$ (binsdf,id =breaks),!is.na(break )
qplot(break,value,data = binsm,geom =line,color = variable)

FYI,我手边有 bin3 的原因是我正在研究如何使这个速度成为ggplot2中的默认值:)


I have ~ 5 very large vectors (~ 108 MM entries) so any plot/stuff I do with them in R takes quite long time.

I am trying to visualize their distribution (histogram), and was wondering what would be the best way to superimpose their histogram distributions in R without taking too long. I am thinking to first fit a distribution to the histogram, and then plot all the distribution line fits together in one plot.

Do you have some suggestions on how to do that?

Let us say my vectors are:

x1, x2, x3, x4, x5.

I am trying to use this code: Overlaying histograms with ggplot2 in R

Example of the code I am using for 3 vectors (R fails to do the plot):

n = length(x1)
dat <- data.frame(xx = c(x1, x2, x3),yy = rep(letters[1:3],each = n))
ggplot(dat,aes(x=xx)) + 
    geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)

but it takes forever to produce the plot, and eventually it kicks me out of R. Any ideas on how to use ggplot2 efficiently for large vectors? Seems to me that I had to create a dataframe, of 5*108MM entries and then plot, highly inefficient in my case.

Thanks!

解决方案

Here's a little snippet of Rcpp that bins data very efficiently - on my computer it takes about a second to bin 100,000,000 observations:

library(Rcpp)
cppFunction('
  std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
    int bin, nmissing = 0;
    std::vector<int> out;

    NumericVector::iterator x_it = x.begin(), x_end;
    for(; x_it != x.end(); ++x_it) {
      double val = *x_it;
      if (ISNAN(val)) {
        ++nmissing;
      } else {
        bin = (val - origin) / width;
        if (bin < 0) continue;

        // Make sure there\'s enough space
        if (bin >= out.size()) {
          out.resize(bin + 1);
        }
        ++out[bin];
      }
    }

    // Put missing values in the last position
    out.push_back(nmissing);
    return out;
  }
')

x8 <- runif(1e8)
system.time(bin3(x8, 1/100))
#   user  system elapsed 
#  1.373   0.000   1.373 

That said, hist is pretty fast here too:

system.time(hist(x8, breaks = 100, plot = F))
#   user  system elapsed 
#  7.281   1.362   8.669 

It's straightforward to use bin3 to make a histogram or frequency polygon:

# First we create some sample data, and bin each column

library(reshape2)
library(ggplot2)

df <- as.data.frame(replicate(5, runif(1e6)))
bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))

# Next we match up the bins with the breaks
binsdf <- data.frame(
  breaks = c(seq(0, 1, length = 100), NA),
  bins)

# Then melt and plot
binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
qplot(breaks, value, data = binsm, geom = "line", colour = variable)

FYI, the reason I had bin3 on hand is that I'm working on how to make this speed the default in ggplot2 :)

这篇关于SuperImpose直方图适合一个图ggplot的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆