SuperImpose直方图适合一个图ggplot [英] SuperImpose Histogram fits in one plot ggplot
问题描述
我有5个非常大的矢量(〜108 MM条目),所以我在R中做的任何情节/东西都需要很长时间。
我正在尝试以直观地显示它们的分布(直方图),并想知道在R中叠加它们的直方图分布而不需要太长时间的最佳方式。我正在考虑首先将分布拟合到直方图上,然后将所有分布线拟合在一起。
您是否对如何做到这一点有一些建议?
让我们说我的向量是:
x1,x2 ,x3,x4,x5。
我正在尝试使用此代码:在R中用ggplot2覆盖直方图
我使用的代码示例3个向量(R未能做出图):
pre code $ n = length(x1)
dat < - data .frame(xx = c(x1,x2,x3),yy = rep(letters [1:3],each = n))
ggplot(dat,aes(x = xx))+
geom_histogram(data = subset(dat,yy =='a'),fill =red,alpha = 0.2)+
geom_histogram(data = subset(dat,yy =='b'),fill =蓝色,alpha = 0.2)+
geom_histogram(data = subset(dat,yy =='c'),fill =green,alpha = 0.2)
但生成剧情需要花费很长时间,最终它会将我从R中踢出去。关于如何有效地使用ggplot2来处理大型矢量的想法?在我看来,我必须创建一个5 * 108MM条目的数据框,然后在我的情况下非常低效。
谢谢!
$ b $ b library(Rcpp)
cppFunction('
std :: vector< int> bin3(NumericVector x,double width,double origin = 0){
int bin,nmissing = 0;
std :: vector< int> out;
NumericVector :: iterator x_it = x.begin(),x_end;
for(; x_it!= x.end(); ++ x_it){
double val = * x_it;
if(ISNAN(val)){
++ nmissing;
} else {
bin =(val - origin)/ width;
if(bin <0)continue;
//确保有' (bin> = out.size()){
out.resize(bin + 1);
}
足够的空间
++ out [bin];
}
}
//将缺失值放入最后位置
out.push_back(nmissing);
退出;
$)b
$ b x8 < - runif(1e8)
system.time(bin3(x8,1 / 100))
#user系统流逝
#1.373 0.000 1.373
也就是说, hist
在这里也相当快:
system.time(hist(x8,breaks = 100,plot = F))
#用户系统经过
#7.281 1.362 8.669
这很简单使用 bin3
制作直方图或频率多边形:
#First我们创建了一些示例数据,并将每列分别存储为
library(reshape2)
library(ggplot2)
df< - as.data.frame(replicate( 5,runif(1e6)))
bins< - vapply(df,bin3,1/100,FUN.VALUE = integer(100 + 1))
#接下来我们匹配
binsdf< - data.frame(
breaks = c(seq(0,1,length = 100),NA),
bins)
$ (binsdf,id =breaks),!is.na(break )
qplot(break,value,data = binsm,geom =line,color = variable)
FYI,我手边有 bin3
的原因是我正在研究如何使这个速度成为ggplot2中的默认值:)
I have ~ 5 very large vectors (~ 108 MM entries) so any plot/stuff I do with them in R takes quite long time.
I am trying to visualize their distribution (histogram), and was wondering what would be the best way to superimpose their histogram distributions in R without taking too long. I am thinking to first fit a distribution to the histogram, and then plot all the distribution line fits together in one plot.
Do you have some suggestions on how to do that?
Let us say my vectors are:
x1, x2, x3, x4, x5.
I am trying to use this code: Overlaying histograms with ggplot2 in R
Example of the code I am using for 3 vectors (R fails to do the plot):
n = length(x1)
dat <- data.frame(xx = c(x1, x2, x3),yy = rep(letters[1:3],each = n))
ggplot(dat,aes(x=xx)) +
geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)
but it takes forever to produce the plot, and eventually it kicks me out of R. Any ideas on how to use ggplot2 efficiently for large vectors? Seems to me that I had to create a dataframe, of 5*108MM entries and then plot, highly inefficient in my case.
Thanks!
Here's a little snippet of Rcpp that bins data very efficiently - on my computer it takes about a second to bin 100,000,000 observations:
library(Rcpp)
cppFunction('
std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
int bin, nmissing = 0;
std::vector<int> out;
NumericVector::iterator x_it = x.begin(), x_end;
for(; x_it != x.end(); ++x_it) {
double val = *x_it;
if (ISNAN(val)) {
++nmissing;
} else {
bin = (val - origin) / width;
if (bin < 0) continue;
// Make sure there\'s enough space
if (bin >= out.size()) {
out.resize(bin + 1);
}
++out[bin];
}
}
// Put missing values in the last position
out.push_back(nmissing);
return out;
}
')
x8 <- runif(1e8)
system.time(bin3(x8, 1/100))
# user system elapsed
# 1.373 0.000 1.373
That said, hist
is pretty fast here too:
system.time(hist(x8, breaks = 100, plot = F))
# user system elapsed
# 7.281 1.362 8.669
It's straightforward to use bin3
to make a histogram or frequency polygon:
# First we create some sample data, and bin each column
library(reshape2)
library(ggplot2)
df <- as.data.frame(replicate(5, runif(1e6)))
bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))
# Next we match up the bins with the breaks
binsdf <- data.frame(
breaks = c(seq(0, 1, length = 100), NA),
bins)
# Then melt and plot
binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
qplot(breaks, value, data = binsm, geom = "line", colour = variable)
FYI, the reason I had bin3
on hand is that I'm working on how to make this speed the default in ggplot2 :)
这篇关于SuperImpose直方图适合一个图ggplot的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!