R / ggplot直方图中的累积和 [英] R/ggplot Cumulative Sum in Histogram

查看:425
本文介绍了R / ggplot直方图中的累积和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含用户ID和他们创建的对象数量的数据集。我使用ggplot绘制了直方图,现在我试图将x值的累积和作为一行。目标是看到很多垃圾箱对总数的贡献。我尝试了以下方法:

  ggplot(data = userStats,aes(x = Num_Tours))+ geom_histogram(binwidth = 0.2)+ 
scale_x_log10(name ='计划行程数',休息= c(1,5,10,50,100,200))+
geom_line(aes(x = Num_Tours,y = cumsum(Num_Tours)/ sum Num_Tours)* 3500),color =red)+
scale_y_continuous(name ='Number of users',sec.axis = sec_axis(〜。/ 3500,name =累计路线百分比[%]) )

这是行不通的,因为我没有包含任何垃圾箱,所以剧情





  ggplot(data = userStats,aes(x = Num_Tours))+ geom_histogram (binwidth = 0.2)+ 
scale_x_log10(name ='计划行程数',break = c(1,5,10,50,100,200))+
stat_bin(aes(y = cumsum(.. count ..)),binwidth = 0.2,geom =line,color =red)+
scale_y_continuous(name ='Number of users',sec.axis = sec_axis(〜。/ 3500,name =)累积百分比的路线[%]))



导致:


这里考虑计数的cumsum。我想要的是bin的count *值的cumsum。然后它应该正常化,以便它可以显示在一个图中。我想要的是这样的:





如果有任何输入,我将不胜感激!感谢

编辑:
作为测试数据,这应该是正常的:

< pre $ userID <-c(1:100)
Num_Tours < - 样本(1:100,100)
userStats< - data.frame(userID,Num_Tours )
userStats $ cumulative< - cumsum(userStats $ Num_Tours / sum(userStats $ Num_Tours))


解决方案

这是一个说明性的例子,可以帮助您。

  set .seed(111)
userID <-c(1:100)
Num_Tours < - sample(1:100,100,replace = T)
userStats< - data.frame (用户ID,Num_Tours)

#排序x数据
userStats $ Num_Tours< - sort(userStats $ Num_Tours)
userStats $ cumulative< - cumsum(userStats $ Num_Tours / sum (userStats $ Num_Tours))

library(ggplot2)
#手动修复y轴的最大值
ymax < - 40
ggplot(data = userStats ,aes(x = Num_Tours))+
geom_histogram(binwidth = 0.2,col =white)+
scale_x_log10(nam e ='计划行程数',中断= c(1,5,10,50,100,200))+
geom_line(aes(x = Num_Tours,y =累积* ymax),col =红色,lwd = 1)+
scale_y_continuous(name ='Number of users',sec.axis = sec_axis(〜。/ ymax,
name =累计路线百分比[%]))


I have a dataset with user IDs and the number of objects they created. I drew the histogram using ggplot and now I'm trying to include the cumulative sum of the x-values as a line. The aim is to see much the bins contribute to the total number. I tried the following:

ggplot(data=userStats,aes(x=Num_Tours)) + geom_histogram(binwidth = 0.2)+
   scale_x_log10(name = 'Number of planned tours',breaks=c(1,5,10,50,100,200))+
   geom_line(aes(x=Num_Tours, y=cumsum(Num_Tours)/sum(Num_Tours)*3500),color="red")+
   scale_y_continuous(name = 'Number of users', sec.axis = sec_axis(~./3500, name = "Cummulative percentage of routes [%]"))

This does not work because I don't include any bins so the plot

and

ggplot(data=userStats,aes(x=Num_Tours)) + geom_histogram(binwidth = 0.2)+
   scale_x_log10(name = 'Number of planned tours',breaks=c(1,5,10,50,100,200))+
   stat_bin(aes(y=cumsum(..count..)),binwidth = 0.2, geom="line",color="red")+
   scale_y_continuous(name = 'Number of users', sec.axis = sec_axis(~./3500, name = "Cummulative percentage of routes [%]"))

Resulting in this: .

Here the cumsum of the count is considered. What I want is the cumsum of the count * value of the bin. Then it should be normalized, so that it can be displayed in one plot. What I am trying to to is something like that:

I would appreciate any input! Thanks

Edit: As test data, this should work:

userID <- c(1:100)
Num_Tours <- sample(1:100,100)
userStats <- data.frame(userID,Num_Tours)
userStats$cumulative <- cumsum(userStats$Num_Tours/sum(userStats$Num_Tours))

解决方案

Here is an illustrative example that could be helpful for you.

set.seed(111)
userID <- c(1:100)
Num_Tours <- sample(1:100, 100, replace=T)
userStats <- data.frame(userID, Num_Tours)

# Sorting x data
userStats$Num_Tours <- sort(userStats$Num_Tours)
userStats$cumulative <- cumsum(userStats$Num_Tours/sum(userStats$Num_Tours))

library(ggplot2)
# Fix manually the maximum value of y-axis
ymax <- 40
ggplot(data=userStats,aes(x=Num_Tours)) + 
   geom_histogram(binwidth = 0.2, col="white")+
   scale_x_log10(name = 'Number of planned tours',breaks=c(1,5,10,50,100,200))+
   geom_line(aes(x=Num_Tours,y=cumulative*ymax), col="red", lwd=1)+
   scale_y_continuous(name = 'Number of users', sec.axis = sec_axis(~./ymax, 
    name = "Cumulative percentage of routes [%]"))

这篇关于R / ggplot直方图中的累积和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆