寻找可视化R和ggplot2中分布的更好方法 [英] Looking for better way to visualise distribution in R and ggplot2

查看:40
本文介绍了寻找可视化R和ggplot2中分布的更好方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想可视化以下数据:一家酒店观察到,每年其一些客户都是回头客.因此,每年约有一半的客户是第一时间的客户,20%是第二时间的客户,依此类推.以下是一些R代码,其中包括数据和可视化效果.但是,我对此不满意,并且正在寻求改进:

I'd like to visualise the following data: a hotel observes that each year some of its customers are repeat customers. So, each year about half of all customers are fist-time customers, 20% are 2nd time-customers, and so on. Below is some R code that includes the data and a visualisation. However, I'm not happy with it and I'm looking for improvements:

  • R不喜欢具有多种颜色的色带-也许是组数据?
  • 阶梯曲线是否可以更好地可视化?
  • 访问次数被视为一个因素-这是正确的方法吗?

  • R doesn't like color bands with many colours - so maybe group data?
  • would a step curve be a better visualisation altogether?
  • The number of visits is treated as a factor - is this the right approach?

通过堆叠栏可以轻松比较第一次来宾,但不能比较其他来宾.我应该选择其他可视化效果吗?

Stacking bars makes it easy to compare 1st-time guests but not the other ones. Should I pick a different visualisation?

#! /usr/bin/env R CMD BATCH

library(ggplot2)

d <- read.table(header=TRUE, text='
    year visit count
    2013 1 1641
    2013 2 604
    2013 3 256
    2013 4 89
    2013 5 32
    2013 6 10
    2013 7 4
    2013 8 3
    2014 1 1365
    2014 2 637
    2014 3 276
    2014 4 154
    2014 5 86
    2014 6 39
    2014 7 19
    2014 8 6
    2014 9 4
    2014 10 2
    2014 11 1
    2014 12 1
    2015 1 1251
    2015 2 608
    2015 3 288
    2015 4 143
    2015 5 88
    2015 6 52
    2015 7 21
    2015 8 8
    2015 9 8
    2015 10 3
    2015 11 2
    2015 12 1')

d$year  <- factor(d$year)
d$visit <- factor(d$visit)

p <- ggplot(d, aes(year,count))
p <- p + geom_bar(aes(fill=visit),position="fill",stat="identity")
p <- p + xlab("Year") + ylab("Distribution")
# pdf("returners.pdf",9,6)
print(p)
# dev.off()

推荐答案

为什么不像实际分布那样可视化它们?

Why not visualize them like actual distributions?

p <- ggplot(d, aes(visit, count))
p <- p + geom_bar(stat="identity", width=0.75)
p <- p + scale_x_discrete(expand=c(0,0))
p <- p + scale_y_continuous(expand=c(0,0))
p <- p + facet_wrap(~year)
p <- p + labs(x=NULL, y="Visits")
p <- p + ggthemes::theme_tufte(base_family="Helvetica") 
p <- p + theme(legend.position="none")
p <- p + theme(panel.grid=element_line(color="#2b2b2b", size=0.15))
p <- p + theme(panel.grid.minor=element_blank())
p <- p + theme(panel.grid.major.x=element_blank())
p <- p + theme(axis.ticks=element_blank())
p <- p + theme(strip.text=element_text(hjust=0))
p <- p + theme(panel.margin.x=unit(1, "cm"))
p

要查看按年份划分的访问次数增量,您可以交换构面:

To see the visit count deltas by year, you can just swap the facets:

d$year  <- factor(d$year)
d$visit <- sprintf("Visit: %d", d$visit)
d$visit <- factor(d$visit, levels=unique(d$visit))

p <- ggplot(d, aes(year, count))
p <- p + geom_segment(aes(xend=year, yend=0), size=0.3)
p <- p + geom_point()
p <- p + scale_x_discrete(expand=c(0, 0.25))
p <- p + scale_y_continuous(label=scales::comma)
p <- p + facet_wrap(~visit, scales="free_y")
p <- p + labs(x="NOTE: Free y-axis scale", y="Count")
p <- p + ggthemes::theme_tufte(base_family="Helvetica") 
p <- p + theme(legend.position="none")
p <- p + theme(panel.grid=element_line(color="#2b2b2b", size=0.15))
p <- p + theme(panel.grid.minor=element_blank())
p <- p + theme(panel.grid.major.x=element_blank())
p <- p + theme(axis.ticks=element_blank())
p <- p + theme(strip.text=element_text(hjust=0))
p <- p + theme(panel.margin=unit(1.5, "cm"))
p

或者,您可以按访问量(%)查看同比增长:

Or, you can look at YoY growth by visit (%):

library(dplyr)

group_by(d, visit) %>% 
  arrange(year) %>% 
  mutate(lag=lag(count),
         chg_pct=(count-lag)/lag,
         chg_pct=ifelse(is.na(chg_pct), 0, chg_pct),
         pos=as.character(sign(chg_pct))) -> d

p <- ggplot(d, aes(year, chg_pct))
p <- p + geom_hline(yintercept=0, color="#2b2b2b", size=0.5)
p <- p + geom_segment(aes(xend=year, yend=0, color=pos), size=0.3)
p <- p + geom_point(aes(color=pos))
p <- p + scale_x_discrete(expand=c(0, 0.25))
p <- p + scale_y_continuous(label=scales::percent)
p <- p + scale_color_manual(values=c("#b2182b", "#878787", "#7fbc41"))
p <- p + facet_wrap(~visit, scales="free_y")
p <- p + labs(x="NOTE: free y-axis", y="YoY % Difference per visit count")
p <- p + ggthemes::theme_tufte(base_family="Helvetica") 
p <- p + theme(legend.position="none")
p <- p + theme(panel.grid=element_line(color="#2b2b2b", size=0.15))
p <- p + theme(panel.grid.minor=element_blank())
p <- p + theme(panel.grid.major.x=element_blank())
p <- p + theme(axis.ticks=element_blank())
p <- p + theme(strip.text=element_text(hjust=0))
p <- p + theme(panel.margin=unit(1.5, "cm"))
p <- p + theme(legend.position="none")
p

这篇关于寻找可视化R和ggplot2中分布的更好方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆