在R中了解日期并绘制ggplot2的直方图 [英] Understanding dates and plotting a histogram with ggplot2 in R

查看:467
本文介绍了在R中了解日期并绘制ggplot2的直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

主要问题



我有理解理解为什么日期,标签和休息时间的处理不能像我在R中所想的那样工作一个直方图与ggplot2。



我正在寻找:




  • 我的日期频率的直方图

  • 以匹配条为中心的刻度标记

  • %Yb 格式

  • 适当的限制;最小化网格空间边缘与最外边栏之间的空白空间



我有在learnr.wordpress,一个流行的R博客。它表示我需要将我的数据写入POSIXct格式,我现在认为是错误的,浪费了我的时间。

  • 另一个学习者帖子在ggplot2中重新创建一个时间序列,但并不适用于我的情况。

  • r-博客有一个帖子这个,但它似乎过时了。简单的 format = 选项对我来说无效。

  • 这个SO问题正在玩休息和标签。我试着把我的 Date 矢量作为连续的,不要以为它的效果很好。看起来它覆盖着相同的标签文字,所以这些字母看起来很奇怪。分配是正确的,但有奇数的休息。我根据接受的答案进行的尝试是这样的(结果这里)。


  • 解决方案

    更新



    < h1>版本2:使用日期类

    我更新了该示例,以演示对齐标签并在图上设置限制。我也证明, as.Date 确实可以一直使用(实际上它可能比我早期的例子更适合你的数据)。



    Target Plot v2





    代码v2



    这里是(有些过分)评论代码:

     库(ggplot2)
    库(scales)

    日期< - read.csv(http://pastebin.com/raw.php?i=sDzXKFxJ,sep =,,header = T)
    dates $ Date< - as.Date(dates $ Date)

    #将日期转换为数值等价物
    #注意日期是内部存储的天数,
    #因此很容易精神上来回转换
    日期$ num < - as.numeric(dates $ Date)

    bin< - 60#用于聚合数据并对齐标签

    p< - ggplot(dates,aes(num,..count ..))
    p< - p + geom_histogram(binwidth = bin,color =white)

    #数字数据被视为一个日期,
    #break被设置为等于binwidth的间隔,
    #和一组标签是生成和调整以便与条对齐
    p< - p + scale_x_date(breaks = seq(min(dates $ num)-20,#change -20 term to taste
    max(dates $ num) ,
    bin),
    labels = date_format(%Y-%b),
    limits = c(as.Date(2009-01-01),
    as.Date(2011-12-01))

    #从这里,格式放松
    p< - p + theme_bw()+ xlab(NULL)+ opts .text.x = theme_text(angle = 45,
    hjust = 1,
    vjust = 1))
    p



    版本1:使用POSIXct



    我尝试一个解决方案,在 ggplot2 ,无需汇总,并在2009年初至2011年底之间在x轴上设置限制。



    Target Plot v1





    代码v1



      library(ggplot2 )
    库(scales)

    日期< - read.csv(http://pastebin.com/raw.php?i=sDzXKFxJ,sep =, ,header = T)
    日期$ Date< - as.POSIXct(dates $ Date)

    p< - ggplot(dates,aes(Date,..count ..)) +
    geom_histogram()+
    theme_bw()+ xlab(NULL)+
    scale_x_datetime(breaks = date_breaks(3 months),
    labels = date_format(%Y- %b),
    limits = c(as.POSIXct(2009-01-01),
    as.POSIXct(2011-12-01))

    p

    当然,它可以在轴上使用标签选项,但这是在绘图包中用一个干净的简短例程来绘制绘图。


    Main Question

    I'm having issues with understanding why the handling of dates, labels and breaks is not working as I would have expected in R when trying to make a histogram with ggplot2.

    I'm looking for:

    • A histogram of the frequency of my dates
    • Tick marks centered under the matching bars
    • Date labels in %Y-b format
    • Appropriate limits; minimized empty space between edge of grid space and outermost bars

    I've uploaded my data to pastebin to make this reproducible. I've created several columns as I wasn't sure the best way to do this:

    > dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
    > head(dates)
           YM       Date Year Month
    1 2008-Apr 2008-04-01 2008     4
    2 2009-Apr 2009-04-01 2009     4
    3 2009-Apr 2009-04-01 2009     4
    4 2009-Apr 2009-04-01 2009     4
    5 2009-Apr 2009-04-01 2009     4
    6 2009-Apr 2009-04-01 2009     4
    

    Here's what I tried:

    library(ggplot2)
    library(scales)
    dates$converted <- as.Date(dates$Date, format="%Y-%m-%d")
    
    ggplot(dates, aes(x=converted)) + geom_histogram()
    +      opts(axis.text.x = theme_text(angle=90))
    

    Which yields this graph. I wanted %Y-%b formatting, though, so I hunted around and tried the following, based on this SO:

    ggplot(dates, aes(x=converted)) + geom_histogram()
    +    scale_x_date(labels=date_format("%Y-%b"),
    +    breaks = "1 month")
    +    opts(axis.text.x = theme_text(angle=90))
    
    stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
    

    That gives me this graph

    • Correct x axis label format
    • The frequency distribution has changed shape (binwidth issue?)
    • Tick marks don't appear centered under bars
    • The xlims have changed as well

    I worked through the example in the ggplot2 documentation at the scale_x_date section and geom_line() appears to break, label, and center ticks correctly when I use it with my same x-axis data. I don't understand why the histogram is different.


    Updates based on answers from edgester and gauden

    I initially thought gauden's answer helped me solve my problem, but am now puzzled after looking more closely. Note the differences between the two answers' resulting graphs after the code.

    Assume for both:

    library(ggplot2)
    library(scales)
    dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
    

    Based on @edgester's answer below, I was able to do the following:

    freqs <- aggregate(dates$Date, by=list(dates$Date), FUN=length)
    freqs$names <- as.Date(freqs$Group.1, format="%Y-%m-%d")
    
    ggplot(freqs, aes(x=names, y=x)) + geom_bar(stat="identity") +
           scale_x_date(breaks="1 month", labels=date_format("%Y-%b"),
                        limits=c(as.Date("2008-04-30"),as.Date("2012-04-01"))) +
           ylab("Frequency") + xlab("Year and Month") +
           theme_bw() + opts(axis.text.x = theme_text(angle=90))
    

    Here is my attempt based on gauden's answer:

    dates$Date <- as.Date(dates$Date)
    ggplot(dates, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
           scale_x_date(labels = date_format("%Y-%b"),
                        breaks = seq(min(dates$Date)-5, max(dates$Date)+5, 30),
                        limits = c(as.Date("2008-05-01"), as.Date("2012-04-01"))) +
           ylab("Frequency") + xlab("Year and Month") +
           theme_bw() + opts(axis.text.x = theme_text(angle=90))
    

    Plot based on edgester's approach:

    Plot based on gauden's approach:

    Note the following:

    • gaps in gauden's plot for 2009-Dec and 2010-Mar; table(dates$Date) reveals that there are 19 instances of 2009-12-01 and 26 instances of 2010-03-01 in the data
    • edgester's plot starts at 2008-Apr and ends at 2012-May. This is correct based on a minimum value in the data of 2008-04-01 and a max date of 2012-05-01. For some reason gauden's plot starts in 2008-Mar and still somehow manages to end at 2012-May. After counting bins and reading along the month labels, for the life of me I can't figure out which plot has an extra or is missing a bin of the histogram!

    Any thoughts on the differences here? edgester's method of creating a separate count


    Related References

    As an aside, here are other locations that have information about dates and ggplot2 for passers-by looking for help:

    • Started here at learnr.wordpress, a popular R blog. It stated that I needed to get my data into POSIXct format, which I now think is false and wasted my time.
    • Another learnr post recreates a time series in ggplot2, but wasn't really applicable to my situation.
    • r-bloggers has a post on this, but it appears outdated. The simple format= option did not work for me.
    • This SO question is playing with breaks and labels. I tried treating my Date vector as continuous and don't think it worked so well. It looked like it was overlaying the same label text over and over so the letters looked kind of odd. The distribution is sort of correct but there are odd breaks. My attempt based on the accepted answer was like so (result here).

    解决方案

    UPDATE

    Version 2: Using Date class

    I update the example to demonstrate aligning the labels and setting limits on the plot. I also demonstrate that as.Date does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).

    The Target Plot v2

    The Code v2

    And here is (somewhat excessively) commented code:

    library("ggplot2")
    library("scales")
    
    dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
    dates$Date <- as.Date(dates$Date)
    
    # convert the Date to its numeric equivalent
    # Note that Dates are stored as number of days internally,
    # hence it is easy to convert back and forth mentally
    dates$num <- as.numeric(dates$Date)
    
    bin <- 60 # used for aggregating the data and aligning the labels
    
    p <- ggplot(dates, aes(num, ..count..))
    p <- p + geom_histogram(binwidth = bin, colour="white")
    
    # The numeric data is treated as a date,
    # breaks are set to an interval equal to the binwidth,
    # and a set of labels is generated and adjusted in order to align with bars
    p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
                                       max(dates$num), 
                                       bin),
                          labels = date_format("%Y-%b"),
                          limits = c(as.Date("2009-01-01"), 
                                     as.Date("2011-12-01")))
    
    # from here, format at ease
    p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x  = theme_text(angle=45,
                                                                      hjust = 1,
                                                                      vjust = 1))
    p
    

    Version 1: Using POSIXct

    I try a solution that does everything in ggplot2, drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.

    The Target Plot v1

    The Code v1

    library("ggplot2")
    library("scales")
    
    dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
    dates$Date <- as.POSIXct(dates$Date)
    
    p <- ggplot(dates, aes(Date, ..count..)) + 
        geom_histogram() +
        theme_bw() + xlab(NULL) +
        scale_x_datetime(breaks = date_breaks("3 months"),
                         labels = date_format("%Y-%b"),
                         limits = c(as.POSIXct("2009-01-01"), 
                                    as.POSIXct("2011-12-01")) )
    
    p
    

    Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.

    这篇关于在R中了解日期并绘制ggplot2的直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆