转换数据框以在ggplot2中制作瀑布图 [英] Convert Dataframe to make Waterfall Chart in ggplot2

查看:99
本文介绍了转换数据框以在ggplot2中制作瀑布图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据框转换为适合瀑布图的格式.

I want to transform my dataframe into a format that would be suitable for a waterfall chart.

我的数据框如下:

employee <- c('A','B','C','D','E','F', 
              'A','B','C','D','E','F',
              'A','B','C','D','E','F',
              'A','B','C','D','E','F')
revenue <- c(10, 20, 30, 40, 10, 40, 
              8, 10, 20, 50, 20, 10,
              2,  5, 70, 30, 10, 50,
             40,  8, 30, 40, 10, 40)
date <- as.Date(c('2017-03-01','2017-03-01','2017-03-01',
                  '2017-03-01','2017-03-01','2017-03-01',
                  '2017-03-02','2017-03-02','2017-03-02',
                  '2017-03-02','2017-03-02','2017-03-02',
                  '2017-03-03','2017-03-03','2017-03-03',
                  '2017-03-03','2017-03-03','2017-03-03',
                  '2017-03-04','2017-03-04','2017-03-04',
                  '2017-03-04','2017-03-04','2017-03-04'))
df <- data.frame(date,employee,revenue)

         date employee revenue
1  2017-03-01        A      10
2  2017-03-01        B      20
3  2017-03-01        C      30
4  2017-03-01        D      40
5  2017-03-01        E      10
6  2017-03-01        F      40
7  2017-03-02        A       8
8  2017-03-02        B      10
9  2017-03-02        C      20
10 2017-03-02        D      50
11 2017-03-02        E      20
12 2017-03-02        F      10
13 2017-03-03        A       2
14 2017-03-03        B       5
15 2017-03-03        C      70
16 2017-03-03        D      30
17 2017-03-03        E      10
18 2017-03-03        F      50
19 2017-03-04        A      40
20 2017-03-04        B       8
21 2017-03-04        C      30
22 2017-03-04        D      40
23 2017-03-04        E      10
24 2017-03-04        F      40

如何转换此数据框,以便可以将其转换为ggplot2中瀑布图的形式?

How do I transform this dataframe so that I can get it into a form for a waterfall chart in ggplot2?

amount列是与员工的总天数之差.

The amount column is the difference from the total day by employee.

end列是start列减去amount列.

start列是前一天的Total最终值.

The start column is the Total end values from previous day.

最终数据框应如下所示:

Final dataframe should look like this:

         date employee     start    end    amount    total_for_day
1  2017-03-01        A         0     10        10               10
2  2017-03-01        B         0     20        20               20
3  2017-03-01        C         0     30        30               30
4  2017-03-01        D         0     40        40               40
5  2017-03-01        E         0     10        10               10
6  2017-03-01        F         0     40        40               40
7  2017-03-01    Total         0    150       150              150
8  2017-03-02        A       150    148        -2                8
9  2017-03-02        B       150    140       -10               10
10 2017-03-02        C       150    140       -10               20
11 2017-03-02        D       150    160        10               50 
12 2017-03-02        E       150    160        10               20
13 2017-03-02        F       150    120       -30               10  
14 2017-03-02    Total       150    118       -32               98
15 2017-03-03        A       118    112        -6                2                      
16 2017-03-03        B       118    113        -5                5                  
17 2017-03-03        C       118    168        50               70
18 2017-03-03        D       118     98       -20               30  
19 2017-03-03        E       118    108       -10               10  
20 2017-03-03        F       118    158        40               50
21 2017-03-03    Total       118    167        49              170  
22 2017-03-04        A       167    205        38               40
23 2017-03-04        B       167    170         3                8
24 2017-03-04        C       167    127       -40               30
25 2017-03-04        D       167    177        10               40
26 2017-03-04        E       167    167         0               10
27 2017-03-04        F       167    157       -10               40 
28 2017-03-04    Total       167    168         1              168

推荐答案

有一些步骤可以帮助您实现这一目标,并且我认为dplyr软件包会有所帮助(在下面大量使用).

There are a few steps to get you to this, and I think that the dplyr package will help (used heavily below).

我的理解是revenue给出的是累计总收入,而不是每日变化.如果那是错误的,则需要逆转其中一些计算.

My understanding is that revenue gives the cumulative total revenue, rather than the daily change. If that is wrong, you would need to reverse some of these calculations.

第一步是创建一个新的data.frame来计算每日总计,然后将其绑定回data.frame.然后,您可以group_by雇员(包括总计")并添加将分别为每个雇员创建的列(前一天的值,更改,然后是增加还是减少).

The first step is to create a new data.frame that calculates the daily totals, then bind that back to the data.frame. Then, you can group_by the employees (including "Total") and add columns that will be created separately for each employee (value on the previous day, the change, and then whether it was an increase or a decrease).

toPlot <-
  bind_rows(
    df
    , df %>%
      group_by(date) %>%
      summarise(revenue = sum(revenue)) %>%
      mutate(employee = "Total") 
  ) %>%
  group_by(employee) %>%
  mutate(
    previousDay = lag(revenue, default = 0) 
    , change = revenue - previousDay
    , direction = ifelse(change > 0
                         , "Positive"
                         , "Negative"))

返回:

         date employee revenue previousDay change direction
       <date>    <chr>   <dbl>       <dbl>  <dbl>     <chr>
1  2017-03-01        A      10           0     10  Positive
2  2017-03-01        B      20           0     20  Positive
3  2017-03-01        C      30           0     30  Positive
4  2017-03-01        D      40           0     40  Positive
5  2017-03-01        E      10           0     10  Positive
6  2017-03-01        F      40           0     40  Positive
7  2017-03-02        A       8          10     -2  Negative
8  2017-03-02        B      10          20    -10  Negative
9  2017-03-02        C      20          30    -10  Negative
10 2017-03-02        D      50          40     10  Positive
# ... with 18 more rows

然后,我们可以使用以下方法进行绘制:

Then, we can plot that using:

toPlot %>%
  ggplot(aes(xmin = date - 0.5
             , xmax = date + 0.5
             , ymin = previousDay
             , ymax = revenue
             , fill = direction)) +
  geom_rect(col = "black"
            , show.legend = FALSE) +
  facet_wrap(~employee
             , scale = "free_y") +
  scale_fill_brewer(palette = "Set1")

给予

请注意,将总计"包括在内会超出范围(要求使用免费范围),因此我宁愿忽略它:

Note that including "Total" throws off the scale (requiring the free scales), so I would prefer to omit it:

toPlot %>%
  filter(employee != "Total") %>%
  ggplot(aes(xmin = date - 0.5
             , xmax = date + 0.5
             , ymin = previousDay
             , ymax = revenue
             , fill = direction)) +
  geom_rect(col = "black"
            , show.legend = FALSE) +
  facet_wrap(~employee) +
  scale_fill_brewer(palette = "Set1")

为此,员工之间可以直接进行比较

For this to allow direct comparsion between employees

这是整个总数

toPlot %>%
  filter(employee == "Total") %>%
  ggplot(aes(xmin = date - 0.5
             , xmax = date + 0.5
             , ymin = previousDay
             , ymax = revenue
             , fill = direction)) +
  geom_rect(col = "black"
            , show.legend = FALSE) +
  scale_fill_brewer(palette = "Set1")

尽管我仍然发现折线图更易于解释(尤其是比较员工):

though I still find line graphs to be easier to interpret (especially comparing employees):

toPlot %>%
  filter(employee != "Total") %>%
  ggplot(aes(x = date
             , y = revenue
             , col = employee)) +
  geom_line() +
  scale_fill_brewer(palette = "Dark2")

如果您想按日自己绘制更改,则可以执行以下操作:

If you want to plot the changes themselves by day, you can do:

toPlot %>%
  filter(employee != "Total") %>%
  ggplot(aes(x = date
             , y = change
             , fill = employee)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Dark2")

获得:

但是现在您与瀑布"图输出相差很远.如果您确实确实想使瀑布在各图之间具有可比性,那么它将非常难看(我强烈建议在上面的线图上使用

but now you are getting rather far from the "waterfall" plot outputs. If you really, really want to make a waterfall comparable across plots you can, but it is going to be rather ugly (I'd strongly recommend the line plot above instead).

在这里,您需要手动移动框,如果您更改输出宽高比(或大小)或员工人数,则需要进行一些修补.您还需要包括员工和更改方向的颜色,这些颜色开始看起来很粗糙.这属于可以,但可能不应该"的类别-显示这些数据可能是一种更好的方法.

Here, you need to manually move the boxes around, and this will require some tinkering if you change the output aspect ratio (or size) or the number of employees. You also need to include colors for both the employee and the direction of the change, which starts to look rough. This falls into the category of "can, but probably shouldn't" -- there is likely a better way to display these data.

toPlot %>%
  filter(employee != "Total") %>%
  ungroup() %>%
  mutate(empNumber = as.numeric(as.factor(employee))) %>%
  ggplot(aes(xmin = (empNumber) - 0.4
             , xmax = (empNumber) + 0.4
             , ymin = previousDay
             , ymax = revenue
             , col = direction
             , fill = employee)) +
  geom_rect(size = 1.5) +
  facet_grid(~date) +
  scale_fill_brewer(palette = "Dark2") +
  theme(axis.text.x = element_blank()
        , axis.ticks.x = element_blank())

给予

这篇关于转换数据框以在ggplot2中制作瀑布图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆