用于可视化离散和连续面板数据的加权桑基/冲积图? [英] Weighted sankey / alluvial diagram for visualizing discrete and continuous panel data?

查看:29
本文介绍了用于可视化离散和连续面板数据的加权桑基/冲积图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我正在尝试将个人的面板数据可视化,其中包括每个时间段的离散或分类选择和连续选择.这种情况的一个常见示例是客户购买产品/订阅,然后选择使用该产品/服务的频率.

我想显示由连续变量加权的时间段的流量"在每个时间段——加权堆积条形图和桑基图或冲积图之间的某种交叉.桑基图和冲积图从根本上表示节点之间的流量,其中每个流量都有一个量级.相反,我想展示代表一个连续选择的流",即使对于同一个人,它在不同时间段可能具有不同的值.生成的图表看起来与桑基图或冲积图非常相似,只是冲积或流"会在时间段之间逐渐改变宽度.例如,假设客户在两个时间段内购买了相同的订阅,但在第二个时间段内使用频率更高;该使用可以由宽度从第一时间段到第二时间段增加的带或流"表示.

  1. 这种图表类型是否已经存在于任何地方?我无法在相当广泛的搜索中找到任何示例.如果它不存在,我希望这种图表类型的价值是明确的,有人会命名并创建它!:)
  2. 如何使用现有的 alluvial 或 sankey 库在 R 中破解"这样的图?我想这不是微不足道的,因为这些图表类型是由节点之间的恒定流定义的.

R 中的示例

我将通过一个使用 R 的示例来阐明问题.这是一个示例数据集:

库(tidyr)图书馆(dplyr)图书馆(冲积)图书馆(ggplot2)图书馆(forcats)set.seed(42)个人 <- rep(LETTERS[1:10],each=2)时间段 <- paste0("time_",rep(1:2,10))离散选择 <- 因子(paste0("choice_",sample(letters[1:3],20,replace=T)))连续选择 <- 天花板(runif(20, 0, 100))d <- data.frame(个人,时间段,离散选择,连续选择)

我可以完美地可视化离散或分类选择件的面板数据.堆积条形图可用于显示每个类别中的个人数量如何随时间变化.冲积图或桑基图还可以显示导致类别总数变化的单个运动.例如:

#个体离散选择的堆积条形图g <- ggplot(data=d,aes(timeperiod,fill=fct_rev(discretechoice)))g + geom_bar(position="stack") + guides(fill=guide_legend(title=NULL))#个体离散选择的冲积图d_冲积<-d%>%选择(个人,时间段,离散选择)%>%传播(时间段,离散选择)%>%group_by(time_1,time_2)%>%总结(计数=n())%>%取消分组()冲积(选择(d_alluvial,-计数),频率= d_alluvial $计数)

我还可以通过加权堆积条形图来查看按类别和跨时间段的连续选择总数.

#离散选择堆积条形图,连续选择加权g + geom_bar(position="stack",aes(weight=continuouschoice))

但是,我无法向这个加权堆积条形图添加任何跨时间段的单独流量".这些流"在时间段 1 中的宽度与时间段 2 中的宽度不同,因此它们需要显示为时间段之间逐渐变化的宽度.相比之下,桑基图和冲积图对于每个流都有一个大小或宽度.

解决方案

在将 alluvial 包适应 ggplot2 框架之初,我就遇到了这种困惑.桑基图和冲积图在不同位置改变权重的情况并不少见,但 冲积图 并不是为了以适合对数据进行编码的格式处理数据而构建的.(alluvial was 中的 alluvial_ts() 函数—参见

还可以为时间段之间的流量着色;查看示例.

我找不到关于数据格式差异的很好的讨论,即每一行对应所有时间段的一个主题,而不是一个时间段的一个主题,所以我尝试在 小插图.如果您有任何建议,我很乐意听取他们的意见!

Questions

I'm trying to visualize panel data on individuals that includes both a discrete or categorical choice and a continuous choice in each time period. One common example of this situation is customers purchasing a product/subscription and then choosing how frequently to use the product/service.

I would like to show "flows" across time periods weighted by the continuous variable in each time period -- some sort of cross between a weighted stacked bar chart and a sankey or alluvial diagram. Sankey and alluvial diagrams fundamentally represent flows between nodes, where each flow has a single magnitude. Instead, I would like to show "flows" representing a continuous choice that might have different values in different time periods, even for the same individual. The resulting diagram would look very similar to a sankey or alluvial plot, except that the alluvia or "flows" would gradually change widths between time periods. For example, suppose a customer buys the same subscription in two time periods, but uses it more frequently in the second period; that usage could be represented by a band or "flow" that increases in width from the first to the second time period.

  1. Does this chart type already exist anywhere? I was unable to find any examples in a fairly extensive search. If it doesn't exist, I hope that the value of such a chart type is clear and that someone will name and create it! :)
  2. How might such a graph be "hacked" in R using existing alluvial or sankey libraries? I imagine this is not trivial, since those chart types are defined by constant flows between nodes.

Example in R

I'll walk through an example using R to clarify the problem. Here's an example data set:

library(tidyr)
library(dplyr)
library(alluvial)
library(ggplot2)
library(forcats)

set.seed(42)
individual <- rep(LETTERS[1:10],each=2)
timeperiod <- paste0("time_",rep(1:2,10))
discretechoice <- factor(paste0("choice_",sample(letters[1:3],20, replace=T)))
continuouschoice <- ceiling(runif(20, 0, 100))
d <- data.frame(individual, timeperiod, discretechoice, continuouschoice)

I can visualize panel data for the discrete or categorical choice piece perfectly well. A stacked bar chart can be used to show how the number of individuals in each category changes over time. Alluvial or sankey diagrams can additionally show the individual movements that are causing changes in the category totals. For example:

# stacked bar diagram of discrete choice by individual
g <- ggplot(data=d,aes(timeperiod,fill=fct_rev(discretechoice)))
g + geom_bar(position="stack") + guides(fill=guide_legend(title=NULL))


# alluvial diagram of discrete choice by individual
d_alluvial <- d %>%
  select(individual,timeperiod,discretechoice) %>%
  spread(timeperiod,discretechoice) %>%
  group_by(time_1,time_2) %>%
  summarize(count=n()) %>%
  ungroup()
alluvial(select(d_alluvial,-count),freq=d_alluvial$count)

I can also look at the continuous choice totals by category and across time periods by weighting the stacked bar chart.

# stacked bar diagram of discrete choice, weighting by continuous choice
g + geom_bar(position="stack",aes(weight=continuouschoice))

However, I cannot add any kind of individual "flows" across time periods to this weighted stacked bar chart. Those "flows" would have a different width in time period 1 than in time period 2, so they would need to be shown as gradually changing widths between the time periods. Sankey and alluvial diagrams, by contrast, have a single magnitude or width for each flow.

解决方案

I faced just this sort of confusion at the beginning of adapting the alluvial package to the ggplot2 framework. It's not uncommon for Sankey and alluvial diagrams to change weight from position to position, but alluvial was not built to handle data in a format suitable to encode it. (Edit: The alluvial_ts() function in alluvial was—see an example in the README—but it doesn't produce stacked histograms at each time period.)

One option may be to use the parallel set geoms in the development version of ggforce, though i'm not familiar with them myself. The other I'm aware of is my own, ggalluvial. Here's one solution to your problem, I think, using your dataset d (notice that the colors differ):

library(ggalluvial)
ggplot(
  data = d,
  aes(
    x = timeperiod,
    stratum = discretechoice,
    alluvium = individual,
    y = continuouschoice
  )
) +
  geom_stratum(aes(fill = discretechoice)) +
  geom_flow()

It's also possible to color the flows between the time periods; see the examples.

I couldn't find a good discussion of the differences in data formats, i.e. in which each row corresponds to one subject across all time periods versus one subject at one time period, so I tried to write one in the vignette. If you have any suggestions, I'd be glad to hear them!

这篇关于用于可视化离散和连续面板数据的加权桑基/冲积图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆