加权sankey /冲积图可显示离散和连续的面板数据? [英] Weighted sankey / alluvial diagram for visualizing discrete and continuous panel data?

查看:172
本文介绍了加权sankey /冲积图可显示离散和连续的面板数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题



我正在尝试可视化有关个人的面板数据,其中包括离散或分类选择以及每个选择中的连续选择时间段。这种情况的一个常见示例是客户购买产品/订阅,然后选择使用产品/服务的频率。



我想显示一段时间内的流量每个时间段内由连续变量 加权的时间段-加权堆积条形图与sankey或冲积图之间的某种交叉。桑基图和冲积图从根本上表示节点之间的流量,其中每个流量只有一个大小。相反,我想显示代表连续选择的流程,即使对于同一个人,该选择在不同时间段可能具有不同的值。生成的图看起来与Sankey或冲积图非常相似,不同之处在于,冲积物或水流将逐渐改变时间段之间的宽度。例如,假设客户在两个时间段内购买了相同的订阅,但在第二个时间段内更频繁地使用它;该使用可以由从第一时间段到第二时间段宽度增加的带或流表示。


  1. 此图表类型是否已存在?我无法在相当广泛的搜索中找到任何示例。如果不存在,我希望这种图表类型的值是明确的,并希望有人来命名和创建它! :)

  2. 如何使用现有的冲积库或sankey库在R中入侵这样的图形?我想这并不简单,因为这些图表类型是由节点之间的恒定流定义的。

R中的示例



我将通过使用R的示例来阐明问题。这是一个示例数据集:

 库(tidyr)
库(dplyr)
库(冲积)
库(ggplot2)
库(forcats)

set.seed(42)
个人<-rep(LETTERS [1:10],each = 2 )
时间段<-paste0( time _,rep(1:2,10))
离散选择<-factor(paste0( choice _,sample(letters [1:3],20 ,replace = T)))
Continuouschoice<-ceiling(runif(20,0,100))
d<-data.frame(单个,时间周期,离散选择,continuouschoice)

我可以很好地可视化离散或分类选择项的面板数据。堆积的条形图可用于显示每个类别中的个人数量如何随时间变化。冲积图或sankey图还可以显示引起类别总计变化的单个运动。例如:

 #单个
g的离散选择的堆叠条形图<-ggplot(data = d,aes (timeperiod,fill = fct_rev(discretechoice)))
g + geom_bar(position = stack)+ guides(fill = guide_legend(title = NULL))


#冲积层单个
d_alluvial<-d%>%
选择(个体,时间,离散技术)%&%;%
点差(时间,离散技术)%&%;%
group_by(time_1,time_2)%>%
summary(count = n())%&%;%
ungroup()
冲积(select(d_alluvial,-count), freq = d_alluvial $ count)



我还可以通过对堆积的权重进行加权来按类别和跨时间段查看连续选择总数

 #堆叠的离散选择条形图,通过连续选择加权
g + geom_bar(position = stack ,一种es(weight = continuouschoice))



但是,我无法在该加权堆积栏上添加跨时间段的任何类型的流图表。这些流在时间段1中的宽度将与时间段2中的宽度不同,因此需要将其显示为在时间段之间逐渐变化的宽度。相比之下,Sankey图和冲积图的每个流只有一个大小或宽度。

解决方案

我正面临着这种混乱在将冲积程序包调整为 ggplot2 框架的开始。 Sankey和冲积图在位置之间改变权重的情况并不少见,但是冲积并不是为处理数据而设计的,该格式适合于对其进行编码。 (编辑:冲积 中的 alluvial_ts()函数-请参见



还可以为时间段之间的流着色;



我找不到关于数据格式差异的很好的讨论,即每一行对应于所有时间段的一个主题,而不是一个主题在一段时间内,所以我尝试在小插图中写一个。如果您有任何建议,我很乐意听到!


Questions

I'm trying to visualize panel data on individuals that includes both a discrete or categorical choice and a continuous choice in each time period. One common example of this situation is customers purchasing a product/subscription and then choosing how frequently to use the product/service.

I would like to show "flows" across time periods weighted by the continuous variable in each time period -- some sort of cross between a weighted stacked bar chart and a sankey or alluvial diagram. Sankey and alluvial diagrams fundamentally represent flows between nodes, where each flow has a single magnitude. Instead, I would like to show "flows" representing a continuous choice that might have different values in different time periods, even for the same individual. The resulting diagram would look very similar to a sankey or alluvial plot, except that the alluvia or "flows" would gradually change widths between time periods. For example, suppose a customer buys the same subscription in two time periods, but uses it more frequently in the second period; that usage could be represented by a band or "flow" that increases in width from the first to the second time period.

  1. Does this chart type already exist anywhere? I was unable to find any examples in a fairly extensive search. If it doesn't exist, I hope that the value of such a chart type is clear and that someone will name and create it! :)
  2. How might such a graph be "hacked" in R using existing alluvial or sankey libraries? I imagine this is not trivial, since those chart types are defined by constant flows between nodes.

Example in R

I'll walk through an example using R to clarify the problem. Here's an example data set:

library(tidyr)
library(dplyr)
library(alluvial)
library(ggplot2)
library(forcats)

set.seed(42)
individual <- rep(LETTERS[1:10],each=2)
timeperiod <- paste0("time_",rep(1:2,10))
discretechoice <- factor(paste0("choice_",sample(letters[1:3],20, replace=T)))
continuouschoice <- ceiling(runif(20, 0, 100))
d <- data.frame(individual, timeperiod, discretechoice, continuouschoice)

I can visualize panel data for the discrete or categorical choice piece perfectly well. A stacked bar chart can be used to show how the number of individuals in each category changes over time. Alluvial or sankey diagrams can additionally show the individual movements that are causing changes in the category totals. For example:

# stacked bar diagram of discrete choice by individual
g <- ggplot(data=d,aes(timeperiod,fill=fct_rev(discretechoice)))
g + geom_bar(position="stack") + guides(fill=guide_legend(title=NULL))


# alluvial diagram of discrete choice by individual
d_alluvial <- d %>%
  select(individual,timeperiod,discretechoice) %>%
  spread(timeperiod,discretechoice) %>%
  group_by(time_1,time_2) %>%
  summarize(count=n()) %>%
  ungroup()
alluvial(select(d_alluvial,-count),freq=d_alluvial$count)

I can also look at the continuous choice totals by category and across time periods by weighting the stacked bar chart.

# stacked bar diagram of discrete choice, weighting by continuous choice
g + geom_bar(position="stack",aes(weight=continuouschoice))

However, I cannot add any kind of individual "flows" across time periods to this weighted stacked bar chart. Those "flows" would have a different width in time period 1 than in time period 2, so they would need to be shown as gradually changing widths between the time periods. Sankey and alluvial diagrams, by contrast, have a single magnitude or width for each flow.

解决方案

I faced just this sort of confusion at the beginning of adapting the alluvial package to the ggplot2 framework. It's not uncommon for Sankey and alluvial diagrams to change weight from position to position, but alluvial was not built to handle data in a format suitable to encode it. (Edit: The alluvial_ts() function in alluvial was—see an example in the README—but it doesn't produce stacked histograms at each time period.)

One option may be to use the parallel set geoms in the development version of ggforce, though i'm not familiar with them myself. The other I'm aware of is my own, ggalluvial. Here's one solution to your problem, I think, using your dataset d (notice that the colors differ):

library(ggalluvial)
ggplot(
  data = d,
  aes(
    x = timeperiod,
    stratum = discretechoice,
    alluvium = individual,
    y = continuouschoice
  )
) +
  geom_stratum(aes(fill = discretechoice)) +
  geom_flow()

It's also possible to color the flows between the time periods; see the examples.

I couldn't find a good discussion of the differences in data formats, i.e. in which each row corresponds to one subject across all time periods versus one subject at one time period, so I tried to write one in the vignette. If you have any suggestions, I'd be glad to hear them!

这篇关于加权sankey /冲积图可显示离散和连续的面板数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆