在R中可视化纵向分类数据的好方法 [英] Good Ways to Visualize Longitudinal Categorical Data in R

查看:238
本文介绍了在R中可视化纵向分类数据的好方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

[更新:)尽管我已经接受了答案,但如果您有其他可视化提示(无论是在R还是其他语言/程序中),请添加另一个答案。关于分类数据分析的文本似乎没有太多关于纵向数据可视化的说法,而关于纵向数据分析的文本似乎没有多少关于在类别成员中随时间变化的主题内变化的可视化。对这个问题有更多的答案会使它成为一个更好的资源,但是这个问题在标准参考文献中没有得到很多报道。]



一位同事给我一个纵向分类要查看的数据集,我试图弄清楚如何捕获可视化中的纵向方面。我在这里发布,因为我想在R中这样做,但是请让我知道是否有意义也交叉发布到交叉验证,因为交叉发布通常是不鼓励的。



快速背景:数据追踪学术期间学术建议计划学生的学术地位。数据格式很长,有五个变量:id,cohort,term,standing和termGPA。前两个标识学生和他们在咨询项目中的任期。最后三个是学生的学术表现和GPA记录的条款。我使用 dput 粘贴了一些示例数据。

我创建了一个马赛克图(见下文),按照队列,常规和术语对学生进行分组。这显示了每学期每个学术类别中的学生分数。但这并不能捕捉到纵向方面 - 个别学生随着时间的推移而被追踪的事实。我想跟踪具有学术知识的学生群体需要花费时间的路径。例如:在2009年秋季(F09)有AP(学术试用期)的学生,在未来条件下仍然是AP的分数,以及分数转入其他类别(例如GS,良好信誉)?在进入咨询计划之后,各类别之间的时间间隔是否存在差异?

我无法完全弄清楚如何在R图形。 vcd 包提供了用于可视化分类数据的工具,但似乎并不处理纵向分类数据。是否有标准方法可视化纵向分类数据? R有专门为此设计的软件包吗?长格式适合这种类型的数据,还是用宽格式更好?



我希望能够解决这个特定问题的建议,以及关于可视化纵向分类数据的文章,书籍等的建议。



以下是我用来制作拼接图的代码。代码使用下面列出的数据: dput

  library(RColorBrewer )

#创建绘制表格对象
df1.tab = table(df1 $ cohort,df1 $ term,df1 $ standing,
dnn = c(Cohort\ n学术常规,术语,常规))

#创建一个马赛克图
plot(df1.tab,las = 1,dir = c(h,v ,h),
col = brewer.pal(8,Dark2),
main =2009年秋季和2010年秋季队列)

以下是马赛克图(侧面问题:是否有任何方法可以使F10队列的列直接位于下方,并且与列的宽度相同F09队列,即使在F10队列中没有一些术语的数据时也是如此):


以下是用于创建表格和图表的数据:

  df1 = 
结构(列表(id = c(101L,102L,103L,104L,105) L,106L,107L,
108L,109L,110L,111L,112L,113L,114L,115L,116L,117L,118L,
119L,120L,121L,122L,123L,124L,125L, 101L,102L,103L,104L,
105L,106L,107L,108L,109L,110L,111L,112L,113L,114L,115L,
116L,117L,118L,119L,120L,121L, 122L,123L,124L,125L,101L,
102L,103L,104L,105L,106L,107L,108L,109L,110L,111L,112L,
113L,114L,115L,116L,117L, 118L,119L,120L,121L,122L,123L,
124L,125L,101L,102L,103L,104L,105L,106L,107L,108L,109L,$ b $ 110L,111L,112L,113L, 114L,115L,116L,117L,118L,119L,120L,
121L,122L,123L,124L,125L,101L,102L,103L,104L,105L,106L,$ b $ 107L,108L,109L, 110L,111L,112L,113L,114L,115L,116L,117L,
118L,119L,120L,121L,122L,123L,124L,125L,101L,102L,103L,$ b $ 104L,105L, 106L,107L,108L,109L,110L,111L,112L,113L,114L,
115L,116L,117L,118L,119L,120L,121L,122L,123L,124L,125L, 102L,103L,104L,105L,106L,107L,1 08L,109L,110L,111L,
112L,113L,114L,115L,116L,117L,118L,119L,120L,121L,122L,
123L,124L,125L) (1L,1L,1L,1L,2L,1L,
1L,2L,2L,2L,2L,1L,1L,1L,1L,1L,1L,1L,2L,2L,1L,1L, b $ b 1L,1L,2L,1L,1L,1L,1L,2L,1L,1L,2L,2L,2L,2L,1L,1L,
1L,1L,1L,1L,1L,2L ,2L,1L,1L,1L,1L,2L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L, ,1L,2L,2L,1L,2L,2L,2L,2L,2L,2L,2L,1L,1L,1L,1L,1L,1L,2L,1L,1L,1L,1L,2L, ,1L,1L,1L,1L,1L,2L,2L,1L,1L,1L,1L,2L,1L,1L,
1L,1L,2L,1L,1L,2L,2L,2L,2L 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,2L,1L,1L,1L,1L,2L, ,
2L,2L,1L,1L,1L,1L,1L,1L,1L,2L,2L,1L,1L,1L,1L,2L,
1L,1L,1L,1L,2L ,1L,1L,2L,2L,2L,2L,1L,1L,1L,1L,1L,
1L,1L,2L,2L,1L,1L,1L,1L,2L) (F09,F10),class = c(ordered,
factor)),term = structure(c(1L,1L, 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L, $ b 1L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L, 2L,2L,2L,2L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L, 3L,3L,3L,
3L,3L,3L,4L,4L,4L,4L,4L,4L,4L,4L,4L,4L,4L,4L,4L,
4L, 4L,4L,4L,4L,4L,4L,4L,4L,4L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L, 5L,5L,5L,5L,5L,5L,5L,
5L,5L,5L,5L,5L,6L,6L,6L,6L,6L,6L,6L,6L,6L,6L,
6L,6L,6L,6L,6L,6L,6L,6L,6L,6L,6L,6L,6L,6L,7L,7L,
7L,7L,7L,7L,7L, 7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L,7L),标签= c(S09, F09,S10,
F10,S11,F11,S12),class = c(ordered,factor)),
standing = structure (c(2L,4L,1L,4L,NA,4L,1L,NA,NA,
NA,NA,2L,2L,1L,4L,4L,1L,3L,NA,NA,4L,3L ,1L,4L,
NA,2L,1L,3L,3L,NA,1L,2L,NA,NA,NA,NA,2L,4L,3L,
4L,4L,4L,2L,NA,NA,4L ,2L,4L,4L,NA,3L,4L,6L,6L,
1L,4L,4L,1L,1L,1L,1L,1L,4L,6L,4L,4L,1L,4L, ,
2L,4L,3L,1L,4L,1L,6L,1L,6L,6L,7L,4L,4L,2L,2L,
4L,2L,6L,4L,6L,7L ,4L,2L,4L,1L,2L,4L,6L,6L,4L,
2L,2L,3L,6L,6L,7L,4L,4L,3L,4L,4L,6L,2L, ,6L,
6L,4L,2L,1L,7L,2L,4L,6L,6L,4L,4L,3L,6L,4L,6L,
2L,4L,4L,6L,4L ,4L,6L,3L,2L,6L,6L,4L,2L,6L,3L,
4L,4L,6L,6L,4L,4L,5L,6L,4L,6L,4L,4L,4L ,5L,4L,
4L,6L,6L,2L,6L,6L,4L,3L,6L,6L,4L,4L,6L,6L,4L,
4L),。标签= c (AP,CP,DQ,GS,DM,NE,WD),等级=factor),
termGPA = c(1.433,1.925,1 ,1.68,NA,1.579,1.233,NA,
NA,NA,NA,2.009,1.675,0,1.5,1.86,0.5,0.94,NA,NA,
1.777,1.1,1.1313,1.675 ,NA,2,1.25,1.66,0,NA,1.525,
2.25,NA,NA,NA,NA,1.66,2.325,0,2.308,1.6,1.825,
2.33,NA,NA,2.65,2.65,2.85,3.233,NA,1.25,1.575,NA,
NA,1,2,385,3.13 3,0,0,1.729,1.075,0.4,NA,2.74,
0,1.369,2.53,0,2.65,2.75,0.333,3.3767,1.304,2.003, NA,0.1,
NA,NA,1,2.2,2.18,2.31,1.75,3.073,0.7,NA,1.425,
NA,2.74,2.9,0.692,2,0.75,1.675,2.4, NA,3.829,
2.33,2.3,1.5,NA,NA,NA,2.69,1.52,0.838,2.35,1.55,
NA,1.35,0.66,NA,NA,1.35,1.9, 1.04,NA,1.464,2.94,
NA,3.72,2.867,1.467,NA,3.133,NA,1,2.458,1.214,
NA,3.325,2.315,NA,1,2.233, NA,2.567,1,NA,0,
3.325,2.077,NA,NA,3.85,2.718,1.385,NA,2.333,NA,
2.675,1.267,1.6,1.38,3.433, 0.838,NA,NA,0,NA,NA,
2.6,0,NA,NA,1,2.825,NA,NA,3.838,2.883)),.Names = c(id,
队列,术语,站立,术语GPA),row.names = c(101.F09.s09,
102.F09.s09,103.F09.s09 104.F09.s09,105.F10.s09,106.F09.s09,
107.F09.s09,108.F10.s09,109.F10.s09 ,110.F10.s09,111.F10.s09,
112.F09.s09,113.F09.s09,114.F09.s09,115.F09 .s09,116.F09.s09,
117.F09.s09,118.F09.s09,119.F10.s09,120.F10.s09,121 .F09.s09,
122.F09.s09,123.F09.s09,124.F09.s09,125.F10.s09,101.F09.f09,
102.F09.f09,103.F09.f09,104.F09.f09,105.F10.f09,106.F09.f09,
107 F09.f09,108.F10.f09,109.F10.f09,110.F10.f09,111.F10.f09,
112.F09.f09, 113.F09.f09,114.F09.f09,115.F09.f09,116.F09.f09,
117.F09.f09,118.F09.f09 ,119.F10.f09,120.F10.f09,121.F09.f09,
122.F09.f09,123.F09.f09,124.F09 .f09,125.F10.f09,101.F09.s10,
102.F09.s10,103.F09.s10,104.F09.s10,105 F10.s10,106.F09.s10,
107.F09.s10,108.F10.s10,109.F10.s10,110.F10.s10, 111.F10.s10,
112.F09.s10,113.F09.s10,114.F09.s10,115.F09.s10 ,116.F09.s10,
117.F09.s10,118.F09.s10,119.F10.s10,120.F10.s10,121.F09。 s10,
122.F09.s10,123.F09.s10,124.F09.s10,125.F10.s10,101.F09.f10,
102.F09.f10,103.F09.f10,104.F09.f10,105.F10.f10,106.F09.f10,
107.F09。 f10,108.F10.f10,109.F10.f10,110.F10.f10,111.F10.f10,
112.F09.f10,113。 F09.f10,114.F09.f10,115.F09.f10,116.F09.f10,
117.F09.f10,118.F09.f10, 119.F10.f10,120.F10.f10,121.F09.f10,
122.F09.f10,123.F09.f10,124.F09.f10 ,125.F10.f10,101.F09.s11,
102.F09.s11,103.F09.s11,104.F09.s11,105.F10。 s11,106.F09.s11,
107.F09.s11,108.F10.s11,109.F10.s11,110.F10.s11,111。 F10.s11,
112.F09.s11,113.F09.s11,114.F09.s11,115.F09.s11,116.F09.s11,
117.F09.s11,118.F09.s11,119.F10.s11,120.F10.s11,121.F09.s11,
122。 F09.s11,123.F09.s11,124.F09.s11,125.F10.s11,101.F09.f11,
102。 F09.f11,103.F09.f11,104.F09.f11,105.F10.f11,106.F09.f11,
107.F09.f11, 108.F10.f11,109.F10.f11,110.F10.f11,111.F10.f11,
112.F09.f11,113.F09.f11 ,114.F09.f11,115.F09.f11,116.F09.f11,
117.F09.f11,118.F09.f11,119.F10。 f11,120.F10.f11,121.F09.f11,
122.F09.f11,123.F09.f11,124.F09.f11,125。 F10.f11,101.F09.s12,
102.F09.s12,103.F09.s12,104.F09.s12,105.F10.s12, 106.F09.s12,
107.F09.s12,108.F10.s12,109.F10.s12,110.F10.s12,111.F10.s12 ,
112.F09.s12,113.F09.s12,114.F09.s12,115.F09.s12,116.F09.s12,
117.F09.s12,118.F09.s12,119.F10.s12,120.F10.s12,121.F09.s12,
122.F09.s12 ,123.F09.s12,124.F09.s12,125.F10.s12),reshapeLong = structure(list(
vary = list(c(s09as,f09as, s10as,f10as,s11as,
f11as,s12as),c(s09termGPA,f09termGPA,s10termGPA,
f10termGPA,s11termGPA ,f1 1termGPA,s12termGPA)),
v.names = c(standing,termGPA),idvar = c(id,cohort
),timevar =term ),.Names = c(vary,v.names,idvar,
timevar)),class =data.frame)


解决方案

以下是绘制数据的一些想法。我已经使用了ggplot2,并且我已经将这些数据重新格式化了一些地方。



图1




我使用了一个堆叠的barplot来模仿你的马赛克图,并解决了对齐问题问题。

图2 b
$ b


每个学生的数据点都用灰线连接,这让人联想到平行坐标图。着色点显示了绝对的地位。在y轴上使用GPA有助于分散点以减少重叠图,并显示站立和GPA的相关性。一个主要的问题是许多有效的站点数据点因为缺少匹配的termGPA值而丢失。


图3




在这里,我创建了一个名为initial_standing的新变量用于构建切面。每个小组包含同时匹配队列和initial_standing的学生。以文本形式绘制id使得此图有点混乱,但在某些情况下可能会有用。

图4




这个图就像一个热图,每一行都是学生。我控制了 id 轴的顺序,以强制initial_standing和群组分组保持在一起。如果你有更多的行,你可能想考虑按某种类型的集群排序行。

  library(ggplot2)

#创建新的数据框以确定初始站点。
standing_data = data.frame(id = unique(df1 $ id),initial_standing = NA,cohort = NA)

for(i in 1:nrow(standing_data)){
id = standing_data $ id [i]
subdat = df1 [df1 $ id == id,]
subdat = subdat [complete.cases(subdat),]
initial_standing = subdat $站立[which.min(subdat $ term)]
standing_data [i,initial_standing] = as.character(initial_standing)
standing_data [i,cohort] = as.character(subdat $ cohort [ 1])$ ​​b $ b}

standing_data $ cohort = factor(standing_data $ cohort,levels = levels(df1 $ cohort))
standing_data $ initial_standing = factor(standing_data $ initial_standing,
levels = levels(df1 $ standing))

#将新列(initial_standing)添加到df1。
df1 = merge(df1,standing_data [,c(id,initial_standing)],by =id)

#删除站立不见的行。让一些地块更整洁。
df1 = df1 [!is.na(df1 $ standing),]

#创建id因子,控制关卡的排序顺序。
id_order = order(standing_data $ initial_standing,standing_data $ cohort)
df1 $ id = factor(df1 $ id,levels = as.character(standing_data $ id)[id_order])


p1 = ggplot(df1,aes(x = term,fill = standing))+
geom_bar(position =fill,color =grey20,size = 0.5,width = 1.0) +
facet_grid(cohort〜。)+
scale_fill_brewer(palette =Set1)
$ b $ p2 = ggplot(df1,aes(x = term,y = termGPA,group = id))+
geom_line(color =grey70)+
geom_point(aes(color = standing),size = 4)+
facet_grid(cohort〜。)+
scale_colour_brewer(palette =Set1)

p3 = ggplot(df1,aes(x = term,y = termGPA,group = id))+
geom_line(color =grey70) +
geom_point(aes(color = standing),size = 4)+
geom_text(aes(label = id),hjust = -0.30,size = 3)+
facet_grid(ini​​tial_standing〜 (x = term,y = id,fill = standing))+
scale_colour_brewer(palette =Set1)


p4 = ggplot(df1,aes b $ b geom_tile(color =grey20)+
facet_grid(ini​​tial_standing〜。,space =free_y,scales =free_y)+
scale_fill_brewer(palette =Set1)+
opts (panel.grid.major = theme_blank())+
opts(panel.grid.minor = theme_blank())

ggsave(plot_1.png,p1,width = 10,
ggsave(plot_2.png,p2,width = 10,height = 6.25,dpi = 80)
ggsave(plot_3.png,p3,width = 10,height = 6.25,dpi = 80)
ggsave(plot_4.png,p4,width = 10,height = 6.25,dpi = 80)


[Update: Although I've accepted an answer, please add another answer if you have additional visualization ideas (whether in R or another language/program). Texts on categorical data analysis don't seem to say much about visualizing longitudinal data, while texts on longitudinal data analysis don't seem to say much about visualizing within-subject changes over time in category membership. Having more answers to this question will make it a better resource on an issue that doesn't get much coverage in standard references.]

A colleague just gave me a longitudinal categorical data set to look at and I'm trying to figure out how to capture the longitudinal aspect in a visualization. I'm posting here, because I'd like to do this in R, but please let me know if it makes sense to also cross-post to Cross-Validated, since cross-posting is generally discouraged.

Quick background: The data track the academic standing from term to term for students who went through an academic advising program. The data are in long format and have five variables: "id", "cohort", "term", "standing", and "termGPA". The first two identify the student and the term in which they were in the advising program. The last three are the terms when the student's academic standing and GPA were recorded. I've pasted in some sample data below using dput.

I've created a mosaic plot (see below) that groups students by cohort, standing, and term. This shows what fraction of students were in each academic-standing category in each term. But this doesn't capture the longitudinal aspect--the fact that individual students are tracked over time. I'd like to track the path that groups of students with a given academic standing take over time.

For example: Of students with standing "AP" (academic probation) in Fall 2009 ("F09"), what fraction were still AP in future terms, and what fraction moved into other categories (e.g., GS, "good standing")? Are there differences between cohorts in terms of movement between categories with time since entry into the advising program?

I couldn't quite figure out how to capture this longitudinal aspect in an R graphic. The vcd package has facilities for visualizing categorical data, but doesn't seem to address longitudinal categorical data. Are there "standard" methods for visualizing longitudinal categorical data? Does R have packages designed for this? Is long format appropriate for this type of data or would I be better off with wide format?

I would appreciate suggestions for solving this particular problem and also suggestions for articles, books, etc. for learning more about visualizing longitudinal categorical data.

Here's the code I used to make the mosaic plot. The code uses the data listed below with dput.

library(RColorBrewer)

# create a table object for plotting
df1.tab = table(df1$cohort, df1$term, df1$standing,
            dnn=c("Cohort\nAcademic Standing", "Term", "Standing"))

# create a mosaic plot
plot(df1.tab, las=1, dir=c("h","v","h"), 
     col=brewer.pal(8,"Dark2"),
     main="Fall 2009 and Fall 2010 Cohorts")

Here's the mosaic plot (side question: is there any way to make the columns for the F10 cohort sit directly under and have the same width as the columns for the F09 cohort, even when there's no data for some terms in the F10 cohort?):

And here's the data used to create the table and the plot:

df1 =
structure(list(id = c(101L, 102L, 103L, 104L, 105L, 106L, 107L, 
108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 
119L, 120L, 121L, 122L, 123L, 124L, 125L, 101L, 102L, 103L, 104L, 
105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 
116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 101L, 
102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 
113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 
124L, 125L, 101L, 102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 
110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 
121L, 122L, 123L, 124L, 125L, 101L, 102L, 103L, 104L, 105L, 106L, 
107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 
118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 101L, 102L, 103L, 
104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 
115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 
101L, 102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 
112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 
123L, 124L, 125L), cohort = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 
1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L), .Label = c("F09", "F10"), class = c("ordered", 
"factor")), term = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("S09", "F09", "S10", 
"F10", "S11", "F11", "S12"), class = c("ordered", "factor")), 
    standing = structure(c(2L, 4L, 1L, 4L, NA, 4L, 1L, NA, NA, 
    NA, NA, 2L, 2L, 1L, 4L, 4L, 1L, 3L, NA, NA, 4L, 3L, 1L, 4L, 
    NA, 2L, 1L, 3L, 3L, NA, 1L, 2L, NA, NA, NA, NA, 2L, 4L, 3L, 
    4L, 4L, 4L, 2L, NA, NA, 4L, 2L, 4L, 4L, NA, 3L, 4L, 6L, 6L, 
    1L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 4L, 6L, 4L, 4L, 1L, 4L, 1L, 
    2L, 4L, 3L, 1L, 4L, 1L, 6L, 1L, 6L, 6L, 7L, 4L, 4L, 2L, 2L, 
    4L, 2L, 6L, 4L, 6L, 7L, 4L, 2L, 4L, 1L, 2L, 4L, 6L, 6L, 4L, 
    2L, 2L, 3L, 6L, 6L, 7L, 4L, 4L, 3L, 4L, 4L, 6L, 2L, 1L, 6L, 
    6L, 4L, 2L, 1L, 7L, 2L, 4L, 6L, 6L, 4L, 4L, 3L, 6L, 4L, 6L, 
    2L, 4L, 4L, 6L, 4L, 4L, 6L, 3L, 2L, 6L, 6L, 4L, 2L, 6L, 3L, 
    4L, 4L, 6L, 6L, 4L, 4L, 5L, 6L, 4L, 6L, 4L, 4L, 4L, 5L, 4L, 
    4L, 6L, 6L, 2L, 6L, 6L, 4L, 3L, 6L, 6L, 4L, 4L, 6L, 6L, 4L, 
    4L), .Label = c("AP", "CP", "DQ", "GS", "DM", "NE", "WD"), class = "factor"), 
    termGPA = c(1.433, 1.925, 1, 1.68, NA, 1.579, 1.233, NA, 
    NA, NA, NA, 2.009, 1.675, 0, 1.5, 1.86, 0.5, 0.94, NA, NA, 
    1.777, 1.1, 1.133, 1.675, NA, 2, 1.25, 1.66, 0, NA, 1.525, 
    2.25, NA, NA, NA, NA, 1.66, 2.325, 0, 2.308, 1.6, 1.825, 
    2.33, NA, NA, 2.65, 2.65, 2.85, 3.233, NA, 1.25, 1.575, NA, 
    NA, 1, 2.385, 3.133, 0, 0, 1.729, 1.075, 0, 4, NA, 2.74, 
    0, 1.369, 2.53, 0, 2.65, 2.75, 0, 0.333, 3.367, 1, NA, 0.1, 
    NA, NA, 1, 2.2, 2.18, 2.31, 1.75, 3.073, 0.7, NA, 1.425, 
    NA, 2.74, 2.9, 0.692, 2, 0.75, 1.675, 2.4, NA, NA, 3.829, 
    2.33, 2.3, 1.5, NA, NA, NA, 2.69, 1.52, 0.838, 2.35, 1.55, 
    NA, 1.35, 0.66, NA, NA, 1.35, 1.9, 1.04, NA, 1.464, 2.94, 
    NA, NA, 3.72, 2.867, 1.467, NA, 3.133, NA, 1, 2.458, 1.214, 
    NA, 3.325, 2.315, NA, 1, 2.233, NA, NA, 2.567, 1, NA, 0, 
    3.325, 2.077, NA, NA, 3.85, 2.718, 1.385, NA, 2.333, NA, 
    2.675, 1.267, 1.6, 1.388, 3.433, 0.838, NA, NA, 0, NA, NA, 
    2.6, 0, NA, NA, 1, 2.825, NA, NA, 3.838, 2.883)), .Names = c("id", 
"cohort", "term", "standing", "termGPA"), row.names = c("101.F09.s09", 
"102.F09.s09", "103.F09.s09", "104.F09.s09", "105.F10.s09", "106.F09.s09", 
"107.F09.s09", "108.F10.s09", "109.F10.s09", "110.F10.s09", "111.F10.s09", 
"112.F09.s09", "113.F09.s09", "114.F09.s09", "115.F09.s09", "116.F09.s09", 
"117.F09.s09", "118.F09.s09", "119.F10.s09", "120.F10.s09", "121.F09.s09", 
"122.F09.s09", "123.F09.s09", "124.F09.s09", "125.F10.s09", "101.F09.f09", 
"102.F09.f09", "103.F09.f09", "104.F09.f09", "105.F10.f09", "106.F09.f09", 
"107.F09.f09", "108.F10.f09", "109.F10.f09", "110.F10.f09", "111.F10.f09", 
"112.F09.f09", "113.F09.f09", "114.F09.f09", "115.F09.f09", "116.F09.f09", 
"117.F09.f09", "118.F09.f09", "119.F10.f09", "120.F10.f09", "121.F09.f09", 
"122.F09.f09", "123.F09.f09", "124.F09.f09", "125.F10.f09", "101.F09.s10", 
"102.F09.s10", "103.F09.s10", "104.F09.s10", "105.F10.s10", "106.F09.s10", 
"107.F09.s10", "108.F10.s10", "109.F10.s10", "110.F10.s10", "111.F10.s10", 
"112.F09.s10", "113.F09.s10", "114.F09.s10", "115.F09.s10", "116.F09.s10", 
"117.F09.s10", "118.F09.s10", "119.F10.s10", "120.F10.s10", "121.F09.s10", 
"122.F09.s10", "123.F09.s10", "124.F09.s10", "125.F10.s10", "101.F09.f10", 
"102.F09.f10", "103.F09.f10", "104.F09.f10", "105.F10.f10", "106.F09.f10", 
"107.F09.f10", "108.F10.f10", "109.F10.f10", "110.F10.f10", "111.F10.f10", 
"112.F09.f10", "113.F09.f10", "114.F09.f10", "115.F09.f10", "116.F09.f10", 
"117.F09.f10", "118.F09.f10", "119.F10.f10", "120.F10.f10", "121.F09.f10", 
"122.F09.f10", "123.F09.f10", "124.F09.f10", "125.F10.f10", "101.F09.s11", 
"102.F09.s11", "103.F09.s11", "104.F09.s11", "105.F10.s11", "106.F09.s11", 
"107.F09.s11", "108.F10.s11", "109.F10.s11", "110.F10.s11", "111.F10.s11", 
"112.F09.s11", "113.F09.s11", "114.F09.s11", "115.F09.s11", "116.F09.s11", 
"117.F09.s11", "118.F09.s11", "119.F10.s11", "120.F10.s11", "121.F09.s11", 
"122.F09.s11", "123.F09.s11", "124.F09.s11", "125.F10.s11", "101.F09.f11", 
"102.F09.f11", "103.F09.f11", "104.F09.f11", "105.F10.f11", "106.F09.f11", 
"107.F09.f11", "108.F10.f11", "109.F10.f11", "110.F10.f11", "111.F10.f11", 
"112.F09.f11", "113.F09.f11", "114.F09.f11", "115.F09.f11", "116.F09.f11", 
"117.F09.f11", "118.F09.f11", "119.F10.f11", "120.F10.f11", "121.F09.f11", 
"122.F09.f11", "123.F09.f11", "124.F09.f11", "125.F10.f11", "101.F09.s12", 
"102.F09.s12", "103.F09.s12", "104.F09.s12", "105.F10.s12", "106.F09.s12", 
"107.F09.s12", "108.F10.s12", "109.F10.s12", "110.F10.s12", "111.F10.s12", 
"112.F09.s12", "113.F09.s12", "114.F09.s12", "115.F09.s12", "116.F09.s12", 
"117.F09.s12", "118.F09.s12", "119.F10.s12", "120.F10.s12", "121.F09.s12", 
"122.F09.s12", "123.F09.s12", "124.F09.s12", "125.F10.s12"), reshapeLong = structure(list(
    varying = list(c("s09as", "f09as", "s10as", "f10as", "s11as", 
    "f11as", "s12as"), c("s09termGPA", "f09termGPA", "s10termGPA", 
    "f10termGPA", "s11termGPA", "f11termGPA", "s12termGPA")), 
    v.names = c("standing", "termGPA"), idvar = c("id", "cohort"
    ), timevar = "term"), .Names = c("varying", "v.names", "idvar", 
"timevar")), class = "data.frame")

解决方案

Here are a few ideas for plotting your data. I've used ggplot2, and I've reformatted the data a bit in places.

Figure 1

I've used a stacked barplot to mimic your mosaic plot and solve the alignment issue.

Figure 2

Data points for each student are connected by a gray line, making this reminiscent of a parallel coordinates plot. Coloring the points shows the categorical standing. Using GPA on the y-axis helps spread out the points to reduce overplotting, and shows correlation of standing and GPA. A major problem is that many valid standing datapoints drop out because they lack a matching termGPA value.

Figure 3

Here I've created a new variable called initial_standing to use for facetting. Each panel contains students who match in both cohort and initial_standing. Plotting the id as text makes this figure a bit cluttered, but could be useful in some cases.

Figure 4

This plot is like a heatmap where each row is a student. I controlled the order of the id axis to force initial_standing and cohort groupings to stay together. If you have many more rows, you may want to consider sorting rows by some type of clustering.

library(ggplot2)

# Create new data frame for determining initial standing.
standing_data = data.frame(id=unique(df1$id), initial_standing=NA, cohort=NA)

for (i in 1:nrow(standing_data)) {
    id = standing_data$id[i]
    subdat = df1[df1$id == id, ]
    subdat = subdat[complete.cases(subdat), ]
    initial_standing = subdat$standing[which.min(subdat$term)]
    standing_data[i, "initial_standing"] = as.character(initial_standing)
    standing_data[i, "cohort"] = as.character(subdat$cohort[1])
}

standing_data$cohort = factor(standing_data$cohort, levels=levels(df1$cohort))
standing_data$initial_standing = factor(standing_data$initial_standing,
                                        levels=levels(df1$standing))

# Add the new column (initial_standing) to df1.
df1 = merge(df1, standing_data[, c("id", "initial_standing")], by="id")

# Remove rows where standing is missing. Make some plots tidier.
df1 = df1[!is.na(df1$standing), ]

# Create id factor, controlling the sort order of the levels.     
id_order = order(standing_data$initial_standing, standing_data$cohort)
df1$id = factor(df1$id, levels=as.character(standing_data$id)[id_order])


p1 = ggplot(df1, aes(x=term, fill=standing)) +
     geom_bar(position="fill", colour="grey20", size=0.5, width=1.0) +
     facet_grid(cohort ~ .) +
     scale_fill_brewer(palette="Set1")

p2 = ggplot(df1, aes(x=term, y=termGPA, group=id)) + 
     geom_line(colour="grey70") + 
     geom_point(aes(colour=standing), size=4) + 
     facet_grid(cohort ~ .) +
     scale_colour_brewer(palette="Set1")

p3 = ggplot(df1, aes(x=term, y=termGPA, group=id)) +
     geom_line(colour="grey70") + 
     geom_point(aes(colour=standing), size=4) + 
     geom_text(aes(label=id), hjust=-0.30, size=3) +
     facet_grid(initial_standing ~ cohort) +
     scale_colour_brewer(palette="Set1")


p4 = ggplot(df1, aes(x=term, y=id, fill=standing)) + 
     geom_tile(colour="grey20") +
     facet_grid(initial_standing ~ ., space="free_y", scales="free_y") +
     scale_fill_brewer(palette="Set1") +
     opts(panel.grid.major=theme_blank()) +
     opts(panel.grid.minor=theme_blank())

ggsave("plot_1.png", p1, width=10, height=6.25, dpi=80)
ggsave("plot_2.png", p2, width=10, height=6.25, dpi=80)
ggsave("plot_3.png", p3, width=10, height=6.25, dpi=80)
ggsave("plot_4.png", p4, width=10, height=6.25, dpi=80)

这篇关于在R中可视化纵向分类数据的好方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆