tidyr扩展函数生成稀疏矩阵 [英] tidyr spread function generates sparse matrix when compact vector expected

查看:111
本文介绍了tidyr扩展函数生成稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习dlyr,来自plyr,我想从xtabs的输出中生成(每组)列(每个交互)。



短总结:我正在获得

  AB 
1 NA
NA 2

当我想要

  AB 
1 2






xtabs数据看起来像这个:

 > xtabs(data = data.frame(P = c(F,T,F,T,F),A = c(F,F,T,T,T)))
A
P FALSE TRUE
FALSE 1 2
TRUE 1 1

now do(想要数据帧中的数据,如下所示:

 > xtabs(data = data.frame(P = c(F,T,F,T,F),A = c(F,F,T,T,T)))%>%as.data.frame 
PA Freq
1 FALSE FALSE 1
2 TRUE FALSE 1
3 FALSE TRUE 2
4 TRUE TRUE 1

现在我想要一个单行输出,列是级别的交互,这是我正在寻找的:

  FALSE_FALSE TRUE_TRUE FALSE_TRUE TRUE_FALSE 
1 1 2 1

但是而是我得到

 > xtabs(data = data.frame(P = c(F,T,F,T,F ),A = c(F,F,T,T,T)))%>%
as.data.frame%>%
unite(S,A,P)%> %
spread(S,Freq)
FALSE_FALSE FALSE_TRUE TRUE_F ALSE TRUE_TRUE
1 1 NA NA NA
2 NA 1 NA NA
3 NA NA 2 NA
4 NA NA NA 1
/ pre>

我显然误会了这里的一些东西。我正在寻找相当于reshape2的代码在这里(使用magrittr管道一致性):

 > xtabs(data = data.frame(P = c(F,T,F,T,F),A = c(F,F,T,T,T)))%>%
as.data .frame%>%#可以省略。 (安全??)
melt%>%
mutate(S = interactive(P,A),value = value)%>%
dcast(NA〜S)
使用P,A作为变量
NA FALSE.FALSE TRUE.FALSE FALSE.TRUE TRUE.TRUE
1 NA 1 1 2 1

(注意NA在这里使用,因为我在此简化示例中没有分组变量)






更新 - 有趣的是,添加单个分组列似乎解决了这个问题 - 为什么它没有我告诉它合成(大概来自row_name)一个分组列?

 > xtabs(data = data.frame(h =foo,P = c(F,T,F,T,F),A = c(F,F,T,T,T)))%>%
as.data.frame%>%
unite(S,A,P)%>%
spread(S,Freq)
h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
1 foo 1 1 2 1

这似乎是一个部分解决方案。

解决方案

这里的关键是 spread 不汇总数据。



因此,如果您还没有使用 xtabs 先聚合,那么您将这样做:


$ (P,C(F,T,F,T,F)),A = c(F,F,T,T, T),Freq = 1)%>%
unite(S,A,P)
a
## S Freq
## 1 FALSE_FALSE 1
## 2 FALSE_TRUE 1
## 3 TRUE_FALSE 1
## 4 TRUE_TRUE 1
## 5 TRUE_FALSE 1

a%>%spread(S,Freq)
## FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 1 NA NA NA
## 2 NA 1 NA NA
## 3 NA NA 1 NA
## 4 NA NA NA 1
## 5 NA NA 1 NA

哪个不会有任何其他方式(没有聚合)。



根据填充参数的帮助文件,这是可预测的:


如果其他变量
和键列的每个组合都没有值,则此值将被替换。


在您的情况下,没有任何其他变量可以与键列相结合。如果有,那么...

  b<  -  data.frame(P = c(F,T,F, T,F),A = c(F,F,T,T,T),Freq = 1 
,h = rep(c(foo,bar),length.out = 5) %>%
unite(S,A,P)
b
## S Freq h
## 1 FALSE_FALSE 1 foo
## 2 FALSE_TRUE 1 bar
## 3 TRUE_FALSE 1 foo
## 4 TRUE_TRUE 1 bar
## 5 TRUE_FALSE 1 foo

> b%>%spread(S,Freq)
##错误:行(3,5)的重复标识符

...它将失败,因为它不能聚合第3行和第5行(因为它不是设计)。



tidyr / dplyr 这样做的方式是 group_by 总结而不是 xtabs ,因为总结保留分组列,因此 spread 可以告诉哪些观察属于同一行:

  b%>%group_by(h,S)%>%
总结(Freq = sum(Freq))
##来源:本地数据框[4 x 3]
##组:h
##
## h S Freq
## 1 bar FALSE_TRUE 1
## 2 bar TRUE_TRUE 1
## 3 foo FALSE_FALSE 1
## 4 foo TRUE_FALSE 2

b%>%group_by(h,S)%>%
总汇(Freq = sum(Freq))%>%
spread(S,Freq)
##来源:本地数据框[2 x 5]
##
## h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 bar NA 1 NA 1
## 2 foo 1 NA 2 NA


I'm learning dplyr, having come from plyr, and I want to generate (per group) columns (per interaction) from the output of xtabs.

Short summary: I'm getting

A    B
1    NA
NA   2

when I wanted

A    B
1    2


xtabs data looks like this:

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T)))
       A
P       FALSE TRUE
  FALSE     1    2
  TRUE      1    1

now do( wants it's data in data frames, like this:

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% as.data.frame
      P     A Freq
1 FALSE FALSE    1
2  TRUE FALSE    1
3 FALSE  TRUE    2
4  TRUE  TRUE    1

Now I want a single row output with columns being the interaction of levels. Here's what I'm looking for:

FALSE_FALSE TRUE_TRUE FALSE_TRUE TRUE_FALSE
          1         1          2          1

But instead I get

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% 
    as.data.frame %>% 
    unite(S,A,P) %>% 
    spread(S,Freq)
  FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
1           1         NA         NA        NA
2          NA          1         NA        NA
3          NA         NA          2        NA
4          NA         NA         NA         1

I'm clearly misunderstanding something here. I'm looking for the equivalent of reshape2's code here (using magrittr pipes for consistency):

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% 
    as.data.frame %>% # can be omitted. (safely??)
    melt %>% 
    mutate(S=interaction(P,A),value=value) %>% 
    dcast(NA~S)
Using P, A as id variables
  NA FALSE.FALSE TRUE.FALSE FALSE.TRUE TRUE.TRUE
1 NA           1          1          2         1

(note NA is used here because I don't have a grouping variable in this simplified example)


Update - interestingly, adding a single grouping column seems to fix this - why does it synthesise (presumably from row_name) a grouping column without me telling it?

> xtabs(data=data.frame(h="foo",P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% 
  as.data.frame %>% 
  unite(S,A,P) %>% 
  spread(S,Freq)
    h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
1 foo           1          1          2         1

This seems like a partial solution.

解决方案

The key here is that spread doesn't aggregate the data.

Hence, if you hadn't already used xtabs to aggregate first, you would be doing this:

a <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1) %>% 
    unite(S,A,P)
a
##             S Freq
## 1 FALSE_FALSE    1
## 2  FALSE_TRUE    1
## 3  TRUE_FALSE    1
## 4   TRUE_TRUE    1
## 5  TRUE_FALSE    1

a %>% spread(S, Freq)
##   FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1           1         NA         NA        NA
## 2          NA          1         NA        NA
## 3          NA         NA          1        NA
## 4          NA         NA         NA         1
## 5          NA         NA          1        NA

Which wouldn't make sense any other way (without aggregation).

This is predictable based on the help file for the fill parameter:

If there isn't a value for every combination of the other variables and the key column, this value will be substituted.

In your case, there aren't any other variables to combine with the key column. Had there been, then...

b <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1
                                , h = rep(c("foo", "bar"), length.out = 5)) %>% 
    unite(S,A,P)
b
##             S Freq   h
## 1 FALSE_FALSE    1 foo
## 2  FALSE_TRUE    1 bar
## 3  TRUE_FALSE    1 foo
## 4   TRUE_TRUE    1 bar
## 5  TRUE_FALSE    1 foo

> b %>% spread(S, Freq)
## Error: Duplicate identifiers for rows (3, 5)

...it would fail, because it can't aggregate rows 3 and 5 (because it isn't designed to).

The tidyr/dplyr way to do it would be group_by and summarize instead of xtabs, because summarize preserves the grouping column, hence spread can tell which observations belong in the same row:

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq))
## Source: local data frame [4 x 3]
## Groups: h
## 
##     h           S Freq
## 1 bar  FALSE_TRUE    1
## 2 bar   TRUE_TRUE    1
## 3 foo FALSE_FALSE    1
## 4 foo  TRUE_FALSE    2

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq)) %>%
    spread(S, Freq)
## Source: local data frame [2 x 5]
## 
##     h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 bar          NA          1         NA         1
## 2 foo           1         NA          2        NA

这篇关于tidyr扩展函数生成稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆