tidyr 扩展函数在期望紧凑向量时生成稀疏矩阵 [英] tidyr spread function generates sparse matrix when compact vector expected

查看:13
本文介绍了tidyr 扩展函数在期望紧凑向量时生成稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习 dplyr,来自 plyr,我想从 xtabs 的输出生成(每组)列(每次交互).

I'm learning dplyr, having come from plyr, and I want to generate (per group) columns (per interaction) from the output of xtabs.

简短总结:我得到了

A    B
1    NA
NA   2

我想要的时候

A    B
1    2

<小时>

xtabs 数据如下所示:


xtabs data looks like this:

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T)))
       A
P       FALSE TRUE
  FALSE     1    2
  TRUE      1    1

现在 do( 想要数据帧中的数据,像这样:

now do( wants it's data in data frames, like this:

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% as.data.frame
      P     A Freq
1 FALSE FALSE    1
2  TRUE FALSE    1
3 FALSE  TRUE    2
4  TRUE  TRUE    1

现在我想要一个单行输出,列是级别的交互.这是我要找的:

Now I want a single row output with columns being the interaction of levels. Here's what I'm looking for:

FALSE_FALSE TRUE_TRUE FALSE_TRUE TRUE_FALSE
          1         1          2          1

但我得到了

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% 
    as.data.frame %>% 
    unite(S,A,P) %>% 
    spread(S,Freq)
  FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
1           1         NA         NA        NA
2          NA          1         NA        NA
3          NA         NA          2        NA
4          NA         NA         NA         1

我显然在这里误解了一些东西.我在这里寻找相当于 reshape2 的代码(使用 magrittr 管道以保持一致性):

I'm clearly misunderstanding something here. I'm looking for the equivalent of reshape2's code here (using magrittr pipes for consistency):

> xtabs(data=data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% 
    as.data.frame %>% # can be omitted. (safely??)
    melt %>% 
    mutate(S=interaction(P,A),value=value) %>% 
    dcast(NA~S)
Using P, A as id variables
  NA FALSE.FALSE TRUE.FALSE FALSE.TRUE TRUE.TRUE
1 NA           1          1          2         1

(注意这里使用NA是因为在这个简化的例子中我没有分组变量)

(note NA is used here because I don't have a grouping variable in this simplified example)

更新 - 有趣的是,添加一个分组列似乎可以解决这个问题 - 为什么它会在没有我告诉的情况下合成(大概来自 row_name)一个分组列?

Update - interestingly, adding a single grouping column seems to fix this - why does it synthesise (presumably from row_name) a grouping column without me telling it?

> xtabs(data=data.frame(h="foo",P=c(F,T,F,T,F),A=c(F,F,T,T,T))) %>% 
  as.data.frame %>% 
  unite(S,A,P) %>% 
  spread(S,Freq)
    h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
1 foo           1          1          2         1

这似乎是部分解决方案.

This seems like a partial solution.

推荐答案

这里的关键是 spread 不会聚合数据.

The key here is that spread doesn't aggregate the data.

因此,如果您还没有先使用 xtabs 进行聚合,您应该这样做:

Hence, if you hadn't already used xtabs to aggregate first, you would be doing this:

a <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1) %>% 
    unite(S,A,P)
a
##             S Freq
## 1 FALSE_FALSE    1
## 2  FALSE_TRUE    1
## 3  TRUE_FALSE    1
## 4   TRUE_TRUE    1
## 5  TRUE_FALSE    1

a %>% spread(S, Freq)
##   FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1           1         NA         NA        NA
## 2          NA          1         NA        NA
## 3          NA         NA          1        NA
## 4          NA         NA         NA         1
## 5          NA         NA          1        NA

这没有任何其他意义(没有聚合).

Which wouldn't make sense any other way (without aggregation).

这可以根据 fill 参数的帮助文件预测:

This is predictable based on the help file for the fill parameter:

如果其他变量的每个组合都没有值和键列,这个值将被替换.

If there isn't a value for every combination of the other variables and the key column, this value will be substituted.

在您的情况下,没有任何其他变量可以与键列组合.如果有的话……

In your case, there aren't any other variables to combine with the key column. Had there been, then...

b <- data.frame(P=c(F,T,F,T,F),A=c(F,F,T,T,T), Freq = 1
                                , h = rep(c("foo", "bar"), length.out = 5)) %>% 
    unite(S,A,P)
b
##             S Freq   h
## 1 FALSE_FALSE    1 foo
## 2  FALSE_TRUE    1 bar
## 3  TRUE_FALSE    1 foo
## 4   TRUE_TRUE    1 bar
## 5  TRUE_FALSE    1 foo

> b %>% spread(S, Freq)
## Error: Duplicate identifiers for rows (3, 5)

...它会失败,因为它不能聚合第 3 行和第 5 行(因为它不是设计的).

...it would fail, because it can't aggregate rows 3 and 5 (because it isn't designed to).

tidyr/dplyr 的方法是 group_bysummarize 而不是 xtabs,因为 summarize 保留了分组列,因此 spread 可以判断哪些观察值属于同一行:

The tidyr/dplyr way to do it would be group_by and summarize instead of xtabs, because summarize preserves the grouping column, hence spread can tell which observations belong in the same row:

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq))
## Source: local data frame [4 x 3]
## Groups: h
## 
##     h           S Freq
## 1 bar  FALSE_TRUE    1
## 2 bar   TRUE_TRUE    1
## 3 foo FALSE_FALSE    1
## 4 foo  TRUE_FALSE    2

b %>%   group_by(h, S) %>%
    summarize(Freq = sum(Freq)) %>%
    spread(S, Freq)
## Source: local data frame [2 x 5]
## 
##     h FALSE_FALSE FALSE_TRUE TRUE_FALSE TRUE_TRUE
## 1 bar          NA          1         NA         1
## 2 foo           1         NA          2        NA

这篇关于tidyr 扩展函数在期望紧凑向量时生成稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆