R:清理宽大不整的数据框 [英] R: Cleaning up a wide and untidy dataframe

查看:47
本文介绍了R:清理宽大不整的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据框:

I have a data frame that looks like:

d<-data.frame(id=(1:9), 
                  grp_id=(c(rep(1,3), rep(2,3), rep(3,3))), 
                  a=rep(NA, 9), 
                  b=c("No", rep(NA, 3), "Yes", rep(NA, 4)), 
                  c=c(rep(NA,2), "No", rep(NA,6)), 
                  d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)), 
                  e=c(rep(NA, 7), "No", NA), 
                  f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
  id grp_id  a    b    c    d    e    f
1  1      1 NA   No <NA> <NA> <NA> <NA>
2  2      1 NA <NA> <NA> <NA> <NA>   No
3  3      1 NA <NA>   No <NA> <NA> <NA>
4  4      2 NA <NA> <NA>  Yes <NA> <NA>
5  5      2 NA  Yes <NA> <NA> <NA> <NA>
6  6      2 NA <NA> <NA> <NA> <NA>   No
7  7      3 NA <NA> <NA>   No <NA> <NA>
8  8      3 NA <NA> <NA> <NA>   No <NA>
9  9      3 NA <NA> <NA> <NA> <NA>   No

在每个组(grp_id)中,只有1个是或否值与每个列a:f。

Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.

我想为每个grp_id创建一行,以获取如下所示的数据框:

I'd like to create a single row for each grp_id to get a data frame that looks like the following:

grp_id  a    b    c    d    e    f
     1 NA   No   No <NA> <NA>   No
     2 NA  Yes <NA>  Yes <NA>   No
     3 NA <NA> <NA>   No   No   No

我认识到tidyr软件包可能是最好的工具,第一步可能是

I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be

d %>% 
   group_by(grp_id) %>%
     summarise()

我很乐意提供有关摘要中命令或任何解决方案的帮助。谢谢。

I would appreciate help with the commands within summarise, or any solution really. Thanks.

推荐答案

您已经收到了一些很好的答案,但是他们都没有真正使用 tidyr 包。 ( summarize() summarize_at()函数族来自 dplyr 。)

You've received some good answers but neither of them actually uses the tidyr package. (The summarize() and summarize_at() family of functions is from dplyr.)

实际上,仅针对您的问题的 tidyr 解决方案是可行的。 / p>

In fact, a tidyr-only solution for your problem is very doable.

d %>%
    gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
    na.omit() %>%
    select(-id) %>%
    spread(col, value, fill=NA, drop=FALSE)

唯一困难的部分是确保获得 a 列在您的输出中。对于您的示例数据,它完全是 NA 。诀窍是 gather() factor_key = TRUE 参数和 drop = FALSE 自变量 spread()。如果没有设置这两个参数,则输出将没有 a 列,而只有具有至少一个非 NA 条目。

The only hard part is ensuring that you get the a column in your output. For your example data, it is entirely NA. The trick is the factor_key=TRUE argument to gather() and the drop=FALSE argument to spread(). Without those two arguments being set, the output would not have an a column, and would only have columns with at least one non-NA entry.

以下是其工作方式的说明:

Here's a description of how it works:

gather(col, value, -id, -grp_id, factor_key=TRUE) %>%

这将整理您的数据-有效地将 a - f 列替换为新列 col value ,形成一个长格式的整洁数据框。 col 列中的条目为字母 a - f 。并且因为我们使用了 factor_key = TRUE ,所以此列是具有级别的 factor ,而不仅仅是字符向量。

This tidies your data -- it effectively replaces columns a - f with new columns col and value, forming a long-formated "tidy" data frame. The entries in the col column are letters a - f. And because we've used factor_key=TRUE, this column is a factor with levels, not just a character vector.

na.omit() %>%

这将从长数据中删除所有 NA 值。

This removes all the NA values from the long data.

select(-id) %>%

这消除了 id 列。

spread(col, value, fill=NA, drop=FALSE)

这将使用 col 列定义新的列名,并在 value 列中的值填充新列的条目。当数据丢失时,将使用 fill 的值(此处为 NA )代替。而 drop = FALSE 意味着,当 col 是一个因子时,该因子的每个级别只有一列,无论该级别是否出现在数据中。这与将 col 设置为一个因素一起,将 a 作为输出列。

This re-widens the data, using the values in the col column to define new column names, and the values in the value column to fill in the entries of the new columns. When data is missing, a value of fill (here NA) is used instead. And the drop=FALSE means that when col is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with setting col to be a factor, is what gets a as an output column.

我个人认为这种方法比需要子集或 lapply 的方法更具可读性。此外,如果您的数据实际上不是一个热点,则此方法将失败,而其他方法可能会起作用并为您提供意外的输出。这种方法的缺点是输出列 a - f 不是因素,而是字符向量。如果需要因子输出,则应该可以(未经测试)

I personally find this approach more readable than the approaches requiring subsetting or lapply stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columns a - f are not factors, but character vectors. If you need factor output you should be able to do (untested)

mutate(value = factor(value, levels=c('Yes', 'No', NA))) %>%

c $ c> gather() spread()函数可确保要素输出。

anywhere between the gather() and spread() functions to ensure factor output.

这篇关于R:清理宽大不整的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆