标题卡在行中的整理和转换数据 [英] Tidy and Cast Data With Headers Stuck in Rows

查看:28
本文介绍了标题卡在行中的整理和转换数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

demodf <- data.frame(
  name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
  Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
  Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"))

   name     Field    Values
1  Mike EDUCATION EDUCATION
2  Mike    Degree   Masters
3  Mike     Title   Student
4  Mike      WORK      WORK
5  Mike     Title  VP Sales
6   Joe EDUCATION EDUCATION
7   Joe    Degree  Bachelors
8   Joe     Title   Student
9   Joe      WORK      WORK
10  Joe     Title   Analyst

我想将 tidyr::spreadreshape2::dcast 转换为宽格式,其中 Field 成为列标题.

I want to tidyr::spread or reshape2::dcast into wide format, where Field becomes the column headers.

该代码看起来像 dcast(demodf, name ~ Values)demodf %>% spread(Field, Values).但是,dcast 强制转换为数字,spread 会抛出错误.

That code would look like dcast(demodf, name ~ Values) or demodf %>% spread(Field, Values). However, dcast coerces to numeric, and spread throws an error.

问题是标题"重复了.您可以看到,由于数据中的一个怪癖,我们将 EDUCATION 和 WORK 作为数据中的假"标头.是否可以用大写的标题标记每个 Field 条目,以便 dcast 可以工作(即 Title_EDUCATIONTitle_WORK>)?将这种转换应用到整个 Field 会更好,所以EDUCATION"和WORK"一起消失,我们只剩下 Degree_EDUCATION, TITLE_EDUCATION……等等).

The problem is that "Title" repeats. You can see that due to a quirk in the data, we have EDUCATION and WORK as "false" headers within the data. Is it possible to tag each Field entry with the capitalized header so that dcast will work (i.e. Title_EDUCATION, and Title_WORK)? And it would be even better to apply that transformation to the whole Field, so "EDUCATION" and "WORK" disappear all together, and we're left with Degree_EDUCATION, TITLE_EDUCATION... etc.).

请注意,实际数据中有更多的标头,因此最好将假标头"标识为全部大写的条目,或者Field == Values

Note that there are many more headers in the actual data, so it would be best to identify the "false headers" as the all-cap entries, or the entries where Field == Values

所需的输出:

output <- data.frame(
 Name=c("Mike", "Joe"),
 Degree_EDUCATION =c("Masters", "Bachelors"),
 Title_EDUCATION = c("Student", "Student"),
 Title_WORK= c("VP Sales", "Analyst"))

  Name Degree_EDUCATION Title_EDUCATION Title_WORK
1 Mike          Masters         Student   VP Sales
2  Joe        Bachelors         Student    Analyst

推荐答案

关键是将重复的类别行添加为新列,然后您可以轻松使用它.

The key is to add that repeated category line as a new column, then you can work with it easily.

首先,添加stringsAsFactors=FALSE 以便可以比较FieldValues:

First, adding stringsAsFactors=FALSE so can compare Field and Values:

demodf <- data.frame(
  name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
  Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
  Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"),
  stringsAsFactors=FALSE)

现在使用 tidyrdplyr 添加列,如果该行是一个类别和该类别的名称,然后填写缺失值,然后删除额外的行和列.

Now use tidyr and dplyr to add columns for if that row is a category and the name of that category, then to fill down the missing values, then to remove the extra rows and columns.

library(tidyr)
library(dplyr)
d2 <- demodf %>% mutate(IsCategory=Field==Values,
                        Category=ifelse(IsCategory, Field, NA)) %>%
  fill(Category) %>% subset(!IsCategory, select=-IsCategory)
d2
##    name  Field    Values  Category
## 2  Mike Degree   Masters EDUCATION
## 3  Mike  Title   Student EDUCATION
## 5  Mike  Title  VP Sales      WORK
## 7   Joe Degree Bachelors EDUCATION
## 8   Joe  Title   Student EDUCATION
## 10  Joe  Title   Analyst      WORK

dcast 将如您所愿!

library(reshape2)    
dcast(d2, name ~ Field+Category, value.var="Values")
##   name Degree_EDUCATION Title_EDUCATION Title_WORK
## 1  Joe        Bachelors         Student    Analyst
## 2 Mike          Masters         Student   VP Sales

这篇关于标题卡在行中的整理和转换数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆