标题卡在行中的整理和转换数据 [英] Tidy and Cast Data With Headers Stuck in Rows
问题描述
demodf <- data.frame(
name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"))
name Field Values
1 Mike EDUCATION EDUCATION
2 Mike Degree Masters
3 Mike Title Student
4 Mike WORK WORK
5 Mike Title VP Sales
6 Joe EDUCATION EDUCATION
7 Joe Degree Bachelors
8 Joe Title Student
9 Joe WORK WORK
10 Joe Title Analyst
我想将 tidyr::spread
或 reshape2::dcast
转换为宽格式,其中 Field
成为列标题.
I want to tidyr::spread
or reshape2::dcast
into wide format, where Field
becomes the column headers.
该代码看起来像 dcast(demodf, name ~ Values)
或 demodf %>% spread(Field, Values)
.但是,dcast
强制转换为数字,spread
会抛出错误.
That code would look like dcast(demodf, name ~ Values)
or demodf %>% spread(Field, Values)
. However, dcast
coerces to numeric, and spread
throws an error.
问题是标题"重复了.您可以看到,由于数据中的一个怪癖,我们将 EDUCATION 和 WORK 作为数据中的假"标头.是否可以用大写的标题标记每个 Field
条目,以便 dcast
可以工作(即 Title_EDUCATION
和 Title_WORK
>)?将这种转换应用到整个 Field
会更好,所以EDUCATION"和WORK"一起消失,我们只剩下 Degree_EDUCATION, TITLE_EDUCATION
……等等).
The problem is that "Title" repeats. You can see that due to a quirk in the data, we have EDUCATION and WORK as "false" headers within the data. Is it possible to tag each Field
entry with the capitalized header so that dcast
will work (i.e. Title_EDUCATION
, and Title_WORK
)? And it would be even better to apply that transformation to the whole Field
, so "EDUCATION" and "WORK" disappear all together, and we're left with Degree_EDUCATION, TITLE_EDUCATION
... etc.).
请注意,实际数据中有更多的标头,因此最好将假标头"标识为全部大写的条目,或者Field == Values
Note that there are many more headers in the actual data, so it would be best to identify the "false headers" as the all-cap entries, or the entries where Field == Values
所需的输出:
output <- data.frame(
Name=c("Mike", "Joe"),
Degree_EDUCATION =c("Masters", "Bachelors"),
Title_EDUCATION = c("Student", "Student"),
Title_WORK= c("VP Sales", "Analyst"))
Name Degree_EDUCATION Title_EDUCATION Title_WORK
1 Mike Masters Student VP Sales
2 Joe Bachelors Student Analyst
推荐答案
关键是将重复的类别行添加为新列,然后您可以轻松使用它.
The key is to add that repeated category line as a new column, then you can work with it easily.
首先,添加stringsAsFactors=FALSE
以便可以比较Field
和Values
:
First, adding stringsAsFactors=FALSE
so can compare Field
and Values
:
demodf <- data.frame(
name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"),
stringsAsFactors=FALSE)
现在使用 tidyr
和 dplyr
添加列,如果该行是一个类别和该类别的名称,然后填写缺失值,然后删除额外的行和列.
Now use tidyr
and dplyr
to add columns for if that row is a category and the name of that category, then to fill down the missing values, then to remove the extra rows and columns.
library(tidyr)
library(dplyr)
d2 <- demodf %>% mutate(IsCategory=Field==Values,
Category=ifelse(IsCategory, Field, NA)) %>%
fill(Category) %>% subset(!IsCategory, select=-IsCategory)
d2
## name Field Values Category
## 2 Mike Degree Masters EDUCATION
## 3 Mike Title Student EDUCATION
## 5 Mike Title VP Sales WORK
## 7 Joe Degree Bachelors EDUCATION
## 8 Joe Title Student EDUCATION
## 10 Joe Title Analyst WORK
dcast
将如您所愿!
library(reshape2)
dcast(d2, name ~ Field+Category, value.var="Values")
## name Degree_EDUCATION Title_EDUCATION Title_WORK
## 1 Joe Bachelors Student Analyst
## 2 Mike Masters Student VP Sales
这篇关于标题卡在行中的整理和转换数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!