整理具有可变位置的多个节/标题的数据集 [英] Tidying datasets with multiple sections/headers at variable positions
问题描述
上下文
我试图读取并整理一个Excel文件,该文件具有多个放置在可变位置的标题/节。这些标头的内容需要作为变量添加。输入文件是相对较大的excel文件,其格式考虑了(人类)可读性,但仅此而已。
I am trying to read in and tidy an excel file with multiple headers/sections placed at variable positions. The content of these headers need to be added as a variable. The input files are relatively large excel files which are formatted with (human) readability in mind but little more than that.
输入:
让我们说数据集包含许多城市的汽车类型分布(基于他们使用的燃料)。如您所见,在原始文件中,城市名称用作标题(或您将要使用的分隔符)。我们需要此标头作为变量。不幸的是,并未列出所有类型,并且缺少某些值。这是一个虚构的示例集:
Let's say the data set contains the distributions of types of car (based on the fuel they use) for a number of cities. As you will see, in the original file, the name of the city is used as header (or divider as you will). We need this header as a variable. Unfortunately not all types are listed and some values are missing. Here's a fictional example set:
df <- data.frame(
col1= c("Seattle","Diesel","Gasoline","LPG","Electric","Boston","Diesel","Gasoline","Electric"),
col2= c(NA, 80 ,NA,10,10,NA,65,25,10)
)
col1 col2
1 Seattle NA
2 Diesel 80
3 Gasoline NA
4 LPG 10
5 Electric 10
6 Boston NA
7 Diesel 65
8 Gasoline 25
9 Electric 10
期望结果:
city type value
1 Seattle Diesel 80
2 Seattle Gasoline NA
3 Seattle LPG 10
4 Seattle Electric 10
5 Boston Diesel 65
6 Boston Gasoline 25
7 Boston Electric 10
我的尝试
我最接近的是使用dplyr dense_rank()
和 lag()
s不是理想的解决方案。
The closest I got was using dplyr's dense_rank()
and lag()
but this was not an ideal solution.
任何输入都非常感谢!
推荐答案
假设您有一个有限的措施清单(柴油,电气等),您可以列出要检查的清单。不在该组度量中的任何 col1
值都可能是城市。提取这些内容(请注意,这是当前的一个因素,因此我使用 as.character
),填写并删除所有标题行。
Assuming you have a finite list of measures (diesel, electric, etc), you can make a list to check against. Any value of col1
not in that set of measures is presumably a city. Extract those (note that it's currently a factor, so I used as.character
), fill down, and remove any heading rows.
library(dplyr)
meas <- c("Diesel", "Gasoline", "LPG", "Electric")
df %>%
mutate(city = ifelse(!col1 %in% meas, as.character(col1), NA)) %>%
tidyr::fill(city) %>%
filter(col1 != city)
#> col1 col2 city
#> 1 Diesel 80 Seattle
#> 2 Gasoline NA Seattle
#> 3 LPG 10 Seattle
#> 4 Electric 10 Seattle
#> 5 Diesel 65 Boston
#> 6 Gasoline 25 Boston
#> 7 Electric 10 Boston
这篇关于整理具有可变位置的多个节/标题的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!