整理具有可变位置的多个节/标题的数据集 [英] Tidying datasets with multiple sections/headers at variable positions

查看:67
本文介绍了整理具有可变位置的多个节/标题的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文

我试图读取并整理一个Excel文件,该文件具有多个放置在可变位置的标题/节。这些标头的内容需要作为变量添加。输入文件是相对较大的excel文件,其格式考虑了(人类)可读性,但仅此而已。

I am trying to read in and tidy an excel file with multiple headers/sections placed at variable positions. The content of these headers need to be added as a variable. The input files are relatively large excel files which are formatted with (human) readability in mind but little more than that.

输入:

让我们说数据集包含许多城市的汽车类型分布(基于他们使用的燃料)。如您所见,在原始文件中,城市名称用作标题(或您将要使用的分隔符)。我们需要此标头作为变量。不幸的是,并未列出所有类型,并且缺少某些值。这是一个虚构的示例集:

Let's say the data set contains the distributions of types of car (based on the fuel they use) for a number of cities. As you will see, in the original file, the name of the city is used as header (or divider as you will). We need this header as a variable. Unfortunately not all types are listed and some values are missing. Here's a fictional example set:

 df <- data.frame(
        col1= c("Seattle","Diesel","Gasoline","LPG","Electric","Boston","Diesel","Gasoline","Electric"),
        col2= c(NA, 80 ,NA,10,10,NA,65,25,10)
 )



      col1 col2
1  Seattle   NA
2   Diesel   80
3 Gasoline   NA
4      LPG   10
5 Electric   10
6   Boston   NA
7   Diesel   65
8 Gasoline   25
9 Electric   10

期望结果:

     city     type value
1 Seattle   Diesel    80
2 Seattle Gasoline    NA
3 Seattle      LPG    10
4 Seattle Electric    10
5  Boston   Diesel    65
6  Boston Gasoline    25
7  Boston Electric    10

我的尝试

我最接近的是使用dplyr dense_rank() lag() s不是理想的解决方案。

The closest I got was using dplyr's dense_rank() and lag() but this was not an ideal solution.

任何输入都非常感谢!

推荐答案

假设您有一个有限的措施清单(柴油,电气等),您可以列出要检查的清单。不在该组度量中的任何 col1 值都可能是城市。提取这些内容(请注意,这是当前的一个因素,因此我使用 as.character ),填写并删除所有标题行。

Assuming you have a finite list of measures (diesel, electric, etc), you can make a list to check against. Any value of col1 not in that set of measures is presumably a city. Extract those (note that it's currently a factor, so I used as.character), fill down, and remove any heading rows.

library(dplyr)

meas <- c("Diesel", "Gasoline", "LPG", "Electric")

df %>%
  mutate(city = ifelse(!col1 %in% meas, as.character(col1), NA)) %>%
  tidyr::fill(city) %>%
  filter(col1 != city)
#>       col1 col2    city
#> 1   Diesel   80 Seattle
#> 2 Gasoline   NA Seattle
#> 3      LPG   10 Seattle
#> 4 Electric   10 Seattle
#> 5   Diesel   65  Boston
#> 6 Gasoline   25  Boston
#> 7 Electric   10  Boston

这篇关于整理具有可变位置的多个节/标题的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆