整理具有可变位置的多个节/标题的数据集 [英] Tidying datasets with multiple sections/headers at variable positions

查看：67 发布时间：2020/10/26 3:29:20 r dplyr

本文介绍了整理具有可变位置的多个节/标题的数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

上下文

我试图读取并整理一个Excel文件，该文件具有多个放置在可变位置的标题/节。这些标头的内容需要作为变量添加。输入文件是相对较大的excel文件，其格式考虑了（人类）可读性，但仅此而已。

I am trying to read in and tidy an excel file with multiple headers/sections placed at variable positions. The content of these headers need to be added as a variable. The input files are relatively large excel files which are formatted with (human) readability in mind but little more than that.

输入：

让我们说数据集包含许多城市的汽车类型分布（基于他们使用的燃料）。如您所见，在原始文件中，城市名称用作标题（或您将要使用的分隔符）。我们需要此标头作为变量。不幸的是，并未列出所有类型，并且缺少某些值。这是一个虚构的示例集：

Let's say the data set contains the distributions of types of car (based on the fuel they use) for a number of cities. As you will see, in the original file, the name of the city is used as header (or divider as you will). We need this header as a variable. Unfortunately not all types are listed and some values are missing. Here's a fictional example set:

 df <- data.frame(
        col1= c("Seattle","Diesel","Gasoline","LPG","Electric","Boston","Diesel","Gasoline","Electric"),
        col2= c(NA, 80 ,NA,10,10,NA,65,25,10)
 )

      col1 col2
1  Seattle   NA
2   Diesel   80
3 Gasoline   NA
4      LPG   10
5 Electric   10
6   Boston   NA
7   Diesel   65
8 Gasoline   25
9 Electric   10

期望结果：

     city     type value
1 Seattle   Diesel    80
2 Seattle Gasoline    NA
3 Seattle      LPG    10
4 Seattle Electric    10
5  Boston   Diesel    65
6  Boston Gasoline    25
7  Boston Electric    10

我的尝试

我最接近的是使用dplyr dense_rank（）和 lag（） s不是理想的解决方案。

The closest I got was using dplyr's dense_rank() and lag() but this was not an ideal solution.

任何输入都非常感谢！

推荐答案

假设您有一个有限的措施清单（柴油，电气等），您可以列出要检查的清单。不在该组度量中的任何 col1 值都可能是城市。提取这些内容（请注意，这是当前的一个因素，因此我使用 as.character ），填写并删除所有标题行。

Assuming you have a finite list of measures (diesel, electric, etc), you can make a list to check against. Any value of col1 not in that set of measures is presumably a city. Extract those (note that it's currently a factor, so I used as.character), fill down, and remove any heading rows.

library(dplyr)

meas <- c("Diesel", "Gasoline", "LPG", "Electric")

df %>%
  mutate(city = ifelse(!col1 %in% meas, as.character(col1), NA)) %>%
  tidyr::fill(city) %>%
  filter(col1 != city)
#>       col1 col2    city
#> 1   Diesel   80 Seattle
#> 2 Gasoline   NA Seattle
#> 3      LPG   10 Seattle
#> 4 Electric   10 Seattle
#> 5   Diesel   65  Boston
#> 6 Gasoline   25  Boston
#> 7 Electric   10  Boston

这篇关于整理具有可变位置的多个节/标题的数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

整理具有可变位置的多个节/标题的数据集 [英] Tidying datasets with multiple sections/headers at variable positions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

整理具有可变位置的多个节/标题的数据集 [英] Tidying datasets with multiple sections/headers at variable positions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭