根据表中的中断将数据框子集为 2 个数据框 [英] Subset a data frame into 2 data frames based on breaks in the tables

查看:77
本文介绍了根据表中的中断将数据框子集为 2 个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我下载了一个 csv 文件,该文件在同一选项卡上包含 3 个不同的表.我只需要顶表和底表,但根据我下载文件的时间,行数会有所不同.我在下面附上了文件的图片.CSV 文件,其中 3 个表格由空行分隔

I have a csv file that I download that contains 3 different tables on the same tab. I only need the top table and the bottom table but depending on when I download the file the number of rows vary. I have attached an image of the file below. CSV file with the 3 tables separated by blank rows

我希望完成的是将第一个表和第三个表作为两个单独的数据帧读取.我希望使用 grep/grepl 使 DF1 达到第 1 次休息(第 202 行),并在第 2 次休息(第 212 行)后开始 DF2

What I am hoping to accomplish is reading the 1st table and the 3rd table as two separate dataframes. I was hoping to use grep/grepl to get DF1 up to the 1st break (row 202) and get DF2 starting after the 2nd break (row 212)

我知道我可以通过进入文件并跳过行和/或删除行来对数据进行子集化.虽然我想看看是否有一种方法可以自动识别这些表并对其进行子集化.

I know I can subset the data by going into the file and skipping rows and/or dropping rows. Although I wanted to see if there is a method to automatically identify these tables and subset them.

推荐答案

(最好的解决方法是更正此数据的来源:不要这样做,最好使用单独的文件或其他格式.缺少....)

(The best fix is to correct the origination of this data: don't do this, better to have separate files or some other format. Lacking that ...)

我只能从图片中猜测,所以这是一个示例文件.

I can only guess from an image, so here's a sample file.

a,b,c
1,11,21
2,12,22
,,,,,
aa,bb,cc,dd
31,41,51,61
,,,,,
aaa,bbb,ccc,ddd,eee,fff
111,222,333,444,555,666

使用这个功能:

##' Read multi-part CSV files.
##'
##' @details
##' A typical CSV file contains rows, unbroken by spaces,
##' with an equal number of columns separated by a fixed character
##' (typically "," or "\\t"). Occasionally, some rows are incomplete
##' (insufficient number of fields); this issue is handled by
##' \code{read.csv} directly with the \code{fill = TRUE} argument.
##'
##' Two other issues can arise in a seemingly compliant CSV file:
##'
##' \itemize{
##'
##'   \item{The header row is repeated multiple times throughout the
##' document. This likely spoils the results from \code{read.csv} by
##' forcing all columns to be factors or characters, instead of the
##' actual data (e.g., numeric, integer).}
##'
##'   \item{There are blank lines separating truly disparate tables.
##' With just \code{read.csv}, the blank lines will typically be
##' \code{fill}ed, all tables will be expanded to the width of the
##' widest table, and all headers will be from the first table.}
##' }
##'
##' This function mitigates both of these issues.
##'
##' NOTE: arguments passed to \code{read.csv} are used with all
##' tables, so if you have blank lines with disparate tables, the
##' presence or absence of headers will not be handled gracefully.
##' @param fname character or vector, the file name(s)
##' @param by.header logical (default TRUE), whether to split by identical header rows
##' @param by.space logical (default TRUE), whether to split by empty lines
##' @param ... arguments passed to \code{readLines} or \code{read.csv}
##' @return list, one named entry per filename, each containing a list
##' containing the recursive tables in the CSV file
##' @export
readMultiCSV <- function(fname, by.header = TRUE, by.space = TRUE, ...) {
    dots <- list(...)

    readlinesopts <- (names(dots) %in% names(formals(readLines)))
    readcsvopts <- (! readlinesopts) & (names(dots) %in% names(formals(read.csv)))

    ret <- lapply(fname, function(fn) {
        txt <- do.call(readLines, c(list(con = fn), dots[readlinesopts]))

        starts <- 1

        if (by.space) {
            starts <- sort(c(starts, 1 + which(txt == ''), 1 + grep("^,*$", txt)))
            stops <- c(starts[-1], length(txt) + 2) - 2
        }

        if (by.header) {
            morestarts <- unlist(mapply(
                function(x,y)
                    if ((x+1) < y)
                        x + which(txt[x] == txt[(x+1):y]),
                starts,
                ## I do "- 2" to remove the empty lines found in the by.space block
                c(starts[-1], length(txt) + 2) - 2, SIMPLIFY = TRUE))
            starts <- sort(c(starts, morestarts))
            stops <- sort(c(stops, morestarts - 1))
        }

        ## filter out empty ranges
        nonEmpties <- (stops - starts) > 0
        starts <- starts[nonEmpties]
        stops <- stops[nonEmpties]

        mapply(function(x,y) do.call(read.csv, c(list(file = fn, skip = x-1, nrows = y-x), dots[readcsvopts])),
               starts, stops, SIMPLIFY = FALSE)
    })
    names(ret) <- basename(fname)
    ret
}

演示:

readMultiCSV("~/StackOverflow/11815793/61091149.csv")
# $`61091149.csv`
# $`61091149.csv`[[1]]
#   a  b  c
# 1 1 11 21
# 2 2 12 22
# $`61091149.csv`[[2]]
#   aa bb cc dd
# 1 31 41 51 61
# $`61091149.csv`[[3]]
#   aaa bbb ccc ddd eee fff
# 1 111 222 333 444 555 666

Excel 往往比我们聪明,而是在所有表格中最宽的最右边缘都有尾随逗号.相反,这会给我们一个文件,如:

Excel will often out-smart us, and instead have trailing commas in all tables to the right-most edge of the widest. Instead, this will give us a file like:

a,b,c,,,
1,11,21,,,
2,12,22,,,
,,,,,
aa,bb,cc,dd,,
31,41,51,61,,
,,,,,
aaa,bbb,ccc,ddd,eee,fff
111,222,333,444,555,666

这不会破坏它,它只是让您在背面做更多的工作:

This doesn't break it, it just gives you more work on the back side:

readMultiCSV("~/StackOverflow/11815793/61091149.csv")
# $`61091149.csv`
# $`61091149.csv`[[1]]
#   a  b  c  X X.1 X.2
# 1 1 11 21 NA  NA  NA
# 2 2 12 22 NA  NA  NA
# $`61091149.csv`[[2]]
#   aa bb cc dd  X X.1
# 1 31 41 51 61 NA  NA
# $`61091149.csv`[[3]]
#   aaa bbb ccc ddd eee fff
# 1 111 222 333 444 555 666

这篇关于根据表中的中断将数据框子集为 2 个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆