如何跳过在R中的制表符分隔文件之前的额外行 [英] How to skip extra lines before the header of a tab delimited delimited file in R

查看:573
本文介绍了如何跳过在R中的制表符分隔文件之前的额外行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的软件生成具有可变数量的摘要信息行的日志文件,后面跟着许多制表符分隔数据。我试图写一个函数,将读取的数据从这些日志文件到一个数据框,忽略了摘要信息。摘要信息从不包含制表符,所以下面的函数工作:

The software I am using produces log files with a variable number of lines of summary information followed by lots of tab delimited data. I am trying to write a function that will read the data from these log files into a data frame ignoring the summary information. The summary information never contains a tab, so the following function works:

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n")
  first.line <- min(grep("\\t", lines))
  return(read.delim(file.name, skip=first.line-1, ...))
}

但是,这些日志文件相当大,因此读取文件两次是非常慢的。当然有更好的方法吗?

However, these logfiles are quite big, and so reading the file twice is very slow. Surely there is a better way?

编辑以添加:

Marek使用 textConnection 对象。他在答案中建议的方式在一个大文件中失败,但是以下工作:

Marek suggested using a textConnection object. The way he suggested in the answer fails on a big file, but the following works:

read.parameters <- function(file.name, ...){
  conn = file(file.name, "r")
  on.exit(close(conn))
  repeat{
    line = readLines(conn, 1)
    if (length(grep("\\t", line))) {
      pushBack(line, conn)
      break}}
  df <- read.delim(conn, ...)
  return(df)}

再次编辑:感谢Marek进一步改进上述功能。

Edited again: Thanks Marek for further improvement to the above function.

推荐答案

t需要读取两次。在第一个结果上使用 textConnection

You don't need to read twice. Use textConnection on first result.

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n") # you got "tmp.log" here, i suppose file.name should be
  first.line <- min(grep("\\t", lines))
  return(read.delim(textConnection(lines), skip=first.line-1, ...))
}

这篇关于如何跳过在R中的制表符分隔文件之前的额外行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆