读取固定宽度文件的更快方法 [英] Faster way to read fixed-width files

查看:22
本文介绍了读取固定宽度文件的更快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我处理了许多需要读入 R 的固定宽度文件(即没有分隔符).因此,通常有一个列宽定义来将字符串解析为变量.我可以使用 read.fwf 读取数据而不会出现问题.但是,对于大文件,这可能需要很长时间.对于最近的数据集,读取包含约 500,000 行和 143 个变量的数据集需要 800 秒.

I work with a lot of fixed width files (i.e., no separating character) that I need to read into R. So, there is usually a definition of the column width to parse the string into variables. I can use read.fwf to read in the data without a problem. However, for large files, this can take a long time. For a recent dataset, this took 800 seconds to read in a dataset with ~500,000 rows and 143 variables.

seer9 <- read.fwf("~/data/rawdata.txt", 
  widths = cols,
  header = FALSE,
  buffersize = 250000,
  colClasses = "character",
  stringsAsFactors = FALSE))

R 中 data.table 包中的

fread 非常适合解决大多数数据读取问题,只是它不解析固定宽度的文件.但是,我可以将每一行作为单个字符串(~500,000 行,1 列)读取.这需要 3-5 秒.(我喜欢 data.table.)

fread in the data.table package in R is awesome for solving most data read problems, except it doesn't parse fixed width files. However, I can read each line in as a single character string (~500,000 rows, 1 column). This takes 3-5 seconds. (I love data.table.)

seer9 <- fread("~/data/rawdata.txt", colClasses = "character",
               sep = "\n", header = FALSE, verbose = TRUE)

SO 上有许多关于如何解析文本文件的好帖子.请参阅 JHoward 的建议 此处,创建一个开始和结束列的矩阵,以及 substr 来解析数据.请参阅 GSee 的建议 此处 使用 strsplit.我无法弄清楚如何使用这些数据进行处理.(此外,Michael Smith 对涉及 sed 的 data.table 邮件列表提出了一些我无法实施.) 现在,使用 freadsubstr()我可以在大约 25-30 秒内完成整个过程.请注意,最后强制转换为 data.table 需要花费大量时间(5 秒?).

There are a number of good posts on SO on how to parse text files. See JHoward's suggestion here, to create a matrix of start and end columns, and substr to parse the data. See GSee's suggestion here to use strsplit. I couldn't figure out how to make that work with this data. (Also, Michael Smith made some suggestions on the data.table mailing list involving sed that were beyond my ability to implement.) Now, using fread and substr() I can do the whole thing in about 25-30 seconds. Note that coercing to a data.table at end takes a chunk of time (5 sec?).

end_col <- cumsum(cols)
start_col <- end_col - cols + 1
start_end <- cbind(start_col, end_col) # matrix of start and end positions
text <- lapply(seer9, function(x) {
        apply(start_end, 1, function(y) substr(x, y[1], y[2])) 
        })
dt <- data.table(text$V1)
setnames(dt, old = 1:ncol(dt), new = seervars)

我想知道这是否可以进一步改进?我知道我不是唯一一个必须读取固定宽度文件的人,所以如果这可以做得更快,它会使加载更大的文件(具有数百万行)更容易.我尝试将 parallelmclapplydata.table 一起使用,而不是 lapply,但这些都没有改变.(可能是由于我在 R 方面的经验不足.)我想可以编写一个 Rcpp 函数来非常快地执行此操作,但这超出了我的技能范围.另外,我可能没有使用 lapply 并适当地应用.

What I am wondering is whether this can be improved any further? I know I am not the only one who has to read fixed width files, so if this could be made faster, it would make loading even larger files (with millions of rows) more tolerable. I tried using parallel with mclapply and data.tableinstead of lapply, but those didn't change anything. (Likely due to my inexperience in R.) I imagine that an Rcpp function could be written to do this really fast, but that is beyond my skill set. Also, I may not be using lapply and apply appropriately.

我的 data.table 实现(使用 magrittr 链接)需要相同的时间:

My data.table implementation (with magrittr chaining) takes the same time:

text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% 
  data.table(.)

任何人都可以提出建议以提高速度吗?或者这已经达到了最好的程度?

Can anyone make suggestions to improve the speed of this? Or is this about as good as it gets?

这是在 R 中创建类似 data.table 的代码(而不是链接到实际数据).它应该有 331 个字符和 500,000 行.有空格可以模拟数据中的缺失字段,但这是空格分隔的数据.(我正在阅读原始 SEER 数据,以防万一有人感兴趣.)还包括列宽 (cols) 和变量名称 (seervars),以防这对其他人有帮助.这些是 SEER 数据的实际列和变量定义.

Here is code to create a similar data.table within R (rather than linking to actual data). It should have 331 characters, and 500,000 rows. There are spaces to simulate missing fields in the data, but this is NOT space delimited data. (I am reading raw SEER data, in case anyone is interested.) Also including column widths (cols) and variable names (seervars) in case this helps someone else. These are the actual column and variable definitions for SEER data.

seer9 <-
  data.table(rep((paste0(paste0(letters, 1000:1054, " ", collapse = ""), " ")),
                 500000))

cols = c(8,10,1,2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2,13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1,1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1,1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,1,4,1,4,1,1,3,3,3,3,2,2,2,2)
seervars <- c("CASENUM", "REG", "MAR_STAT", "RACE", "ORIGIN", "NHIA", "SEX", "AGE_DX", "YR_BRTH", "PLC_BRTH", "SEQ_NUM", "DATE_mo", "DATE_yr", "SITEO2V", "LATERAL", "HISTO2V", "BEHO2V", "HISTO3V", "BEHO3V", "GRADE", "DX_CONF", "REPT_SRC", "EOD10_SZ", "EOD10_EX", "EOD10_PE", "EOD10_ND", "EOD10_PN", "EOD10_NE", "EOD13", "EOD2", "EOD4", "EODCODE", "TUMOR_1V", "TUMOR_2V", "TUMOR_3V", "CS_SIZE", "CS_EXT", "CS_NODE", "CS_METS", "CS_SSF1", "CS_SSF2", "CS_SSF3", "CS_SSF4", "CS_SSF5", "CS_SSF6", "CS_SSF25", "D_AJCC_T", "D_AJCC_N", "D_AJCC_M", "D_AJCC_S", "D_SSG77", "D_SSG00", "D_AJCC_F", "D_SSG77F", "D_SSG00F", "CSV_ORG", "CSV_DER", "CSV_CUR", "SURGPRIM", "SCOPE", "SURGOTH", "SURGNODE", "RECONST", "NO_SURG", "RADIATN", "RAD_BRN", "RAD_SURG", "SS_SURG", "SRPRIM02", "SCOPE02", "SRGOTH02", "REC_NO", "O_SITAGE", "O_SEQCON", "O_SEQLAT", "O_SURCON", "O_SITTYP", "H_BENIGN", "O_RPTSRC", "O_DFSITE", "O_LEUKDX", "O_SITBEH", "O_EODDT", "O_SITEOD", "O_SITMOR", "TYPEFUP", "AGE_REC", "SITERWHO", "ICDOTO9V", "ICDOT10V", "ICCC3WHO", "ICCC3XWHO", "BEHANAL", "HISTREC", "BRAINREC", "CS0204SCHEMA", "RAC_RECA", "RAC_RECY", "NHIAREC", "HST_STGA", "AJCC_STG", "AJ_3SEER", "SSG77", "SSG2000", "NUMPRIMS", "FIRSTPRM", "STCOUNTY", "ICD_5DIG", "CODKM", "STAT_REC", "IHS", "HIST_SSG_2000", "AYA_RECODE", "LYMPHOMA_RECODE", "DTH_CLASS", "O_DTH_CLASS", "EXTEVAL", "NODEEVAL", "METSEVAL", "INTPRIM", "ERSTATUS", "PRSTATUS", "CSSCHEMA", "CS_SSF8", "CS_SSF10", "CS_SSF11", "CS_SSF13", "CS_SSF15", "CS_SSF16", "VASINV", "SRV_TIME_MON", "SRV_TIME_MON_FLAG", "SRV_TIME_MON_PA", "SRV_TIME_MON_FLAG_PA", "INSREC_PUB", "DAJCC7T", "DAJCC7N", "DAJCC7M", "DAJCC7STG", "ADJTM_6VALUE", "ADJNM_6VALUE", "ADJM_6VALUE", "ADJAJCCSTG")

更新:LaF 在不到 7 秒的时间内从原始 .txt 文件中完成了整个读取.也许有一种更快的方法,但我怀疑任何事情都可以做得更好.惊人的包裹.

UPDATE: LaF did the entire read in just under 7 seconds from the raw .txt file. Maybe there is an even faster way, but I doubt anything could do appreciably better. Amazing package.

2015 年 7 月 27 日更新只是想为此提供一个小的更新.我使用了新的 readr 包,使用 readr::read_fwf 可以在 5 秒内读取整个文件.

27 July 2015 Update Just wanted to provide a small update to this. I used the new readr package, and I was able to read in the entire file in 5 seconds using readr::read_fwf.

seer9_readr <- read_fwf("path_to_data/COLRECT.TXT",
  col_positions = fwf_widths(cols))

此外,更新后的 stringi::stri_sub 函数的速度至少是 base::substr() 的两倍.因此,在上面使用 fread 读取文件(大约 4 秒),然后使用 apply 解析每一行的代码中,使用 stringi::stri_sub 提取 143 个变量大约需要 8 秒,而 base::substr 则需要 19 秒.所以,fread 加上 stri_sub 仍然只有大约 12 秒的运行时间.不错.

Also, the updated stringi::stri_sub function is at least twice as fast as base::substr(). So, in the code above that uses fread to read the file (about 4 seconds), followed by apply to parse each line, the extraction of 143 variables took about 8 seconds with stringi::stri_sub compared to 19 for base::substr. So, fread plus stri_sub is still only about 12 seconds to run. Not bad.

seer9 <-  fread("path_to_data/COLRECT.TXT",     
  colClasses = "character", 
  sep = "\n", 
  header = FALSE)
text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% 
  data.table(.)

2015 年 12 月 10 日更新:

另请参阅下面的答案 @MichaelChirico 添加了一些很棒的基准测试和 iotools 包.

10 Dec 2015 update:

Please also see the answer below by @MichaelChirico who has added some great benchmarks and the iotools package.

推荐答案

现在有了(在这个和 关于有效读取固定宽度文件的其他主要问题)提供了大量读取此类文件的选项,我认为进行一些基准测试是合适的.

Now that there are (between this and the other major question about effective reading of fixed-width files) a fair amount of options on the offer for reading in such files, I think some benchmarking is appropriate.

我将使用以下大型 (400 MB) 文件进行比较.它只是一堆随机定义的字段和宽度的随机字符:

I'll use the following on-the-large-side (400 MB) file for comparison. It's just a bunch of random characters with randomly defined fields and widths:

set.seed(21394)
wwidth = 400L
rrows = 1000000
    
#creating the contents at random
contents = write.table(
  replicate(
    rrows,
    paste0(sample(letters, wwidth, replace = TRUE), collapse = "")
  ),
  file = "testfwf.txt",
  quote = FALSE, row.names = FALSE, col.names = FALSE
)
    
#defining the fields & writing a dictionary
n_fields = 40L
endpoints = unique(
  c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L)
)
cols = list(
  beg = endpoints[-(n_fields + 1L)], 
  end = endpoints[-1L] - 1L
)
    
dict = data.frame(
  column = paste0("V", seq_len(length(endpoints)) - 1L)),
  start = endpoints[-length(endpoints)] - 1,
  length = diff(endpoints)
)
    
write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE)

我将比较这两个线程之间提到的五种方法(如果作者愿意,我会添加一些其他方法):基本版本(read.fwf),管道 in2csvfread(@AnandaMahto 的建议),Hadley 的新 readr (read_fwf),使用 LaF/ffbase(@jwijffls 的建议),以及问题作者(@MarkDanese)建议的改进(简化)版本,结合 freadstri_sub 来自 stringi.

I'll compare five methods mentioned between these two threads (I'll add some others if the authors would like): the base version (read.fwf), piping the result of in2csv to fread (@AnandaMahto's suggestion), Hadley's new readr (read_fwf), that using LaF/ffbase (@jwijffls' suggestion), and an improved (streamlined) version of that suggested by the question author (@MarkDanese) combining fread with stri_sub from stringi.

这是基准测试代码:

library(data.table)
library(stringi)
library(readr)
library(LaF)
library(ffbase)
library(microbenchmark)
    
microbenchmark(
  times = 5L,
  utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE),
  in2csv = fread(cmd = sprintf(
    "in2csv -f fixed -s %s %s",
    "testdic.csv", "testfwf.txt"
  )),
  readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
  LaF = {
    my.data.laf = laf_open_fwf(
      'testfwf.txt', 
      column_widths = diff(endpoints),
      column_types = rep("character", length(endpoints) - 1L)
    )
    my.data = laf_to_ffdf(my.data.laf, nrows = rrows)
    as.data.frame(my.data)
  },
  fread = {
    DT = fread("testfwf.txt", header = FALSE, sep = "\n")
    DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
      stri_sub(V1, cols$beg[ii], cols$end[ii])
    })]
  }
)

和输出:

# Unit: seconds
#    expr       min        lq      mean    median        uq       max neval cld
#   utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598     5   c
#  in2csv  67.74065  68.56549  69.60069  70.11774  70.18746  71.39210     5 a  
#   readr  10.57945  11.32205  15.70224  14.89057  19.54617  22.17298     5 a  
#     LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798     5  b 
#   fread  14.42617  15.44693  26.09877  15.76016  20.45481  64.40581     5 a  

所以看起来 readrfread + stri_sub 作为最快的很有竞争力;内置的 read.fwf 显然是输家.

So it seems readr and fread + stri_sub are pretty competitive as the fastest; built-in read.fwf is the clear loser.

注意这里readr的真正优势是你可以预先指定列类型;使用 fread 你必须在之后输入 convert.

Note that the real advantage of readr here is that you can pre-specify column types; with fread you'll have to type convert afterwards.

在@AnandaMahto 的建议下,我包含了更多选项,包括一个似乎是新赢家的选项!为了节省时间,我在新的比较中排除了上面最慢的选项.这是新代码:

At @AnandaMahto's suggestion I am including some more options, including one that appears to be a new winner! To save time I excluded the slowest options above in the new comparison. Here's the new code:

library(iotools)
    
microbenchmark(
  times = 5L,
  readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
  fread = {
    DT = fread("testfwf.txt", header = FALSE, sep = "\n")
    DT[ , lapply(seq_len(length(cols$beg)), function(ii) {
      stri_sub(V1, cols$beg[ii], cols$end[ii])
    })]
  },
  iotools = input.file(
    "testfwf.txt", formatter = dstrfw, 
    col_types = rep("character", length(endpoints) - 1L), 
    widths = diff(endpoints)
  ),
  awk = fread(header = FALSE, cmd = sprintf(
    "awk -v FIELDWIDTHS='%s' -v OFS=', ' '{$1=$1 \"\"; print}' < testfwf.txt",
    paste(diff(endpoints), collapse = " ")
  ))
)

和新的输出:

# Unit: seconds
#     expr       min        lq      mean    median        uq       max neval cld
#    readr  7.892527  8.016857 10.293371  9.527409  9.807145 16.222916     5  a 
#    fread  9.652377  9.696135  9.796438  9.712686  9.807830 10.113160     5  a 
#  iotools  5.900362  7.591847  7.438049  7.799729  7.845727  8.052579     5  a 
#      awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156     5   b

所以看起来iotools既非常快又非常一致.

So it appears iotools is both very fast and very consistent.

这篇关于读取固定宽度文件的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆