读取固定宽度文件的更快方法 [英] Faster way to read fixed-width files

查看:87
本文介绍了读取固定宽度文件的更快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我处理许多需要读入R的固定宽度文件(即没有分隔符).因此,通常有一个列宽定义,用于将字符串解析为变量.我可以使用read.fwf毫无问题地读取数据.但是,对于大文件,这可能需要很长时间.对于最近的数据集,这需要800秒钟才能读取具有约500,000行和143个变量的数据集.

I work with a lot of fixed width files (i.e., no separating character) that I need to read into R. So, there is usually a definition of the column width to parse the string into variables. I can use read.fwf to read in the data without a problem. However, for large files, this can take a long time. For a recent dataset, this took 800 seconds to read in a dataset with ~500,000 rows and 143 variables.

seer9 <- read.fwf("~/data/rawdata.txt", 
  widths = cols,
  header = FALSE,
  buffersize = 250000,
  colClasses = "character",
  stringsAsFactors = FALSE))

R中的data.table软件包中的

fread对于解决大多数数据读取问题非常有用,只是它不解析固定宽度的文件.但是,我可以将每行读为单个字符串(〜500,000行,1列).这需要3-5秒. (我喜欢data.table.)

fread in the data.table package in R is awesome for solving most data read problems, except it doesn't parse fixed width files. However, I can read each line in as a single character string (~500,000 rows, 1 column). This takes 3-5 seconds. (I love data.table.)

seer9 <- fread("~/data/rawdata.txt", colClasses = "character",
               sep = "\n", header = FALSE, verbose = TRUE)

关于如何解析文本文件,SO上有很多不错的文章.请参阅JHoward的建议此处查看GSee的建议使用strsplit.我不知道如何使用此数据. (此外,迈克尔·史密斯(Michael Smith)在涉及sed的data.table邮件列表中提出了一些建议,这些建议超出了我

There are a number of good posts on SO on how to parse text files. See JHoward's suggestion here, to create a matrix of start and end columns, and substr to parse the data. See GSee's suggestion here to use strsplit. I couldn't figure out how to make that work with this data. (Also, Michael Smith made some suggestions on the data.table mailing list involving sed that were beyond my ability to implement.) Now, using fread and substr() I can do the whole thing in about 25-30 seconds. Note that coercing to a data.table at end takes a chunk of time (5 sec?).

end_col <- cumsum(cols)
start_col <- end_col - cols + 1
start_end <- cbind(start_col, end_col) # matrix of start and end positions
text <- lapply(seer9, function(x) {
        apply(start_end, 1, function(y) substr(x, y[1], y[2])) 
        })
dt <- data.table(text$V1)
setnames(dt, old = 1:ncol(dt), new = seervars)

我想知道的是这是否可以进一步改善?我知道不是唯一一个需要读取固定宽度文件的人,因此,如果可以更快地读取固定宽度的文件,则可以容忍更大的文件(具有数百万行)的加载.我尝试将parallelmclapplydata.table一起使用,而不是lapply,但是这些没有改变. (可能是由于我对R的缺乏经验.)我想可以编写一个Rcpp函数来真正快速地完成此操作,但这超出了我的技能范围.另外,我可能没有使用lapply并正确应用.

What I am wondering is whether this can be improved any further? I know I am not the only one who has to read fixed width files, so if this could be made faster, it would make loading even larger files (with millions of rows) more tolerable. I tried using parallel with mclapply and data.tableinstead of lapply, but those didn't change anything. (Likely due to my inexperience in R.) I imagine that an Rcpp function could be written to do this really fast, but that is beyond my skill set. Also, I may not be using lapply and apply appropriately.

我的data.table实现(具有magrittr链接)花费相同的时间:

My data.table implementation (with magrittr chaining) takes the same time:

text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% 
  data.table(.)

任何人都可以提出建议以提高速度吗?还是说它就好了?

Can anyone make suggestions to improve the speed of this? Or is this about as good as it gets?

这是在R中创建类似data.table的代码(而不是链接到实际数据).它应该具有331个字符和500,000行.可以使用空格来模拟数据中缺少的字段,但这是以空格分隔的数据. (如果有人感兴趣,我会读取SEER原始数据.)还包括列宽(cols)和变量名(seervars),以防对他人有所帮助.这些是SEER数据的实际列和变量定义.

Here is code to create a similar data.table within R (rather than linking to actual data). It should have 331 characters, and 500,000 rows. There are spaces to simulate missing fields in the data, but this is NOT space delimited data. (I am reading raw SEER data, in case anyone is interested.) Also including column widths (cols) and variable names (seervars) in case this helps someone else. These are the actual column and variable definitions for SEER data.

seer9 <-
  data.table(rep((paste0(paste0(letters, 1000:1054, " ", collapse = ""), " ")),
                 500000))

cols = c(8,10,1,2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2,13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1,1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1,1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,1,4,1,4,1,1,3,3,3,3,2,2,2,2)
seervars <- c("CASENUM", "REG", "MAR_STAT", "RACE", "ORIGIN", "NHIA", "SEX", "AGE_DX", "YR_BRTH", "PLC_BRTH", "SEQ_NUM", "DATE_mo", "DATE_yr", "SITEO2V", "LATERAL", "HISTO2V", "BEHO2V", "HISTO3V", "BEHO3V", "GRADE", "DX_CONF", "REPT_SRC", "EOD10_SZ", "EOD10_EX", "EOD10_PE", "EOD10_ND", "EOD10_PN", "EOD10_NE", "EOD13", "EOD2", "EOD4", "EODCODE", "TUMOR_1V", "TUMOR_2V", "TUMOR_3V", "CS_SIZE", "CS_EXT", "CS_NODE", "CS_METS", "CS_SSF1", "CS_SSF2", "CS_SSF3", "CS_SSF4", "CS_SSF5", "CS_SSF6", "CS_SSF25", "D_AJCC_T", "D_AJCC_N", "D_AJCC_M", "D_AJCC_S", "D_SSG77", "D_SSG00", "D_AJCC_F", "D_SSG77F", "D_SSG00F", "CSV_ORG", "CSV_DER", "CSV_CUR", "SURGPRIM", "SCOPE", "SURGOTH", "SURGNODE", "RECONST", "NO_SURG", "RADIATN", "RAD_BRN", "RAD_SURG", "SS_SURG", "SRPRIM02", "SCOPE02", "SRGOTH02", "REC_NO", "O_SITAGE", "O_SEQCON", "O_SEQLAT", "O_SURCON", "O_SITTYP", "H_BENIGN", "O_RPTSRC", "O_DFSITE", "O_LEUKDX", "O_SITBEH", "O_EODDT", "O_SITEOD", "O_SITMOR", "TYPEFUP", "AGE_REC", "SITERWHO", "ICDOTO9V", "ICDOT10V", "ICCC3WHO", "ICCC3XWHO", "BEHANAL", "HISTREC", "BRAINREC", "CS0204SCHEMA", "RAC_RECA", "RAC_RECY", "NHIAREC", "HST_STGA", "AJCC_STG", "AJ_3SEER", "SSG77", "SSG2000", "NUMPRIMS", "FIRSTPRM", "STCOUNTY", "ICD_5DIG", "CODKM", "STAT_REC", "IHS", "HIST_SSG_2000", "AYA_RECODE", "LYMPHOMA_RECODE", "DTH_CLASS", "O_DTH_CLASS", "EXTEVAL", "NODEEVAL", "METSEVAL", "INTPRIM", "ERSTATUS", "PRSTATUS", "CSSCHEMA", "CS_SSF8", "CS_SSF10", "CS_SSF11", "CS_SSF13", "CS_SSF15", "CS_SSF16", "VASINV", "SRV_TIME_MON", "SRV_TIME_MON_FLAG", "SRV_TIME_MON_PA", "SRV_TIME_MON_FLAG_PA", "INSREC_PUB", "DAJCC7T", "DAJCC7N", "DAJCC7M", "DAJCC7STG", "ADJTM_6VALUE", "ADJNM_6VALUE", "ADJM_6VALUE", "ADJAJCCSTG")

更新: LaF只需不到7秒的时间即可完成对原始.txt文件的完整读取.也许有更快的方法,但是我怀疑任何事情都可以做得更好.很棒的包裹.

UPDATE: LaF did the entire read in just under 7 seconds from the raw .txt file. Maybe there is an even faster way, but I doubt anything could do appreciably better. Amazing package.

2015年7月27日更新 只是想为此提供一个小的更新.我使用了新的readr包,并且可以使用readr :: read_fwf在5秒钟内读取整个文件.

27 July 2015 Update Just wanted to provide a small update to this. I used the new readr package, and I was able to read in the entire file in 5 seconds using readr::read_fwf.

seer9_readr <- read_fwf("path_to_data/COLRECT.TXT",
  col_positions = fwf_widths(cols))

此外,更新的stringi :: stri_sub函数的速度至少是base :: substr()的两倍.因此,在上面的代码中,使用fread读取文件(大约4秒),然后应用apply解析每行,使用stringi :: stri_sub提取143个变量大约花费了8秒,而使用base :: substr则花费了19秒.因此,fread plus stri_sub仍然只有大约12秒的运行时间.不错.

Also, the updated stringi::stri_sub function is at least twice as fast as base::substr(). So, in the code above that uses fread to read the file (about 4 seconds), followed by apply to parse each line, the extraction of 143 variables took about 8 seconds with stringi::stri_sub compared to 19 for base::substr. So, fread plus stri_sub is still only about 12 seconds to run. Not bad.

seer9 <-  fread("path_to_data/COLRECT.TXT",     
  colClasses = "character", 
  sep = "\n", 
  header = FALSE)
text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% 
  data.table(.)

2015年12月10日更新:

另请参见下面的答案 @MichaelChirico,他添加了一些出色的基准测试和iotools软件包.

10 Dec 2015 update:

Please also see the answer below by @MichaelChirico who has added some great benchmarks and the iotools package.

推荐答案

现在(在此和

Now that there are (between this and the other major question about effective reading of fixed-width files) a fair amount of options on the offer for reading in such files, I think some benchmarking is appropriate.

我将使用以下大型文件(400 MB)进行比较.这只是一堆随机字符,具有随机定义的字段和宽度:

I'll use the following on-the-large-side (400 MB) file for comparison. It's just a bunch of random characters with randomly defined fields and widths:

set.seed(21394)
wwidth = 400L
rrows = 1000000

#creating the contents at random
contents = 
  write.table(replicate(rrows, paste0(sample(letters, wwidth, replace = TRUE),
                                      collapse = "")), file="testfwf.txt",
              quote = FALSE, row.names = FALSE, col.names = FALSE)

#defining the fields & writing a dictionary
n_fields = 40L
endpoints = unique(c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L))
cols = ist(beg = endpoints[-(n_fields + 1L)],
             end = endpoints[-1L] - 1L)

dict = data.frame(column = paste0("V", seq_len(length(endpoints)) - 1L)),
                  start = endpoints[-length(endpoints)] - 1,
                  length = diff(endpoints))

write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE)

我将比较这两个线程之间提到的五个方法(如果作者愿意,我还会添加一些其他方法):基本版本(read.fwf),将in2csv的结果传递给fread(@ AnandaMahto的建议),Hadley的新readr(read_fwf),使用LaF/ffbase(@jwijffls的建议),以及问题作者(@MarkDanese)提出的建议的改进版本(精简版) freadstringi中的stri_sub.

I'll compare five methods mentioned between these two threads (I'll add some others if the authors would like): the base version (read.fwf), piping the result of in2csv to fread (@AnandaMahto's suggestion), Hadley's new readr (read_fwf), that using LaF/ffbase (@jwijffls' suggestion), and an improved (streamlined) version of that suggested by the question author (@MarkDanese) combining fread with stri_sub from stringi.

这是基准代码:

library(data.table)
library(stringi)
library(readr)
library(LaF); library(ffbase)
library(microbenchmark)

microbenchmark(times = 5L,
               utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE),
               in2csv = 
                 fread(paste("in2csv -f fixed -s",
                             "~/Desktop/testdic.csv",
                             "~/Desktop/testfwf.txt")),
               readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
               LaF = {
                 my.data.laf = 
                   laf_open_fwf('testfwf.txt', column_widths=diff(endpoints),
                                column_types = rep("character", 
                                                   length(endpoints) - 1L))
                 my.data = laf_to_ffdf(my.data.laf, nrows = rrows)
                 as.data.frame(my.data)},
               fread = fread(
                 "testfwf.txt", header = FALSE, sep = "\n"
                 )[ , lapply(seq_len(length(cols$beg)),
                             function(ii) 
                               stri_sub(V1, cols$beg[ii], cols$end[ii]))])

输出:

# Unit: seconds
#    expr       min        lq      mean    median        uq       max neval cld
#   utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598     5   c
#  in2csv  67.74065  68.56549  69.60069  70.11774  70.18746  71.39210     5 a  
#   readr  10.57945  11.32205  15.70224  14.89057  19.54617  22.17298     5 a  
#     LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798     5  b 
#   fread  14.42617  15.44693  26.09877  15.76016  20.45481  64.40581     5 a  

因此,看来readrfread + stri_sub具有最快的竞争力.内置的read.fwf是显而易见的失败者.

So it seems readr and fread + stri_sub are pretty competitive as the fastest; built-in read.fwf is the clear loser.

请注意,这里readr的真正优势在于您可以预先指定列类型.使用fread之后,您必须输入convert.

Note that the real advantage of readr here is that you can pre-specify column types; with fread you'll have to type convert afterwards.

在@AnandaMahto的建议下,我包括更多选项,其中包括一个看来是新的赢家!为了节省时间,我在新的比较中排除了上面最慢的选项.这是新代码:

At @AnandaMahto's suggestion I am including some more options, including one that appears to be a new winner! To save time I excluded the slowest options above in the new comparison. Here's the new code:

library(iotools)

microbenchmark(times = 5L,
               readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
               fread = fread(
                 "testfwf.txt", header = FALSE, sep = "\n"
                 )[ , lapply(seq_len(length(cols$beg)),
                             function(ii) 
                               stri_sub(V1, cols$beg[ii], cols$end[ii]))],
               iotools = input.file("testfwf.txt", formatter = dstrfw, 
                                    col_types = rep("character",
                                                    length(endpoints) - 1L), 
                                    widths = diff(endpoints)),
               awk = fread(paste(
                 "awk -v FIELDWIDTHS='", 
                 paste(diff(endpoints), collapse = " "), 
                 "' -v OFS=', ' '{$1=$1 \"\"; print}' < ~/Desktop/testfwf.txt", 
                 collapse = " "), header = FALSE))

新的输出:

# Unit: seconds
#     expr       min        lq      mean    median        uq       max neval cld
#    readr  7.892527  8.016857 10.293371  9.527409  9.807145 16.222916     5  a 
#    fread  9.652377  9.696135  9.796438  9.712686  9.807830 10.113160     5  a 
#  iotools  5.900362  7.591847  7.438049  7.799729  7.845727  8.052579     5  a 
#      awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156     5   b

所以iotools看起来非常快而且非常一致.

So it appears iotools is both very fast and very consistent.

这篇关于读取固定宽度文件的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆