更快的方式读取R中的固定宽度文件 [英] Faster way to read fixed-width files in R

查看:164
本文介绍了更快的方式读取R中的固定宽度文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用了很多固定宽度的文件(即没有分隔字符),我需要读入R.因此,通常有一个定义的列宽度来解析字符串变量。我可以使用 read.fwf 读取数据没有问题。但是,对于大型文件,这可能需要时间。对于最近的数据集,花费800秒读取数据集中的约500,000行和143个变量。

  seer9<  -  read.fwf(〜/ data / rawdata.txt,
width = cols,
header = FALSE,
buffersize = 250000,
colClasses =character,
stringsAsFactors = FALSE))
data.table 包中的$ p>

fread 真棒解决大多数数据读取问题,除了它不解析固定宽度的文件。但是,我可以读取每行作为单个字符串(〜500,000行,1列)。这需要3-5秒。 (我喜欢data.table。)

  seer9 < -  fread(〜/ data / rawdata.txt,colClasses = character,
sep =\\\
,header = FALSE,verbose = TRUE)

有很多关于如何解析文本文件的好帖子。请参阅JHoward的建议这里,创建一个开始和结束列的矩阵, substr 来解析数据。请参阅GSee的建议此处 a>使用 strsplit 。我不知道如何使这个数据的工作。 (另外,迈克尔·史密斯对data.table邮件列表提出了一些建议,涉及 sed ,这超出了我的能力 implement。)现在,使用 fread substr()我可以在大约25-30秒内完成整个事情。注意,在结束时强制转换为data.table需要一段时间(5秒?)。

  end_col < -  cumsum (cols)
start_col< - end_col - cols + 1
start_end < - cbind(start_col,end_col)#开始和结束位置的矩阵
text< - lapply (x){
apply(start_end,1,function(y)substr(x,y [1],y [2]))
})
dt < (text $ V1)
setnames(dt,old = 1:ncol(dt),new = seervars)

我想知道的是,这是否可以进一步改善?我知道我不是唯一一个必须读固定宽度的文件,所以如果这可以做得更快,它会使加载更大的文件(百万行)更容忍。我尝试使用 parallel mclapply data.table lapply ,但那些没有改变任何东西。 (可能是因为我在R的经验)我想象一个Rcpp函数可以写得这样做真的很快,但这是超出了我的技能设置。



我的data.table实现( magrittr 链接)需要相同的时间:

  text<  -  seer9 [,apply(start_end,1,function y [1],y [2])]%%
data.table(。)

任何人都可以提出建议来提高这个速度吗?



这里是代码在R中创建一个类似的data.table(而不是链接到实际的数据)。它应该有331个字符和500,000行。有空格可以模拟数据中缺少的字段,但这是 NOT 空格分隔数据。 (我正在读取原始SEER数据,以防任何人感兴趣。)还包括列宽(cols)和变量名(seervars),如果这有助于别人。这些是SEER数据的实际列和变量定义。

  seer9 < -  
data.table (paste0(paste0(letters,1000:1054,,collapse =),)),
500000))

cols = c(8,10,1, 2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2, 13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1, 1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1, 1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3, 3,1,4,1,4,1,1,3,3,3,3,2,2,2,2)
seervars < - c(CASENUM,REG,MAR_STAT ,RATE,ORIGIN,NHIA,SEX,AGE_DX,YR_BRTH,PLC_BRTH,SEQ_NUM,DATE_mo,DATE_yr,SITEO2V,LATERAL HISTO2V,BEHO2V,HISTO3V,BEHO3V,GRADE,DX_CONF,REPT_SRC,EOD10_SZ,EOD10_EX,EOD10_PE,EOD10_ND,EOD10_PN,EOD10_NE ,CS_EXT,CS_NODE,CS_METS,CS_SSF1,CS_SET1,CS_SF1,EOD3,EOD2,EOD4,EODCODE,TUMOR_1V,TUMOR_2V,TUMOR_3V CS_SSF2,CS_SSF3,CS_SSF4,CS_SSF5,CS_SSF6,CS_SSF25,D_AJCC_T,D_AJCC_N,D_AJCC_M,D_AJCC_S,D_SSG77,D_SSG00,D_AJCC_F ,D_SSG77F,D_SSG00F,CSV_ORG,CSV_DER,CSV_CUR,SURGPRIM,SCOPE,SURGOTH,SURGNODE,RECONST,NO_SURG,RADIATN RAD_BRN,RAD_SURG,SS_SURG,SRPRIM02,SCOPE02,SRGOTH02,REC_NO,O_SITAGE,O_SEQCON,O_SEQLAT,O_SURCON,O_SITTYP,H_BENIGN ,O_DESITE,O_ITOD,O_SITOD,O_SITMOR,TYPEFUP,AGE_REC,SITERWHO,ICDOTO9V,ICDOT10V ICCC3WHO,ICCC3XWHO,BEHANAL,HISTREC,BRAINREC,CS0204SCHEMA,RAC_RECA,RAC_RECY,NHIAREC,HST_STGA,AJCC_STG,AJ_3SEER,SSG77 ,SSG2000,NUMPRIMS,FIRSTPRM,STCOUNTY,ICD_5DIG,CODKM,STAT_REC,IHS,HIST_SSG_2000,AYA_RECODE,LYMPHOMA_RECODE,DTH_CLASS O_DTH_CLASS,EXTEVAL,NODEEVAL,METSEVAL,INTPRIM,ERSTATUS,PRSTATUS,CSSCHEMA,CS_SSF8,CS_SSF10,CS_SSF11,CS_SSF13 ,DAJCC7N,DAJCC7M,DAJCC7STG,ADJTM_6VALUE,DAJCC7M,CS_SSF16,VASINV,SRV_TIME_MON,SRV_TIME_MON_FLAG,SRV_TIME_MON_PA,SRV_TIME_MON_FLAG_PA,INSREC_PUB ADJNM_6VALUE,ADJM_6VALUE,ADJAJCCSTG)

UPDATE:
LaF从原始.txt文件中完成了在7秒内的整个读取。也许有更快的方式,但我怀疑什么可以做得更好。惊人的套装。



2015年7月27日更新
只是想提供一个小小的更新。我使用新的readr软件包,我能够在5秒内使用readr :: read_fwf读取整个文件。

  seer9_readr<  -  read_fwf(path_to_data / COLRECT.TXT,
col_positions = fwf_widths(cols))


$ b b

此外,更新的stringi :: stri_sub函数的速度至少是base :: substr()的两倍。所以,在上面的代码中使用fread读取文件(约4秒),然后应用于解析每行,提取143个变量花了大约8秒与stringi :: stri_sub相比,19为base :: substr。所以,fread加stri_sub仍然只有大约12秒运行。不错。

  seer9 < -  fread(path_to_data / COLRECT.TXT,
colClasses =character ,
sep =\ n,
header = FALSE)
text< - seer9 [,apply(start_end,1,function(y)substr(V1,y [1] ,y [2]))]%>%
data.table(。)



2015年12月10日更新:



请参阅

您可以使用 LaF 包,这是为了处理大型固定宽度文件(也太大,不适合内存)。要使用它,你首先需要使用 laf_open_fwf 打开文件。然后,您可以像生成正常数据框架一样索引生成的对象,以读取所需的数据。在下面的示例中,我读取了整个文件,但也可以读取特定的列和/或行:

 
laf< - laf_open_fwf(foo.dat,column_widths = cols,
column_types = rep(character,length(cols)),
column_names = seervars)
seer9 < - laf [,]

您使用5000行(而不是500,000行) 28秒使用 read.fwf 和1.6秒使用 LaF



添加。使用 read.fwf ,使用50,000行(而不是您的500,000行)的示例需要258秒,使用 LaF


I work with a lot of fixed width files (i.e., no separating character) that I need to read into R. So, there is usually a definition of the column width to parse the string into variables. I can use read.fwf to read in the data without a problem. However, for large files, this can take a long time. For a recent dataset, this took 800 seconds to read in a dataset with ~500,000 rows and 143 variables.

seer9 <- read.fwf("~/data/rawdata.txt", 
  widths = cols,
  header = FALSE,
  buffersize = 250000,
  colClasses = "character",
  stringsAsFactors = FALSE))

fread in the data.table package in R is awesome for solving most data read problems, except it doesn't parse fixed width files. However, I can read each line in as a single character string (~500,000 rows, 1 column). This takes 3-5 seconds. (I love data.table.)

seer9 <- fread("~/data/rawdata.txt", colClasses = "character",
               sep = "\n", header = FALSE, verbose = TRUE)

There are a number of good posts on SO on how to parse text files. See JHoward's suggestion here, to create a matrix of start and end columns, and substr to parse the data. See GSee's suggestion here to use strsplit. I couldn't figure out how to make that work with this data. (Also, Michael Smith made some suggestions on the data.table mailing list involving sed that were beyond my ability to implement.) Now, using fread and substr() I can do the whole thing in about 25-30 seconds. Note that coercing to a data.table at end takes a chunk of time (5 sec?).

end_col <- cumsum(cols)
start_col <- end_col - cols + 1
start_end <- cbind(start_col, end_col) # matrix of start and end positions
text <- lapply(seer9, function(x) {
        apply(start_end, 1, function(y) substr(x, y[1], y[2])) 
        })
dt <- data.table(text$V1)
setnames(dt, old = 1:ncol(dt), new = seervars)

What I am wondering is whether this can be improved any further? I know I am not the only one who has to read fixed width files, so if this could be made faster, it would make loading even larger files (with millions of rows) more tolerable. I tried using parallel with mclapply and data.tableinstead of lapply, but those didn't change anything. (Likely due to my inexperience in R.) I imagine that an Rcpp function could be written to do this really fast, but that is beyond my skill set. Also, I may not be using lapply and apply appropriately.

My data.table implementation (with magrittr chaining) takes the same time:

text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% 
  data.table(.)

Can anyone make suggestions to improve the speed of this? Or is this about as good as it gets?

Here is code to create a similar data.table within R (rather than linking to actual data). It should have 331 characters, and 500,000 rows. There are spaces to simulate missing fields in the data, but this is NOT space delimited data. (I am reading raw SEER data, in case anyone is interested.) Also including column widths (cols) and variable names (seervars) in case this helps someone else. These are the actual column and variable definitions for SEER data.

seer9 <-
  data.table(rep((paste0(paste0(letters, 1000:1054, " ", collapse = ""), " ")),
                 500000))

cols = c(8,10,1,2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2,13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1,1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1,1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,1,4,1,4,1,1,3,3,3,3,2,2,2,2)
seervars <- c("CASENUM", "REG", "MAR_STAT", "RACE", "ORIGIN", "NHIA", "SEX", "AGE_DX", "YR_BRTH", "PLC_BRTH", "SEQ_NUM", "DATE_mo", "DATE_yr", "SITEO2V", "LATERAL", "HISTO2V", "BEHO2V", "HISTO3V", "BEHO3V", "GRADE", "DX_CONF", "REPT_SRC", "EOD10_SZ", "EOD10_EX", "EOD10_PE", "EOD10_ND", "EOD10_PN", "EOD10_NE", "EOD13", "EOD2", "EOD4", "EODCODE", "TUMOR_1V", "TUMOR_2V", "TUMOR_3V", "CS_SIZE", "CS_EXT", "CS_NODE", "CS_METS", "CS_SSF1", "CS_SSF2", "CS_SSF3", "CS_SSF4", "CS_SSF5", "CS_SSF6", "CS_SSF25", "D_AJCC_T", "D_AJCC_N", "D_AJCC_M", "D_AJCC_S", "D_SSG77", "D_SSG00", "D_AJCC_F", "D_SSG77F", "D_SSG00F", "CSV_ORG", "CSV_DER", "CSV_CUR", "SURGPRIM", "SCOPE", "SURGOTH", "SURGNODE", "RECONST", "NO_SURG", "RADIATN", "RAD_BRN", "RAD_SURG", "SS_SURG", "SRPRIM02", "SCOPE02", "SRGOTH02", "REC_NO", "O_SITAGE", "O_SEQCON", "O_SEQLAT", "O_SURCON", "O_SITTYP", "H_BENIGN", "O_RPTSRC", "O_DFSITE", "O_LEUKDX", "O_SITBEH", "O_EODDT", "O_SITEOD", "O_SITMOR", "TYPEFUP", "AGE_REC", "SITERWHO", "ICDOTO9V", "ICDOT10V", "ICCC3WHO", "ICCC3XWHO", "BEHANAL", "HISTREC", "BRAINREC", "CS0204SCHEMA", "RAC_RECA", "RAC_RECY", "NHIAREC", "HST_STGA", "AJCC_STG", "AJ_3SEER", "SSG77", "SSG2000", "NUMPRIMS", "FIRSTPRM", "STCOUNTY", "ICD_5DIG", "CODKM", "STAT_REC", "IHS", "HIST_SSG_2000", "AYA_RECODE", "LYMPHOMA_RECODE", "DTH_CLASS", "O_DTH_CLASS", "EXTEVAL", "NODEEVAL", "METSEVAL", "INTPRIM", "ERSTATUS", "PRSTATUS", "CSSCHEMA", "CS_SSF8", "CS_SSF10", "CS_SSF11", "CS_SSF13", "CS_SSF15", "CS_SSF16", "VASINV", "SRV_TIME_MON", "SRV_TIME_MON_FLAG", "SRV_TIME_MON_PA", "SRV_TIME_MON_FLAG_PA", "INSREC_PUB", "DAJCC7T", "DAJCC7N", "DAJCC7M", "DAJCC7STG", "ADJTM_6VALUE", "ADJNM_6VALUE", "ADJM_6VALUE", "ADJAJCCSTG")

UPDATE: LaF did the entire read in just under 7 seconds from the raw .txt file. Maybe there is an even faster way, but I doubt anything could do appreciably better. Amazing package.

27 July 2015 Update Just wanted to provide a small update to this. I used the new readr package, and I was able to read in the entire file in 5 seconds using readr::read_fwf.

seer9_readr <- read_fwf("path_to_data/COLRECT.TXT",
  col_positions = fwf_widths(cols))

Also, the updated stringi::stri_sub function is at least twice as fast as base::substr(). So, in the code above that uses fread to read the file (about 4 seconds), followed by apply to parse each line, the extraction of 143 variables took about 8 seconds with stringi::stri_sub compared to 19 for base::substr. So, fread plus stri_sub is still only about 12 seconds to run. Not bad.

seer9 <-  fread("path_to_data/COLRECT.TXT",     
  colClasses = "character", 
  sep = "\n", 
  header = FALSE)
text <- seer9[ , apply(start_end, 1, function(y) substr(V1, y[1], y[2]))] %>% 
  data.table(.)

10 Dec 2015 update:

Please also see the answer below by @MichaelChirico who has added some great benchmarks and the iotools package.

解决方案

You can use the LaF package, which was written to handle large fixed width files (also too large to fit into memory). To use it you first need to open the file using laf_open_fwf. You can then index the resulting object as you would a normal data frame to read the data you need. In the example below, I read the entire file, but you can also read specific columns and/or lines:

library(LaF)
laf <- laf_open_fwf("foo.dat", column_widths = cols, 
  column_types=rep("character", length(cols)),
  column_names = seervars)
seer9 <- laf[,]

Your example using 5000 lines (instead of your 500,000) took 28 seconds using read.fwf and 1.6 seconds using LaF.

Addition Your example using 50,000 lines (instead of your 500,000) took 258 seconds using read.fwf and 7 seconds using LaF on my machine.

这篇关于更快的方式读取R中的固定宽度文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆