在R中快速读取非常大的表作为数据帧 [英] Quickly reading very large tables as dataframes in R

查看:160
本文介绍了在R中快速读取非常大的表作为数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有非常大的表(3000万行),我想作为数据框加载在R. read.table()有很多方便的功能,但是,似乎执行中有很多逻辑会减慢事情。在我的情况下,我假设我提前了解列的类型,该表不包含任何列标题或行名,并且没有任何需要担心的病理字符。



我知道使用 scan()在列表中读取可能非常快,例如:

  datalist<  -  scan('myfile',sep ='\t',list(url ='',popularity = 0,mintime = maxtime = 0)))

但是我将其转换为数据框的一些尝试似乎减少了上述因素为6:

  df<  -  as.data.frame(scan('myfile' sep ='\t',list(url ='',popularity = 0,mintime = 0,maxtime = 0)))

有更好的方法吗?或者可能完全不同的方法来解决问题?

解决方案

几年后的更新 p>

这个答案是老的,R已经移动了。调整 read.table


  1. 使用 fread data.table 。 href =https://stackoverflow.com/a/15058684/134830> mnel的答案。


  2. 使用 read_table readr (在2015年4月起的CRAN上) 。这很像上面的 fread 。链接中的README解释了两个函数之间的区别( readr 目前声称比codedata.table :: fread的1.5-2x慢 )。


  3. read.csv.raw iotools 提供了快速阅读CSV文件的第三个选项。


  4. 尝试在数据库中存储尽可能多的数据,而不是平面文件。 (除了作为一个更好的永久性存储介质之外,数据以二进制格式传递到R中也就是更快。) read.csv.sql 。 org / web / packages / sqldf / index.htmlrel =noreferrer> sqldf 包,如 JD Long的答案,将数据导入临时SQLite数据库,然后将其读入R中。另请参阅: RODBC 包,反向依照 DBI 页面。 MonetDB.R 给你一种假装是数据框架的数据类型,但实际上是一个MonetDB,提高了性能。使用其 monetdb.read.csv导入数据 功能。 dplyr 允许您直接使用存储在多种数据库中的数据。


  5. 以二进制格式存储数据也可用于提高性能。使用 saveRDS / readRDS (见下文)或 h5

  6. >






原始答案



有几个简单的事情要尝试,无论你使用read.table还是scan。


  1. 设置 nrows = 数据中的记录数 nmax scan )。


  2. 确保 comment.char =关闭评论的解释。


  3. 使用 colClasses 明确定义每列的类 read.table


  4. 设置 multi.line = FALSE 也可以提高扫描性能。


如果这些事情都不起作用,然后使用概要分析软件包之一来确定哪些线路减缓了事情。也许您可以根据结果编写 read.table 的缩减版本。



另一种替代方法是过滤您的数据在您读入R之前。



或者,如果问题是您必须定期阅读,请使用这些方法一次读取数据,然后将数据框以 保存 saveRDS ,那么下次你可以用 加载 readRDS


I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.

I know that reading in a table as a list using scan() can be quite fast, e.g.:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

Is there a better way of doing this? Or quite possibly completely different approach to the problem?

解决方案

An update, several years later

This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:

  1. Using fread in data.table for importing data from CVS/tab-delimited files directly into R. See mnel's answer.

  2. Using read_table in readr (on CRAN from April 2015). This works much like fread above. The README in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

  3. read.csv.raw from iotools provides a third option for quickly reading CSV files.

  4. Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

  5. Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), or the h5 or rhdf5 packages for HDF5 format.


The original answer

There are a couple of simple things to try, whether you use read.table or scan.

  1. Set nrows=the number of records in your data (nmax in scan).

  2. Make sure that comment.char="" to turn off interpretation of comments.

  3. Explicitly define the classes of each column using colClasses in read.table.

  4. Setting multi.line=FALSE may also improve performance in scan.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

The other alternative is filtering your data before you read it into R.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

这篇关于在R中快速读取非常大的表作为数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆