快速读取非常大的表作为数据帧 [英] Quickly reading very large tables as dataframes

查看:112
本文介绍了快速读取非常大的表作为数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有非常大的表(3000万行),我想在R中加载数据帧。 read.table()有很多方便的功能,但似乎实施中有很多逻辑会降低速度。在我的情况下,我假设我提前知道列的类型,表不包含任何列标题或行名称,并且没有任何我需要担心的病态字符。

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.

我知道使用 scan()在表格中阅读可能会非常快,例如:

I know that reading in a table as a list using scan() can be quite fast, e.g.:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

但是我将此转换为数据帧的一些尝试似乎减少了以上表现为6:

But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

有没有更好的方法呢?或者很可能完全不同的解决方法?

Is there a better way of doing this? Or quite possibly completely different approach to the problem?

推荐答案

几年后的更新

这个答案是旧的,R继续前进。调整 read.table 跑得快一点就没有什么好处。您的选择是:

This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:


  1. 使用 fread data.table ,用于将数据从csv /制表符分隔的文件直接导入R.参见 mnel的回答

使用 read_table readr (2015年4月在CRAN上) 。这与上面的 fread 非常相似。链接中的自述文件解释了两个函数之间的区别( readr 目前声称比 data.table :: fread )。

Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

read.csv.raw .r-project.org / web / packages / iotools / index.htmlrel =noreferrer> iotools 提供快速阅读CSV文件的第三个选项。

read.csv.raw from iotools provides a third option for quickly reading CSV files.

尝试在数据库而不是平面文件中存储尽可能多的数据。 (作为一种更好的永久存储介质,数据以二进制格式传递给R和从R传递,速度更快。) read.csv.sql 。 org / web / packages / sqldf / index.htmlrel =noreferrer> sqldf 包,如长龙的答案,将数据导入临时SQLite数据库,然后将其读入R.另请参阅: RODBC 包,反面取决于 DBI 页面。 MonetDB.R 为您提供了一个数据类型,它假装是一个数据框,但实际上是一个MonetDB,提高了性能。使用 monetdb.read.csv导入数据 功能。 dplyr 允许您直接处理存储在多种类型数据库中的数据。

Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

以二进制格式存储数据对于提高性能也很有用。使用 saveRDS / readRDS (见下文), h5 rhdf5 HDF5格式的软件包,或者 write_fst / read_fst 来自 fst 包。

Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.






原始答案

无论是否使用read.table,都可以尝试一些简单的事情。或扫描。

There are a couple of simple things to try, whether you use read.table or scan.


  1. 设置 nrows = 的数量数据中的记录 nmax 扫描)。

确保 comment.char =关闭评论的解释。

rea中使用 colClasses 显式定义每列的类d.table

Explicitly define the classes of each column using colClasses in read.table.

设置 multi.line = FALSE 也可能提高扫描效果。

Setting multi.line=FALSE may also improve performance in scan.

如果这些都不起作用,那么使用分析包以确定哪些行减慢了速度。也许您可以根据结果编写 read.table 的缩减版本。

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

另一种选择是过滤您的数据在读入R之前。

The other alternative is filtering your data before you read it into R.

或者,如果问题是您必须定期阅读,那么使用这些方法一次读取数据,然后使用 save <将数据框保存为二进制blob / code> saveRDS ,然后下次您可以使用 加载 readRDS

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

这篇关于快速读取非常大的表作为数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆