在R中快速读取非常大的表作为数据帧 [英] Quickly reading very large tables as dataframes in R
问题描述
我有非常大的表(3000万行),我想作为数据框加载在R. read.table()
有很多方便的功能,但是,似乎执行中有很多逻辑会减慢事情。在我的情况下,我假设我提前了解列的类型,该表不包含任何列标题或行名,并且没有任何需要担心的病理字符。
我知道使用 scan()
在列表中读取可能非常快,例如:
datalist< - scan('myfile',sep ='\t',list(url ='',popularity = 0,mintime = maxtime = 0)))
但是我将其转换为数据框的一些尝试似乎减少了上述因素为6:
df< - as.data.frame(scan('myfile' sep ='\t',list(url ='',popularity = 0,mintime = 0,maxtime = 0)))
有更好的方法吗?或者可能完全不同的方法来解决问题?
几年后的更新 p>
这个答案是老的,R已经移动了。调整 read.table
运行更快一点也没有什么好处。您的选项是:
-
使用
fread
在data.table
。 href =https://stackoverflow.com/a/15058684/134830> mnel的答案。 -
使用
read_table
readr
(在2015年4月起的CRAN上) 。这很像上面的fread
。链接中的README解释了两个函数之间的区别(readr
目前声称比codedata.table :: fread的1.5-2x慢 )。 -
read.csv.raw
从iotools
提供了快速阅读CSV文件的第三个选项。 -
尝试在数据库中存储尽可能多的数据,而不是平面文件。 (除了作为一个更好的永久性存储介质之外,数据以二进制格式传递到R中也就是更快。)
read.csv.sql
。 org / web / packages / sqldf / index.htmlrel =noreferrer>sqldf
包,如 JD Long的答案,将数据导入临时SQLite数据库,然后将其读入R中。另请参阅:RODBC
包,反向依照DBI
包页面。MonetDB.R
给你一种假装是数据框架的数据类型,但实际上是一个MonetDB,提高了性能。使用其monetdb.read.csv导入数据
功能。dplyr
允许您直接使用存储在多种数据库中的数据。 -
以二进制格式存储数据也可用于提高性能。使用
saveRDS
/readRDS
(见下文)或h5
或 >
原始答案
有几个简单的事情要尝试,无论你使用read.table还是scan。
-
设置
nrows
= 数据中的记录数(nmax
scan
)。 -
确保
comment.char =
关闭评论的解释。 -
使用
colClasses
明确定义每列的类read.table
。 -
设置
multi.line = FALSE
也可以提高扫描性能。
如果这些事情都不起作用,然后使用概要分析软件包之一来确定哪些线路减缓了事情。也许您可以根据结果编写 read.table
的缩减版本。
另一种替代方法是过滤您的数据在您读入R之前。
或者,如果问题是您必须定期阅读,请使用这些方法一次读取数据,然后将数据框以 保存
saveRDS
,那么下次你可以用 加载
readRDS
。
I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table()
has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using scan()
can be quite fast, e.g.:
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?
An update, several years later
This answer is old, and R has moved on. Tweaking read.table
to run a bit faster has precious little benefit. Your options are:
Using
fread
indata.table
for importing data from CVS/tab-delimited files directly into R. See mnel's answer.Using
read_table
inreadr
(on CRAN from April 2015). This works much likefread
above. The README in the link explains the difference between the two functions (readr
currently claims to be "1.5-2x slower" thandata.table::fread
).read.csv.raw
fromiotools
provides a third option for quickly reading CSV files.Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
read.csv.sql
in thesqldf
package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: theRODBC
package, and the reverse depends section of theDBI
package page.MonetDB.R
gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with itsmonetdb.read.csv
function.dplyr
allows you to work directly with data stored in several types of database.Storing data in binary formats can also be useful for improving performance. Use
saveRDS
/readRDS
(see below), or theh5
orrhdf5
packages for HDF5 format.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set
nrows
=the number of records in your data (nmax
inscan
).Make sure that
comment.char=""
to turn off interpretation of comments.Explicitly define the classes of each column using
colClasses
inread.table
.Setting
multi.line=FALSE
may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table
based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save
saveRDS
, then next time you can retrieve it faster with load
readRDS
.
这篇关于在R中快速读取非常大的表作为数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!