read.csv在读取具有大量列的csv文件时非常慢 [英] read.csv is extremely slow in reading csv files with large numbers of columns
问题描述
我有一个.csv文件:example.csv有8000列x 40000行。 csv文件具有每个列的字符串标题。所有字段包含介于0和10之间的整数值。当我尝试加载此文件与read.csv它结果是非常慢。当我添加一个参数nrow = 100时,它也很慢。我想知道是否有一种方法来加速read.csv,或使用一些其他函数,而不是read.csv将文件加载到内存作为矩阵或data.frame?
I have a .csv file: example.csv with 8000 columns x 40000 rows. The csv file have a string header for each column. All fields contains integer values between 0 and 10. When I try to load this file with read.csv it turns out to be extremely slow. It is also very slow when I add a parameter nrow=100. I wonder if there is a way to accelerate the read.csv, or use some other function instead of read.csv to load the file into memory as a matrix or data.frame?
提前感谢。
推荐答案
如果CSV只包含整数,则应使用 scan
而不是 read.csv
,因为?read.csv
说:
If your CSV only contains integers, you should use scan
instead of read.csv
, since ?read.csv
says:
‘read.table’ is not the right tool for reading large matrices,
especially those with many columns: it is designed to read _data
frames_ which may have columns of very different classes. Use
‘scan’ instead for matrices.
由于你的文件有一个头,你需要 skip = code>,如果你设置
what = integer()
,它可能会更快。如果你必须使用 read.csv
和速度/内存消耗,设置 colClasses
参数是一个巨大的帮助。
Since your file has a header, you will need skip=1
, and it will probably be faster if you set what=integer()
. If you must use read.csv
and speed / memory consumption are a concern, setting the colClasses
argument is a huge help.
这篇关于read.csv在读取具有大量列的csv文件时非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!