read.csv在读取具有大量列的csv文件时非常慢 [英] read.csv is extremely slow in reading csv files with large numbers of columns

查看:634
本文介绍了read.csv在读取具有大量列的csv文件时非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个.csv文件:example.csv有8000列x 40000行。 csv文件具有每个列的字符串标题。所有字段包含介于0和10之间的整数值。当我尝试加载此文件与read.csv它结果是非常慢。当我添加一个参数nrow = 100时,它也很慢。我想知道是否有一种方法来加速read.csv,或使用一些其他函数,而不是read.csv将文件加载到内存作为矩阵或data.frame?

I have a .csv file: example.csv with 8000 columns x 40000 rows. The csv file have a string header for each column. All fields contains integer values between 0 and 10. When I try to load this file with read.csv it turns out to be extremely slow. It is also very slow when I add a parameter nrow=100. I wonder if there is a way to accelerate the read.csv, or use some other function instead of read.csv to load the file into memory as a matrix or data.frame?

提前感谢。

推荐答案

如果CSV只包含整数,则应使用 scan 而不是 read.csv ,因为?read.csv 说:

If your CSV only contains integers, you should use scan instead of read.csv, since ?read.csv says:

 ‘read.table’ is not the right tool for reading large matrices,
 especially those with many columns: it is designed to read _data
 frames_ which may have columns of very different classes.  Use
 ‘scan’ instead for matrices.

由于你的文件有一个头,你需要 skip = code>,如果你设置 what = integer(),它可能会更快。如果你必须使用 read.csv 和速度/内存消耗,设置 colClasses 参数是一个巨大的帮助。

Since your file has a header, you will need skip=1, and it will probably be faster if you set what=integer(). If you must use read.csv and speed / memory consumption are a concern, setting the colClasses argument is a huge help.

这篇关于read.csv在读取具有大量列的csv文件时非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆