更快速地读取CSV文件的单列 [英] Quicker way to read single column of CSV file
问题描述
我想尽快将 CSV
文件的一列读入 R
。我希望减少标准方法,使列进入RAM的时间减少10倍。
我的动机是什么?我有两个文件;一个叫做 Main.csv
,它是300000行和500列,一个叫做 Second.csv
,它是300000行, 5列。如果I system.time()
命令 read.csv(Second.csv)
秒。现在如果我使用下面的两种方法读取 Main.csv
的第一列(这是 Second.csv
,因为它是1列而不是5),它将需要40秒。 这是读取整个600兆字节文件所需的时间相同 - 显然是不可接受的。
-
方法1
colClasses< - rep('NULL',500)
colClasses [1]< - NA
system.time(
read.csv(Main.csv,colClasses = colClasses)
)#40+秒,不可接受
-
方法2
read.table(pipe(cut -f1 Main.csv))#40 +秒,不可接受
如何减少这个时间?我希望能有一个 R
解决方案。
/ p>
scan(pipe(cut -f1 -d,Main.csv))
这不同于原始提议( read.table(pipe(cut -f1 Main.csv)) code>):
- ,因为文件以逗号分隔,
cut
默认为tab分隔,您需要指定d,
指定逗号分隔 - code> scan()比简单/非结构化数据读取的
read.table
快得多。
根据OP的评论,这需要大约4而不是40多秒。
I am trying to read a single column of a CSV
file to R
as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.
What is my motivation? I have two files; one called Main.csv
which is 300000 rows and 500 columns, and one called Second.csv
which is 300000 rows and 5 columns. If I system.time()
the command read.csv("Second.csv")
, it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv
(which is 20% the size of Second.csv
since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.
Method 1
colClasses <- rep('NULL',500) colClasses[1] <- NA system.time( read.csv("Main.csv",colClasses=colClasses) ) # 40+ seconds, unacceptable
Method 2
read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable
How to reduce this time? I am hoping for an R
solution.
I would suggest
scan(pipe("cut -f1 -d, Main.csv"))
This differs from the original proposal (read.table(pipe("cut -f1 Main.csv"))
) in a couple of different ways:
- since the file is comma-separated and
cut
assumes tab-separation by default, you need to specifyd,
to specify comma-separation scan()
is much faster thanread.table
for simple/unstructured data reads.
According to the comments by the OP this takes about 4 rather than 40+ seconds.
这篇关于更快速地读取CSV文件的单列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!