更快速地读取CSV文件的单列 [英] Quicker way to read single column of CSV file

查看：172 发布时间：2017/2/24 19:53:02 r performance optimization csv io

本文介绍了更快速地读取CSV文件的单列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想尽快将 CSV 文件的一列读入 R 。我希望减少标准方法，使列进入RAM的时间减少10倍。

我的动机是什么？我有两个文件;一个叫做 Main.csv ，它是300000行和500列，一个叫做 Second.csv ，它是300000行， 5列。如果I system.time（）命令 read.csv（Second.csv）秒。现在如果我使用下面的两种方法读取 Main.csv 的第一列（这是 Second.csv ，因为它是1列而不是5），它将需要40秒。 这是读取整个600兆字节文件所需的时间相同 - 显然是不可接受的。

方法1

  colClasses<  -  rep（'NULL'，500）
 
 colClasses [1]<  -  NA 
 system.time（
 read.csv（Main.csv，colClasses = colClasses）
）＃40+秒，不可接受

方法2

  read.table（pipe（cut -f1 Main.csv））＃40 +秒，不可接受

如何减少这个时间？我希望能有一个 R 解决方案。

解决方案

/ p>

  scan（pipe（cut -f1 -d，Main.csv））

这不同于原始提议（ read.table（pipe（cut -f1 Main.csv）） code>）：

 
 ，因为文件以逗号分隔， cut 默认为tab分隔，您需要指定 d，指定逗号分隔

code> scan（）比简单/非结构化数据读取的 read.table 快得多。

根据OP的评论，这需要大约4而不是40多秒。

I am trying to read a single column of a CSV file to R as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.

What is my motivation? I have two files; one called Main.csv which is 300000 rows and 500 columns, and one called Second.csv which is 300000 rows and 5 columns. If I system.time() the command read.csv("Second.csv"), it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv (which is 20% the size of Second.csv since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.

Method 1

colClasses <- rep('NULL',500)

colClasses[1] <- NA
system.time(
read.csv("Main.csv",colClasses=colClasses)
) # 40+ seconds, unacceptable

Method 2

 read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable

How to reduce this time? I am hoping for an R solution.

解决方案

I would suggest

scan(pipe("cut -f1 -d, Main.csv"))

This differs from the original proposal (read.table(pipe("cut -f1 Main.csv"))) in a couple of different ways:

since the file is comma-separated and cut assumes tab-separation by default, you need to specify d, to specify comma-separation
scan() is much faster than read.table for simple/unstructured data reads.

According to the comments by the OP this takes about 4 rather than 40+ seconds.

这篇关于更快速地读取CSV文件的单列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

更快速地读取CSV文件的单列 [英] Quicker way to read single column of CSV file

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录关闭

更快速地读取CSV文件的单列 [英] Quicker way to read single column of CSV file

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录 关闭

登录关闭