更快速地读取CSV文件的单列 [英] Quicker way to read single column of CSV file

查看:172
本文介绍了更快速地读取CSV文件的单列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想尽快将 CSV 文件的一列读入 R 。我希望减少标准方法,使列进入RAM的时间减少10倍。



我的动机是什么?我有两个文件;一个叫做 Main.csv ,它是300000行和500列,一个叫做 Second.csv ,它是300000行, 5列。如果I system.time()命令 read.csv(Second.csv)秒。现在如果我使用下面的两种方法读取 Main.csv 的第一列(这是 Second.csv ,因为它是1列而不是5),它将需要40秒。 这是读取整个600兆字节文件所需的时间相同 - 显然是不可接受的。




  • 方法1

      colClasses<  -  rep('NULL',500)

    colClasses [1]< - NA
    system.time(
    read.csv(Main.csv,colClasses = colClasses)
    )#40+秒,不可接受


  • 方法2

      read.table(pipe(cut -f1 Main.csv))#40 +秒,不可接受




如何减少这个时间?我希望能有一个 R 解决方案。

解决方案

/ p>

  scan(pipe(cut -f1 -d,Main.csv))

这不同于原始提议( read.table(pipe(cut -f1 Main.csv)) code>):




  • ,因为文件以逗号分隔, cut 默认为tab分隔,您需要指定 d,指定逗号分隔

  • code> scan()比简单/非结构化数据读取的 read.table 快得多。



根据OP的评论,这需要大约4而不是40多秒。


I am trying to read a single column of a CSV file to R as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.

What is my motivation? I have two files; one called Main.csv which is 300000 rows and 500 columns, and one called Second.csv which is 300000 rows and 5 columns. If I system.time() the command read.csv("Second.csv"), it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv (which is 20% the size of Second.csv since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.

  • Method 1

    colClasses <- rep('NULL',500)
    
    colClasses[1] <- NA
    system.time(
    read.csv("Main.csv",colClasses=colClasses)
    ) # 40+ seconds, unacceptable
    

  • Method 2

     read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable
    

How to reduce this time? I am hoping for an R solution.

解决方案

I would suggest

scan(pipe("cut -f1 -d, Main.csv"))

This differs from the original proposal (read.table(pipe("cut -f1 Main.csv"))) in a couple of different ways:

  • since the file is comma-separated and cut assumes tab-separation by default, you need to specify d, to specify comma-separation
  • scan() is much faster than read.table for simple/unstructured data reads.

According to the comments by the OP this takes about 4 rather than 40+ seconds.

这篇关于更快速地读取CSV文件的单列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆