read.csv比data.table :: fread更快 [英] read.csv faster than data.table::fread

查看:111
本文介绍了read.csv比data.table :: fread更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在网络上我可以读到我应该使用data.table并读取来加载我的数据。

across the web I can read that I should use data.table and fread to load my data.

但是当我运行基准测试时,就会得到以下内容结果

But when I run a benchmark, then I get the following results

Unit: milliseconds
expr       min        lq      mean    median        uq        max neval
test1  1.229782  1.280000  1.382249  1.366277  1.460483   1.580176    10
test3  1.294726  1.355139  1.765871  1.391576  1.542041   4.770357    10
test2 23.115503 23.345451 42.307979 25.492186 57.772522 125.941734    10

可以在下面看到代码。

loadpath <- readRDS("paths.rds")

microbenchmark(
  test1 = read.csv(paste0(loadpath,"data.csv"),header=TRUE,sep=";", stringsAsFactors = FALSE,colClasses = "character"),
  test2 = data.table::fread(paste0(loadpath,"data.csv"), sep=";"),
  test3 = read.csv(paste0(loadpath,"data.csv")),
  times = 10
) %>%
  print(order = "min") 

我知道 fread()应该比快read.csv(),因为它首先尝试将行作为字符读取到内存中,然后尝试将它们转换为数据类型的整数和因子。另一方面, fread()只是将所有内容读取为字符。

I understand that fread() should be faster than read.csv() because it tries to first read rows into memory as character and then tries to convert them into integer and factor as data types. On the other hand, fread() simply reads everything as character.

如果为真,则不应 test2 test3 快吗?

If this is true, shouldn't test2 be faster than test3 ?

有人可以向我解释一下,为什么我不使用 test2 加速或保持相同的速度吗? test1 ? :)

Can someone explain me, why I do not archieve a speed-up or atleast the same speed with test2 as test1 ? :)

推荐答案

data.table :: fread 具有显着的性能优势如果考虑使用较大的文件,将变得很清楚。这是一个完全可重现的示例。

data.table::freads significant performance advantage becomes clear if you consider larger files. Here is a fully reproducible example.


  1. 让我们生成一个由10 ^ 5行和100列组成的CSV文件

  1. Let's generate a CSV file consisting of 10^5 rows and 100 columns

if (!file.exists("test.csv")) {
    set.seed(2017)
    df <- as.data.frame(matrix(runif(10^5 * 100), nrow = 10^5))
    write.csv(df, "test.csv", quote = F)
}


  • 我们运行了微基准测试分析(请注意,这可能需要几分钟,具体取决于您的硬件)

  • We run a microbenchmark analysis (note that this may take a couple of minutes depending on your hardware)

    library(microbenchmark)
    res <- microbenchmark(
        read.csv = read.csv("test.csv", header = TRUE, stringsAsFactors = FALSE, colClasses = "numeric"),
        fread = data.table::fread("test.csv", sep = ",", stringsAsFactors = FALSE, colClasses = "numeric"),
        times = 10)
    res
    #          Unit: milliseconds
    #     expr        min         lq       mean     median         uq        max
    # read.csv 17034.2886 17669.8653 19369.1286 18537.7057 20433.4933 23459.4308
    #    fread   287.1108   311.6304   432.8106   356.6992   460.6167   888.6531
    
    
    library(ggplot2)
    autoplot(res)
    


  • 这篇关于read.csv比data.table :: fread更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆