R中data.table包中fread速度的原因 [英] Reason behind speed of fread in data.table package in R

查看:14
本文介绍了R中data.table包中fread速度的原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

data.table 中的 fread 函数在处理大型数据文件时的速度让我感到惊讶,但它是如何读取这么快的呢?freadread.csv 的基本实现区别是什么?

I am amazed by the speed of the fread function in data.table on large data files but how does it manages to read so fast? What are the basic implementation differences between fread and read.csv?

推荐答案

我假设我们正在与 read.csv 进行比较,并应用了所有已知的建议,例如设置 colClassesnrows 等. read.csv(filename) 没有任何其他参数很慢,主要是因为它首先将所有内容读入内存,就好像它是 character 和然后尝试将其强制转换为 integernumeric 作为第二步.

I assume we are comparing to read.csv with all known advice applied such as setting colClasses, nrows etc. read.csv(filename) without any other arguments is slow mainly because it first reads everything into memory as if it were character and then attempts to coerce that to integer or numeric as a second step.

所以,比较 freadread.csv(filename, colClasses=, nrows=, etc) ...

So, comparing fread to read.csv(filename, colClasses=, nrows=, etc) ...

它们都是用 C 编写的,所以不是这样.

They are both written in C so it's not that.

没有一个特别的原因,但本质上,fread 内存将文件映射到内存,然后使用指针遍历文件.而 read.csv 通过连接将文件读入缓冲区.

There isn't one reason in particular, but essentially, fread memory maps the file into memory and then iterates through the file using pointers. Whereas read.csv reads the file into a buffer via a connection.

如果您使用 verbose=TRUE 运行 fread,它将告诉您它是如何工作的,并报告每个步骤所花费的时间.例如,请注意它会直接跳到文件的中间和末尾,以便更好地猜测列类型(尽管在这种情况下前 5 个就足够了).

If you run fread with verbose=TRUE it will tell you how it works and report the time spent in each of the steps. For example, notice that it skips straight to the middle and the end of the file to make a much better guess of the column types (although in this case the top 5 were enough).

> fread("test.csv",verbose=TRUE)
Input contains no 
. Taking this to be a filename to open
File opened, filesize is 0.486 GB
File is opened and mapped ok
Detected eol as 
 only (no 
 afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 6 columns
First row with 6 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 10000001
Subtracted 1 for last eol and any trailing empty lines, leaving 10000000 data rows
Type codes (   first 5 rows): 113431
Type codes (+ middle 5 rows): 113431
Type codes (+   last 5 rows): 113431
Type codes: 113431 (after applying colClasses and integer64)
Type codes: 113431 (after applying drop or select (if supplied)
Allocating 6 column slots (6 - 0 dropped)
Read 10000000 rows and 6 (of 6) columns from 0.486 GB file in 00:00:44
  13.420s ( 31%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   3.210s (  7%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   1.310s (  3%) Allocation of 10000000x6 result (xMB) in RAM
  25.580s ( 59%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.040s (  0%) Changing na.strings to NA
  43.560s        Total

注意:这些时间在我没有 SSD 的非常慢的上网本上.每个步骤的绝对时间和相对时间都会因机器而异.例如,如果您再次运行 fread,您可能会注意到映射到 mmap 的时间要少得多,因为您的操作系统已经从上次运行中缓存了它.

NB: these timings on my very slow netbook with no SSD. Both the absolute and relative times of each step will vary widely from machine to machine. For example if you rerun fread a second time you may notice the time to mmap is much less because your OS has cached it from the previous run.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            20
Model:                 2
Stepping:              0
CPU MHz:               800.000         # i.e. my slow netbook
BogoMIPS:              1995.01
Virtualisation:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
NUMA node0 CPU(s):     0,1

这篇关于R中data.table包中fread速度的原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆