为什么读取行比读取列快? [英] Why reading rows is faster than reading columns?

查看:86
本文介绍了为什么读取行比读取列快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在分析具有200行和1200列的数据集,该数据集存储在.CSV文件中.为了进行处理,我使用R的read.csv()函数读取了此文件.

I am analysing a dataset having 200 rows and 1200 columns, this dataset is stored in a .CSV file. In order to process, I read this file using R's read.csv() function.

R花费≈600秒来读取此数据集.后来我有了一个主意,我将数据转换到.CSV文件中,并尝试使用read.csv()函数再次读取它.令我惊讶的是,它只花了大约20秒的时间.如您所见,它快了大约30倍.

R takes ≈ 600 seconds to read this dataset. Later I got an idea and I transposed the data inside .CSV file and tried to read it again using read.csv() function. I was amazed to see that it only took ≈ 20 seconds. As you can see, it was ≈ 30 times faster.

我在以下迭代中对其进行了验证:

I verified it for following iterations:

> system.time(dat <- read.csv(file = "data.csv", sep = ",", header = F))

   user  system elapsed 
 610.98    6.54  618.42 # 1st iteration
 568.27    5.83  574.47 # 2nd iteration
 521.13    4.73  525.97 # 3rd iteration
 618.31    3.11  621.98 # 4th iteration
 603.85    3.29  607.50 # 5th iteration

读取1200行200列(已转置)

> system.time(dat <- read.csv(file = "data_transposed.csv",
      sep = ",", header = F))

   user  system elapsed 
  17.23    0.73   17.97 # 1st iteration
  17.11    0.69   17.79 # 2nd iteration
  20.70    0.89   21.61 # 3rd iteration
  18.28    0.82   19.11 # 4th iteration
  18.37    1.61   20.01 # 5th iteration

在任何数据集中,我们在行和列中进行观察,其中包含要观察的变量.转置会更改此数据结构. 转置数据以进行处理是一种好习惯吗,即使它使数据看起来很奇怪?

In any data-set we take observations in rows and columns contain variables to-be observed. Transpose changes this structure of data. Is it a good practice to transpose the data for processing, even though it makes data look weird?

我想知道什么使R在转置数据时快速读取数据集.我敢肯定这是因为较早的尺寸是200 * 1200,它在转置操作后变成了1200 * 200. 为什么我转置数据时R会快速读取数据?

I am wondering what makes R read datasets fast when I transposed the data. I am sure it is because earlier dimensions were 200 * 1200 which became 1200 * 200 after transpose operation. Why R reads data fast when I transpose the data?

我最初问这个问题是因为我的RStudio花很长时间读取和计算高维数据集(与行[200行,1200列]相比,有许多列).我正在使用内置的R函数read.csv().我阅读了下面的评论,根据他们后来的建议,我对read.csv2()fread()函数进行了实验,它们都能正常工作,但是它们对我的原始数据集[200行* 1200列]的运行速度很慢,并且读取转置数据集的速度更快.

I initially asked this question because my RStudio was taking long time to read and compute a highly dimensional dataset (many columns as compare to rows [200 rows, 1200 columns]). I was using built-in R function read.csv(). I read the comments below, as per their suggestions later I experimented with read.csv2() and fread() function they all work well but they perform slowly for my original dataset [200 rows * 1200 columns] and they read transposed data-set faster.

我观察到,这对于 MS-Excel Libre office Calc 也同样有效.我什至试图将其打开到 Sublime Text Editor 中,即使对于该文本编辑器,它也很容易(快速)读取转置数据.我仍然无法弄清所有这些应用程序如何运行的原因.如果您的数据中的列数比行数多,那么所有这些应用程序都会遇到麻烦.

I observed that this is also valid for MS-Excel and Libre office Calc too. I even tried to open it into Sublime Text editor and even for this text editor it was easy(fast) to read transposed data. I am still not able to figure out the reason why all these applications behave so. All these apps get into trouble if your data has many columns as compare to rows.

所以要总结整个故事,我只有3个问题.

So to wrap up whole story, I have only 3 question.

  1. 这是什么问题?它与操作系统有关还是应用程序级别的问题?
  2. 将数据转置进行处理是一种好习惯吗?
  3. 为什么我转置数据时R和/或其他应用程序会快速读取我的数据?

我的实验也许帮助我重新发现了一些"已经知道的" 智慧,但我在互联网上找不到任何相关内容.亲切地 分享这种良好的编程/数据分析实践.

My experiments perhaps helped me to rediscover some 'already known' wisdom, but I couldn't find anything relevant on internet. Kindly share such good programming/data analysis practices.

推荐答案

您的问题基本上是关于:读取长数据集比读取宽数据集要快得多吗?

Your question is basically about: is reading long dataset much faster than reading wide dataset?

我在这里给出的不是最终答案,而是一个新的起点.

What I give here is not going to be the final answer, but a new starting point.

对于任何与性能相关的问题,进行概要分析总是比猜测更好. system.time很好,但是它只告诉您总的运行时间,而不是内部时间的分配方式.如果您快速浏览了read.table的源代码(read.csv只是read.table的包装),它包含三个阶段:

For any performance-related issue, it is always better to profile than guess. system.time is good, but it only tells you about the total run time than how time is split inside. If you have a quick glance of the source code of read.table (read.csv is merely a wrapper of read.table), it contains three stages:

  1. 调用scan读取5行数据.我不确定这部分的目的;
  2. 调用scan读取完整的数据.基本上,这会将您的数据逐列读入一个字符串列表,其中每一列都是一个记录";
  3. 类型转换,或者由type.convert隐式地进行,或者由as.numericas.Date等来显式地(如果您已指定列类).
  1. call scan to read in 5 rows of your data. I am not entirely sure about the purpose of this part;
  2. call scan to read in your complete data. Basically this will read your data column by column into a list of character strings, where each column is a "record";
  3. type conversion, either implicitly by type.convert, or explicitly (if you have specified column classes) by say as.numeric, as.Date, etc.

前两个阶段是在C级别完成的,而最后阶段是在R级别,并通过for循环遍历所有记录.

The first two stages are done at C-level, while the final stage at R-level with a for loop through all records.

一个基本的分析工具是RprofsummaryRprof.以下是一个非常非常简单的示例.

A basic profiling tool is Rprof and summaryRprof. The following is a very very simple example.

## configure size
m <- 10000
n <- 100

## a very very simple example, where all data are numeric
x <- runif(m * n)

## long and wide .csv
write.csv(matrix(x, m, n), file = "long.csv", row.names = FALSE, quote = FALSE)
write.csv(matrix(x, n, m), file = "wide.csv", row.names = FALSE, quote = FALSE)

## profiling (sample stage)
Rprof("long.out")
long <- read.csv("long.csv")
Rprof(NULL)

Rprof("wide.out")
wide <- read.csv("wide.csv")
Rprof(NULL)

## profiling (report stage)
summaryRprof("long.out")[c(2, 4)]
summaryRprof("wide.out")[c(2, 4)]

c(2, 4)为具有足够采样数和"CPU总时间"(可能低于挂钟时间)的所有R级功能提取总时间".以下是我在Sandy Bridge 2011上的 intel i5 2557m @ 1.1GHz(禁用涡轮增压)上得到的东西.

The c(2, 4) extracts "by.total" time for all R-level functions with enough samples and "total CPU time" (may be lower than wall clock time). The following is what I get on my intel i5 2557m @1.1GHz (turbo boost disabled), Sandy Bridge 2011.

## "long.csv"
#$by.total
#               total.time total.pct self.time self.pct
#"read.csv"            7.0       100       0.0        0
#"read.table"          7.0       100       0.0        0
#"scan"                6.3        90       6.3       90
#".External2"          0.7        10       0.7       10
#"type.convert"        0.7        10       0.0        0
#
#$sampling.time
#[1] 7

## "wide.csv"
#$by.total
#               total.time total.pct self.time self.pct
#"read.table"        25.86    100.00      0.06     0.23
#"read.csv"          25.86    100.00      0.00     0.00
#"scan"              23.22     89.79     23.22    89.79
#"type.convert"       2.22      8.58      0.38     1.47
#"match.arg"          1.20      4.64      0.46     1.78
#"eval"               0.66      2.55      0.12     0.46
#".External2"         0.64      2.47      0.64     2.47
#"parent.frame"       0.50      1.93      0.50     1.93
#".External"          0.30      1.16      0.30     1.16
#"formals"            0.08      0.31      0.04     0.15
#"make.names"         0.04      0.15      0.04     0.15
#"sys.function"       0.04      0.15      0.02     0.08
#"as.character"       0.02      0.08      0.02     0.08
#"c"                  0.02      0.08      0.02     0.08
#"lapply"             0.02      0.08      0.02     0.08
#"sys.parent"         0.02      0.08      0.02     0.08
#"sapply"             0.02      0.08      0.00     0.00
#
#$sampling.time
#[1] 25.86

因此,读取长数据集将花费7s CPU时间,而读取宽数据集将花费25.86s CPU时间.

So reading a long dataset takes 7s CPU time, while reading a wide dataset takes 25.86s CPU time.

乍一看可能会感到困惑,因为在更广泛的情况下报告了更多的功能.实际上,长案例和宽案例都执行相同的功能集,但长案例的执行速度更快,因此许多函数所花费的时间少于采样间隔(0.02s),因此无法测量.

It might be confusing at first glance, that more functions are reported for wide case. In fact, both long and wide cases execute the same set of functions, but long case is faster, so many functions take less time than the sampling interval (0.02s) hence can not be measured.

但是无论如何,运行时间由scantype.convert(隐式类型转换)控制.在这个示例中,我们看到了

But anyway, the run time is dominated by scan and type.convert (implicit type conversion). For this example, we see that

    即使在R级别完成
  • 类型转换,它的成本也不会太高.无论是长边还是宽边,它所占的时间都不超过10%;
  • scan基本上是read.csv的全部工作对象,但是不幸的是,我们无法将这样的时间进一步划分为阶段1和阶段2.不要认为这是理所当然的,因为阶段1仅读取5行,所以它会非常快.实际上,在调试模式下,我发现阶段1可能花费很长时间.
  • type conversion is not too costly even though it is done at R-level; for both long and wide it accounts for no more than 10% of the time;
  • scan is basically all read.csv is working with, but unfortunately, we are unable to further divide such time to stage-1 and stage-2. Don't take it for granted that because stage-1 only reads in 5 rows so it would be very fast. In debugging mode I actually find that stage-1 can take quite a long time.

那我们下一步该怎么做?

So what should we do next?

  • 如果我们能找到一种方法来衡量在第一阶段和第二阶段花费的时间,那就太好了了;
  • 您可能想描述数据集中混合数据类的一般情况.

这篇关于为什么读取行比读取列快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆