如何在 R 中有效地使用 Rprof? [英] How to efficiently use Rprof in R?

查看:83
本文介绍了如何在 R 中有效地使用 Rprof?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有可能以类似于 matlab 的 Profiler 的方式从 R-Code 获取配置文件.也就是说,要了解哪些行号特别慢.

到目前为止我所取得的成绩并不令人满意.我使用 Rprof 来制作配置文件.使用 summaryRprof 我得到如下内容:

<块引用>

$by.selfself.time self.pct total.time total.pct[.data.frame 0.72 10.1 1.84 25.8继承 0.50 7.0 1.10 15.4数据框 0.48 6.7 4.86 68.3unique.default 0.44 6.2 0.48 6.7解析 0.36 5.1 1.18 16.6rbind 0.30 4.2 2.22 31.2匹配 0.28 3.9 1.38 19.4[<-.因子 0.28 3.9 0.56 7.9水平 0.26 3.7 0.34 4.8下一个方法 0.22 3.1 0.82 11.5...

<块引用>

$by.totaltotal.time total.pct self.time self.pct数据框 4.86 68.3 0.48 6.7rbind 2.22 31.2 0.30 4.2do.call 2.22 31.2 0.00 0.0[ 1.98 27.8 0.16 2.2[.data.frame 1.84 25.8 0.72 10.1匹配 1.38 19.4 0.28 3.9%in% 1.26 17.7 0.14 2.0is.factor 1.20 16.9 0.10 1.4解析 1.18 16.6 0.36 5.1...

老实说,从这个输出中我不知道我的瓶颈在哪里,因为 (a) 我经常使用 data.frame 并且 (b) 我从不使用例如 deparse.此外,什么是[?

所以我尝试了 Hadley Wickham 的 profr,但考虑到下图,它不再有用:

有没有更方便的方法来查看哪些行号和特定的函数调用很慢?
或者,我应该查阅一些文献吗?

任何提示表示赞赏.

编辑 1:
根据 Hadley 的评论,我将粘贴下面的脚本代码和绘图的基本图形版本.但请注意,我的问题与此特定脚本无关.这只是我最近写的一个随机脚本.我正在寻找如何找到瓶颈和加速 R-code 的通用方法.

数据 (x) 如下所示:

<块引用>

type word response N 分类 classN摘要愤怒苦 1 3a 3a摘要 ANGER 控制 1 1a 1a抽象愤怒父亲 1 3a 3a抽象的愤怒脸红 1 3a 3a抽象愤怒愤怒 1 1c 1c抽象的愤怒帽子 1 3a 3a摘要愤怒帮助 1 3a 3a抽象愤怒疯狂 13 3a 3a摘要 愤怒管理 2 1a 1a...直到第 1700 行

脚本(有简短的解释)是这样的:

<块引用>

Rprof("profile1.out")# 生成一个新数据集,每行 x 包含 x$N 次y <- vector('list',length(x[,1]))for (i in 1:length(x[,1])) {y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"]))}所有 <- do.call('rbind',y)colnames(all) <- colnames(x)# 从一个词 x 类表中创建一个数据框table_all <- table(all$word,all$classN)dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])])dataf.all$words <- as.factor(rownames(dataf.all))dataf.all$type <- "no"# 获取单词的类型.单词 <- 级别(dataf.all$words)for (i in 1:length(words)) {dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"])}dataf.all$type <- as.factor(dataf.all$type)dataf.all$typeN <- as.numeric(dataf.all$type)# 聚合响应类别dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum)dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum)dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum)Rprof(NULL)图书馆(教授)ggplot.profr(parse_rprof("profile1.out"))

最终数据如下:

<块引用>

1a 1b 1c 1d 1e 1f 2a 2b 2c 3a 3b pa 字型 typeN c1 c2 c3 pa3 0 8 0 0 0 0 0 0 24 0 0 愤怒 摘要 1 11 0 24 06 0 4 0 1 0 0 11 0 13 0 0 焦虑 摘要 1 11 11 13 02 11 1 0 0 0 0 4 0 17 0 0 态度摘要 1 14 4 17 09 18 0 0 0 0 0 0 0 0 8 0 桶 混凝土 2 27 0 8 00 1 18 0 0 0 0 4 0 12 0 0 信念摘要 1 19 4 12 0

基础图:

今天运行脚本也稍微改变了 ggplot2 图(基本上只有标签),见这里.

解决方案

提醒读者昨天的突发新闻(R 3.0.0 终于出来了)可能已经注意到一些与这个问题直接相关的有趣内容:

<块引用>
  • 通过 Rprof() 进行分析现在可以选择在语句级别记录信息,而不仅仅是函数级别.

事实上,这个新功能回答了我的问题,我将展示如何.

<小时>

比方说,我们想比较向量化和预分配是否真的比旧的 for 循环和增量构建数据在计算汇总统计数据(例如均值)方面更好.相对愚蠢的代码如下:

#创建大数据框:n <- 1000x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n))# 合理操作:边际.means.1 <-聚合(数据~组+条件,数据=x,乐趣=平均值)# 不合理的操作:margin.means.2 <- margin.means.1[NULL,]row.counter <- 1for(水平条件(x$条件)){for (group in levels(x$group)) {tmp.value <- 0tmp.length <- 0for (c in 1:nrow(x)) {if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) {tmp.value <- tmp.value + x[c,"data"]tmp.length <- tmp.length + 1}}margin.means.2[row.counter,"group"] <- groupmargin.means.2[row.counter,"condition"] <- 条件margin.means.2[row.counter,"data"] <- tmp.value/tmp.lengthrow.counter <- row.counter + 1}}# 它产生相同的结果吗?all.equal(marginal.means.1,marginal.means.2)

要将这段代码与Rprof 一起使用,我们需要对它进行解析.也就是说,它需要保存在一个文件中,然后从那里调用.因此,我将其上传到 pastebin,但它与本地文件的工作方式完全相同.

现在,我们

  • 只需创建一个配置文件并指明我们要保存行号,
  • 使用令人难以置信的组合获取代码 eval(parse(..., keep.source = TRUE))(看似臭名昭著的 fortune(106) 不适用在这里,因为我还没有找到其他方法)
  • 停止分析并表明我们想要基于行号的输出.

代码是:

Rprof("profile1.out", line.profiling=TRUE)评估(解析(文件=http://pastebin.com/download.php?i=KjdkSVZq",keep.source=TRUE))Rprof(NULL)summaryRprof("profile1.out", lines = "show")

给出:

$by.selfself.time self.pct total.time total.pct下载.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11<没有位置>4.38 34.93 4.38 34.93下载.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48下载.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16$by.totaltotal.time total.pct self.time self.pct下载.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11<没有位置>4.38 34.93 4.38 34.93下载.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48下载.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16$by.lineself.time self.pct total.time total.pct<没有位置>4.38 34.93 4.38 34.93下载.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48下载.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11下载.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16$sample.interval[1] 0.02$sampling.time[1] 12.54

检查源代码告诉我们有问题的行(#17)确实是愚蠢的if-for 循环中的语句.与使用矢量化代码(第 6 行)基本上没有时间计算相同的相比.

我还没有尝试过任何图形输出,但我已经对我目前得到的东西印象深刻.

I would like to know if it is possible to get a profile from R-Code in a way that is similar to matlab's Profiler. That is, to get to know which line numbers are the one's that are especially slow.

What I acchieved so far is somehow not satisfactory. I used Rprof to make me a profile file. Using summaryRprof I get something like the following:

$by.self
                  self.time self.pct total.time total.pct
[.data.frame               0.72     10.1       1.84      25.8
inherits                   0.50      7.0       1.10      15.4
data.frame                 0.48      6.7       4.86      68.3
unique.default             0.44      6.2       0.48       6.7
deparse                    0.36      5.1       1.18      16.6
rbind                      0.30      4.2       2.22      31.2
match                      0.28      3.9       1.38      19.4
[<-.factor                 0.28      3.9       0.56       7.9
levels                     0.26      3.7       0.34       4.8
NextMethod                 0.22      3.1       0.82      11.5
...

and

$by.total
                      total.time total.pct self.time self.pct
data.frame                  4.86      68.3      0.48      6.7
rbind                       2.22      31.2      0.30      4.2
do.call                     2.22      31.2      0.00      0.0
[                           1.98      27.8      0.16      2.2
[.data.frame                1.84      25.8      0.72     10.1
match                       1.38      19.4      0.28      3.9
%in%                        1.26      17.7      0.14      2.0
is.factor                   1.20      16.9      0.10      1.4
deparse                     1.18      16.6      0.36      5.1
...

To be honest, from this output I don't get where my bottlenecks are because (a) I use data.frame pretty often and (b) I never use e.g., deparse. Furthermore, what is [?

So I tried Hadley Wickham's profr, but it was not any more useful considering the following graph:

Is there a more convenient way to see which line numbers and particular function calls are slow?
Or, is there some literature that I should consult?

Any hints appreciated.

EDIT 1:
Based on Hadley's comment I will paste the code of my script below and the base graph version of the plot. But note, that my question is not related to this specific script. It is just a random script that I recently wrote. I am looking for a general way of how to find bottlenecks and speed up R-code.

The data (x) looks like this:

type      word    response    N   Classification  classN
Abstract  ANGER   bitter      1   3a              3a
Abstract  ANGER   control     1   1a              1a
Abstract  ANGER   father      1   3a              3a
Abstract  ANGER   flushed     1   3a              3a
Abstract  ANGER   fury        1   1c              1c
Abstract  ANGER   hat         1   3a              3a
Abstract  ANGER   help        1   3a              3a
Abstract  ANGER   mad         13  3a              3a
Abstract  ANGER   management  2   1a              1a
... until row 1700

The script (with short explanations) is this:

Rprof("profile1.out")

# A new dataset is produced with each line of x contained x$N times 
y <- vector('list',length(x[,1]))
for (i in 1:length(x[,1])) {
  y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"]))
}
all <- do.call('rbind',y)
colnames(all) <- colnames(x)

# create a dataframe out of a word x class table
table_all <- table(all$word,all$classN)
dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])])
dataf.all$words <- as.factor(rownames(dataf.all))
dataf.all$type <- "no"
# get type of the word.
words <- levels(dataf.all$words)
for (i in 1:length(words)) {
  dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"])
}
dataf.all$type <- as.factor(dataf.all$type)
dataf.all$typeN <- as.numeric(dataf.all$type)

# aggregate response categories
dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum)
dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum)
dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum)

Rprof(NULL)

library(profr)
ggplot.profr(parse_rprof("profile1.out"))

Final data looks like this:

1a    1b  1c  1d  1e  1f  2a  2b  2c  3a  3b  pa  words   type    typeN   c1  c2  c3  pa
3 0   8   0   0   0   0   0   0   24  0   0   ANGER   Abstract    1   11  0   24  0
6 0   4   0   1   0   0   11  0   13  0   0   ANXIETY Abstract    1   11  11  13  0
2 11  1   0   0   0   0   4   0   17  0   0   ATTITUDE    Abstract    1   14  4   17  0
9 18  0   0   0   0   0   0   0   0   8   0   BARREL  Concrete    2   27  0   8   0
0 1   18  0   0   0   0   4   0   12  0   0   BELIEF  Abstract    1   19  4   12  0

The base graph plot:

Running the script today also changed the ggplot2 graph a little (basically only the labels), see here.

解决方案

Alert readers of yesterdays breaking news (R 3.0.0 is finally out) may have noticed something interesting that is directly relevant to this question:

  • Profiling via Rprof() now optionally records information at the statement level, not just the function level.

And indeed, this new feature answers my question and I will show how.


Let's say, we want to compare whether vectorizing and pre-allocating are really better than good old for-loops and incremental building of data in calculating a summary statistic such as the mean. The, relatively stupid, code is the following:

# create big data frame:
n <- 1000
x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n))

# reasonable operations:
marginal.means.1 <- aggregate(data ~ group + condition, data = x, FUN=mean)

# unreasonable operations:
marginal.means.2 <- marginal.means.1[NULL,]

row.counter <- 1
for (condition in levels(x$condition)) {
  for (group in levels(x$group)) {  
    tmp.value <- 0
    tmp.length <- 0
    for (c in 1:nrow(x)) {
      if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) {
        tmp.value <- tmp.value + x[c,"data"]
        tmp.length <- tmp.length + 1
      }
    }
    marginal.means.2[row.counter,"group"] <- group 
    marginal.means.2[row.counter,"condition"] <- condition
    marginal.means.2[row.counter,"data"] <- tmp.value / tmp.length
    row.counter <- row.counter + 1
  }
}

# does it produce the same results?
all.equal(marginal.means.1, marginal.means.2)

To use this code with Rprof, we need to parse it. That is, it needs to be saved in a file and then called from there. Hence, I uploaded it to pastebin, but it works exactly the same with local files.

Now, we

  • simply create a profile file and indicate that we want to save the line number,
  • source the code with the incredible combination eval(parse(..., keep.source = TRUE)) (seemingly the infamous fortune(106) does not apply here, as I haven't found another way)
  • stop the profiling and indicate that we want the output based on the line numbers.

The code is:

Rprof("profile1.out", line.profiling=TRUE)
eval(parse(file = "http://pastebin.com/download.php?i=KjdkSVZq", keep.source=TRUE))
Rprof(NULL)

summaryRprof("profile1.out", lines = "show")

Which gives:

$by.self
                           self.time self.pct total.time total.pct
download.php?i=KjdkSVZq#17      8.04    64.11       8.04     64.11
<no location>                   4.38    34.93       4.38     34.93
download.php?i=KjdkSVZq#16      0.06     0.48       0.06      0.48
download.php?i=KjdkSVZq#18      0.02     0.16       0.02      0.16
download.php?i=KjdkSVZq#23      0.02     0.16       0.02      0.16
download.php?i=KjdkSVZq#6       0.02     0.16       0.02      0.16

$by.total
                           total.time total.pct self.time self.pct
download.php?i=KjdkSVZq#17       8.04     64.11      8.04    64.11
<no location>                    4.38     34.93      4.38    34.93
download.php?i=KjdkSVZq#16       0.06      0.48      0.06     0.48
download.php?i=KjdkSVZq#18       0.02      0.16      0.02     0.16
download.php?i=KjdkSVZq#23       0.02      0.16      0.02     0.16
download.php?i=KjdkSVZq#6        0.02      0.16      0.02     0.16

$by.line
                           self.time self.pct total.time total.pct
<no location>                   4.38    34.93       4.38     34.93
download.php?i=KjdkSVZq#6       0.02     0.16       0.02      0.16
download.php?i=KjdkSVZq#16      0.06     0.48       0.06      0.48
download.php?i=KjdkSVZq#17      8.04    64.11       8.04     64.11
download.php?i=KjdkSVZq#18      0.02     0.16       0.02      0.16
download.php?i=KjdkSVZq#23      0.02     0.16       0.02      0.16

$sample.interval
[1] 0.02

$sampling.time
[1] 12.54

Checking the source code tells us that the problematic line (#17) is indeed the stupid if-statement in the for-loop. Compared with basically no time for calculating the same using vectorized code (line #6).

I haven't tried it with any graphical output, but I am already very impressed by what I got so far.

这篇关于如何在 R 中有效地使用 Rprof?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆