如何在 R 中有效地使用 Rprof? [英] How to efficiently use Rprof in R?
问题描述
我想知道是否有可能以类似于 matlab
的 Profiler 的方式从 R
-Code 获取配置文件.也就是说,要了解哪些行号特别慢.
到目前为止我所取得的成绩并不令人满意.我使用 Rprof
来制作配置文件.使用 summaryRprof
我得到如下内容:
$by.selfself.time self.pct total.time total.pct[.data.frame 0.72 10.1 1.84 25.8继承 0.50 7.0 1.10 15.4数据框 0.48 6.7 4.86 68.3unique.default 0.44 6.2 0.48 6.7解析 0.36 5.1 1.18 16.6rbind 0.30 4.2 2.22 31.2匹配 0.28 3.9 1.38 19.4[<-.因子 0.28 3.9 0.56 7.9水平 0.26 3.7 0.34 4.8下一个方法 0.22 3.1 0.82 11.5...
和
<块引用>$by.totaltotal.time total.pct self.time self.pct数据框 4.86 68.3 0.48 6.7rbind 2.22 31.2 0.30 4.2do.call 2.22 31.2 0.00 0.0[ 1.98 27.8 0.16 2.2[.data.frame 1.84 25.8 0.72 10.1匹配 1.38 19.4 0.28 3.9%in% 1.26 17.7 0.14 2.0is.factor 1.20 16.9 0.10 1.4解析 1.18 16.6 0.36 5.1...
老实说,从这个输出中我不知道我的瓶颈在哪里,因为 (a) 我经常使用 data.frame
并且 (b) 我从不使用例如 deparse代码>.此外,什么是
[
?
所以我尝试了 Hadley Wickham 的 profr
,但考虑到下图,它不再有用:
有没有更方便的方法来查看哪些行号和特定的函数调用很慢?
或者,我应该查阅一些文献吗?
任何提示表示赞赏.
编辑 1:
根据 Hadley 的评论,我将粘贴下面的脚本代码和绘图的基本图形版本.但请注意,我的问题与此特定脚本无关.这只是我最近写的一个随机脚本.我正在寻找如何找到瓶颈和加速 R
-code 的通用方法.
数据 (x
) 如下所示:
type word response N 分类 classN摘要愤怒苦 1 3a 3a摘要 ANGER 控制 1 1a 1a抽象愤怒父亲 1 3a 3a抽象的愤怒脸红 1 3a 3a抽象愤怒愤怒 1 1c 1c抽象的愤怒帽子 1 3a 3a摘要愤怒帮助 1 3a 3a抽象愤怒疯狂 13 3a 3a摘要 愤怒管理 2 1a 1a...直到第 1700 行
脚本(有简短的解释)是这样的:
<块引用>Rprof("profile1.out")# 生成一个新数据集,每行 x 包含 x$N 次y <- vector('list',length(x[,1]))for (i in 1:length(x[,1])) {y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"]))}所有 <- do.call('rbind',y)colnames(all) <- colnames(x)# 从一个词 x 类表中创建一个数据框table_all <- table(all$word,all$classN)dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])])dataf.all$words <- as.factor(rownames(dataf.all))dataf.all$type <- "no"# 获取单词的类型.单词 <- 级别(dataf.all$words)for (i in 1:length(words)) {dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"])}dataf.all$type <- as.factor(dataf.all$type)dataf.all$typeN <- as.numeric(dataf.all$type)# 聚合响应类别dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum)dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum)dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum)Rprof(NULL)图书馆(教授)ggplot.profr(parse_rprof("profile1.out"))
最终数据如下:
<块引用>1a 1b 1c 1d 1e 1f 2a 2b 2c 3a 3b pa 字型 typeN c1 c2 c3 pa3 0 8 0 0 0 0 0 0 24 0 0 愤怒 摘要 1 11 0 24 06 0 4 0 1 0 0 11 0 13 0 0 焦虑 摘要 1 11 11 13 02 11 1 0 0 0 0 4 0 17 0 0 态度摘要 1 14 4 17 09 18 0 0 0 0 0 0 0 0 8 0 桶 混凝土 2 27 0 8 00 1 18 0 0 0 0 4 0 12 0 0 信念摘要 1 19 4 12 0
基础图:
今天运行脚本也稍微改变了 ggplot2 图(基本上只有标签),见这里.
提醒读者昨天的突发新闻(R 3.0.0
终于出来了)可能已经注意到一些与这个问题直接相关的有趣内容:
- 通过 Rprof() 进行分析现在可以选择在语句级别记录信息,而不仅仅是函数级别.
事实上,这个新功能回答了我的问题,我将展示如何.
<小时>比方说,我们想比较向量化和预分配是否真的比旧的 for 循环和增量构建数据在计算汇总统计数据(例如均值)方面更好.相对愚蠢的代码如下:
#创建大数据框:n <- 1000x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n))# 合理操作:边际.means.1 <-聚合(数据~组+条件,数据=x,乐趣=平均值)# 不合理的操作:margin.means.2 <- margin.means.1[NULL,]row.counter <- 1for(水平条件(x$条件)){for (group in levels(x$group)) {tmp.value <- 0tmp.length <- 0for (c in 1:nrow(x)) {if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) {tmp.value <- tmp.value + x[c,"data"]tmp.length <- tmp.length + 1}}margin.means.2[row.counter,"group"] <- groupmargin.means.2[row.counter,"condition"] <- 条件margin.means.2[row.counter,"data"] <- tmp.value/tmp.lengthrow.counter <- row.counter + 1}}# 它产生相同的结果吗?all.equal(marginal.means.1,marginal.means.2)
要将这段代码与Rprof
一起使用,我们需要对它进行解析
.也就是说,它需要保存在一个文件中,然后从那里调用.因此,我将其上传到 pastebin,但它与本地文件的工作方式完全相同.
现在,我们
- 只需创建一个配置文件并指明我们要保存行号,
- 使用令人难以置信的组合获取代码
eval(parse(..., keep.source = TRUE))
(看似臭名昭著的fortune(106)
不适用在这里,因为我还没有找到其他方法) - 停止分析并表明我们想要基于行号的输出.
代码是:
Rprof("profile1.out", line.profiling=TRUE)评估(解析(文件=http://pastebin.com/download.php?i=KjdkSVZq",keep.source=TRUE))Rprof(NULL)summaryRprof("profile1.out", lines = "show")
给出:
$by.selfself.time self.pct total.time total.pct下载.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11<没有位置>4.38 34.93 4.38 34.93下载.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48下载.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16$by.totaltotal.time total.pct self.time self.pct下载.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11<没有位置>4.38 34.93 4.38 34.93下载.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48下载.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16$by.lineself.time self.pct total.time total.pct<没有位置>4.38 34.93 4.38 34.93下载.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48下载.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11下载.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16下载.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16$sample.interval[1] 0.02$sampling.time[1] 12.54
检查源代码告诉我们有问题的行(#17)确实是愚蠢的if
-for 循环中的语句.与使用矢量化代码(第 6 行)基本上没有时间计算相同的相比.
我还没有尝试过任何图形输出,但我已经对我目前得到的东西印象深刻.
I would like to know if it is possible to get a profile from R
-Code in a way that is similar to matlab
's Profiler. That is, to get to know which line numbers are the one's that are especially slow.
What I acchieved so far is somehow not satisfactory. I used Rprof
to make me a profile file. Using summaryRprof
I get something like the following:
$by.self self.time self.pct total.time total.pct [.data.frame 0.72 10.1 1.84 25.8 inherits 0.50 7.0 1.10 15.4 data.frame 0.48 6.7 4.86 68.3 unique.default 0.44 6.2 0.48 6.7 deparse 0.36 5.1 1.18 16.6 rbind 0.30 4.2 2.22 31.2 match 0.28 3.9 1.38 19.4 [<-.factor 0.28 3.9 0.56 7.9 levels 0.26 3.7 0.34 4.8 NextMethod 0.22 3.1 0.82 11.5 ...
and
$by.total total.time total.pct self.time self.pct data.frame 4.86 68.3 0.48 6.7 rbind 2.22 31.2 0.30 4.2 do.call 2.22 31.2 0.00 0.0 [ 1.98 27.8 0.16 2.2 [.data.frame 1.84 25.8 0.72 10.1 match 1.38 19.4 0.28 3.9 %in% 1.26 17.7 0.14 2.0 is.factor 1.20 16.9 0.10 1.4 deparse 1.18 16.6 0.36 5.1 ...
To be honest, from this output I don't get where my bottlenecks are because (a) I use data.frame
pretty often and (b) I never use e.g., deparse
. Furthermore, what is [
?
So I tried Hadley Wickham's profr
, but it was not any more useful considering the following graph:
Is there a more convenient way to see which line numbers and particular function calls are slow?
Or, is there some literature that I should consult?
Any hints appreciated.
EDIT 1:
Based on Hadley's comment I will paste the code of my script below and the base graph version of the plot. But note, that my question is not related to this specific script. It is just a random script that I recently wrote. I am looking for a general way of how to find bottlenecks and speed up R
-code.
The data (x
) looks like this:
type word response N Classification classN Abstract ANGER bitter 1 3a 3a Abstract ANGER control 1 1a 1a Abstract ANGER father 1 3a 3a Abstract ANGER flushed 1 3a 3a Abstract ANGER fury 1 1c 1c Abstract ANGER hat 1 3a 3a Abstract ANGER help 1 3a 3a Abstract ANGER mad 13 3a 3a Abstract ANGER management 2 1a 1a ... until row 1700
The script (with short explanations) is this:
Rprof("profile1.out") # A new dataset is produced with each line of x contained x$N times y <- vector('list',length(x[,1])) for (i in 1:length(x[,1])) { y[[i]] <- data.frame(rep(x[i,1],x[i,"N"]),rep(x[i,2],x[i,"N"]),rep(x[i,3],x[i,"N"]),rep(x[i,4],x[i,"N"]),rep(x[i,5],x[i,"N"]),rep(x[i,6],x[i,"N"])) } all <- do.call('rbind',y) colnames(all) <- colnames(x) # create a dataframe out of a word x class table table_all <- table(all$word,all$classN) dataf.all <- as.data.frame(table_all[,1:length(table_all[1,])]) dataf.all$words <- as.factor(rownames(dataf.all)) dataf.all$type <- "no" # get type of the word. words <- levels(dataf.all$words) for (i in 1:length(words)) { dataf.all$type[i] <- as.character(all[pmatch(words[i],all$word),"type"]) } dataf.all$type <- as.factor(dataf.all$type) dataf.all$typeN <- as.numeric(dataf.all$type) # aggregate response categories dataf.all$c1 <- apply(dataf.all[,c("1a","1b","1c","1d","1e","1f")],1,sum) dataf.all$c2 <- apply(dataf.all[,c("2a","2b","2c")],1,sum) dataf.all$c3 <- apply(dataf.all[,c("3a","3b")],1,sum) Rprof(NULL) library(profr) ggplot.profr(parse_rprof("profile1.out"))
Final data looks like this:
1a 1b 1c 1d 1e 1f 2a 2b 2c 3a 3b pa words type typeN c1 c2 c3 pa 3 0 8 0 0 0 0 0 0 24 0 0 ANGER Abstract 1 11 0 24 0 6 0 4 0 1 0 0 11 0 13 0 0 ANXIETY Abstract 1 11 11 13 0 2 11 1 0 0 0 0 4 0 17 0 0 ATTITUDE Abstract 1 14 4 17 0 9 18 0 0 0 0 0 0 0 0 8 0 BARREL Concrete 2 27 0 8 0 0 1 18 0 0 0 0 4 0 12 0 0 BELIEF Abstract 1 19 4 12 0
The base graph plot:
Alert readers of yesterdays breaking news (R 3.0.0
is finally out) may have noticed something interesting that is directly relevant to this question:
- Profiling via Rprof() now optionally records information at the statement level, not just the function level.
And indeed, this new feature answers my question and I will show how.
Let's say, we want to compare whether vectorizing and pre-allocating are really better than good old for-loops and incremental building of data in calculating a summary statistic such as the mean. The, relatively stupid, code is the following:
# create big data frame:
n <- 1000
x <- data.frame(group = sample(letters[1:4], n, replace=TRUE), condition = sample(LETTERS[1:10], n, replace = TRUE), data = rnorm(n))
# reasonable operations:
marginal.means.1 <- aggregate(data ~ group + condition, data = x, FUN=mean)
# unreasonable operations:
marginal.means.2 <- marginal.means.1[NULL,]
row.counter <- 1
for (condition in levels(x$condition)) {
for (group in levels(x$group)) {
tmp.value <- 0
tmp.length <- 0
for (c in 1:nrow(x)) {
if ((x[c,"group"] == group) & (x[c,"condition"] == condition)) {
tmp.value <- tmp.value + x[c,"data"]
tmp.length <- tmp.length + 1
}
}
marginal.means.2[row.counter,"group"] <- group
marginal.means.2[row.counter,"condition"] <- condition
marginal.means.2[row.counter,"data"] <- tmp.value / tmp.length
row.counter <- row.counter + 1
}
}
# does it produce the same results?
all.equal(marginal.means.1, marginal.means.2)
To use this code with Rprof
, we need to parse
it. That is, it needs to be saved in a file and then called from there. Hence, I uploaded it to pastebin, but it works exactly the same with local files.
Now, we
- simply create a profile file and indicate that we want to save the line number,
- source the code with the incredible combination
eval(parse(..., keep.source = TRUE))
(seemingly the infamousfortune(106)
does not apply here, as I haven't found another way) - stop the profiling and indicate that we want the output based on the line numbers.
The code is:
Rprof("profile1.out", line.profiling=TRUE)
eval(parse(file = "http://pastebin.com/download.php?i=KjdkSVZq", keep.source=TRUE))
Rprof(NULL)
summaryRprof("profile1.out", lines = "show")
Which gives:
$by.self
self.time self.pct total.time total.pct
download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11
<no location> 4.38 34.93 4.38 34.93
download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48
download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16
$by.total
total.time total.pct self.time self.pct
download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11
<no location> 4.38 34.93 4.38 34.93
download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48
download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16
$by.line
self.time self.pct total.time total.pct
<no location> 4.38 34.93 4.38 34.93
download.php?i=KjdkSVZq#6 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#16 0.06 0.48 0.06 0.48
download.php?i=KjdkSVZq#17 8.04 64.11 8.04 64.11
download.php?i=KjdkSVZq#18 0.02 0.16 0.02 0.16
download.php?i=KjdkSVZq#23 0.02 0.16 0.02 0.16
$sample.interval
[1] 0.02
$sampling.time
[1] 12.54
Checking the source code tells us that the problematic line (#17) is indeed the stupid if
-statement in the for-loop. Compared with basically no time for calculating the same using vectorized code (line #6).
I haven't tried it with any graphical output, but I am already very impressed by what I got so far.
这篇关于如何在 R 中有效地使用 Rprof?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!