深入研究 R 分析信息 [英] Digging into R profiling information

查看:45
本文介绍了深入研究 R 分析信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试优化一些代码,并且对来自 summaryRprof() 的信息感到困惑.特别是,看起来对外部 C 程序进行了许多调用,但我无法确定哪个 C 程序,来自哪个 R 函数.我计划通过对代码进行大量切片和切块来解决此问题,但想知道我是否忽略了一些更好的方法来解释分析数据.

I am trying to optimize a bit of code, and am puzzled about information from summaryRprof(). In particular, it looks like a number of calls are made to external C programs, but I'm not able to pin down which C program, from which R function. I am planning to resolve this through a bunch of slicing and dicing of the code, but wondered if I am overlooking some better way to interpret the profiling data.

消耗最高的函数是.Call,它显然是对C代码调用的通用描述;下一个主要功能似乎是赋值操作:

The highest-consuming function is .Call, which is apparently a generic description for calls to C code; the next leading functions appear to be assignment operations:

$by.self
                             self.time self.pct total.time total.pct
".Call"                        2281.0    54.40     2312.0     55.14
"[.data.frame"                  145.0     3.46      218.5      5.21
"initialize"                    123.5     2.95      217.5      5.19
"$<-.data.frame"                121.5     2.90      121.5      2.90
"as.vector"                     110.5     2.64      416.0      9.92

我决定关注 .Call 来看看这是如何发生的.我查看了分析文件以在调用堆栈中找到那些带有 .Call 的条目,以下是调用堆栈中的顶部条目(按出现次数计算):

I decided to focus on the .Call to see how this arises. I looked through the profiling file to find those entries with .Call in the call stack, and the following are the top entries in the call stack (by count of # of appearances):

13640 "eval"
11252 "["
7044 "standardGeneric"
4691 "<Anonymous>"
4658 "tryCatch"
4654 "tryCatchList"
4652 "tryCatchOne"
4648 "doTryCatch"

这个列表一目了然:我在里面有 standardGeneric.

This list is as clear as mud: I have <Anonymous> and standardGeneric in there.

我相信这是由于调用了 Matrix 包中的函数,但那是因为我正在查看代码并且该包似乎是 C 代码的唯一可能来源.但是,在这个包中调用了许多来自 Matrix 的不同函数,并且似乎很难确定 哪个函数正在消耗这个时间.

I believe this is due to calls to functions in the Matrix package, but that's because I'm looking at the code and that package appears to be the only possible source of C code. However, a lot of different functions from Matrix are called in this package, and it seems very difficult to determine which function is consuming this time.

所以,我的问题是非常基本的:是否有某种方法可以在某些情况下破译和归因这些调用(例如 .Call 等)?另一种方式?考虑到所涉及的函数数量,此代码的调用图绘制起来相当棘手.

So, my question is pretty basic: is there some way of deciphering and attributing these calls (e.g. .Call, <Anonymous>, etc.) in some other way? The plot of the call graph for this code is rather tricky to render, given the # of functions involved.

我看到的后备策略是 (1) 注释掉一些代码(并修改代码以使其能够使用它)以查看时间消耗发生的位置,或者 (2) 将某些操作包装在其他函数并查看这些函数何时出现在调用堆栈中.后者不雅,但似乎是向调用堆栈添加标签的最佳方式.前者令人不快,因为运行代码需要相当长的时间,而反复取消注释和重新运行代码令人不快.

The fallback tactics I see are to either (1) comment out bits of code (and hack around to make the code work with this) to see where the time consumption occurs, or to (2) wrap certain operations inside of other functions and see when those functions appear on the call stack. The latter is inelegant, but it seems like it's the best way to add a tag to the call stack. The former is unpleasant because it takes quite some time to run the code, and iteratively uncommenting code and rerunning is unpleasant.

推荐答案

似乎简短的回答是不",而长的回答是是的,但你不会喜欢这个."即使回答这个问题也需要一些时间(所以请留下来,我可能会更新它).

It seems that the short answer is "No" and the long answer is "Yes, but you're not going to enjoy this." Even answering this question is going to take some time (so stick around, I may be updating it).

在 R 中使用分析时,有几个基本的事情需要了解:

There are several basic things to get one's head around when working with profiling in R:

首先,有许多不同的方式来考虑分析.从调用堆栈的角度考虑是很典型的.在任何给定时刻,这是活动的函数调用序列,本质上是相互嵌套的(子例程,如果你愿意的话).这对于理解评估状态、函数将返回的位置以及许多其他对于计算机/解释器/操作系统可能会看到的东西很重要的东西非常有用.Rprof 调用堆栈分析.

First, there are many different ways to think about profiling. It is quite typical to think in terms of a call stack. At any given instant, this is the sequence of function calls that are active, essentially nested within each other (subroutines, if you will). This is quite useful for understanding the state of evaluations, where functions will return, and lots of other things that are important for seeing things as the computer / interpreter / OS may see them. Rprof does call stack profiling.

第二,不同的观点是我有一堆代码,一个特定的调用需要很长时间:我的代码中的哪一行导致了这个调用?这是线分析.据我所知,R 没有行分析.这与 Python 和 Matlab 形成对比,后者都具有线分析器.

Second, a different perspective is that I've got a bunch of code and a particular call is taking a long time: which line in my code caused that call to be made? This is line profiling. R doesn't have line profiling, as far as I can tell. This is in contrast with Python and Matlab, which both have line profilers.

第三,从行到调用的映射是满射的,但不是双射的:给定一个特定的调用堆栈,我们不能保证我们可以将它映射回代码.事实上,调用堆栈分析通常会完全脱离整个堆栈的上下文来总结调用(即,无论该调用发生在所有不同堆栈的哪个位置,都会报告累积时间).

Third, the map from from lines to calls is surjective, but it is not bijective: given a particular call stack we cannot guarantee that we can map it back to the code. In fact, call stack analyses often summarize the calls completely out of context of the whole stack (i.e. cumulative times are reported no matter where that call was on all of the different stacks in which it occurred).

第四,即使我们有这些限制,我们也可以戴上我们的统计帽子,仔细分析调用堆栈数据,看看我们能做出什么.调用堆栈信息是数据,我们喜欢数据,不是吗?:)

Fourth, even though we have these constraints, we can put on our statistical hats and analyze the call stack data carefully and see what we can make of it. The call stack information is data and we like data, don't we? :)

只是对调用堆栈的快速介绍.让我们假设我们的调用堆栈如下所示:

Just a quick intro to a call stack. Let's just assume that our call stack looked like this:

"C" "B" "A"

这意味着函数A调用了B,然后调用了C(顺序颠倒了),调用栈有3层深.在我的代码中,调用堆栈的深度多达 41 层.由于堆栈可以如此之深并且以相反的顺序呈现,因此软件比人类更容易解释.自然地,我们开始清理和转换这些数据.:)

This means that function A called B which then calls C (the order is reversed), and the call stack is 3 levels deep. In my code, the call stack gets to as many as 41 levels deep. Since the stacks can be so deep and are presented in reverse order, this is more interpretable by software than by a human. Naturally, we begin cleaning and transforming this data. :)

现在,我们的数据看起来像:

Now, our data really comes along looking like:

".Call" "subCsp_cols" "[" "standardGeneric" "[" "eval" "eval" "callGeneric"
"[" "standardGeneric" "[" "myFunc2" "myFunc1" "eval" "eval" "doTryCatch"
"tryCatchOne" "tryCatchList" "tryCatch" "FUN" "lapply" "mclapply"
"<Anonymous>" "%dopar%"

很惨吧?它甚至有诸如 eval 之类的东西的副本,某个叫做 的家伙——可能是某个该死的黑客.(顺便说一句,匿名是军团.:-))

Miserable, isn't it? It even has duplicates of things like eval, some guy called <Anonymous> - probably some darn hacker. (Anonymous is legion, by the way. :-))

将其转化为有用的东西的第一步是拆分 Rprof() 输出的每一行并反转条目(通过 strsplitrev).前 12 个条目(如果您查看原始调用堆栈,则是最后 12 个条目,而不是 rev 后版本)对于每一行都是相同的(其中大约有 12000 个,采样间隔为 0.5秒 - 所以大约 100 分钟的分析),这些可以被丢弃.

The first step in transforming this into something useful was to split each line of Rprof() output and reverse the entries (via strsplit and rev). The first 12 entries (last 12 if you look at the raw call stack, rather than the post-rev version) were the same for every line (of which there were about 12000, the sampling interval was 0.5 seconds - so about 100 minutes of profiling), and these can be discarded.

请记住,我们仍然想知道哪一行导致了.Call,这花费了很多时间.在我们解决这个问题之前,我们先设定统计上限:分析报告,例如来自 summaryRprofprofrggplot 等,仅反映给定调用或给定调用之下的调用所花费的累积时间.这些累积信息没有告诉我们什么?Bingo:该调用是否进行了多次或几次,以及在该调用的所有调用中花费的时间是否恒定,或者是否存在一些异常值.一个特定的函数可能会被执行 100 次或 100K 次,但所有的成本都可能来自一次调用(它不应该,但我们不知道,直到我们查看数据).

Remember, we're still interested in knowing which line(s) led to .Call, which took so much time. Before we get to that question, we put on our statistical caps: the profiling reports, e.g. from summaryRprof, profr, ggplot, etc., only reflect the cumulative time spent for a given call or for calls beneath a given call. What does this cumulative information not tell us? Bingo: whether that call was made many times, or a few, and whether the time spent was constant over all invocations of that call or whether there are some outliers. A particular function might be executed 100 times or 100K times, but all of the cost may come from a single invocation (it shouldn't, but we don't know until we look at the data).

这只是描述乐趣的开始.A->B->C 示例并未反映事物可能真实出现的方式,例如 A->B->C->D->B->E.现在,B"可能会被计算几次.更重要的是,假设很多时间都花在了 C 级别,但我们从来没有在那个级别进行采样,只会在堆栈中看到它的子调用.我们可能会看到total.time"有相当长的时间,但self.time"不会.如果在 C 下有很多不同的子调用,我们可能会忘记优化什么——我们应该完全去掉 C 还是调整子、B、D 和 E?

This only begins to describe the fun. The A->B->C example doesn't reflect the way things may really appear, such as A->B->C->D->B->E. Now, "B" may be counted a couple of times. What's more, suppose that a lot of time is spent in the C level, but we never sample at precisely that level, only seeing its child calls in the stack. We may see a sizable time for "total.time", but not for "self.time". If there are lots of different child calls under C, we may lose sight of what to optimize - should we take out C altogether or tweak the children, B, D, and E?

考虑到所花费的时间,我获取了序列并通过 digest 运行它们,通过 hash 存储摘要值的计数.我还拆分了序列,存储了 {(A),(A,B), (A,B,C) 等}.这看起来并不那么有趣,但是从计数中删除单例对清理数据有很大帮助.我们还可以使用 rle() 存储每次调用所花费的时间.这对于分析给定调用花费的时间分布非常有用.

Just to account for the time spent, I took the sequences and ran them through digest, storing counts for the digested values, via hash. I also split up the sequences, storing {(A),(A,B), (A,B,C), etc.}. This doesn't seem so interesting, but removing singletons from the counts helps a lot in cleaning up the data. We can also store the time spent in each call by using rle(). This is useful for analyzing the distribution of time spent for a given call.

我们仍然无法找到每行代码花费的实际时间.我们永远不会从调用堆栈中获取代码行.一个更简单的方法是在整个代码中存储一个时间列表,它存储给定调用的 proc.time() 的输出.利用这些时间的差异可以揭示哪些代码行或部分需要很长时间.(提示:这才是我们真正要寻找的,而不是实际的调用.)

Still we're nowhere closer to finding the actual time spent per line of code. We'll never get lines of code from the call stack. A simpler way to do this is to store a list of times throughout the code, which stores the output of proc.time(), for a given invocation. Taking the difference of these times reveals which lines or sections of code are taking a long time. (Hint: that's what we're really looking for, not the actual calls.)

然而,我们有这个调用堆栈,我们不妨做一些有用的事情.向上堆栈有点有趣,但是如果我们将配置文件信息回滚到更早一点,我们可以找到哪些调用倾向于在运行时间较长的调用之前.这允许我们在调用堆栈中寻找标志——我们可以将调用与特定代码行联系起来的位置.如果我们拥有的只是调用堆栈而不是检测的代码,那么将更多调用映射回代码会更容易一些.(正如我一直提到的:脱离上下文,没有 1:1 的映射,但是在足够细的粒度上,尤其是在重复点击独特的调用中,您可能能够在映射到代码的调用中找到地标.)

However, we have this call stack and we might as well do something useful. Going up the stack is somewhat interesting, but if we rewind the profile information to a little earlier, we can find which calls tend to precede the longer running calls. This allows us to look for landmarks in the call stack - positions where we can tie a call to a particular line of code. This makes it a bit easier to map more calls back to code, if all we have is the call stack, rather than instrumented code. (As I keep mentioning: out of context, there isn't a 1:1 mapping, but at a fine enough granularity, especially in repeatedly hit calls that are distinctive, you may be able to find landmarks in the calls that map to code.)

总而言之,我能够找到哪些调用花费了很多时间,无论是基于 1 个长间隔还是许多小间隔,花费的时间分布是什么样的,并且经过一些努力,我能够映射最重要的 &费时地调用代码并发现代码的哪些部分可以从重写或算法更改中受益最大.

Altogether, I was able to find which calls were taking a lot of time, whether that was based on 1 long interval or many small ones, what the distribution of time spent was like, and, with some effort, I was able to map the most important & time consuming calls back to the code and discover which parts of the code could benefit the most from rewriting or from a change in algorithms.

调用堆栈的统计分析很有趣,但基于累积时间消耗调查特定调用并不是一个很好的方法.一次调用消耗的累积时间在相对基础上提供了信息,但它并没有启发我们这次是否消耗了一个或多个调用,也不是调用在堆栈中的深度,也不是负责调用的代码部分.前两件事可以通过更多的 R 代码来解决,而后者最好通过检测代码来解决.

Statistical analyses of the call stack is loads of fun, but investigating a particular call based on cumulative time consumption is not a very good way to go. The cumulative time consumed by a call is informative on a relative basis, but it doesn't enlighten us as whether one or many calls consumed this time, nor the depth of the call in the stack, nor the section of code responsible for the invocations. The first two things can be addressed via a bit more R code, while the latter is best pursued through instrumented code.

由于 R 还没有像 Python 和 Matlab 这样的线分析器,处理这个问题的最简单方法就是检测自己的代码.

As R doesn't yet have line profilers like Python and Matlab, the simplest way to handle this is to just instrument one's code.

这篇关于深入研究 R 分析信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆