是什么导致我的运行时间比用户时间长得多? [英] What caused my elapsed time much longer than user time?

查看:43
本文介绍了是什么导致我的运行时间比用户时间长得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对一些 R 语句进行基准测试(查看详情此处)并发现我经过的时间比我的用户时间长得多.

I am benchmarking some R statements (see details here) and found that my elapsed time is way longer than my user time.

   user  system elapsed 
  7.910   7.750  53.916 

有人可以帮助我了解哪些因素(R 或硬件)决定了用户时间和已用时间之间的差异,以及我如何改进它?如果有帮助:我正在 Macbook Air 1.7Ghz i5 和 4GB RAM 上运行 data.table 数据操作.

Could someone help me to understand what factors (R or hardware) determine the difference between user time and elapsed time, and how I can improve it? In case it helps: I am running data.table data manipulation on a Macbook Air 1.7Ghz i5 with 4GB RAM.

更新:我粗略的理解是用户时间是我的 CPU 处理我的工作所花费的时间.已用时间是从我提交作业到我取回数据的时间长度.我的电脑处理了 8 秒后还需要做什么?

Update: My crude understanding is that user time is what it takes my CPU to process my job. elapsed time is the length from I submit a job until I get the data back. What else did my computer need to do after processing for 8 seconds?

更新:正如评论中所建议的,我在两个 data.table 上运行了几次:Y,有 104 列(抱歉,随着时间的推移,我添加了更多列),而 X 作为 Y 的子集,只有 3键.以下是更新内容.请注意,我连续运行了这两个程序,所以内存状态应该是相似的.

Update: as suggested in the comment, I run a couple times on two data.table: Y, with 104 columns (sorry, I add more columns as time goes by), and X as a subset of Y with only 3 keys. Below are the updates. Please note that I ran these two procedures consecutively, so the memory state should be similar.

 X<- Y[, list(Year, MemberID, Month)]

 system.time(
   {X[ , Month:= -Month]
   setkey(X,Year, MemberID, Month)
   X[,Month:=-Month]}
  )
   user  system elapsed 
  3.490   0.031   3.519 

 system.time(
 {Y[ , Month:= -Month]
  setkey(Y,Year, MemberID, Month)
  Y[,Month:=-Month]}
 )
   user  system elapsed 
  8.444   5.564  36.284 

这是我工作区中仅有的两个对象的大小(添加了逗号).:

Here are the size of the only two objects in my workspace (commas added). :

object.size(X)
83,237,624 bytes

 object.size(Y)
2,449,521,080 bytes

谢谢

推荐答案

用户时间是计算机进行计算所用的秒数.系统时间是操作系统响应程序请求所花费的时间.已用时间是这两者的总和,加上您的程序和/或操作系统必须执行的任何等待".需要注意的是,这些数字是所花费时间的总和.您的程序可能会计算 1 秒钟,然后在操作系统上等待 1 秒钟,然后在磁盘上等待 3 秒钟,并在运行时重复此循环多次.

User time is how many seconds the computer spent doing your calculations. System time is how much time the operating system spent responding to your program's requests. Elapsed time is the sum of those two, plus whatever "waiting around" your program and/or the OS had to do. It's important to note that these numbers are the aggregate of time spent. Your program might compute for 1 second, then wait on the OS for one second, then wait on disk for 3 seconds and repeat this cycle many times while it's running.

基于您的程序占用的系统时间和用户时间一样多的事实,这是一个非常 IO 密集型的事情.大量读取磁盘或大量写入磁盘.RAM 非常快,通常只有几百纳秒.因此,如果一切都适合 RAM,那么经过的时间通常只比用户时间长一点.但是磁盘可能需要几毫秒来寻找,甚至需要更长的时间来回复数据.这慢了百万分之一.

Based on the fact that your program took as much system time as user time it was a very IO intensive thing. Reading from disk a lot or writing to disk a lot. RAM is pretty fast, a few hundred nanoseconds usually. So if everything fits in RAM elapsed time is usually just a little bit longer than user time. But disk might take a few milliseconds to seek and even longer to reply with the data. That's slower by a factor of of a million.

我们已经确定您的处理器做事"了 ~8 + ~8 = ~ 16 秒.其他 ~54 - ~16 = ~38 秒它在做什么?等待硬盘向它发送它要求的数据.

We've determined that your processor was "doing stuff" for ~8 + ~8 = ~ 16 seconds. What was it doing for the other ~54 - ~16 = ~38 seconds? Waiting for the hard drive to send it the data it asked for.

更新 1:

马修提出了一些我可能不应该做出的假设.亚当,如果您想发布表中所有行的列表(我们只需要数据类型),我们可以更好地了解发生了什么.

Matthew had made some excellent points that I'm making assumptions that I probably shouldn't be making. Adam, if you'd care to publish a list of all the rows in your table (datatypes are all we need) we can get a better idea of what's going on.

我刚刚编写了一个什么都不做的小程序来验证我的假设,即没有花在用户空间和内核空间的时间很可能花在等待 IO 上.

I just cooked up a little do-nothing program to validate my assumption that time not spent in userspace and kernel space is likely spent waiting for IO.

#include <stdio.h>
int main()
{
    int i;
    for(i = 0; i < 1000000000; i++)
    {
        int j, k, l, m;
        j = 10;
        k = i;
        l = j + k;
        m = j + k - i + l;
    }
    return 0;
}

当我运行生成的程序并计时时,我看到如下内容:

When I run the resulting program and time it I see something like this:

mike@computer:~$ time ./waste_user
real    0m4.670s
user    0m4.660s
sys 0m0.000s
mike@computer:~$ 

正如您通过检查所看到的那样,该程序没有实际工作,因此它不会要求内核执行任何操作,只是将其加载到 RAM 中并开始运行.所以几乎所有的真实"时间都花在用户"时间上.

As you can see by inspection the program does no real work and as such it doesn't ask the kernel to do anything short of load it into RAM and start it running. So nearly ALL the "real" time is spent as "user" time.

现在是一个内核繁重的什么都不做的程序(减少了几次迭代以保持合理的时间):

Now a kernel-heavy do-nothing program (with a few less iterations to keep the time reasonable):

#include <stdio.h>
int main()
{
    FILE * random;
    random = fopen("/dev/urandom", "r");
    int i;
    for(i = 0; i < 10000000; i++)
    {
        fgetc(random);
    }
    return 0;
}

当我运行那个时,我看到了更像这样的东西:

When I run that one, I see something more like this:

mike@computer:~$ time ./waste_sys
real    0m1.138s
user    0m0.090s
sys     0m1.040s
mike@computer:~$ 

再次通过检查很容易看出程序所做的只是要求内核为其提供随机字节./dev/urandom 是一个非阻塞的熵源.这意味着什么?内核使用伪随机数生成器为我们的小测试程序快速生成随机"值.这意味着内核必须进行一些计算,但它可以很快返回.所以这个程序大部分时间都在等待内核为它计算,我们可以看到几乎所有的时间都花在了sys上.

Again it's easy to see by inspection that the program does little more than ask the kernel to give it random bytes. /dev/urandom is a non-blocking source of entropy. What does that mean? The kernel uses a pseudo-random number generator to quickly generate "random" values for our little test program. That means the kernel has to do some computation but it can return very quickly. So this program mostly waits for the kernel to compute for it, and we can see that reflected in the fact that almost all the time is spent on sys.

现在我们要做一个小小的改变.我们不是从非阻塞的/dev/urandom 读取,而是从阻塞的/dev/random 读取.这意味着什么?它不做太多计算,而是等待内核开发人员凭经验确定的随机事件在您的计算机上发生.(我们也会做更少的迭代,因为这些东西需要更长的时间)

Now we're going to make one little change. Instead of reading from /dev/urandom which is non-blocking we'll read from /dev/random which is blocking. What does that mean? It doesn't do much computing but rather it waits around for stuff to happen on your computer that the kernel developers have empirically determined is random. (We'll also do far fewer iterations since this stuff takes much longer)

#include <stdio.h>
int main()
{
    FILE * random;
    random = fopen("/dev/random", "r");
    int i;
    for(i = 0; i < 100; i++)
    {
        fgetc(random);
    }
    return 0;
}

当我运行这个版本的程序并计时时,我看到的是:

And when I run and time this version of the program, here's what I see:

mike@computer:~$ time ./waste_io
real    0m41.451s
user    0m0.000s
sys     0m0.000s
mike@computer:~$ 

运行耗时 41 秒,但在用户和实际上花费的时间却少得可怜.这是为什么?所有的时间都花在内核中,但没有进行主动计算.内核只是在等待事情发生.一旦收集到足够的熵,内核就会唤醒并将数据发送回程序.(请注意,根据发生的情况,在您的计算机上运行可能需要更少或更多的时间).我认为 user+sys 和 real 之间的时间差异是 IO.

It took 41 seconds to run, but immeasurably small amounts of time on user and real. Why is that? All the time was spent in the kernel, but not doing active computation. The kernel was just waiting for stuff to happen. Once enough entropy was collected the kernel would wake back up and send back the data to the program. (Note it might take much less or much more time to run on your computer depending on what all is going on). I argue that the difference in time between user+sys and real is IO.

那么这一切意味着什么呢?这并不能证明我的回答是正确的,因为对于您为什么会看到自己的行为,可能还有其他解释.但它确实展示了用户计算时间、内核计算时间和我所说的用于 IO 的时间之间的差异.

So what does all this mean? It doesn't prove that my answer is right because there could be other explanations for why you're seeing the behavior that you are. But it does demonstrate the differences between user compute time, kernel compute time and what I'm claiming is time spent doing IO.

这是我关于/dev/urandom 和/dev/random 之间差异的来源:http://en.wikipedia.org/wiki//dev/random

Here's my source for the difference between /dev/urandom and /dev/random: http://en.wikipedia.org/wiki//dev/random

更新2:

我想我会尝试解决 Matthew 的建议,即 L2 缓存未命中可能是问题的根源.Core i7 有一个 64 字节的缓存线.我不知道你对缓存了解多少,所以我会提供一些细节.当你从内存中请求一个值时,CPU 不会只得到那个值,它会得到它周围的所有 64 个字节.这意味着如果你以一种非常可预测的模式访问内存——比如数组[0]、数组[1]、数组[2]等——需要一段时间才能获得值0,然后是1、2、3、4...要快得多.直到您到达下一个缓存行,即.如果这是一个整数数组,0 会很慢,1..15 会很快,16 会很慢,17..31 会很快,等等.

I thought I would try and address Matthew's suggestion that perhaps L2 cache misses are at the root of the problem. The Core i7 has a 64 byte cache line. I don't know how much you know about caches, so I'll provide some details. When you ask for a value from memory the CPU doesn't get just that one value, it gets all 64 bytes around it. That means if you're accessing memory in a very predictable pattern -- like say array[0], array[1], array[2], etc -- it takes a while to get value 0, but then 1, 2, 3, 4... are much faster. Until you get to the next cache line, that is. If this were an array of ints, 0 would be slow, 1..15 would be fast, 16 would be slow, 17..31 would be fast, etc.

http://software.intel.com/en-us/forums/topic/296674

为了测试这一点,我制作了两个程序.它们都有一个包含 1024*1024 个元素的结构数组.在一种情况下,结构中有一个双精度,在另一种情况下,它有 8 个双精度.double 是 8 字节长,因此在第二个程序中,我们以最糟糕的方式访问内存以获取缓存.第一个会很好地使用缓存.

In order to test this out I've made two programs. They both have an array of structs in them with 1024*1024 elements. In one case the struct has a single double in it, in the other it's got 8 doubles in it. A double is 8 bytes long so in the second program we're accessing memory in the worst possible fashion for a cache. The first will get to use the cache nicely.

#include <stdio.h>
#include <stdlib.h>
#define MANY_MEGS 1048576
typedef struct {
    double a;
} PartialLine;
int main()
{
    int i, j;
    PartialLine* many_lines;
    int total_bytes = MANY_MEGS * sizeof(PartialLine);
    printf("Striding through %d total bytes, %d bytes at a time\n", total_bytes, sizeof(PartialLine));
    many_lines = (PartialLine*) malloc(total_bytes);
    PartialLine line;
    double x;
    for(i = 0; i < 300; i++)
    {
        for(j = 0; j < MANY_MEGS; j++)
        {
            line = many_lines[j];
            x = line.a;
        }
    }
    return 0;
}

当我运行这个程序时,我看到了这个输出:

When I run this program I see this output:

mike@computer:~$ time ./cache_hits
Striding through 8388608 total bytes, 8 bytes at a time
real    0m3.194s
user    0m3.140s
sys     0m0.016s
mike@computer:~$

这是带有大结构体的程序,它们每个占用 64 字节的内存,而不是 8 字节.

Here's the program with the big structs, they each take up 64 bytes of memory, not 8.

#include <stdio.h>
#include <stdlib.h>
#define MANY_MEGS 1048576
typedef struct {
    double a, b, c, d, e, f, g, h;
} WholeLine;
int main()
{
    int i, j;
    WholeLine* many_lines;
    int total_bytes = MANY_MEGS * sizeof(WholeLine);
    printf("Striding through %d total bytes, %d bytes at a time\n", total_bytes, sizeof(WholeLine));
    many_lines = (WholeLine*) malloc(total_bytes);
    WholeLine line;
    double x;
    for(i = 0; i < 300; i++)
    {
        for(j = 0; j < MANY_MEGS; j++)
        {
            line = many_lines[j];
            x = line.a;
        }
    }
    return 0;
}

当我运行它时,我看到:

And when I run it, I see this:

mike@computer:~$ time ./cache_misses
Striding through 67108864 total bytes, 64 bytes at a time
real    0m14.367s
user    0m14.245s
sys     0m0.088s
mike@computer:~$ 

第二个程序 - 设计为具有缓存未命中的程序 - 运行完全相同数量的内存访问所需的时间是其五倍.

The second program -- the one designed to have cache misses -- it took five times as long to run for the exact same number of memory accesses.

另外值得注意的是,在这两种情况下,所有的时间都花在了 user 上,而不是 sys 上.这意味着操作系统正在计算您的程序必须等待针对您的程序而非操作系统的数据的时间.鉴于这两个示例,我认为缓存未命中不太可能导致您的运行时间大大长于您的用户时间.

Also worth noting is that in both cases, all the time spent was spent in user, not sys. That means that the OS is counting the time your program has to wait for data against your program, not against the operating system. Given these two examples I think it's unlikely that cache misses are causing your elapsed time to be substantially longer than your user time.

更新 3:

我刚刚看到您的更新,真正精简的桌子的运行速度比普通尺寸的桌子快 10 倍左右.这也会向我表明(正如另一位马修所说)你的 RAM 用完了.

I just saw your update that the really slimmed down table ran about 10x faster than the regular-sized one. That too would indicate to me that (as another Matthew also said) you're running out of RAM.

一旦您的程序尝试使用比您的计算机实际安装的内存更多的内存,它就会开始交换到磁盘.这比您的程序崩溃要好,但它比 RAM 慢得多,并且可能导致显着的减速.

Once your program tries to use more memory than your computer actually has installed it starts swapping to disk. This is better than your program crashing, but its much slower than RAM and can cause substantial slowdowns.

我明天会试着整理一个例子来说明交换问题.

I'll try and put together an example that shows swap problems tomorrow.

更新 4:

好的,这是一个与上一个非常相似的示例程序.但是现在结构是 4096 字节,而不是 8 字节.该程序总共将使用 2GB 内存而不是 64MB.我还稍微改变了一些东西,并确保我随机访问事物而不是逐个元素,这样内核就无法变得聪明并开始预测我的程序需求.缓存由硬件驱动(仅由简单的启发式驱动),但 kswapd(内核交换守护程序)完全有可能比缓存更智能.

Okay, here's an example program which is very similar to the previous one. But now the struct is 4096 bytes, not 8 bytes. In total this program will use 2GB of memory rather than 64MB. I also change things up a bit and make sure that I access things randomly instead of element-by-element so that the kernel can't get smart and start anticipating my programs needs. The caches are driven by hardware (driven solely by simple heuristics) but it's entirely possible that kswapd (the kernel swap daemon) could be substantially smarter than the cache.

#include <stdio.h>
#include <stdlib.h>
typedef struct {
    double numbers[512];
} WholePage;
int main()
{
    int memory_ops = 1024*1024;
    int total_memory = memory_ops / 2;
    int num_chunks = 8;
    int chunk_bytes = total_memory / num_chunks * sizeof(WholePage);
    int i, j, k, l;
    printf("Bouncing through %u MB, %d bytes at a time\n", chunk_bytes/1024*num_chunks/1024, sizeof(WholePage));
    WholePage* many_pages[num_chunks];
    for(i = 0; i < num_chunks; i++)
    {
        many_pages[i] = (WholePage*) malloc(chunk_bytes);
        if(many_pages[i] == 0){ exit(1); }
    }
    WholePage* page_list;
    WholePage* page;
    double x;
    for(i = 0; i < 300*memory_ops; i++)
    {
        j = rand() % num_chunks;
        k = rand() % (total_memory / num_chunks);
        l = rand() % 512;
        page_list = many_pages[j];
        page = page_list + k;
        x = page->numbers[l];
    }
    return 0;
}

从我调用 cache_hits 到 cache_misses 的程序,我们看到内存大小增加了 8 倍,执行时间增加了 5 倍.当我们运行这个程序时,您希望看到什么?它使用的内存是 cache_misses 的 32 倍,但具有相同的内存访问次数.

From the program I called cache_hits to cache_misses we saw the size of memory increased 8x and execution time increased 5x. What do you expect to see when we run this program? It uses 32x as much memory than cache_misses but has the same number of memory accesses.

mike@computer:~$ time ./page_misses
Bouncing through 2048 MB, 4096 bytes at a time
real    2m1.327s
user    1m56.483s
sys     0m0.588s
mike@computer:~$ 

它花费了 8 倍的缓存未命中时间和 40 倍的缓存命中时间.这是在具有 4GB RAM 的计算机上.我在这个程序中使用了 50% 的 RAM,而 cache_misses 使用了 1.5%,cache_hits 使用了 0.2%.即使它没有用完我电脑的所有内存,它也变得明显变慢了.足够重要.

It took 8x as long as cache_misses and 40x as long as cache_hits. And this is on a computer with 4GB of RAM. I used 50% of my RAM in this program versus 1.5% for cache_misses and 0.2% for cache_hits. It got substantially slower even though it wasn't using up ALL the RAM my computer has. It was enough to be significant.

我希望这是一本关于如何诊断程序运行缓慢的问题的不错的入门读物.

I hope this is a decent primer on how to diagnose problems with programs running slow.

这篇关于是什么导致我的运行时间比用户时间长得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆