如何在没有函数调用的情况下在 Linux 中检索处理器时间? [英] How do I retrieve the processor time in Linux without function calls?

查看:36
本文介绍了如何在没有函数调用的情况下在 Linux 中检索处理器时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算一部分 (C++) 代码的运行时间,并希望通过查找代码执行期间经过的时钟滴答数来实现这一点.

I need to calculate the running time of a portion of (C++) code and want to do this by finding the number of clock ticks elapsed during the execution of the code.

我想找到代码开头的处理器时间和末尾的处理器时间,然后减去它们以找到经过的滴答数.

I want to find the processor time at the beginning of the code and the processor time at the end and then subtract them to find the number of elapsed ticks.

这可以通过时钟功能来完成.然而,我测量的时间需要非常精确,而且使用函数调用被证明是非常具有侵入性的,因为调用者保存的寄存器分配器在每次调用时都会溢出许多变量.

This can be done with the clock function. However, the time I'm measuring needs to be very precise and using a function call proved to be very intrusive since the caller-saved register allocator spilled many variables on each call.

因此,我无法使用任何函数调用,需要自己检索处理器时间.汇编代码很好.

我使用的是 Debian 和 i7 Intel 处理器.我无法使用分析器,因为它太具有侵入性.

I am using Debian and an i7 Intel processor. I can't use a profiler because it's too intrusive.

推荐答案

你应该阅读 time(7).请注意,即使是用汇编程序编写的,您的程序也会在任意时刻重新调度(可能是 上下文切换 每毫秒;也查看 /proc/interrupts 并查看 proc(5)).那么任何硬件定时器都是没有意义的.甚至使用RDTSC x86-64 机器指令来读取硬件时间戳计数器 没用(因为在任何上下文切换之后它都会出错,而 Linux 内核正在执行 抢占调度,随时发生.

You should read time(7). Be aware that even written in assembler, your program will be rescheduled at arbitrary moments (perhaps a context switch every millisecond; look also into /proc/interrupts and see proc(5)). Then any hardware timer is meaningless. Even using the RDTSC x86-64 machine instruction to read the hardware timestamp counter is useless (since after any context switch it would be wrong, and the Linux kernel is doing preemptive scheduling, which does happen at any time).

您应该考虑clock_gettime(2).由于 vdso(7).顺便说一句,这是一个系统调用,因此您可以直接编写汇编指令来执行它们.我认为不值得麻烦(并且可能比 vdso 调用).

You should consider clock_gettime(2). It is really fast (about 3.5 or 4 nanoseconds on my i5-4690S, when measuring thousands of calls to it) because of vdso(7). BTW it is a system call, so you might code directly the assembler instructions doing them. I don't think it is worth the trouble (and could be slower than the vdso call).

顺便说一句,任何类型的分析或基准测试都是侵入性的.

BTW, any kind of profiling or benchmarking is somehow intrusive.

最后,如果您的基准函数运行得非常快(远小于一微秒),cache 未命中变得重要甚至占主导地位(请记住,需要有效访问 DRAM 模块的 L3 缓存未命中持续数百纳秒,足以在 L1 I 缓存中运行数百条机器指令).您可能(并且可能应该)尝试对多个(数百个)连续调用进行基准测试.但是您将无法准确地进行测量.

At last, if your benchmarked function runs very quickly (much less than a microsecond), cache misses become significant and even dominant (remember that an L3 cache miss requiring effective access to DRAM modules lasts several hundred nanoseconds, enough to run hundreds of machine instructions in L1 I-cache). You might (and probably should) try to benchmark several (hundreds of) consecutive calls. But you won't be able to measure precisely and accurately.

因此我相信不能比使用clock_gettime做得更好,我不明白为什么它不好足够你的情况...顺便说一句,clock(3) 正在使用 CLOCK_PROCESS_CPUTIME_ID 调用 clock_gettime 所以恕我直言,它应该足够了,而且更简单.

Hence I believe that you cannot do much better than using clock_gettime and I don't understand why it is not good enough for your case... BTW, clock(3) is calling clock_gettime with CLOCK_PROCESS_CPUTIME_ID so IMHO it should be enough, and simpler.

换句话说,我认为避免任何函数调用是您的误解.请记住,函数调用开销比缓存未命中要便宜得多!

请参阅相关问题的此答案(与您的问题一样不清楚);也可以考虑使用 perf(1), gprof(1), oprofile(1), time(1).请参阅.

See this answer to a related question (as unclear as yours); consider also using perf(1), gprof(1), oprofile(1), time(1). See this.

最后,您应该考虑向编译器询问更多优化.您是否考虑过使用 g++ -O3 -flto -march=native 编译和链接(使用链接时优化).

At last, you should consider asking more optimizations from your compiler. Have you considered compiling and linking with g++ -O3 -flto -march=native (with link-time optimizations).

如果您的代码具有数值和向量性质(如此明显且可大规模并行化),您甚至可以考虑花费数月的开发时间来移植其核心代码(数值计算内核) 在您的 GPGPUOpenCLCUDA.但是你确定这样的努力值得吗?更换硬件时,您需要调整和重新开发代码!

If your code is of numerical and vectorial nature (so obviously and massively parallelisable), you could even consider spending months of your development time to port its core code (the numerical compute kernels) on your GPGPU in OpenCL or CUDA. But are you sure it is worth such an effort? You'll need to tune and redevelop your code when changing hardware!

您还可以重新设计您的应用程序以使用多线程JIT 编译部分评估元编程> 技术,多处理云计算(使用inter-进程通信,例如socket(7)-s,也许使用 0mq 或其他消息传递库).这可能需要的开发.有没有银弹.

You could also redesign your application to use multi-threading, JIT compilation and partial evaluation and metaprogramming techniques, multiprocessing or cloud-computing (with inter-process communication, such as socket(7)-s, maybe using 0mq or other messaging libraries). This could take years of development. There is No Silver Bullet.

(不要忘记考虑开发成本;如果可能,更喜欢算法改进.)

这篇关于如何在没有函数调用的情况下在 Linux 中检索处理器时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆