C ++代码执行时间随源代码的更改而变化,这不应该带来任何额外的工作 [英] C++ code execution time varies with small source change that shouldn't introduce any extra work

查看:73
本文介绍了C ++代码执行时间随源代码的更改而变化,这不应该带来任何额外的工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在对某些代码进行基准测试时,我发现即使是最无害的代码更改,其执行时间也会有所不同.

While working on benchmarking some code, I found that its execution time would vary with even the most innocuous code changes.

我试图将下面的代码简化为最小的测试用例,但是它仍然相当冗长(对此我深表歉意).几乎更改任何内容都会在很大程度上影响基准测试结果.

I have attempted to boil down the code below to the most minimal test case, but it is still rather lengthy (for which I apologize). Changing virtually anything largely affects the benchmark results.

#include <string>
#include <vector>
#include <iostream>
#include <random>
#include <chrono>
#include <functional>

constexpr double usec_to_sec = 1000000.0;

// Simple convenience timer
class Timer
{
    std::chrono::high_resolution_clock::time_point start_time;
public:
    Timer() : start_time(std::chrono::high_resolution_clock::now()) { }
    int64_t operator()() const {
        return static_cast<int64_t>(
        std::chrono::duration_cast<std::chrono::microseconds>(
            std::chrono::high_resolution_clock::now()-start_time).count()
        );
    }
};

// Convenience random number generator
template <typename T>
class RandGen
{
    mutable std::default_random_engine generator;
    std::uniform_int_distribution<T> distribution;

    constexpr unsigned make_seed() const {
        return static_cast<unsigned>(std::chrono::system_clock::now().time_since_epoch().count());
    }
public:
    RandGen(T min, T max) : generator(make_seed()), distribution(min, max) { }
    T operator ()() { return distribution(generator); }
};

// Printer class
class Printer
{
    std::string filename;
    template <class S>    
    friend Printer &operator<<(Printer &, S &&s);
public:
    Printer(const char *filename) : filename(filename) {}
};

template <class S>
Printer &operator<<(Printer &pm, S &&s) {
    std::cout << s;
    return pm;
}

// +------------+
// | Main Stuff |
// +------------+
void runtest(size_t run_length)
{
    static RandGen<size_t> word_sz_generator(10, 20);
    static RandGen<int> rand_char_generator(0, 25);

    size_t total_char_count = 0;
    std::vector<std::string> word_list;
    word_list.reserve(run_length);

    Printer printer("benchmark.dat");
    printer << "Running test... ";

    Timer timer; // start timer
    for (auto i = 0; i < run_length; i++) {

        size_t word_sz = word_sz_generator();
        std::string word;
        for (auto sz = 0; sz < word_sz; sz++) {
            word.push_back(static_cast<char>(rand_char_generator())+'a');
        }
        word_list.emplace_back(std::move(word));
        total_char_count += word_sz;
    }
    int64_t execution_time_usec = timer(); // stop timer

    printer << /*run_length*/ word_list.size() << " words, and " 
            << total_char_count << " total characters, were built in "
            << execution_time_usec/usec_to_sec << " seconds.\n";
}

int main(int argc, char **argv)
{
    constexpr size_t iterations = 30;
    constexpr size_t run_length = 50000000;

    for (auto i = 0; i < iterations; i++)
        runtest(run_length);

    return EXIT_SUCCESS;
}

第一个 st Timer只是用于计时代码的小型便利类(为简洁起见,其功能不完善).

The 1st class, Timer, is just a small convenience class (intentionally not well-featured, for brevity) for timing the code.

我尝试不使用2 nd RandGen(它只会生成随机值),但是尝试从测试代码中排除该类会使该问题自动消失.因此,我怀疑这个问题与它有关.但是我不知道怎么办.

I tried to do without the 2nd class RandGen (which just generates random values), but any attempt to exclude this from the test code made the problem auto-magically disappear. So, I suspect the issue has something to do with it. But I can't figure out how.

对于这个问题,3 rd Printer似乎完全没有必要,但同样,包括它似乎加剧了该问题.

The 3rd class Printer seems entirely unnecessary for this question, but again, including it seems to exacerbate the issue.

因此,现在我们来讨论main()(仅运行测试)和runtest().

So, now we're down to main() (which just runs the test) and runtest().

runtest()令人毛骨悚然,所以请不要从干净的代码"的角度来看它.以任何方式更改它(例如将内部for loop移动到其自己的函数)都会导致基准测试结果发生变化.最简单,最令人困惑的示例是最后一行:

runtest() is hideous, so please don't look at it from a "clean code" point-of-view. Changing it in any way (ex. moving the inner for loop to its own function) results in a change in benchmark results. The simplest, and most perplexing example is the last line:

printer << /*run_length*/ word_list.size() << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

在上面的行中,run_lengthword_list.size()相同.向量word_list的大小由run_length定义.但是,如果我按原样运行代码,则平均执行时间为 9.8秒,而如果取消注释run_length和注释掉word_list.size(),则执行时间实际上为增加到平均 10.6秒.我无法理解如此微不足道的代码更改如何在一定程度上影响整个程序的时序.

In the line above, run_length and word_list.size() are the same. The size of vector word_list is defined by run_length. But, if I run the code as-is, I get an average execution time of 9.8 seconds, whereas if I uncomment run_length and comment-out word_list.size(), the execution time actually increases to an average of 10.6 seconds. I can't fathom how such an insignificant code change could affect the timing of the whole program to such an extent.

换句话说...

9.8秒:

printer << /*run_length*/ word_list.size() << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

10.6秒:

printer << run_length /*word_list.size()*/ << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

我已经多次重复注释和取消注释上面提到的变量,然后重新运行基准测试.基准是可重复且一致的-即分别分别为9.8秒和10.6秒.

I have repeated the exercise of commenting and uncommenting the variables noted above, and re-running the benchmarks, many times. The benchmarks are repeatable and consistent - i.e. they are consistently 9.8 seconds and 10.6 seconds, respectively.

在两种情况下,代码输出如下所示:

The code output looks like this, for the two cases:

Running test... 50000000 words, and 750000798 total characters, were built in 9.83379 seconds.
Running test... 50000000 words, and 749978210 total characters, were built in 9.84541 seconds.
Running test... 50000000 words, and 749996688 total characters, were built in 9.87418 seconds.
Running test... 50000000 words, and 749995415 total characters, were built in 9.85704 seconds.
Running test... 50000000 words, and 750017699 total characters, were built in 9.86186 seconds.
Running test... 50000000 words, and 749998680 total characters, were built in 9.83395 seconds.
...

Running test... 50000000 words, and 749988517 total characters, were built in 10.604 seconds.
Running test... 50000000 words, and 749958011 total characters, were built in 10.6283 seconds.
Running test... 50000000 words, and 749994387 total characters, were built in 10.6374 seconds.
Running test... 50000000 words, and 749995242 total characters, were built in 10.6445 seconds.
Running test... 50000000 words, and 749988379 total characters, were built in 10.6543 seconds.
Running test... 50000000 words, and 749969532 total characters, were built in 10.6722 seconds.
...

任何会引起这种差异的信息将不胜感激.

Any information on what would cause this discrepancy would be greatly appreciated.

注释:

  1. 即使从Printer类中删除未使用的std::string filename成员对象,也会产生不同的基准结果-在这种情况下,消除(或减小比例不大)上述两个基准之间的差异.
  2. 使用g ++编译时(在Ubuntu上),这似乎不是问题.虽然,我不能确定地说.我在Ubuntu上进行的测试是在同一台Windows计算机上的VM中进行的,其中该VM可能无法访问所有资源和处理器增强功能.
  3. 我正在使用Visual Studio Community 2017(版本15.7.4)
    • 编译器版本:19.14.26431
    • 所有测试和报告的结果均为发布版本(64位)
  1. Even removing the unused std::string filename member object from the Printer class yields different benchmark results - where doing so, eliminates (or reduces to insignificant proportions) the difference between the two benchmarks provided above.
  2. This does not seem to be an issue when compiling with g++ (on Ubuntu). Although, I can’t say this definitively; my tests with Ubuntu were in a VM on the same Windows machine, where the VM perhaps did not have access to all of the resources and processor enhancements.
  3. I am using Visual Studio Community 2017 (version 15.7.4)
    • Compiler version: 19.14.26431
    • All tests and reported results are Release Build, 64-bit

推荐答案

您可能会遇到某种代码对齐效果.现代的x86-64 CPU在大多数情况下在对齐方面都相当健壮,但是对齐会影响分支预测变量(如提到的@rcgldr)中的分支互为别名,以及各种前端效果.

You're probably running into some kind of code-alignment effect. Modern x86-64 CPUs are fairly robust with respect to alignment most of the time, but alignment can affect branches aliasing each other in the branch predictors (like @rcgldr mentioned), and various front-end effects.

请参见 https://agner.org/optimize/,以及x86标签Wiki .但老实说,我认为这里没有任何有用的解释,除了您发现循环对前端或分支预测的对齐效果很敏感.这意味着即使主程序中不同对齐方式的相同机器代码也可能具有不同的性能.

See https://agner.org/optimize/, and performance links in the x86 tag wiki. But honestly I don't think there's any useful explanation here, other than that you've discovered your loop is sensitive to alignment effects, either from the front-end or from branch prediction. This means that even identical machine code at different alignments in your main program could have different performance.

这是一个已知现象.在某处有一篇文章介绍了如何以不同顺序链接目标文件会如何影响性能(这是工具链产生的意想不到的效果),但我找不到它.

This is a known phenomenon. An answer on Code alignment in one object file is affecting the performance of a function in another object file has some general comments about how alignment can matter, and see also Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly? There's an article somewhere about how linking object files in a different order can affect performance (and that this is an unexpected effect from the toolchain), but I couldn't find it.

您可以使用硬件性能计数器来衡量分支预测错误率,以查看是否可以解释一个版本比另一个版本慢的原因.或是否还有其他前端影响.

You can use HW performance counters to measure branch misprediction rates to see if that explains why one version is slower than the other. Or if there's some other front-end effect.

但是不幸的是,您无能为力.琐碎的源代码差异,如果它们完全影响了组件,将会改变所有内容的对齐方式.

But unfortunately there's not much you can do; trivial source differences, if they affect the asm at all, will change alignment for everything.

您有时可以通过将分支替换为无分支代码来重新设计对分支预测不太敏感的内容.例如总是生成16个字节的随机字母,并将其截断为随机长度. (复制时在大小上可能会出现一些分支,除非创建16字节的std::string然后将其截断可能是无分支的.)

You can sometimes redesign things to be less sensitive to branch-prediction by replacing branches with branchless code. e.g. always generate 16 bytes of random letters, and truncate that to a random length. (Some branching on size when copying it is probably inevitable, unless creating a 16-byte std::string and then truncating it can be branchless.)

您可以使用SIMD加快速度,例如使用矢量化的PRNG,例如每个生成1GiB的空格分隔的随机ASCII数字在3.9GHz的Skylake上,约0.03秒会很有用,尽管它分布不均匀,因为65536%10有余数(如65536/25),但是您可以更改质量与速度之间的权衡,并且仍然可以快速运行. )

You might speed that up with SIMD, e.g. use a vectorized PRNG like with an SSE2 or AVX2 xorshift+ to generate 16 bytes of random letters at a time. (efficiently getting a uniform 0..25 distribution with packed-byte operations may be tricky, but maybe the same technique as the 0..9 distribution I used to generate 1GiB of space-separated random ASCII digits per ~0.03 seconds on a 3.9GHz Skylake would be useful. It's not perfectly uniformly distributed, though, because 65536 % 10 has a remainder (like 65536/25), but you can probably change the quality vs. speed tradeoff and still run fast.)

runtest函数中两个内部循环版本的asm基本相同,至少如果编译器的asm输出我们看到 在该Godbolt编译器资源管理器 与您从MSVC可执行文件中实际获取的内容匹配. (与gcc/clang不同,它的asm输出不一定要组装到一个工作目标文件中.)如果您的实际发行版进行了可以内联某些库代码的任何链接时优化,则最终可能会做出不同的优化选择.可执行文件.

The asm for both versions of the inner loop in the runtest function are essentially identical, at least if the compiler asm output we see on the Godbolt compiler explorer matches what you're actually getting in the executable from MSVC. (Unlike with gcc/clang, its asm output can't necessarily be assembled into a working object file.) If your real release build does any link-time optimization that could inline some library code, it might make different optimization choices in the final executable.

我放入了#ifdef,因此可以使用-DUSE_RL来具有两个MSVC 2017输出,这些输出以不同的方式构建相同的源,并将这些asm输出提供给差异窗格. (差异"窗格位于我链接的凌乱布局的底部;单击全屏框仅显示它即可..)

I put in a #ifdef so I could use -DUSE_RL to have two MSVC 2017 outputs that built the same source different ways, and feed those asm outputs to a diff pane. (The diff pane is at the bottom in the messy layout that I linked; click the fullscreen box on it to show just that.)

整个功能的唯一区别是:

The only differences in the whole function are:

  • 该函数顶部的一些指令(如mov edx, DWORD PTR _tls_indexmov QWORD PTR run_length$GSCopy$1$[rbp-121], rcx)的排序和注册选择,这些指令仅运行一次. (但不是代码大小,因此它们以后不会影响对齐).这应该不会对以后的代码产生任何影响,它们最终会对架构状态做出相同的更改,只是使用了一个不再使用的不同的暂存规则.
  • 堆栈布局(局部变量相对于RBP的位置).但是所有偏移量都在+127以下,因此它们仍然可以使用[rbp + disp8]寻址模式.
  • 与实际源代码差异不同的代码源:

  • ordering and register choice for a few instructions like mov edx, DWORD PTR _tls_index and mov QWORD PTR run_length$GSCopy$1$[rbp-121], rcx at the top of the function which only run once. (But not in code-size so they won't affect alignment later). This should have no effect on later code, and they end up making the same changes to architectural state, just using a different scratch reg which is not used again.
  • stack layout (position of local variables relative to RBP). But all the offsets are under +127, so they can all still use a [rbp + disp8] addressing mode.
  • Different code-gen from the actual source difference:

      mov     rdx, QWORD PTR word_list$[rbp-113]
      sub     rdx, QWORD PTR word_list$[rbp-121]  ; word_list.size() = end - start 
      ...
      sar     rdx, 5               ; >> 5   arithmetic right shift

vs.

      mov     rdx, rsi             ; copy run_length from another register

不,仅这些说明无法解释速度差异.它们仅在每个I/O之前的每个时间间隔运行一次.

And no, these instructions alone can't possibly explain the speed difference. They're only run once per timing interval, before some I/O.

在上述代码差异之后,在函数底部附近的分支目标之前(在call _Xtime_get_ticks之后)的一个额外的npad 7用于对齐.

An extra npad 7 for alignment before a branch target near the bottom of the function (after a call _Xtime_get_ticks), after the above code difference.

有很大的红色/绿色差异,但来自标签编号不同,但功能开始时的这三个指令除外.

There's a big block of red/green differences, but those are only from different numbering of labels, except for those three instructions at the start of the function.

但在runtest之前,word_list.size()版本包含用于??$?6_K@@YAAEAVPrinter@@AEAV0@$QEA_K@Z PROC函数的代码,对于使用run_length的版本,该代码在任何地方都不会出现. (C ++名称处理在函数的asm名称中将类型转换为时髦的字符.)这对class Printer起作用.

But before runtest, the word_list.size() version includes code for a ??$?6_K@@YAAEAVPrinter@@AEAV0@$QEA_K@Z PROC function which doesn't appear anywhere for the version using run_length. (C++ name-mangling turns types into funky characters in the asm names of functions.) This is doing something for class Printer.

您说过从Printer中删除未使用的std::string filename会消除代码源差异.那么该功能可能会随着该更改而消失. IDK 为什么 MSVC决定完全发布它,更不用说仅在一个版本中发布另一个版本了.

You said removing the unused std::string filename from Printer removed the code-gen difference. Well that function probably goes away with that change. IDK why MSVC decided to emit it at all, let alone only in one version vs. another.

可能g++ -O3没有该代码源差异,这就是为什么看不到差异的原因. (假设您的VM是硬件虚拟化,则由g ++生成的机器代码仍将在CPU上本地运行.从OS获取新的内存页面可能会在VM中花费一点时间,但是循环所花费的主要时间可能是在此代码的用户空间中.)

Probably g++ -O3 doesn't have that code-gen difference, and that's why you don't see a difference. (Assuming your VM is hardware virtualization, g++-generated machine code is still running natively on the CPU. Getting a new page of memory from the OS might take a tiny bit longer in the VM, but the main time spent in the loop is probably in user-space in this code.)

顺便说一句,海湾合作委员会警告

BTW, gcc warns

<source>:72:24: warning: comparison of integer expressions of different signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare]

     for (auto i = 0; i < run_length; i++) {
                      ~~^~~~~~~~~~~~

我没有仔细查看asm输出,以查看这是否导致gcc或MSVC的代码生成更糟,或者如果传递大量输入是否只是不安全.

I didn't look closely at the asm output to see if that led to worse code-gen with gcc or MSVC, or if it's just going to be unsafe if you pass large inputs.

这篇关于C ++代码执行时间随源代码的更改而变化,这不应该带来任何额外的工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆