与较新的libstdc ++.so链接时,为什么C ++可执行文件的运行速度如此之快? [英] Why is C++ executable running so much faster when linked against newer libstdc++.so?

查看:94
本文介绍了与较新的libstdc ++.so链接时,为什么C ++可执行文件的运行速度如此之快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在其中运行基准测试的项目(代码此处)比较各种计算点积的方法(朴素方法,本征库,SIMD实现等)的性能.我正在新的Centos 7.6 VM上进行测试.我注意到,当我使用不同版本的libstdc++.so.6时,我会获得明显不同的性能.

I have a project (code here) in which I run benchmarks to compare the performance of different methods for computing dot product (Naive method, Eigen library, SIMD implementation, ect). I am testing on a fresh Centos 7.6 VM. I have noticed that when I use different versions of libstdc++.so.6, I get significantly different performance.

启动新的Centos 7.6实例时,默认的C ++标准库为libstdc++.so.6.0.19.当我运行基准可执行文件(与此版本的libstdc++链接)时,输出如下:

When I spin up a new Centos 7.6 instance, the default C++ standard library is libstdc++.so.6.0.19. When I run my benchmark executable (linked against this version of libstdc++) the output is as follows:

Naive Implementation, 1000000 iterations: 1448.74 ns average time
Optimized Implementation, 1000000 iterations: 1094.2 ns average time
AVX2 implementation, 1000000 iterations: 1069.57 ns average time
Eigen Implementation, 1000000 iterations: 1027.21 ns average time
AVX & FMA implementation 1, 1000000 iterations: 1028.68 ns average time
AVX & FMA implementation 2, 1000000 iterations: 1021.26 ns average time

如果我下载libstdc++.so.6.0.26并将符号链接libstdc++.so.6更改为指向该较新的库并重新运行该可执行文件(而无需重新编译或更改任何其他内容),则结果如下:

If I download libstdc++.so.6.0.26 and change the symbolic link libstdc++.so.6 to point to this newer library and rerun the executable (without recompiling or changing anything else), the results are as follows:

Naive Implementation, 1000000 iterations: 297.981 ns average time
Optimized Implementation, 1000000 iterations: 156.649 ns average time
AVX2 implementation, 1000000 iterations: 131.577 ns average time
Eigen Implementation, 1000000 iterations: 92.9909 ns average time
AVX & FMA implementation 1, 1000000 iterations: 78.136 ns average time
AVX & FMA implementation 2, 1000000 iterations: 80.0832 ns average time

为什么速度会有如此显着的提高(某些实现速度快10倍)?

Why is there such a significant improvement in speed (some implementations are 10x faster)?

由于我的用例,可能会要求我链接到libstdc++.so.6.0.19.在使用旧版本的libstdc++时,在我的代码中/我可以做些什么来查看这些速度改进?

Due to my use case, I may be required to link against libstdc++.so.6.0.19. Is there anything I can do in my code / on my side to see these speed improvements while using the older version of libstdc++?

修改: 我创建了一个最小的可复制示例.

Edit: I created a minimum reproducible example.

main.cpp

#include <iostream>
#include <vector>
#include <cstring>
#include <chrono>
#include <cmath>
#include <iostream>

typedef std::chrono::high_resolution_clock Clock;

const size_t SIZE_FLOAT = 512;

double computeDotProductOptomized(const std::vector<uint8_t>& v1, const std::vector<uint8_t>& v2);
void generateNormalizedData(std::vector<uint8_t>& v);

int main() {
     // Seed for random number
    srand (time(nullptr));

    std::vector<uint8_t> v1;
    std::vector<uint8_t> v2;

    generateNormalizedData(v1);
    generateNormalizedData(v2);

    const size_t numIterations = 10000000;
    double totalTime = 0.0;

    for (size_t i = 0; i < numIterations; ++i) {
        auto t1 = Clock::now(); 
        auto similarity = computeDotProductOptomized(v1, v2);
        auto t2 = Clock::now();

        totalTime += std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    }

    std::cout << "Average Time Taken: " << totalTime / numIterations << '\n';

    return 0;
}

double computeDotProductOptomized(const std::vector<uint8_t>& v1, const std::vector<uint8_t>& v2) {
    const auto *x = reinterpret_cast<const float*>(v1.data());
    const auto *y = reinterpret_cast<const float*>(v2.data());

    double similarity = 0;

    for (size_t i = 0; i < SIZE_FLOAT; ++i) {
        similarity += *(x + i) * *(y + i);
    }

    return similarity;
}

void generateNormalizedData(std::vector<uint8_t>& v) {
    std::vector<float> vFloat(SIZE_FLOAT);
    v.resize(SIZE_FLOAT * sizeof(float));

    for(float & i : vFloat) {
        i = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
    }

    // Normalize the vector
    float mod = 0.0;

    for (float i : vFloat) {
        mod += i * i;
    }

    float mag = std::sqrt(mod);

    if (mag == 0) {
        throw std::logic_error("The input vector is a zero vector");
    }

    for (float & i : vFloat) {
        i /= mag;
    }

    memcpy(v.data(), vFloat.data(), v.size());
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.14)
project(dot-prod-benchmark-min-reproducible-example C CXX)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -Ofast -ffast-math -march=broadwell")
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_CXX_STANDARD 14)

add_executable(benchmark main.cpp)

使用cmake version 3.16.2gcc (GCC) 7.3.1 20180303centos-release-7-6.1810.2.el7.centos.x86_64上编译 Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz,4个vCPU

Compiled on centos-release-7-6.1810.2.el7.centos.x86_64, using cmake version 3.16.2, gcc (GCC) 7.3.1 20180303 Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, 4 vCPUs

使用libstdc++.so.6.0.19:平均花费时间:1279.41 使用libstdc++.20.6.0.26:平均花费时间:168.219

Using libstdc++.so.6.0.19: Average Time Taken: 1279.41 Using libstdc++.20.6.0.26: Average Time Taken: 168.219

推荐答案

rustyx 是正确的.循环中使用auto t1 = Clock::now();导致性能下降.一旦将计时移到循环之外(计时所花费的总时间),它们就会以同样快的速度运行:

rustyx was correct. It was the use of auto t1 = Clock::now(); in the loop that was causing the poor performance. Once I moved the timing to outside the loop (time the total time taken) then they run equally fast:

    const size_t numIterations = 10000000;
    auto t1 = Clock::now(); 

    for (size_t i = 0; i < numIterations; ++i) {
        auto similarity = computeDotProductOptomized(v1, v2);
    }

    auto t2 = Clock::now();

    std::cout << "Total Time Taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";

这篇关于与较新的libstdc ++.so链接时,为什么C ++可执行文件的运行速度如此之快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆