使用两个条件优化数组元素的比较; C ++抽象机制? [英] Optimizing a comparison over array elements with two conditions; C++ abstraction mechanisms?

查看:131
本文介绍了使用两个条件优化数组元素的比较; C ++抽象机制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是如何制作这个代码更快(学习最佳实践)?,已被搁置(bummer)。问题是在具有浮点数的数组上优化循环,测试它们是否在给定的间隔内。数组中匹配元素的索引将存储在提供的结果数组中。

My question is a follow-up to How to make this code faster (learning best practices)?, which has been put on hold (bummer). The problem is to optimize a loop over an array with floats which are tested for whether they lie within a given interval. Indices of matching elements in the array are to be stored in a provided result array.

测试包括两个条件(小于上限,大于下限) 。明显的测试代码是 if(elem< = upper& elem> = lower)... 。我观察到分支(包括涉及短路运算符的隐含分支)比第二比较贵得多。我想到的是在下面。它比一个天真的实现快20%-40%,比我预期的。它使用bool是一个整数类型的事实。条件测试结果用作两个结果数组的索引。只有其中一个将包含所需的数据,另一个可以被丢弃。这代替了程序结构的数据结构和计算。

The test includes two conditions (smaller than the upper threshold and bigger than the lower one). The obvious code for the test is if( elem <= upper && elem >= lower ) .... I observed that branching (including the implicit branch involved in the short-circuiting operator&&) is much more expensive than the second comparison. What I came up with is below. It is about 20%-40% faster than a naive implementation, more than I expected. It uses the fact that bool is an integer type. The condition test result is used as an index into two result arrays. Only one of them will contain the desired data, the other one can be discarded. This replaces program structure with data structure and computation.

我对更多的想法感兴趣。欢迎技术黑客(这里提供的类型)。我也感兴趣的是,现代C ++是否能提供更快的方法,例如。通过使编译器创建并行运行代码。考虑访问者模式/函子。对单个srcArr元素的计算几乎是独立的,除了结果数组中索引的顺序取决于测试源数组元素的顺序。我会放松一些要求,以便在结果数组中报告的匹配索引的顺序是不相关的。任何人都能想出一个快速的方式?

I am interested in more ideas for optimization. "Technical hacks" (of the kind provided here) are welcome. I'm also interested in whether modern C++ could provide means to be faster, e.g. by enabling the compiler to create parallel running code. Think visitor pattern/functor. Computations on the single srcArr elements are almost independent, except that the order of indices in the result array depends on the order of testing the source array elements. I would loosen the requirements a little so that the order of the matching indices reported in the result array is irrelevant. Can anybody come up with a fast way?

这是函数的源代码。支撑主体在下面。 gcc需要-std = c ++ 11因为chrono。 VS 2013 express也能够编译这个(并创建比gcc -O3快40%的代码)。

Here is the source code of the function. A supporting main is below. gcc needs -std=c++11 because of chrono. VS 2013 express was able to compile this too (and created 40% faster code than gcc -O3).

#include <cstdlib>
#include <iostream>
#include <chrono>

using namespace std;
using namespace std::chrono;

/// Check all elements in srcArr whether they lie in 
/// the interval [lower, upper]. Store the indices of
/// such elements in the array pointed to by destArr[1]
/// and return the number of matching elements found.
/// This has been highly optimized, mainly to avoid branches.
int findElemsInInterval(    const float srcArr[],   // contains candidates
                            int **const destArr,    // two arrays to be filled with indices
                            const int arrLen,       // length of each array
                            const float lower, const float upper // interval
                        )
{
    // Instead of branching, use the condition 
    // as an index into two distinct arrays. We need to keep
    // separate indices for both those arrays.
    int destIndices[2];     
    destIndices[0] = destIndices[1] = 0;
    for( int srcInd=0; srcInd<arrLen; ++srcInd )
    {
        // If the element is inside the interval, both conditions
        // are true and therefore equal. In all other cases 
        // exactly one condition is true so that they are not equal.
        // Matching elements' indices are therefore stored in destArr[1].
        // destArr[0] is a kind of a dummy (it will incidentally contain
        // indices of non-matching elements).
        // This used to be (with a simple int *destArr)
        // if( srcArr[srcInd] <= upper && srcArr[srcInd] >= lower) destArr[destIndex++] = srcInd;
        int isInInterval = (srcArr[srcInd] <= upper) == (srcArr[srcInd] >= lower);
        destArr[isInInterval][destIndices[isInInterval]++] = srcInd;    
    }

    return destIndices[1];  // the number of elements in the results array 
}



int main(int argc, char *argv[])
{
    int arrLen = 1000*1000*100;
    if( argc > 1 ) arrLen = atol(argv[1]);

    // destArr[1] will hold the indices of elements which
    // are within the interval.
    int *destArr[2];

    // we don't check destination boundaries, so make them 
    // the same length as the source.
    destArr[0] = new int[arrLen];   
    destArr[1] = new int[arrLen];

    float *srcArr = new float[arrLen];

    // Create always the same numbers for comparison (don't srand).
    for( int srcInd=0; srcInd<arrLen; ++srcInd ) srcArr[srcInd] = rand();

    // Create an interval in the middle of the rand() spectrum
    float lowerLimit = RAND_MAX/3;
    float upperLimit = lowerLimit*2;
    cout << "lower = " << lowerLimit << ", upper = " << upperLimit << endl;

    int numInterval; 
    auto t1 = high_resolution_clock::now(); // measure clock time as an approximation

    // Call the function a few times to get a longer run time
    for( int srcInd=0; srcInd<10; ++srcInd )  
        numInterval = findElemsInInterval( srcArr, destArr, arrLen, lowerLimit, upperLimit );

    auto t2 = high_resolution_clock::now();

    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    cout << numInterval << " elements found in " << duration << " milliseconds. " << endl;

    return 0;
}


推荐答案

SSE(或更好的,AVX)指令集,您可以一起执行4/8比较,执行这两次,'和'结果,并检索4个结果(-1或0)。

If you allow yourself vectorization using the SSE (or better, AVX) instruction set, you can perform 4/8 comparisons in a go, do this twice, 'and' the results, and retrieve the 4 results (-1 or 0). At the same time, this unrolls the loop.

// Preload the bounds
__m128 lo= _mm_set_ps(lower);
__m128 up= _mm_set_ps(upper);

int srcIndex, dstIndex= 0;

for (srcInd= 0; srcInd + 3 < arrLen; )
{
  __m128 src= _mm_load_ps(&srcArr[srcInd]); // Load 4 values
  __m128 tst= _mm_and_ps(_mm_cmple_ps(src, lo), _mm_cmpge_ps(src, up)); // Test

  // Copy the 4 indexes with conditional incrementation
  dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[0];
  dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[1];
  dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[2];
  dstArr[dstIndex]= srcInd++; destIndex-= tst.m128i_i32[3];
}

注意:未检查的代码。

这篇关于使用两个条件优化数组元素的比较; C ++抽象机制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆