灰度双线性斑块提取-SSE优化 [英] Grayscale bilinear patch extraction - SSE optimization

查看:103
本文介绍了灰度双线性斑块提取-SSE优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的程序大量使用了双线性插值从较大的灰度图像中提取的小子图像.

My program makes an intensive use of small sub-images extracted using bilinear interpolation from larger grayscale images.

我为此使用了以下功能:

I am using the following function for this purpose:

bool extract_patch_bilin(const cv::Point2f &patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
    const int hsize = patch.rows/2;

    // ...
    // Precondition checks: patch is a preallocated square matrix and both patch and image have continuous buffers
    // ...

    int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
    if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
        return false;

    float x=patch_ctr.x-hsize-floorx;
    float y=patch_ctr.y-hsize-floory;
    float xy = x*y;
    float w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
    int img_stride = img.cols-patch.cols;
    uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
    uchar* buff_img1 = buff_img0+img.cols;
    uchar* buff_patch = (uchar*)patch.data;
    for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
        for(int u=0; u<patch.cols; ++u,++buff_patch,++buff_img0,++buff_img1)
            buff_patch[0] = cv::saturate_cast<uchar>(buff_img0[0]*w00+buff_img0[1]*w01+buff_img1[0]*w10+buff_img1[1]*w11);
    }
    return true;
}

长话短说,我已经在程序的其他部分中使用了并行化,并且我正在考虑使用SSE来优化此功能的执行,因为我主要使用8x8补丁,并且处理束似乎是一个好主意使用SSE一次拍摄8个像素.

Long story short, I am already using parallelization in other parts of the program, and I am considering using SSE to optimize the execution of this function, because I am mostly using 8x8 patches and it seems like a good idea to process bunches of 8 pixels at a time using SSE.

但是,我不确定如何处理float插值权重(即w00w01w10w11).这些权重必须为正且小于1,因此乘法运算不会溢出unsigned char数据类型.

However, I am not sure how to deal with the multiplication by the float interpolation weights (i.e. w00, w01, w10 and w11. These weights are necessarily positive and smaller than 1, hence the multiplication cannot overflow the unsigned char datatype.

有人知道如何进行吗?

我尝试执行以下操作(假设使用16x16补丁),但是并没有明显的提速:

I tried to do this as follows (assuming 16x16 patches), but there is no significant speed-up:

bool extract_patch_bilin_16x16(const cv::Point2f& patch_ctr, const cv::Mat_<uchar> &img, cv::Mat_<uchar> &patch)
{
    // ...
    // Precondition checks
    // ...

    const int hsize = patch.rows/2;
    int floorx=(int)floor(patch_ctr.x)-hsize, floory=(int)floor(patch_ctr.y)-hsize;
    // Check that the full extracted patch is inside the image
    if(floorx<0 || img.cols-1<floorx+patch.cols || floory<0 || img.rows-1<floory+patch.rows)
        return false;

    // Compute the constant bilinear weights
    float x=patch_ctr.x-hsize-floorx;
    float  y=patch_ctr.y-hsize-floory;
    float  xy = x*y;
    float  w00=1-x-y+xy, w01=x-xy, w10=y-xy, w11=xy;
    // Prepare image resampling loop
    int img_stride = img.cols-patch.cols;
    uchar* buff_img0 = (uchar*)img.data+img.cols*floory+floorx;
    uchar* buff_img1 = buff_img0+img.cols;
    uchar* buff_patch = (uchar*)patch.data;
    // Precompute weighting variables
    const __m128i CONST_0 = _mm_setzero_si128();
    __m128i w00x256_32i = _mm_set1_epi32(cvRound(w00*256));
    __m128i w01x256_32i = _mm_set1_epi32(cvRound(w01*256));
    __m128i w10x256_32i = _mm_set1_epi32(cvRound(w10*256));
    __m128i w11x256_32i = _mm_set1_epi32(cvRound(w11*256));
    __m128i w00x256_16i = _mm_packs_epi32(w00x256_32i,w00x256_32i);
    __m128i w01x256_16i = _mm_packs_epi32(w01x256_32i,w01x256_32i);
    __m128i w10x256_16i = _mm_packs_epi32(w10x256_32i,w10x256_32i);
    __m128i w11x256_16i = _mm_packs_epi32(w11x256_32i,w11x256_32i);
    // Process pixels
    int ngroups = patch.rows>>4;
    for(int v=0; v<patch.rows; ++v,buff_img0+=img_stride,buff_img1+=img_stride) {
        for(int g=0; g<ngroups; ++g,buff_patch+=16,buff_img0+=16,buff_img1+=16) {
                ////////////////////////////////
                // Load the data (16 pixels in one load)
                ////////////////////////////////
                __m128i val00 = _mm_loadu_si128((__m128i*)buff_img0);
                __m128i val01 = _mm_loadu_si128((__m128i*)(buff_img0+1));
                __m128i val10 = _mm_loadu_si128((__m128i*)buff_img1);
                __m128i val11 = _mm_loadu_si128((__m128i*)(buff_img1+1));
                ////////////////////////////////
                // Process the lower 8 values
                ////////////////////////////////
                // Unpack into 16-bits integers
                __m128i val00_lo = _mm_unpacklo_epi8(val00,CONST_0);
                __m128i val01_lo = _mm_unpacklo_epi8(val01,CONST_0);
                __m128i val10_lo = _mm_unpacklo_epi8(val10,CONST_0);
                __m128i val11_lo = _mm_unpacklo_epi8(val11,CONST_0);
                // Multiply with the integer weights
                __m128i w256val00_lo = _mm_mullo_epi16(val00_lo,w00x256_16i);
                __m128i w256val01_lo = _mm_mullo_epi16(val01_lo,w01x256_16i);
                __m128i w256val10_lo = _mm_mullo_epi16(val10_lo,w10x256_16i);
                __m128i w256val11_lo = _mm_mullo_epi16(val11_lo,w11x256_16i);
                // Divide by 256 to get the approximate result of the multiplication with floating-point weights
                __m128i wval00_lo = _mm_srli_epi16(w256val00_lo,8);
                __m128i wval01_lo = _mm_srli_epi16(w256val01_lo,8);
                __m128i wval10_lo = _mm_srli_epi16(w256val10_lo,8);
                __m128i wval11_lo = _mm_srli_epi16(w256val11_lo,8);
                // Add pairwise
                __m128i sum0_lo = _mm_add_epi16(wval00_lo,wval01_lo);
                __m128i sum1_lo = _mm_add_epi16(wval10_lo,wval11_lo);
                __m128i final_lo = _mm_add_epi16(sum0_lo,sum1_lo);
                ////////////////////////////////
                // Process the higher 8 values
                ////////////////////////////////
                // Unpack into 16-bits integers
                __m128i val00_hi = _mm_unpackhi_epi8(val00,CONST_0);
                __m128i val01_hi = _mm_unpackhi_epi8(val01,CONST_0);
                __m128i val10_hi = _mm_unpackhi_epi8(val10,CONST_0);
                __m128i val11_hi = _mm_unpackhi_epi8(val11,CONST_0);
                // Multiply with the integer weights
                __m128i w256val00_hi = _mm_mullo_epi16(val00_hi,w00x256_16i);
                __m128i w256val01_hi = _mm_mullo_epi16(val01_hi,w01x256_16i);
                __m128i w256val10_hi = _mm_mullo_epi16(val10_hi,w10x256_16i);
                __m128i w256val11_hi = _mm_mullo_epi16(val11_hi,w11x256_16i);
                // Divide by 256 to get the approximate result of the multiplication with floating-point weights
                __m128i wval00_hi = _mm_srli_epi16(w256val00_hi,8);
                __m128i wval01_hi = _mm_srli_epi16(w256val01_hi,8);
                __m128i wval10_hi = _mm_srli_epi16(w256val10_hi,8);
                __m128i wval11_hi = _mm_srli_epi16(w256val11_hi,8);
                // Add pairwise
                __m128i sum0_hi = _mm_add_epi16(wval00_hi,wval01_hi);
                __m128i sum1_hi = _mm_add_epi16(wval10_hi,wval11_hi);
                __m128i final_hi = _mm_add_epi16(sum0_hi,sum1_hi);
                ////////////////////////////////
                // Repack all values
                ////////////////////////////////
                __m128i final_val = _mm_packus_epi16(final_lo,final_hi);
                _mm_storeu_si128((__m128i*)buff_patch,final_val);
        }
    }
}

有什么想法可以提高速度吗?

Any idea what could be done to improve the speed-up ?

推荐答案

我会考虑使用整数:您的权重是1/64的倍数,因此使用定点8.6就足够了,并且可以容纳16位数字.

I would consider sticking to integers: your weights are multiples of 1/64 so that working with fixed-point 8.6 is enough and that fits in 16 bits numbers.

最好将双线性插值处理为三个线性插值(两个在Y上,一个在X上;您可以将第二个Y插值重新用于相邻面片).

Bilinear interpolation is best done as three linear ones (two on Y then one on X; you can reuse the second Y interpolation for the neighboring patch).

要在两个值之间执行线性插值,您将为所有插值权重P和Q(8至1和0至7)预存储一次,并将它们成对相乘,如V0.P [i]. + V1.Q [i].使用PMADDUBSW指令可以有效地完成此操作. (经过适当的数据交织,并使用PUNPCKLBW等复制值V0和V1).

To perform a linear interpolation between two values, you will pre-store once for all the interpolation weights P and Q (8 to 1 and 0 to 7), and multiply and add them in pairs like V0.P[i]+V1.Q[i]. This is efficiently done using the PMADDUBSW instruction. (After appropriate data interleaving, and replication of the values V0 and V1, with PUNPCKLBW and the like).

最后,除以总权重(PSRLW),重新缩放为字节(PACKUSWB). (此步骤只能结合两个插值执行一次.)

In the end, divide by the total weight (PSRLW), rescale to bytes (PACKUSWB). (This step can be performed once only, combining the two interpolations.)

您可以考虑将所有权重加倍,以便最终缩放比例为8位,而PACKUSWB足够了,但是不幸的是,它使值饱和并且没有不饱和的等效项.

You could think of doubling all weights, so that the final scaling is by 8 bits, and PACKUSWB would suffice, but unfortunately it saturates the values and there is no unsaturated equivalent.

可能是预先计算所有64个插值权重并对四个双线性项求和会更好.

It could be that precomputing all 64 interpolation weights and summing the four bilinear terms is better.

更新:

如果目标是对所有像素四边形使用固定系数进行插值(实际上实现了子像素平移),则策略是不同的.

If the goal is to interpolate with fixed coefficients for all pixels quads (actually achieving subpixel translation), the strategy is different.

您将加载与左上角相对应的8(16?)个像素行,向右(对应于右上角)向右移动一个像素的行8,对于下一行(底角);将像素值相乘并成对(PMADDUBSW)到相应的插值权重,然后组合成对(PADDW).通过复制存储权重.

You will load a run of 8 (16 ?) pixels corresponding to the upper-left corners, a run of 8 shifted one pixel to the right (corresponding to the upper-right corners), and similarly for the next row (bottom coners); multiply and add in pairs (PMADDUBSW) the pixel values to the corresponding interpolation weights, and combine the pairs (PADDW). Store the weights with replication.

另一个选择是避免(PMADD)并执行单独的乘法(PMULLW)和加法(PADDW).这样可以简化重组方案.

Another option will be to avoid the (PMADD) and perform separate multiplies (PMULLW) and adds (PADDW). This will simplify the reorganization scheme.

按比例缩放后(如上),最终得到8个插值值.

After scaling (as above), you end up with a run of 8 interpolated values.

这对于可变的插值权重同样适用,只要您对每个四边形精确地插补一个像素即可.

This can work as well for variable interpolation weights, as long as you interpolate exactly one pixel per quad.

这篇关于灰度双线性斑块提取-SSE优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆