优化算术codeR [英] Optimizing an arithmetic coder

查看：198 发布时间：2016/7/18 20:37:40 c++ performance optimization assembly x86

本文介绍了优化算术codeR的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在优化C的编码步骤的处理++库称为 PackJPG

我异形code。与英特尔VTune，发现目前的瓶颈是在PackJPG使用算术codeR以下功能：

 无效ARI codeR :: EN code（符号* S）
{
    //更新步骤，低计数，计数高
    unsigned int类型delta_plus_one =（（chigh  - 克洛）+ 1）;
    cstep = delta_plus_one / S-＆GT;规模;
    chigh =克洛+（cstep * S-GT＆; high_count） -  1;
    克洛=克洛+（cstep * S-GT＆; low_count）;    // E3缩放速度进行，以避免下溢
    //如果同时，低和高要么在下半部分，或在更高的一半
    //一个位可以安全地移出
    而（（克洛＆GT; = codeR_LIMIT050）||（chigh＆LT; codeR_LIMIT050））{
        如果（chigh＆LT; codeR_LIMIT050）{//这意味着两者，高和低是以下，0可以安全移出
            //写0位
            write_zero（）;
            //移出remaing E3位
            write_nrbits_as_one（）;        }
        否则{//如果第一，情况并非如此，这是克洛＆GT; = codeR_LIMIT050
            //写1位
            write_one（）;
            克洛和放大器; = codeR_LIMIT050  -  1;
            chigh＆安培; = codeR_LIMIT050  -  1;
            //移出remaing E3位            write_nrbits_as_zeros（）;
        }
        克洛＆所述;＆下; = 1;
        chigh =（chigh＆所述;＆所述; 1）| 1;    }    // E3缩放，以确保低和高之间的theres足够的空间
    而（（克洛＆GT; = codeR_LIMIT025）及及（chigh＆LT; codeR_LIMIT075））{
        ++ nrbits;
        克洛和放大器; = codeR_LIMIT025  -  1;
        chigh ^ = codeR_LIMIT025 + codeR_LIMIT050;
        //克洛 -  = codeR_LIMIT025;
        // chigh  -  = codeR_LIMIT025;
        克洛＆所述;＆下; = 1;
        chigh =（chigh＆所述;＆所述; 1）| 1;    }
}

这个功能似乎借用一些想法：<一href=\"http://paginas.fe.up.pt/~vinhoza/itpa/bodden-07-arithmetic-TR.pdf\">http://paginas.fe.up.pt/~vinhoza/itpa/bodden-07-arithmetic-TR.pdf.我已经成功地有所优化功能（主要是通过加速位写），但现在我卡住了。

目前最大的瓶颈似乎是在开始分裂。这从VTune™可视化屏幕快照显示需要结果的时间，以及创建的组件（蓝色组件向右对应于选择到左侧源$ C $ c中的线）。

S->比例不一定是偶次幂的2这样的划分可以不与模运算来代替。

在code与MSVC编译（从Visual Studio 2013）具有以下设置：

  / GS / Qpar- / GL / analyze- / W3 / GY-/ ZC：wchar_t的/紫/ GM-/ OX / SDL /Fd\"Release\\vc120.pdb/ FP：precise / DWIN32/ DNDEBUG/ D_WINDOWS/ D_USRDLL/ DPACKJPG_EXPORTS/ D_CRT_SECURE_NO_WARNINGS/ DBUILD_DLL/ D_WINDLL/ D_UNI code/ DUNI code/ errorReport：提示/ WX-/ ZC：forScope /弓：IA32 / GD / Oy- /爱/ MT /法发布\\/ EHSC / NOLOGO / FO 发行\\/ OT /Fp\"Release\\PackJPG.pch

这是如何将这种进一步优化任何想法？

更新1
现在我已经尝试了所有的建议，到目前为止，这是最快的版本现在：

 无效ARI codeR :: EN code（符号* S）
{
    unsigned int类型clow_copy =克洛;
    unsigned int类型chigh_copy = chigh;
    //更新步骤，低计数，计数高
    unsigned int类型delta_plus_one =（（chigh_copy  -  clow_copy）+ 1）;
    无符号注册INT cstep = delta_plus_one / S-＆GT;规模;    chigh_copy = clow_copy +（cstep * S-GT＆; high_count） -  1;
    clow_copy = clow_copy +（cstep * S-GT＆; low_count）;    // E3缩放速度进行，以避免下溢
    //如果同时，低和高要么在下半部分，或在更高的一半
    //一个位可以安全地移出
    而（（clow_copy＆GT; = codeR_LIMIT050）||（chigh_copy＆LT; codeR_LIMIT050））{
        如果（chigh_copy＆LT; codeR_LIMIT050）{//这意味着两者，高和低是以下，0可以安全移出
            //写0位
            write_zero（）;
            //移出remaing E3位
            write_nrbits_as_one（）;        }
        否则{//如果第一，情况并非如此，这是克洛＆GT; = codeR_LIMIT050
            //写1位
            write_one（）;
            clow_copy＆安培; = codeR_LIMIT050  -  1;
            chigh_copy＆安培; = codeR_LIMIT050  -  1;
            //移出remaing E3位            write_nrbits_as_zeros（）;
        }
        clow_copy＆所述;＆下; = 1;
        chigh_copy =（chigh_copy＆所述;＆所述; 1）| 1;    }    // E3缩放，以确保低和高之间的theres足够的空间
    而（（clow_copy＆GT; = codeR_LIMIT025）及（chigh_copy＆LT; codeR_LIMIT075））{
        ++ nrbits;
        clow_copy＆安培; = codeR_LIMIT025  -  1;
        chigh_copy ^ = codeR_LIMIT025 + codeR_LIMIT050;
        //克洛 -  = codeR_LIMIT025;
        // chigh  -  = codeR_LIMIT025;
        clow_copy＆所述;＆下; = 1;
        chigh_copy =（chigh_copy＆所述;＆所述; 1）| 1;    }
    克洛= clow_copy;
    chigh = chigh_copy;
}

下面是更新后的结果VTune™可视化与此版本：
这个新版本包括以下更改：

通过与放，避免一个分支;的＆功放代替;＆安培;在最后while循环（这招并没有在第一循环帮助）。

复制类领域的局部变量。

以下建议可惜没的不可以提高性能：

更换第一while循环与goto语句的开关。

使用定点算术除法（它创建舍入误差）。

上做S->规模开关和做位转移，而不是划分为2甚至权力。

@example认为这不是部门，是缓慢的，但为师的一个操作数的内存访问。这似乎是正确的。据VTune™可视化，我们在这里获得高速缓存未命中经常。如何解决这个问题有什么建议？

解决方案

据VTune™可视化，我们正在高速缓存未命中这里经常。任何
如何解决这个问题的建议？

我们组织数据直接影响到性能，数据局部性，因此缓存机制将如何方式表现取决于此。因此，要实现这一目标我们的程序应尽量做到线性存储器访问尽可能多的和应该避免读/写（基于指针的数据结构）任何间接内存。这真的缓存机制中具有L1高速缓存会显著较高的被爱，作为记忆的可能性。
在注视你的code和VTune™可视化报告，它看起来像最重要的数据是传递给这个特殊的函数参数。这样做的各种数据成员的对象是习惯（存储器读）这个特殊的函数中。
 无效ARI codeR :: EN code（符号* S）
 
现在，有以下code，其中程序正在访问该对象的数据成员：
  S-GT＆;规模
S-GT＆; high_count
S-GT＆; low_count
 
无论从VTune™可视化报告中，我们可以验证所有三个内存访问有不同的时序。
这表明，这些数据是在不同的偏移此特定对象的。和同时访问它们中的一个（的 S-> high_count 的），它是从L1高速缓存外出并且因此正在因为它有使数据成更多的时间缓存。由于这种在的 S-> low_count 的的受益，因为它现在在L1缓存。从这些数据我能想到以下几点：

您最常访问数据成员投入的内热点区域的
目的。这意味着我们应该把所有这些成员在第一/顶部
的对象。通过这种方式，我们将是更好的机会，我们的目标
配合到的对象的第一高速缓存线。所以，我们应该尽量
重新组织我们的对象内存布局根据其数据成员的访问。
我认为你不会在这个处理的虚拟表
对象，因为它们不是从高速缓存机制，那么好了。

这可能是你的整体方案以这样一种方式组织
围绕此点（.i.e此功能的执行），在L1
缓存已满，因此程序试图从L2访问并
这种转变，将有更多的CPU周期（峰值）。在这
情景我不认为我们可以做的，因为这是一种限制
机器在某种意义上，我们都太舒展我们的边界
多，试图处理过低层次的东西。

您对象的的取值的似乎型POD的，因此会有
线性访问。这是很好的，并且没有的改进的余地。然而，这样我们可以中分配有缓存机制的影响。如果每次得到分配，它可以在当前函数中执行产生影响。

除此之外，我觉得我们也应该参照有关下列SO帖子里面谈到这些概念很详细的了解（数据缓存/指令缓存）。这些职位也有它有这个深入的分析和信息莫大的联系。
什么是＆QUOT;缓存友好＆QUOT; code？
<一个href=\"http://stackoverflow.com/questions/22921373/how-to-write-instruction-cache-friendly-program-in-c\">How在C写指令缓存友好的程序++？
我建议，你应该尝试将这两个职位。他们会真的有助于了解有关这些概念内部，即使它可能不会帮助你优化你的当前件code的。可能是你的程序已经进行了优化，很少有我们可以在此做的。）
I'm in the process of optimizing the encoding step of a C++ library called PackJPG

I've profiled the code with Intel VTune and found that the current bottleneck is the following function in the arithmetic coder that PackJPG uses:
void aricoder::encode( symbol* s ) { // update steps, low count, high count unsigned int delta_plus_one = ((chigh - clow) + 1); cstep = delta_plus_one / s->scale; chigh = clow + ( cstep * s->high_count ) - 1; clow = clow + ( cstep * s->low_count ); // e3 scaling is performed for speed and to avoid underflows // if both, low and high are either in the lower half or in the higher half // one bit can be safely shifted out while ( ( clow >= CODER_LIMIT050 ) || ( chigh < CODER_LIMIT050 ) ) { if ( chigh < CODER_LIMIT050 ) { // this means both, high and low are below, and 0 can be safely shifted out // write 0 bit write_zero(); // shift out remaing e3 bits write_nrbits_as_one(); } else { // if the first wasn't the case, it's clow >= CODER_LIMIT050 // write 1 bit write_one(); clow &= CODER_LIMIT050 - 1; chigh &= CODER_LIMIT050 - 1; // shift out remaing e3 bits write_nrbits_as_zeros(); } clow <<= 1; chigh = (chigh << 1) | 1; } // e3 scaling, to make sure that theres enough space between low and high while ( ( clow >= CODER_LIMIT025 ) && ( chigh < CODER_LIMIT075 ) ) { ++nrbits; clow &= CODER_LIMIT025 - 1; chigh ^= CODER_LIMIT025 + CODER_LIMIT050; // clow -= CODER_LIMIT025; // chigh -= CODER_LIMIT025; clow <<= 1; chigh = (chigh << 1) | 1; } }
This function seems to borrow some ideas from: http://paginas.fe.up.pt/~vinhoza/itpa/bodden-07-arithmetic-TR.pdf. I've managed to optimize the function somewhat (primarily by speeding up the bit writing) but now I'm stuck.

Right now the biggest bottleneck seems to be division at the beginning. This screenshot from VTune shows the time it takes results as well as the assembly created (the blue assembly to the right corresponds to the line in the source code selected to the left).

s->scale is not necessarily an even power of 2 so the division can't be replaced with a modulo operation.

The code was compiled with MSVC (from Visual Studio 2013) with the following settings:
/GS /Qpar- /GL /analyze- /W3 /Gy- /Zc:wchar_t /Zi /Gm- /Ox /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_USRDLL" /D "PACKJPG_EXPORTS" /D "_CRT_SECURE_NO_WARNINGS" /D "BUILD_DLL" /D "_WINDLL" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:IA32 /Gd /Oy- /Oi /MT /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\PackJPG.pch"
Any ideas on how to optimize this further?

UPDATE 1 I've now tried all suggestions so far and this is the fastest version now:
void aricoder::encode( symbol* s ) { unsigned int clow_copy = clow; unsigned int chigh_copy = chigh; // update steps, low count, high count unsigned int delta_plus_one = ((chigh_copy - clow_copy) + 1); unsigned register int cstep = delta_plus_one / s->scale; chigh_copy = clow_copy + (cstep * s->high_count) - 1; clow_copy = clow_copy + (cstep * s->low_count); // e3 scaling is performed for speed and to avoid underflows // if both, low and high are either in the lower half or in the higher half // one bit can be safely shifted out while ((clow_copy >= CODER_LIMIT050) || (chigh_copy < CODER_LIMIT050)) { if (chigh_copy < CODER_LIMIT050) { // this means both, high and low are below, and 0 can be safely shifted out // write 0 bit write_zero(); // shift out remaing e3 bits write_nrbits_as_one(); } else { // if the first wasn't the case, it's clow >= CODER_LIMIT050 // write 1 bit write_one(); clow_copy &= CODER_LIMIT050 - 1; chigh_copy &= CODER_LIMIT050 - 1; // shift out remaing e3 bits write_nrbits_as_zeros(); } clow_copy <<= 1; chigh_copy = (chigh_copy << 1) | 1; } // e3 scaling, to make sure that theres enough space between low and high while ((clow_copy >= CODER_LIMIT025) & (chigh_copy < CODER_LIMIT075)){ ++nrbits; clow_copy &= CODER_LIMIT025 - 1; chigh_copy ^= CODER_LIMIT025 + CODER_LIMIT050; // clow -= CODER_LIMIT025; // chigh -= CODER_LIMIT025; clow_copy <<= 1; chigh_copy = (chigh_copy << 1) | 1; } clow = clow_copy; chigh = chigh_copy; }
Here's the updated VTune results with this version: This new version includes the following changes:

Avoid one branch by using & instead of && in the last while loop (that trick did not help in the first loop).

Copy the class fields to local variables.

The following suggestions unfortunately did not improve performance:

Replacing the first while loop with a switch with goto statements.

Using fixed point arithmetic for division (it created rounding errors).

Doing a switch on s->scale and doing bit-shifts instead of division for even powers of 2.

@example suggested that it's not the division that's slow but the memory access for one of the operands of the division. That seems to be correct. According to VTune we are getting cache misses here quite often. Any suggestions on how to fix that?
解决方案

According to VTune we are getting cache misses here quite often. Any suggestions on how to fix that?

The way we organize data directly impacts the performance as data locality and hence how cache mechanism would behave depend on this. So to achieve this our program should try to do linear memory access as much as possible and should avoid any indirect memory read/write(pointer based data structure). This would really be loved by cache mechanism, as the probability of memory in having the L1 cache would significantly higher.

While looking your code and VTune report, it looks like the most important data is argument passed to this particular function. The various data members of this objects is getting used(memory read) within this particular function.
void aricoder::encode( symbol* s )
Now, there is following code where program is accessing the data members of this object:
s->scale s->high_count s->low_count
From both VTune report, we can verify that all three memory access have different timing. This indicates that these data are at different offset of this particular object. And while accessing the one of them(s->high_count), it is going out from L1 cache and hence it is taking more time as it has to bring the data into cache. Due to this the s->low_count is benefiting as it is now in L1 cache. From these data I can think the following point:

Put your most accessed data members into the hot zone of within your object. This means we should put all these members to the first/top of object. By this way we would be in better chance that our object fits into the first cache line of an object. So we should try to re-organize our object memory layout as per its data members access. I assume that your are not dealing with the virtual table in this object as they are not so good from cache mechanism.

It is possible that your overall program is organized in such a way that around this point(.i.e the execution of this function), the L1 cache is full and hence program is trying to access it from L2 and this transition, there would be more CPU cycles(spike). In this scenario I do not think we can do much as this is kind of limitation of machine and in some sense we are stretching our boundary too much and trying to deal with too low level stuff.

Your object s seems to be of type of POD and hence there would be linear access. This is good and there is no scope of improvement. However the way we allocates may have impact on cache mechanism. If it is getting allocated everytime, it can have impact while executing within the current function.

Apart from that I think we should also refer about the following SO post which talks about these concepts in great detail about(Data Cache/ Instruction Cache). These post also have great link which has in-depth analysis and information about this.

What is "cache-friendly" code?

How to write instruction cache friendly program in c++?

I suggest that, you should try to refer these post. They would be really really helpful to understand internals about these concepts even though it might not help you out to optimize your current piece of code. May be your program is already optimized and there is very little we can do in this :).

这篇关于优化算术codeR的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

优化算术codeR [英] Optimizing an arithmetic coder

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

优化算术codeR [英] Optimizing an arithmetic coder

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭