ARM NEON:如何实现 256 字节的查找表 [英] ARM NEON: How to implement a 256bytes Look Up table

查看：34 发布时间：2021/11/17 22:06:38 optimization assembly arm neon

本文介绍了ARM NEON:如何实现 256 字节的查找表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用内联汇编将我编写的一些代码移植到 NEON.

I am porting some code I wrote to NEON using inline assembly.

我需要做的一件事是将范围 [0..128] 的字节值转换为表中采用完整范围 [0..255] 的其他字节值

One of the things I need is to convert byte values ranging [0..128] to other byte values in a table which take the full range [0..255]

表格很短，但背后的数学计算并不容易，所以我认为不值得每次动态"计算它.所以我想尝试查找表.

The table is short but the math behind this is not easy so I think it is not worth calculating it each time "on the fly". So I want to try Look Up tables.

我在 32 字节的情况下使用了 VTBL，并且按预期工作

I have used VTBL for a 32byte case, and works as expected

对于完整范围，一个想法是首先比较源所在的范围并进行不同的查找(即，有 4 个 32 位查找表).

For the full range, one idea would be to first compare the range where the source is and do different lookups (i.e, having 4 32-bit lookup tables).

我的问题是:有没有更有效的方法来做到这一点?

编辑

经过一些试验，我已经完成了四次查找，(仍未安排)我对结果感到满意.我在这里留下了内联汇编中的一段代码行，以防万一有人发现它有用或认为它可以改进.

After some trials, I have done it with four look-ups and (still not scheduled) I am happy with the results. I leave here a piece of the code lines in inline assembly, just in case someone may find it useful or thinks it can be improved.

// Have the original data in d0
// d1 holds #32 value 
// d6,d7,d8,d9 has the images for the values [0..31] 

    //First we look for the 0..31 images. The values out of range will be 0
    "vtbl.u8 d2,{d6,d7,d8,d9},d0    \n\t"

    // Now we sub #32 to d1 and find the images for [32...63], which have been previously loaded in d10,d11,d12,d13
    "vsub.u8 d0,d0,d1\n\t"              
    "vtbl.u8 d3,{d10,d11,d12,d13},d1    \n\t"

    // Do the same and calculating images for [64..95]
    "vsub.u8 d0,d0,d1\n\t"
    "vtbl.u8 d4,{d14,d15,d16,d17},d0    \n\t"

    // Last step: images for [96..127]
    "vsub.u8 d0,d0,d1\n\t"
    "vtbl.u8 d5,{d18,d19,d20,d21},d0    \n\t"

    // Now we add all. No need to saturate, since only one will be different than zero each time
    "vadd.u8 d2,d2,d3\n\t"
    "vadd.u8 d4,d4,d5\n\t"
    "vadd.u8 d2,d2,d4\n\t"   // Leave the result in d2

推荐答案

正确的顺序是通过

vtbl d0, { d2,d3,d4,d5 }, d1   // first value
vsub d1, d1, d31               // decrement index
vtbx d0, { d6,d7,d8,d9 }, d1   // all the subsequent values
vsub d1, d1, d31               // decrement index
vtbx d0, { q5,q6 }, d1         // q5 = d10,d11
vsub d1, d1, d31
vtbx d0, { q7,q8 }, d1

vtbl 和 vtbx 之间的区别在于 vtbl 将元素 d0 归零，当 d1 >= 32 时，vtbx 将 d0 中的原始值保持不变.因此不需要像我的评论中那样的诡计，也不需要合并部分值.

The difference between vtbl and vtbx is that vtbl zeroes the element d0, when d1 >= 32, where as vtbx leaves the original value in d0 intact. Thus there's no need for the trickery as in my comment and no need to merge the partial values.

这篇关于ARM NEON:如何实现 256 字节的查找表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

ARM NEON:如何实现 256 字节的查找表 [英] ARM NEON: How to implement a 256bytes Look Up table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

ARM NEON:如何实现 256 字节的查找表 [英] ARM NEON: How to implement a 256bytes Look Up table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭