您将如何转置二进制矩阵? [英] How would you transpose a binary matrix?

查看：149 发布时间：2020/5/6 10:38:10 c++ math matrix binary transpose

本文介绍了您将如何转置二进制矩阵?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在C ++中有二进制矩阵，我用8位值的向量表示.

I have binary matrices in C++ that I repesent with a vector of 8-bit values.

例如，以下矩阵:

1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1

表示为:

const uint8_t matrix[] = {
    0b01010101,
    0b00110011,
    0b00001111,
};

之所以这样做，是因为计算矩阵和8位向量的乘积变得非常简单和有效(每行只需进行一次按位AND奇偶校验计算)，即比单独计算每个比特要好得多.

The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.

我现在正在寻找一种有效的方法来转置这样的矩阵，但是我一直无法弄清楚如何做到这一点，而不必手动计算每个位.

I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.

对于上面的示例，我想澄清一下，我想从换位中得到以下结果:

Just to clarify, for the above example, I'd like to get the following result from the transposition:

const uint8_t transposed[] = {
    0b00000000,
    0b00000100,
    0b00000010,
    0b00000110,
    0b00000001,
    0b00000101,
    0b00000011,
    0b00000111,
};

注意:我希望可以使用任意大小的矩阵进行计算的算法，但也对只能处理特定大小的算法感兴趣.

NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.

推荐答案

我花了更多时间寻找解决方案，并且找到了一些不错的解决方案.

I've spent more time looking for a solution, and I've found some good ones.

在现代x86 CPU上，使用SSE2指令可以非常有效地完成二进制矩阵的转置.使用此类指令，可以处理16×8矩阵.

On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.

此解决方案的灵感来自这篇由mischasan撰写的博客文章，远远优于我到目前为止对这个问题提出的每条建议.

This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.

这个想法很简单:

#include <emmintrin.h>
将16个uint8_t变量打包为__m128i
使用_mm_movemask_epi8获取每个字节的MSB，生成一个uint16_t
使用_mm_slli_epi64将128位寄存器移位1
重复直到所有8个uint16_t s

#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts

不幸的是，我还需要在ARM上进行这项工作.实施SSE2版本后，仅查找NEON等效项就很容易，但是 Cortex-M CPU(与 Cortex-A 相反)却没有SIMD功能，因此NEON目前对我来说不太有用.

Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.

注意:因为 Cortex-M 没有本机64位算术，所以我无法在任何答案中使用这些想法建议通过将8x8块视为uint64_t来做到这一点.大多数具有 Cortex-M CPU的微控制器也没有太多的内存，因此我宁愿不使用查找表来完成所有这些工作.

NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.

经过深思熟虑，可以使用普通的32位算术和一些巧妙的编码来实现相同的算法.这样，我一次可以处理4×8块.它是由一个同事建议的，其神奇之处在于32位乘法的工作方式:您可以找到一个32位数字，可以与之相乘，然后每个字节的MSB在的高32位中彼此相邻.结果.

After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.

在32位变量中打包4个uint8_t
屏蔽每个字节的第一位(使用0x80808080)
将其与0x02040810
取乘法的高32位的4个LSB
通常，您可以屏蔽每个字节中的第N位(将屏蔽右移N位)并乘以魔数，即左移N位.这样做的好处是，如果您的编译器足够聪明，可以展开循环，则掩码和幻数"都将成为编译时常量，因此对它们进行移位不会对性能造成任何损害.最后一个4位序列存在一些问题，因为这样会丢失一个LSB，因此在这种情况下，我需要将输入左移8位，并使用与第一个4位序列相同的方法.

Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.

如果使用两个4×8块进行此操作，则可以完成一个8x8块并排列结果位，以便所有内容进入正确的位置.

If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.

这篇关于您将如何转置二进制矩阵?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

您将如何转置二进制矩阵? [英] How would you transpose a binary matrix?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

您将如何转置二进制矩阵? [英] How would you transpose a binary matrix?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭