C语言中的二进制向量和矩阵操作 [英] Binary vectors and matrix manipulation in C

查看:171
本文介绍了C语言中的二进制向量和矩阵操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在C中实现一个数据结构,该数据结构将使我能够有效地操纵二进制矩阵(仅包含1或0).我将说明我必须对该矩阵执行哪些操作,并想知道什么是最好的数据结构?

I'm trying to implement in C a data structure that would allow me to manipulate efficiently a**binary** matrix (containing only 1 or 0). I'll explain what operations I have to apply to this matrix, and would like to know what's the best possible data structure to use ?

这些操作在字段F_2中完成(这意味着1 + 1 = 0,其他操作保持不变).我有一个名为Hk * n矩阵(k< n).最多k = 2325和n =3009.

The operations are done in the field F_2 (which means 1+1 = 0 the other operations remain unchanged). I have one k*n matrix (k < n) called H. At most, k = 2325 and n = 3009.

我将要在此矩阵上进行的操作是:

我将仅使用行交换和行添加来部分对角线化.完成后,我将不再使用行操作,并且将在此矩阵上添加许多(!)列(我的意思是很多"大约等于((nk)/2 )³列添加项)

I will partially diagonalize it using only row swap and row additions. Once it is done, I will not use anymore row operations and will operate a lot (!) of columns additions over this matrix (what I mean by "a lot" is about ((n-k)/2)³ columns additions)

我正在考虑用于矩阵的数据结构:

对于矩阵系数,我正在考虑一次将多个位的序列存储在一个单个无符号int中.例如,我可以将序列(11001011)存储到uint8_t 203(从二进制转换为十进制)

For the matrix coefficient, I was thinking about storing sequences of multiple bits at once in one single unsigned int. For instance, I could store the sequence (11001011) to the uint8_t 203 (converting from binary to decimal)

  • 这是个好主意吗?

如果这样做,我有两个选择:

If I do so, I have two options :

我可以使用uint16_tuint64_t系数将矩阵H拆分为许多4 * 4或8 * 8子矩阵.

I could use uint16_t or uint64_t coefficients to split my matrix H in many 4*4 or 8*8 submatrices.

  • 这是一个好选择(就时间效率而言),如果是,那么使用uint16_tuint64_t更好吗?
  • Is this a good option (in terms of time efficiency) and if it is, is it better to use uint16_t or uint64_t ?

否则,我正在考虑将每一行存储在多个uint32_tuint64_t中,然后进行部分对角线化.接下来切换到将矩阵编码为 n列向量的结构,以处理剩余的操作.

Else I was thinking about storing every row in multiple uint32_t or uint64_t, then operate my partial diagonalization. Next switch to a structure that would code the matrix as n columns vectors to handle the remaining operations.

  • 您认为这样更有效吗?

无论使用哪种方法,我都必须有效地访问无符号整数(uint163264)的第n位.我该怎么办?

Whatever method I use, I will have to efficiently access the n'th bit of an unsigned int (uint16, 32 or 64). How do I do that ?

推荐答案

为获得最佳性能,请使用行指针数组进行行交换和行添加.使用<stdint.h>和最小支持字长的快速无符号整数类型-我建议使用uint_fast32_t,除非您打算在16位或8位处理器上运行它.

For best performance, use an array of row pointers for the row swap and row additions. Use <stdint.h>, and a fast unsigned integer type of minimum supported word size -- I recommend either uint_fast32_t, unless you intend to run this on 16- or 8-bit processors.

完成所有的行交换和行添加后,转置数组.尽管此操作慢",但随后的列操作将如此快以抵消转置成本.

When all the row swap and row additions are done, transpose the array. Although this operation is "slow", the following column operations will be so fast to offset the transpose cost.

请考虑以下内容:

#include <stdint.h>
#include <limits.h>

typedef uint_fast32_t  word;
#define WORD_BITS      (CHAR_BIT * sizeof (word))

typedef struct {
    int    rows;  /* Number of row vectors */
    int    cols;  /* Number of defined bits in each row */
    int    words; /* Number of words per row vector */
    word **row;   /* Array of pointers */
} row_matrix;

typedef struct {
    int    rows;  /* Number of defined bits in each column */
    int    cols;  /* Number of col vectors */
    int    words; /* Number of words per col vector */
    word **col;
} col_matrix;

尽管您可以使用单个类型来描述两种矩阵形式,但是使用单独的类型可以使代码和函数更易于维护.您最终将获得一些重复的代码,但是与拥有清晰,直观的类型相比,这是一个小问题.

Although you could use a single type to describe the two matrix forms, using separate types makes the code and functions easier to maintain. You'll end up having some duplicated code, but that's a minor issue compared to having clear, intuitive types.

在32位系统上,uint_fast32_t通常是32位类型.在64位系统上,通常为64位. WORD_BITS宏扩展为word中的位数-并不总是32!

On 32-bit systems, uint_fast32_t is typically a 32-bit type. On 64-bit systems, it is typically 64-bit. The WORD_BITS macro expands to the number of bits in a word -- it is NOT always 32!

对位进行编号的最简单方法是将矩阵中最左边的位指定为位0,并将这些位存储在每个字的最低有效位中.如果您有row_matrix *rm,则行row中列col的位是

The easiest way to number the bits is to designate the leftmost bit in the matrix as bit 0, and store the bits in the least significant bits in each word. If you have row_matrix *rm, then the bit at row row, column col is

!!(rm->row[row][col / WORD_BITS] & ((word)1U << (col % WORD_BITS)))

!!是非运算符:如果参数为非零,则产生1,否则产生0.因为我们从单词中屏蔽了一位,所以位已设置"值将为某些值2的幂(1、2、4、8、16、32、64等).

The !! is the not-not operator: if the argument is nonzero, it yields 1, otherwise it yields 0. Because we mask a single bit from the word, the "bit is set" value would otherwise be some power of two (1, 2, 4, 8, 16, 32, 64, and so on).

要设置该位,请使用

rm->row[row][col / WORD_BITS] |= (word)1U << (col % WORD_BITS);

要清除一点,您需要使用除目标位1以外的所有掩码进行二进制与.这可以使用not运算符~轻松实现:

To clear a bit, you need to binary-AND with a mask having all except the target bit 1. This is easy to achieve using the not operator ~:

rm->row[row][col / WORD_BITS] &= ~((word)1U << (col % WORD_BITS));

col_matrix *cm的相应操作是

!!(cm->col[col][row / WORD_BITS] & ((word)1U << (row % WORD_BITS)))
cm->col[col][row / WORD_BITS] |= (word)1U << (row % WORD_BITS);
cm->col[col][row / WORD_BITS] &= ~((word)1U << (row % WORD_BITS));

尽管除数/和模数(或余数)%通常较慢(与加法,减法,甚至是乘法相比),此处WORD_BITS将是两个编译时常数的幂所有广泛使用的架构.我所知道的所有编译器都可以将以上内容转换为快速移位和二进制与运算符.

Although the division / and modulus (or remainder) % are generally slow-ish (compared to addition, subtraction, and even multiplication), here WORD_BITS will be a power of two compile-time constant on all widely used architectures. All compilers I know of will turn the above into fast bit shifts and binary-AND operators.

要将第srcrow行添加到第dstrow行,您只需对所有单词执行二进制异或运算即可:

To add row srcrow to row dstrow, you simply do a binary exclusive-or on all the words:

{
    const word *const src = rm->row[srcrow];
    word *const       dst = rm->row[dstrow];
    int               w = rm->words;
    while (w-->0)
        dst[w] ^= src[w];
}

类似地,对于列矩阵,

{
    const word *const src = cm->col[srccol];
    word *const       dst = cm->col[dstcol];
    int               w = cm->words;
    while (w-->0)
        dst[w] ^= src[w];
}

请注意,如果合并多于两行,则可以非常高效地完成;比连续进行添加要快得多. Intel和AMD CPU非常擅长预测上述模式,因此您可以使用多个源行/列.同样,目的地也不必参与结果,尽管如果我正确地猜到您要实现的算法,我想您也可以.

Note that if you combine more than two rows, you can do so very efficiently; it'll be MUCH faster than doing the additions consecutively. Intel and AMD CPUs are very good at predicting the above pattern, so you can just use more than one source row/column. Also, the destination does not have to participate in the result, although if I guess correctly what algorithm you're implementing, I guess you want it to.

如果您知道目标体系结构具有SSE2或更高版本,甚至是AVX,则可以将emmintrin.himmintrin.h头文件分别用于编译器内置类型和运算符,以使您可以对128位和一次分别为256位;有时会给您很大的帮助.

If you know the target architectures have SSE2 or better, or even AVX, you can use the emmintrin.h or immintrin.h header files, respectively, for compiler built-in types and operators that allow you to XOR 128 bits and 256 bits, respectively, at once; sometimes giving you quite a bit of a boost.

由于向量类型需要C标准所说的过度对齐",因此您还需要包括mm_malloc.h,并使用_mm_malloc()_mm_free()数据-并且显然将words向上取整,因此您可以将行/列作为合适的整数词类型(对于SSE *为__m128i,对于AVX为__m256i)进行访问.

Since the vector types require what the C standards call "excess alignment", you'll then also need to include mm_malloc.h, and use _mm_malloc() and _mm_free() to allocate the row/column vectors for the word data -- and obviously round words up so you can access the row/column as a suitable integer word type (__m128i for SSE*, __m256i for AVX).

就我个人而言,我总是先实现 unvectorized 版本,然后再实现一些讨厌的"测试用例进行单元测试,然后才能实现矢量化.这样做的好处是,您可以将非向量化版本作为使用它的用户的初步版本,并且可以比较向量化和非向量化案例之间的测试用例结果,以查看其中一个是否存在错误.

Personally, I always implement the unvectorized version first, then some "nasty" test cases for unit testing, and only then see about vectorizing it. This has the benefit that you can give the unvectorized version as a preliminary version for those who will be using it, and you can compare the test case results between the vectorized and unvectorized cases, to see if one or the other has a bug in it.

转置操作非常简单,尽管我确实建议使用三重循环:在单个字中的位的最内层循环.另外,您可能想要检查哪种顺序(行或列主音)对外部循环效果最好;根据矩阵大小,您可能会看到巨大的差异. (这是由于缓存行为所致:您希望CPU能够预测访问模式,而不必重新加载相同的缓存行.在最好的情况下,使用年限最短的AMD和Intel x86- 64个处理器,如果两个矩阵都适合高速缓存,则可以获得接近高速缓存的速度.)

The transpose operation is quite straightforward, although I do recommend using a triple loop: innermost looping over the bits in a single word. Also, you'll probably want to check which order -- row or column major -- works best for the outer loop; depending on the matrix size, you may see a huge difference. (This is due to cache behaviour: you want the CPU to be able to predict the access pattern, and not have to reload the same cachelines. In the best case, on at-most-few-years-old AMD and Intel x86-64 processors, you can get near cache speeds if both matrixes fit in the cache.)

以上所有内容都可以在单个头文件中实现-如果目标体系结构支持SSE2/AVX,则甚至可以包括矢量化版本-因此实现起来应该不会太困难.

All of the above can be implemented in a single header file -- even including the vectorized versions if the target architecture supports SSE2/AVX --, so it should not be too difficult to implement.

问题?

这篇关于C语言中的二进制向量和矩阵操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆