程序集处理三角矩阵存储器的算法 [英] algorithm of addressing a triangle matrices memory using assembly

查看:100
本文介绍了程序集处理三角矩阵存储器的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当时在ASM中使用NASM做一个有关帕斯卡三角形的项目

I was doing a project in ASM about pascal triangle using NASM

因此在项目中,您需要计算从0行到63行的帕斯卡三角形

so in the project you need to calculate pascal triangle from line 0 to line 63

我的第一个问题是将计算结果存储在哪里->内存

my first problem is where to store the results of calculation -> memory

第二个问题,我在内存中使用哪种存储类型,以了解我的意思,我有3种方法第一声明一个完整矩阵,因此将是这种方式

second problem what type of storage I use in memory, to understand what I mean I have 3 way first declare a full matrices so will be like this way

memoryLabl:  resd 63*64     ; 63 rows of 64 columns each

但是这样的问题是没有使用一半的矩阵,这使我的程序无效,所以让我们开始第二种方法可用

but the problem in this way that half of matrices is not used that make my program not efficient so let's go the second method is available

为每行声明一个内存标签 例如:

which is declare for every line a label for memory for example :

line0: dd   1
line1: dd  1,1
line2: dd 1,2,1      ; with pre-filled data for example purposes
...
line63: resd 64      ; reserve space for 64 dword entries

这种操作方式就像手工操作一样,

this way of doing it is like do it by hand,

该类中的其他人尝试使用宏,因为您可以在此处 但我不明白

some other from the class try to use macro as you can see here but i don't get it

到目前为止很好

让我们转到我用过的最后一个 就像第一个一样,但是我使用了三角形矩阵,那是怎么回事, 通过仅声明我需要的内存量

let's go to the last one that i have used which is like the first one but i use a triangle matrices , how is that, by declaring only the amount of memory that i need

所以要存储0到63帕斯卡三角形的行,这给了我一个三角形矩阵,因为我每增加一行都会添加一个单元格

so to store line 0 to line 63 line of pascal triangle, it's give me a triangle matrices because every new line I add a cell

我已经为三角形矩阵分配了2080 dword,这是怎么回事? 用2080 dword解释:

I have allocate 2080 dword for the triangle matrices how is that ?? explain by 2080 dword:

                okey we have line0 have 1 dword /* 1 number in first line */
                             line1 have 2 dword /* 2 numbers in second line */
                             line2 have 3 dword /* 3 numbers in third line */
                             ...
                             line63 have 64 dword /* 64 numbers in final line*/
                  so in the end we have 2080   as the sum of them 

我给每个数字1个双字

okey现在我们已经创建了用于存储结果的内存,让我们开始计算

okey now we have create the memory to store results let's start calculation

第一个#在帕斯卡三角形中,您在第0行中的所有单元格都具有值

first# in pascal triangle you have all the cells in row 0 have value 1

我将用伪代码完成此操作,以便您了解如何在第0行的所有单元格中放入一个:

I will do it in pseudo code so you understand how I put one in all cells of row 0:

s=0
for(i=0;i<64;i++):
     s = s+i
     mov dword[x+s*4],1   /* x is addresses of  triangle matrices */

帕斯卡三角形的第二部分

每行的最后一行等于1

second part in pascal triangle is to have the last row of each line equal to 1

我将使用伪代码使其变得简单

I will use pseudo code to make it simple

s=0
for(i=2;i<64;i++):
     s = s+i
     mov dword[x+s*4],1

我从等于2的i开始,因为i = 0(i = 1)是line0(line1),line0(line1)已满,因为仅持有一个(拖曳)值,如我在上面的解释中所述

I start from i equal to 2 because i = 0 (i=1) is line0 (line1) and line0 (line1)is full because is hold only one (tow) value as I say in above explanation

因此,两个伪代码将使我的矩形看起来像在内存中:

so the tow pseudo code will make my rectangle look like in memory :

1
1  1
1      1
1          1
1             1
1                 1
1                     1
1                         1
1                              1
1                                 1
...
1                                        1

现在最困难的部分是使用三角形中的该值填充所有三角形单元格的计算

now come the hard part is the calculation using this value in triangle to fill all the triangle cells

让我们从这里开始构想

let's take cell[line][row]

we have cell[2][1] = cell[1][0]+cell[1][1]
and     cell[3][1]= cell[2][0]+cell[2][1]
        cell[3][2]= cell[2][1]+cell[2][2]

in **general** we have 
        cell[line][row]= cell[line-1][row-1]+cell[line-1][row]

我的问题,因为我有一个

很奇怪的三角形矩阵,谁能用一种关系或非常基本的伪代码或asm代码来帮助打破它?

triangle matrices which weird to work with can any one help me to break it using a relation or very basic pseudo code or asm code ?

推荐答案

TL:DR:您只需要顺序遍历数组,因此不必计算索引.请参阅第二部分.

TL:DR: you just need to traverse the array sequentially, so you don't have to work out the indexing. See the 2nd section.

要随机访问索引到(下部)三角矩阵 ,行r在大小为r-1的三角形之后开始.大小为n的三角形使用 Gauss的元素总数为n*(n+1)/2从1到n-1的数字总和的公式.因此,大小为r-1的三角形具有(r-1)*r/2元素.一旦我们知道了行开始的地址,在索引行中的列当然就很简单了.

To random access index into a (lower) triangular matrix, row r starts after a triangle of size r-1. A triangle of size n has n*(n+1)/2 total elements, using Gauss's formula for the sum of numbers from 1 to n-1. So a triangle of size r-1 has (r-1)*r/2 elements. Indexing a column within a row is of course trivial, once we know the address of the start of a row.

每个DWORD元素的宽度为4个字节,我们可以在进行乘法时将其缩放,因为

Each DWORD element is 4 bytes wide, and we can take care of that scaling as part of the multiply, because lea lets us shift and add as well as put the result in a different register. We simplify n*(n-1)/2 elements * 4 bytes / elem to n*(n-1) * 2 bytes.

以上推理适用于基于1的索引,其中第1行包含1个元素.如果要在计算之前在行索引上加1,则需要对此进行调整,以便从零开始,因此我们需要一个三角形的大小 与r+1 - 1行,因此为r*(r+1)/2 * 4 bytes.有助于将线性数组索引放到一个三角形中,以便快速仔细检查公式

The above reasoning works for 1-based indexing, where row 1 has 1 element. We have to adjust for that if we want zero-based indexing by adding 1 to row indices before the calculation, so we want the size of a triangle with r+1 - 1 rows, thus r*(r+1)/2 * 4 bytes. It helps to put the linear array index into a triangle to quickly double-check the formula

 0
 4   8
12  16  20
24  28  32  36
40  44  48  52  56
60  64  68  72  76  80
84  88  92  96 100  104  108

第四行(我们称为行3")从整个数组的开头开始24个字节.那是(3+1)*(3+1-1) * 2 = (3+1)*3 * 2;是的,r*(r+1)/2公式有效.

The 4th row, which we're calling "row 3", starts 24 bytes from the start of the whole array. That's (3+1)*(3+1-1) * 2 = (3+1)*3 * 2; yes the r*(r+1)/2 formula works.

;; given a row number in EDI, and column in ESI (zero-extended into RSI)
;; load triangle[row][col] into eax

lea    ecx, [2*rdi + 2]
imul   ecx, edi                        ; ecx = r*(r+1) * 2 bytes

mov    eax, [triangle + rcx + rsi*4]

假设32位绝对寻址是可以的( x86寻址模式可以当其中之一为常数时,仅包含3个分量.但是,对于您的静态triangle来说,是 情况,因此我们可以通过对列使用缩放索引,将base作为我们计算出的行偏移量并将实际数组地址作为对位移.

This assuming 32-bit absolute addressing is ok (32-bit absolute addresses no longer allowed in x86-64 Linux?). If not, use a RIP-relative LEA to get the triangle base address in a register, and add that to rsi*4. x86 addressing modes can only have 3 components when one of them is a constant. But that is the case here for your static triangle, so we can take full advantage by using a scaled index for the column, and base as our calculated row offset, and the actual array address as the displacement.

这里的窍门是,您只需要按顺序遍历即可;您不需要随机访问给定的行/列.

The trick here is that you only need to loop over it sequentially; you don't need random access to a given row/column.

您在写下面一行的同时阅读了一行. 当您到达一行的末尾时,下一个元素是下一行的开始.当您向下一行时,源指针和目标指针将彼此越来越远,因为目的地始终是整整1行.而且您知道行的长度=行号,因此实际上可以使用行计数器作为偏移量.

You read one row while writing the one below. When you get to the end of a row, the next element is the start of the next row. The source and destination pointers will get farther and farther from each other as you go down the rows, because the destination is always 1 whole row ahead. And you know the length of a row = row number, so you can actually use the row counter as the offset.

global _start
_start:
    mov  esi, triangle         ; src = address of triangle[0,0]
    lea  rdi, [rsi+4]          ; dst = address of triangle[1,0]

    mov  dword [rsi], 1      ; triangle[0,0] = 1  special case: there is no source
.pascal_row:                   ; do {
    mov  rcx, rdi               ; RCX = one-past-end of src row = start of dst row
    xor  eax, eax               ; EAX = triangle[row-1][col-1] = 0 for first iteration
    ;; RSI points to start of src row: triangle[row-1][0]
    ;; RDI points to start of dst row: triangle[row  ][0]
  .column:
     mov   edx, [rsi]           ; tri[r-1, c]           ; will load 1 on the first iteration
     add   eax, edx             ; eax = tri[r-1, c-1] + tri[r-1, c]
     mov  [rdi], eax            ; store to triangle[row, col]

     add  rdi, 4                ; ++dst
     add  rsi, 4                ; ++src
     mov  eax, edx              ; becomes col-1 src value for next iteration

     cmp  rsi, rcx
     jb   .column              ; }while(src < end_src)

    ;; RSI points to one-past-end of src row, i.e. start of next row = src for next iteration
    ;; RDI points to last element of dst row  (because dst row is 1 element longer than src row)

    mov  dword [rdi], 1        ;  [r,r] = 1   end of a row
    add  rdi, 4                ;  this is where dst-src distance grows each iteration

    cmp  rdi, end_triangle
    jb  .pascal_row

       ;;; triangle is constructed.  Set a breakpoint here to look at it with a debugger

    xor   edi,edi
    mov   eax, 231
    syscall               ; Linux sys_exit_group(0), 64-bit ABI



section .bss

; you could just as well use  resd 64*65/2
; but put a label on each row for debugging convenience.

ALIGN 16
triangle:
%assign i 0
%rep    64
    row %+ i:  resd  i + 1
%assign i i+1
%endrep
end_triangle:

我对此进行了测试,它的工作原理是:纠正内存中的值,并将其停在正确的位置.但是请注意,整数溢出会在您到达最后一行之前发生.如果使用64位整数(可以简单地更改寄存器名称和偏移量,并且不要忘记resdresq),可以避免这种情况. 64选择32是1832624140942590534 = 2 ^ 60.66.

I tested this and it works: correct values in memory, and it stops at the right place. But note that integer overflow happens before you get down to the last row. This would be avoided if you used 64-bit integers (simple change to register names and offsets, and don't forget resd to resq). 64 choose 32 is 1832624140942590534 = 2^60.66.

%rep块用于保留空间并将每一行标记为row0row1等,来自

The %rep block to reserve space and label each row as row0, row1, etc. is from my answer to the question you linked about macros, much more sane than the other answer IMO.

您标记了此NASM,所以这就是我所使用的,因为我对此很熟悉.您在问题中使用的语法是MASM(直到最后一次编辑). MASM中的主要逻辑是相同的,但是请记住,您需要使用OFFSET三角形来立即获取地址,而不是从地址中加载.

You tagged this NASM, so that's what I used because I'm familiar with it. The syntax you used in your question was MASM (until the last edit). The main logic is the same in MASM, but remember that you need OFFSET triangle to get the address as an immediate, instead of loading from it.

我使用x86-64是因为32位已过时,但是我避免了太多寄存器,因此如果需要,您可以轻松地将其移植到32位.如果将其放入函数而不是独立程序中,请不要忘记保存/恢复调用保留的寄存器.

I used x86-64 because 32-bit is obsolete, but I avoided too many registers, so you can easily port this to 32-bit if needed. Don't forget to save/restore call-preserved registers if you put this in a function instead of a stand-alone program.

展开内部循环可以节省一些指令,用于在其周围复制寄存器,以及减少循环开销.这是一个经过优化的实现,但是我主要将其限制为使代码更简单以及更小/更快的优化. (除了使用指针增量而不是索引.)花了一些时间使它简洁明了. :P

Unrolling the inner loop could save some instructions copying registers around, as well as the loop overhead. This is a somewhat optimized implementation, but I mostly limited it to optimizations that make the code simpler as well as smaller / faster. (Except maybe for using pointer increments instead of indexing.) It took a while to make it this clean and simple. :P

在不同的CPU上,执行数组索引的不同方法会更快.例如对于内部循环中的负载,也许使用索引寻址模式(相对于dst),因此仅需要一个指针增量.但是,如果您希望它快速运行,则SSE2或AVX2 vpaddd 可能做个好人.使用palignr进行改组可能会很有用,但也可能是未对齐的负载,而不是某些改组,尤其是使用AVX2或AVX512时.

Different ways of doing the array indexing would be faster on different CPUs. e.g. perhaps use an indexed addressing mode (relative to dst) for the loads in the inner loop, so only one pointer increment is needed. But if you want it to run fast, SSE2 or AVX2 vpaddd could be good. Shuffling with palignr might be useful, but probably also unaligned loads instead of some of the shuffling, especially with AVX2 or AVX512.

但是无论如何,这是我的版本;我并不是想按照自己的方式来编写它,而是需要为任务分配自己编写的内容.我正在为将来的读者写信,他们可能会了解x86的高效功能. (另请参见 x86标签Wiki 中的性能"部分.)

But anyway, this is my version; I'm not trying to write it the way you would, you need to write your own for your assignment. I'm writing for future readers who might learn something about what's efficient on x86. (See also the performance section in the x86 tag wiki.)

我是怎么写的:

我从头开始编写代码,但是很快意识到,一次过的错误将非常棘手,并且我不想只用循环内分支的特殊方式编写这种愚蠢的方式.

I started writing the code from the top, but quickly realized that off-by-one errors were going to be tricky, and I didn't want to just write it the stupid way with branches inside the loops for special cases.

最终的帮助是在内部循环的指针上写了前置条件和后置条件的注释.显然,我需要使用eax=0而不是eax=1进入循环,并将eax作为循环内的第一个操作或类似的东西存储.

What ended up helping was writing the comments for the pre and post conditions on the pointers for the inner loop. That made it clear I needed to enter the loop with eax=0, instead of with eax=1 and storing eax as the first operation inside the loop, or something like that.

显然每个源值只需要读取一次,所以我不想编写一个读取[rsi][rsi+4]之类的内部循环.此外,这将使正确确定边界条件(将不存在的值必须读取为0)变得更加困难.

Obviously each source value only needs to be read once, so I didn't want to write an inner loop that reads [rsi] and [rsi+4] or something. Besides, that would have made it harder to get the boundary condition right (where a non-existant value has to read as 0).

花了一些时间来决定我是否要在寄存器中添加用于行长或行号的实际计数器,然后才结束对整个三角形的使用终点指针的操作.在我完成操作之前,使用纯指针增量/比较并不会节省这么多指令(在上限为像end_triangle这样的构建时常数时进行寄存器存储)并不清楚,但是效果很好.

It took some time to decide whether I was going to have an actual counter in a register for row length or row number, before I ended up just using an end-pointer for the whole triangle. It wasn't obvious before I finished that using pure pointer increments / compares was going to save so many instructions (and registers when the upper bound is a build-time constant like end_triangle), but it worked out nicely.

这篇关于程序集处理三角矩阵存储器的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆