VHDL模拟陷入for循环 [英] VHDL simulation stuck in for loop

查看:150
本文介绍了VHDL模拟陷入for循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我编写的某些VHDL做仿真测试,当我在ModelSim中运行它时,它会卡住.当我点击"break"时,在以下函数中有一个箭头指向 For 循环:

 函数MOD_3(a,b,c:UNSIGNED(1023 downto 0))返回UNSIGNED是变量x:未签名(1023降至0):= TO_UNSIGNED(1,1024);变量y:无符号(1023至0):= a;变量b_temp:未签名(1023降至0):= b;开始适用于0到1024循环中的I如果b_temp>然后0如果b_temp MOD 2 = 1,则x:=(x * y)MOD c;万一;y:=(y * y)MOD c;b_temp:= b_temp/2;别的出口;万一;结束循环;返回x MOD c;终端功能 

我本来是将此作为 while 循环的,但我意识到这不利于综合.因此,我将其转换为 for 循环,条件是 b_temp 大于0. b_temp 是1024位 unsigned ,因此,如果它是最大的数字,可以用1024位表示并除以一半(我在每次迭代中都这样做)1024次,那么它肯定不是0吗?

我感觉我的问题在于大乘法...如果我注释掉 x := (x * y) MOD cy := (y * y)MOD c 然后退出循环.因此,我唯一能想到的是执行这些1024位乘法所需的时间太长了吗?如果是这样,我是否有任何内置方法可以优化它以使其更快,或者我是实现像Karatsuba乘法之类的东西的唯一选择??

解决方案

我认为在numeric_std函数调用中实现蒙哥马利乘法器可能并不能如您所愿地改善仿真效果(同时提供综合资格).

问题在于动态详细阐述的子程序调用的数量,其操作数的大小以及是否适合您的CPU-运行-Modelsim的L1/L2 /L3 高速缓存.

它的确为在FPGA或SIMD GPU实现中实现目标定位而创造了奇迹.

请参见 Subversion存储库 BasicRSA 文件 modmult.vhd(具有通用大小).我成功地将其转换为使用numeric_std [_unsigned].

如果我没记错的话,这似乎是受启发的硕士论文(用于模块化乘法的高效硬件架构),作者是David Narh Amanor在2005年,概述了各种字长的Java和VHDL实现.

我发现了Stackoverflow问题(>蒙哥马利乘法VHDL实现中提到的OpenCores实现),并且在SVN信息库中找到了通用大小的版本(可下载版本为16位),并在这也带来了在VHDL中实现整数矢量(等效于bignum)的可能性,这有可能进一步提高仿真性能(并且您可能在这里处于未知领域).

使用可变参数的OpenCores模型的基于子程序的版本将很实用.(无论您是否能给任何人展示正在执行的仿真模型,或者是否所有人都偷偷瞥了一眼墙上的钟并且看上去很无聊而打断了这种短暂的停顿).

I'm doing simulation testing for some VHDL I wrote and when I run it in ModelSim it gets stuck. When I hit 'break' it has an arrow pointing to the For loop in the following function:

function MOD_3 (a, b, c : UNSIGNED (1023 downto 0)) return UNSIGNED is

  VARIABLE x : UNSIGNED (1023 downto 0) := TO_UNSIGNED(1, 1024);
  VARIABLE y : UNSIGNED (1023 downto 0) := a;
  VARIABLE b_temp : UNSIGNED (1023 downto 0) := b;

begin

  for I in 0 to 1024 loop
    if b_temp > 0 then
      if b_temp MOD 2 = 1 then
        x := (x * y) MOD c;
      end if;
      y := (y * y) MOD c;
      b_temp := b_temp / 2;
    else
      exit;
    end if;
  end loop;

  return x MOD c;

end function;

I originally had this as a while loop which I realize is not good for synthesizing. So I converted it to a for loop with the condition that b_temp is greater than 0. b_temp is a 1024-bit unsigned and so if it is the largest number that could be represented by 1024 bits and divided in half (which I do in each iteration) 1024 times, shouldn't it definitely be 0?

I have a feeling my problem lies in the large multiplications...if I comment out x := (x * y) MOD c and y := (y * y) MOD c then it exits the loop. So the only thing I can think of is it takes too long to carry out these 1024-bit multiplications? If this is the case, is there any built-in way I can optimize this to make it faster, or is my only option to implement something like Karatsuba multiplication, etc...?

解决方案

I'm of the opinion implementing a Montgomery multiplier in numeric_std function calls may not improve simulation as much as you'd like (while giving synthesis eligibility).

The issue is the number of dynamically elaborated subprogram calls vs. their operand sizes vs. fitting in your CPU-running-Modelsim's L1/L2/L3 caches.

It does do wonders for targeting synthesis in an FPGA or a SIMD GPU implementation.

See Subversion Repositories BasicRSA file modmult.vhd (which has a generic size). I successfully converted this to using numeric_std[_unsigned].

If I recall correctly this appears inspired by a Masters thesis (Efficient Hardware Architectures for Modular Multiplication) by David Narh Amanor in 2005 outlining a Java and a VHDL implementation in various word sizes.

I found the OpenCores implementation mentioned in a Stackoverflow question (Montgomery multiplication VHDL Implementation) and found the generic sized version in the SVN repository (the downloadable version is 16 bit) and the mention of the thesis in A 1024 – Bit Implementation of the Faster Montgomery Multiplier Using VHDL (by David Narh Anamor, the original link having expired). Note the quoted FPGA implementation performance under 42 usec.

Notice the length 1024 version specified by a generic would still be performing dynamically elaborated function calls with length 1024 operands (although not the "*"s, the "mod"s or the "/"s. You'd still be doing millions of function calls with dynamically elaborated (passed on an expression stack) 1024 bit parameters. We're simply changing how many millions of large parameter subroutine calls and how long they can take.

And that also brings up the possibility of an integer vector implementation (bignum equivalent) in VHDL, which would potential increase simulation performance even more (and you're likely in uncharted territory here).

A subprogram based version of the OpenCores model using variable parameters would be telling. (Whether or not you can impress anyone showing them a simulation model executing, or whether there's this looong pause interrupted by everyone taking furtive glances at the wall clock and looking bored).

这篇关于VHDL模拟陷入for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆