如何从综合报告中推导出 [英] How to deduce from synthesis report

查看:23
本文介绍了如何从综合报告中推导出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 xilinx 在 VHDL 中编写了 80c51 架构.为了增加时钟频率,我已经流水线化了所有 80c51 指令.指令能够根据需要执行,例如.当第一条指令被处理时,第二条指令被取出.

I had coded the 80c51 architecture in VHDL using xilinx. In an attempt to increase the clock frequency, I had pipelined all the 80c51 instructions. The instructions were able to execute as desired, for eg. when the 1st instruction is being processed, the second instruction gets fetched.

然而,尽管从综合报告中创建了 3 的流水线深度,但我只获得了稍高的时钟频率(大约 +/-10Hz).我发现瓶颈是由于综合报告指定的一项操作,但我无法理解综合报告.

However, I only get a slightly higher clock frequency of (around +/-10Hz) despite creating a pipeline depth of 3, from the synthesis report. I figured out that the bottleneck is due to one operation as specified by the synthesis report, but I could not understand synthesis report.

请问从SEQ/decode_3"到SEQ/i_ram_addr_7"的数据路径是做什么的?(根据我的猜测,我推断出用例,when 语句来检查 100 多个相关操作码,但不确定这是否是瓶颈.但我一无所知)

May I ask what is the data path from 'SEQ/decode_3 to SEQ/i_ram_addr_7' trying to do? (From my guess, i deduce that the use a case, when statement to check the 100+ relevant opcode but not sure if that is the bottleneck. But I am clueless)

因此,我仅有的 2 个查询是:

Hence, my only 2 queries are:

首先,流水线是否有可能不会增加时钟频率,而测试平台是解释时序减少的唯一方法?

Firstly, is it possible that pipelining does not increase the clock frequency and the testbench is the only way to explain the reduce in timing?

其次,我怎么能推断出我的代码中哪条路径是从SEQ/decode_3 到 SEQ/i_ram_addr_7"的瓶颈.

Secondly, how could I deduce which path in my code that is the bottleneck from 'SEQ/decode_3 to SEQ/i_ram_addr_7'.

感谢任何可以帮助解释我的疑问的人!

Thank you for anyone who can help to explain my doubts!

Timing Summary:
---------------
Speed Grade: -4

   Minimum period: 12.542ns (Maximum Frequency: 79.730MHz)
   Minimum input arrival time before clock: 10.501ns
   Maximum output required time after clock: 5.698ns
   Maximum combinational path delay: No path found

Timing Detail:
--------------
All values displayed in nanoseconds (ns)

=========================================================================
Timing constraint: Default period analysis for Clock 'clk'
  Clock period: 12.542ns (frequency: 79.730MHz)
  Total number of paths / destination ports: 113114 / 2670
-------------------------------------------------------------------------
Delay:               12.542ns (Levels of Logic = 10)
  Source:            SEQ/decode_3 (FF)
  Destination:       SEQ/i_ram_addr_7 (FF)
  Source Clock:      clk rising
  Destination Clock: clk rising

  Data Path: SEQ/decode_3 to SEQ/i_ram_addr_7
                                Gate     Net
    Cell:in->out      fanout   Delay   Delay  Logical Name (Net Name)
    ----------------------------------------  ------------
     FDC:C->Q            102   0.591   1.364  SEQ/decode_3 (SEQ/decode_3)
     LUT4_D:I1->O         10   0.643   0.885  SEQ/de_state_cmp_eq002111 (N314)
     LUT4:I3->O            7   0.648   0.740  SEQ/de_state_cmp_eq00711 (SEQ/de_state_cmp_eq0071)
     LUT4:I2->O            3   0.648   0.534  SEQ/i_ram_addr_mux0000<0>11111 (N2301)
     LUT4:I3->O            1   0.648   0.000  SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0_F (N1284)
     MUXF5:I0->O           1   0.276   0.423  SEQ/i_ram_addr_mux0000<0>11270_SW0_SW0 (N955)
     LUT4_D:I3->O          6   0.648   0.701  SEQ/i_ram_addr_mux0000<0>11270 (SEQ/i_ram_addr_mux0000<0>11270)
     LUT3_L:I2->LO         1   0.648   0.103  SEQ/i_ram_addr_mux0000<7>221_SW2_SW0 (N1208)
     LUT4:I3->O            1   0.648   0.423  SEQ/i_ram_addr_mux0000<7>351_SW1 (N1085)
     LUT4:I3->O            1   0.648   0.423  SEQ/i_ram_addr_mux0000<7>2 (SEQ/i_ram_addr_mux0000<7>2)
     LUT4:I3->O            1   0.648   0.000  SEQ/i_ram_addr_mux0000<7>167 (SEQ/i_ram_addr_mux0000<7>)
     FDE:D                     0.252          SEQ/i_ram_addr_7
    ----------------------------------------
    Total                     12.542ns (6.946ns logic, 5.596ns route)
                                       (55.4% logic, 44.6% route)

=========================================================================
Timing constraint: Default OFFSET IN BEFORE for Clock 'clk'
  Total number of paths / destination ports: 154 / 154
-------------------------------------------------------------------------
Offset:              8.946ns (Levels of Logic = 6)
  Source:            rst (PAD)
  Destination:       SEQ/i_ram_diByte_1 (FF)
  Destination Clock: clk rising

  Data Path: rst to SEQ/i_ram_diByte_1
                                Gate     Net
    Cell:in->out      fanout   Delay   Delay  Logical Name (Net Name)
    ----------------------------------------  ------------
     IBUF:I->O           444   0.849   1.392  rst_IBUF (REG/ext_int/fd_out1_0__or0000)
     BUF:I->O            445   0.648   1.425  rst_IBUF_1 (rst_IBUF_1)
     LUT3:I2->O            4   0.648   0.730  ROM/data<1>1 (i_rom_data<1>)
     LUT4:I0->O            1   0.648   0.500  SEQ/i_ram_diByte_mux0000<1>17_SW0 (N1262)
     LUT4:I1->O            1   0.643   0.563  SEQ/i_ram_diByte_mux0000<1>32 (SEQ/i_ram_diByte_mux0000<1>32)
     LUT4:I0->O            1   0.648   0.000  SEQ/i_ram_diByte_mux0000<1>60 (SEQ/i_ram_diByte_mux0000<1>)
     FDE:D                     0.252          SEQ/i_ram_diByte_1
    ----------------------------------------
    Total                      8.946ns (4.336ns logic, 4.610ns route)
                                       (48.5% logic, 51.5% route)

=========================================================================

<小时>

为了让我更具体,我将在 1 个操作码的解码阶段给出一个示例代码的片段.


To allow me to be more specfic, I will give a snipplet of an example code in the decode phase of 1 opcode.

以下是解码 opdcode 时的 1 种情况,这是一条 mov 指令.大约有 100+ 个操作码(100+ 条指令),这意味着这个 case 语句有超过 100 个 when 语句.

The following is 1 such case when decoding an opdcode, which is a mov instruction. There are about 100+ opcodes (100+ instructions), which means this case statements has over 100 when statements.

案例操作码是

--MOV A, Rn
当11101000"|"11101001" |"11101010" |"11101011" |"11101100" |"11101101" |"11101110" |"11101111" => case de_state 是当 E7 =>

--MOV A, Rn
when "11101000" | "11101001" | "11101010" | "11101011" | "11101100" | "11101101" | "11101110" | "11101111" => case de_state is when E7 =>

              de_state <= E8;

          when E8 =>


              de_state <= E9;

          when E9 =>


              de_state <= E10;
          when E10 =>
              --Draw PSW
              i_ram_addr <= xD0;
              i_ram_rdByte <= '1';

              de_state <= E11;
          when E11 =>
              --Draw from Rn
              i_ram_addr <= "000" & i_ram_doByte(4 downto 3)& opcode(2 downto 0);
              i_ram_rdByte <= '1';

              de_state <= E12;

          when E12 =>
              --Place into EDR
              EDR <= i_ram_doByte;
              --close rdByte
              i_ram_rdByte <= '0';

          when others =>

          end case;

我希望你能更好地了解我的 vhdl 代码.我将不胜感激任何形式的帮助.谢谢!

I hope you could have a better idea of my vhdl code. I would appreciate any form of help. Thank you!

推荐答案

仅凭这些信息不会有好的答案;我们只能猜测是什么源代码产生了这个硬件.

There will be no good answers from this information only; we can only guess what source code produced this hardware.

但很明显,您需要检查源,假设为什么它很慢,采取措施纠正问题,并测试解决方案.

But it is clear that you need to examine the source, make a hypothesis why it is slow, take action to correct the problem, and test the solution.

重复直到足够快.

我的猜测,因为你暗示有一个 case 语句来解码操作码......

My guess, given your hint that there is a case statement to decode the opcodes...

其中一只手臂是这样的:

one of the arms is something like:

when <some expression involving decode>  =>
   address <= <some address calculation>;

问题是这两个表达式通常是相互关联的,因此它们在同一个循环中进行评估.一个示例解决方案是将地址表达式(即在前一个周期中)预先计算到寄存器中,并将 case arm 重写为:

The problem is that often the two expressions are inter-related so that they are evaluated in the same cycle. An example solution would be to precompute the address expression (i.e. in the previous cycle) into a register, and rewrite the case arm as:

when <some expression involving decode>  =>
   address <= register;

如果你猜对了,结果会稍微快一点,而且你还有另一个(类似的)瓶颈需要解决.重复直到足够快...

If you guessed right, the result will be slightly faster and you have another (similar) bottleneck to fix. Repeat until fast enough...

但是如果没有来源和时序分析,不要指望有更具体的答案.

But without the source AND the timing analysis, don't expect a more specific answer.

发布了一小部分源代码,图片更清晰一点:您有两个嵌套的 Case 语句,每个语句都很大.你显然需要一些简化...

EDIT : having posted a fraction of source code, the picture is a little clearer : you have two nested Case statements, each quite large. You clearly need some simplification...

我注意到只有 2 个内部 case arm 分配给 i_ram_addr,但时序分析显示 i_ram_addr 上有一个巨大而复杂的多路复用器;很明显,还有很多其他的 case arm 对 i_ram_addr 有贡献......

I note that only 2 of the inner case arms assign to i_ram_addr, yet the timing analysis shows a huge and complex mux on i_ram_addr; clearly there are a lot of other case arms that contribute terms to i_ram_addr...

我建议您可能必须将 i_ram_addr 与主 Case 语句分开处理,并编写最简单的机器来单独生成 i_ram_addr.例如,我会注意到 OPCODE case arm 相当于:

I would suggest that you might have to treat i_ram_addr separately from the main Case statement and write the simplest machine you can to generate i_ram_addr alone. For example I would note that the OPCODE case arm is equivalent to:

if OPCODE(7 downto 3) = "11101" then ...

并询问单独为 i_ram_addr 获得解码器有多简单.你可能会发现很多其他的 case arm 用 i_ram_addr 做非常相似的事情(最初的 8051 设计师会抓住机会简化逻辑!).综合工具在简化逻辑方面可能非常聪明,但当事情变得过于复杂时,它们可能会错失良机.

and ask how simple you can get a decoder for i_ram_addr alone. You may find that a lot of other case arms do very similar things with i_ram_addr (the original 8051 designers would have jumped at the chance to simplify logic!). Synthesis tools can be quite clever at simplifying logic, but when things get too complex they can miss opportunities.

(在这个阶段,我会注释掉 i_ram_addr 的赋值,而留下解码器的其余部分)

(At this stage I would comment out the i_ram_addr assignments and leave the rest of the decoder alone)

这篇关于如何从综合报告中推导出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆