epu timing tool

epu timing tool

Postby notzed » Mon Jul 14, 2014 1:49 am

I've been hacking on a tool which can be used to display some timing information about instruction sequences as shown in the last update on this post:

http://a-hackers-craic.blogspot.com.au/ ... -tool.html

If you used the spu-timing-tool from the IBM for the CELL B.E. the final output might be familiar. It works on object files (linked or not).

I have a super-rough pipeline which tracks the register dependencies and dual issue rules based on the arch ref and some testing on the hardware. I've not tried to implement the fetch logic as yet although this also affects dual-issue rates in ways i've yet to fully determine. At this point the intention is just as a static analysis tool but i'll see where the wind takes me.

I don't have a release package yet, I want to fix a few things and pretty up the output a bit first.

This topic is for any expressions of interest and discussion.

Update: I have now published a version here:
http://www.users.on.net/~notzed/software/ezetool.html

It's not complete but it might be useful or just interesting.
Last edited by notzed on Fri Jul 18, 2014 7:11 am, edited 1 time in total.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: epu timing tool

Postby aolofsson » Mon Jul 14, 2014 8:46 am

Very cool! This was another one of those projects that we always wanted to do..but somehow never got around to. Amazing to see you get something working in an afternoon! Let me know if there is anything I can help out with in terms of pipeline description so that you don't have to do too much reverse engineering.

Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: epu timing tool

Postby timpart » Mon Jul 14, 2014 1:16 pm

One thing I've never been clear on is, as the core is doing a sequence of instructions when does the start of the pipeline fetch a new double word of instructions from memory? Since an instruction can span a double word boundary presumably it must fetch a fresh double word in good time. Presumably this is connected with the enigmatic "IAB". (Does B stand for buffer?) I'm curious as to how easy it is to avoid a clash if you want to do a load into a register from the same bank of memory.

Does the timing of when a double word fetch is started depend on whether its address is on or off core?

Presumably Dual Issuing is still possible if the other instruction is on the following double word?

Thanks in advance,

Tim
timpart
 
Posts: 302
Joined: Mon Dec 17, 2012 3:25 am
Location: UK

Re: epu timing tool

Postby aolofsson » Mon Jul 14, 2014 1:52 pm

Tim,

The IAB is something I don't want to document too much, because it is likely to change (for the better) in future versions of the chip, but here are a few more details:

-The IAB works like a FIFO with a prefetch circuit that continuously brings in 64 bits from the memory whenever it can.

-Instructions get pulled out of the FIFO by the dual issue scheduling logic.

-Detecting FIFO full by the prefetch engine was a little tricky, considering that the "pull rate" depends on the instructions being single issue, double issue, 16/32.

-The IAB gets "flushed" on branches.

Notes:
-Align critical code sections on double word boundaries. (loop starts for example)
-Using 16 bit instructions help, since it's possible the IAB can fetch 4 instructions per clock cycle
-It is possible to load a word from same bank as instruction without suffering a penalty (but it depends on scenario)

Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: epu timing tool

Postby shodruk » Tue Jul 15, 2014 2:48 am

spu_timing! I used to love it. :D
Shodruky
shodruk
 
Posts: 464
Joined: Mon Apr 08, 2013 7:03 pm

Re: epu timing tool

Postby timpart » Tue Jul 15, 2014 12:08 pm

aolofsson wrote:-The IAB works like a FIFO with a prefetch circuit that continuously brings in 64 bits from the memory whenever it can.
[...]
-Detecting FIFO full by the prefetch engine was a little tricky, considering that the "pull rate" depends on the instructions being single issue, double issue, 16/32.

-The IAB gets "flushed" on branches.

Thanks Andreas. So I glean from that:
Working out when instruction fetches happen from memory isn't easy, and may be different between chip versions
They are likely to happen in the instructions immediately after a branch (and by implication less often a bit later)

I've been thinking about when best to align code to double word boundaries for non-critical code, but I'll probably start another thread for that when I've pondered long enough.

Tim
timpart
 
Posts: 302
Joined: Mon Dec 17, 2012 3:25 am
Location: UK

Re: epu timing tool

Postby notzed » Wed Jul 16, 2014 6:11 pm

aolofsson wrote:Very cool! This was another one of those projects that we always wanted to do..but somehow never got around to. Amazing to see you get something working in an afternoon! Let me know if there is anything I can help out with in terms of pipeline description so that you don't have to do too much reverse engineering.


Cheers. A few hacking sessions later i'm close to a first cut code drop: that should show anything i've missed. I got most of the info i needed from the manual+some posts on here, and some of the finer points from test snippets I needed to verify in the simulator.

(i should really check for an updated ARM or in the docs errata thread because i copied some of the errors in the one i have for the instruction encodings, although if bitr really has a shift parameter that would be handy)

There are some missed dual-issue opportunities on the hardware i haven't fully worked out the rules for - but i need to create some test-cases to show them and they are not critical.

Here's one I just noticed but is not the only one i've seen:

This will not dual issue the fadd with the ldr.

Code: Select all
   .global   _stalls
   .balign   8
_stalls:
   mov   r1,#data
   nop
   nop
   nop

   ldr.l   r2,[r1]
   fadd.l   r3,r3,r3
   
   rts


Any r3 ...?

... after some experimentation I found that the dual issue is unnecessarily blocked for register pairs of a load smaller than double-word, but it doesn't stall execution.

None of these dual-issue:

Code: Select all
   .global   _stalls
   .balign   8
_stalls:
   mov   r1,#data

   nop

   ldr.l   r19,[r1]
   fadd.l   r18,r18,r18

   nop

   ldr.l   r16,[r1]
   fadd.l   r17,r17,r17

        nop

   fadd.l   r18,r18,r18
   ldr.l   r19,[r1]

   nop

   fadd.l   r17,r17,r17
   ldr.l   r16,[r1]

   rts


I may or may not implement every such case ...!
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: epu timing tool

Postby aolofsson » Wed Jul 16, 2014 9:57 pm

notzed,

Good catch!! I had forgotten about that "optimization". This was one of the critical path so I took some liberties, load stores work on register pairs.

For good measure, here is the complete decode logic :D Dual issue happens only if none of these conditions are true:
--RAW CODE

//load-->load
assign iab_waw_00_de = iab_op0_wr_rd0_write_de &
iab_op1_wr_rd0_write_de &
(iab_op0_wr_rd0_addr_de[RFAW-1:1]==iab_op1_wr_rd0_addr_de[RFAW-1:1]);

//load-->ialu/fpu
assign iab_waw_01_de = iab_op0_wr_rd0_write_de &
iab_op1_wr_rd1_write_de &
(iab_op0_wr_rd0_addr_de[RFAW-1:1]==iab_op1_wr_rd1_addr_de[RFAW-1:1]);

//ialu/fpu-->load
assign iab_waw_10_de = iab_op0_wr_rd1_write_de &
iab_op1_wr_rd0_write_de &
(iab_op0_wr_rd1_addr_de[RFAW-1:1]==iab_op1_wr_rd0_addr_de[RFAW-1:1]);

//ialu/fpu-->ialu/fpu
assign iab_waw_11_de = iab_op0_wr_rd1_write_de &
iab_op1_wr_rd1_write_de &
(iab_op0_wr_rd1_addr_de[RFAW-1:0]==iab_op1_wr_rd1_addr_de[RFAW-1:0]);

//ialu-fpu-->rn-read
assign iab_raw_10_de = iab_op0_wr_rd1_write_de & iab_op1_rd_rn_read_de &
(iab_op0_wr_rd1_addr_de[RFAW-1:0]==iab_op1_rd_rn_addr_de[RFAW-1:0]);

//ialu-fpu-->rn-read
assign iab_raw_11_de = iab_op0_wr_rd1_write_de & iab_op1_rd_rm_read_de &
(iab_op0_wr_rd1_addr_de[RFAW-1:0]==iab_op1_rd_rm_addr_de[RFAW-1:0]);

//ialu-fpu-->store
//note: matching on odd/even
assign iab_raw_12_de = iab_op0_wr_rd1_write_de & iab_op1_rd_rs_read_de &
(iab_op0_wr_rd1_addr_de[RFAW-1:1]==iab_op1_rd_rs_addr_de[RFAW-1:1]);

//ialu-fpu-->acc-fpu
assign iab_raw_13_de = iab_op0_wr_rd1_write_de & iab_op1_rd_acc_read_de &
(iab_op0_wr_rd1_addr_de[RFAW-1:0]==iab_op1_rd_acc_addr_de[RFAW-1:0]);

//load-->rn-read
assign iab_raw_00_de = iab_op0_wr_rd0_write_de & iab_op1_rd_rn_read_de &
(iab_op0_wr_rd0_addr_de[RFAW-1:1]==iab_op1_rd_rn_addr_de[RFAW-1:1]);

//load-->rm-read
assign iab_raw_01_de = iab_op0_wr_rd0_write_de & iab_op1_rd_rm_read_de &
(iab_op0_wr_rd0_addr_de[RFAW-1:1]==iab_op1_rd_rm_addr_de[RFAW-1:1]);

//load-->store
assign iab_raw_02_de = iab_op0_wr_rd0_write_de & iab_op1_rd_rs_read_de &
(iab_op0_wr_rd0_addr_de[RFAW-1:1]==iab_op1_rd_rs_addr_de[RFAW-1:1]);

//load-->acc-fpu
assign iab_raw_03_de = iab_op0_wr_rd0_write_de & iab_op1_rd_acc_read_de &
(iab_op0_wr_rd0_addr_de[RFAW-1:1]==iab_op1_rd_acc_addr_de[RFAW-1:1]);

//store-->ialu/fpu write
assign iab_war_hazard_de = iab_op0_rd_rs_read_de & iab_op1_wr_rd1_write_de &
(iab_op0_rd_rs_addr_de[RFAW-1:1]==iab_op1_wr_rd1_addr_de[RFAW-1:1]);
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: epu timing tool

Postby notzed » Thu Jul 17, 2014 2:31 pm

Cool thanks. I'll have to spend a bit of time deciphering it but i can follow it i think.

Here's another interesting case. I'm not particularly concerned if i miss this one out though.

Code: Select all
_stalls:
       0: 070094fc        strd.l  r4,[r13],#-1   
       4: 0400d47c        strd.l  r6,[r13,#+0]   
       8: 01a2       nop.s                   
       a: 01a2       nop.s                   
       c: 01a2       nop.s                   
       e: 01a2       nop.s                   

      10: 0007       fadd.s  r0,r0,r0       
      12: fc02fcef        mov.l   r63,r63         
      16: 2487       fadd.s  r1,r1,r1       
      18: fc02fcef        mov.l   r63,r63         
      1c: 4907       fadd.s  r2,r2,r2       
      1e: fc02fcef        mov.l   r63,r63         
      22: 6d87       fadd.s  r3,r3,r3        ***
      24: fc02fcef        mov.l   r63,r63         
 
      28: 01a2       nop.s                   
      2a: 0600d4ec        ldrd.l  r6,[r13],#+1   
      2e: 0400946c        ldrd.l  r4,[r13,#+0]   
      32: 0402194f        jr.l    r14             


This will not dual-issue every fadd/move. I determined that the fadd r3,r3,r3 is where it doesn't in this particular sequence.

If I cycle through r0-r7 using the above sequence, every time r3 comes around it drops another dual issue. If I change all instructions to be 32-bit, then they all dual-issue.

I'm guessing this is something to do with instruction fetch whereby only 1 instruction is presented in that cycle.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia


Return to Assembly

Who is online

Users browsing this forum: No registered users and 3 guests