Parallella Community

by **jenesuispasgoth** » Wed Apr 24, 2013 8:52 pm

Hi,

I'm currently reading the reference manual for the Epiphany processor. In section 7.11 (Branch Penalties), it says that basically all branches (conditional and unconditional) will be predicted non-taken. This is relatively "fine" for conditional branches, but it looks like it is also the case for unconditional ones.

Is there a reason for this behavior? If I have a loop and it systematically takes 3 cycles each time I iterate, this is really NOT cool...

by **ysapir** » Wed Apr 24, 2013 9:10 pm

Obviously, there is a reason for such behaviour. When the design target is to reduce power consumption as much as possible, teh architect needs to make a tradeoff between added features, performance and power. Branch prediction is achieved using a BTB (Branch Target Buffer) mechanism, which is like a cache for branch targets. Since Epiphany doesn't have a data cache for reasons of efficiency, it does not have an address cache from same reason.

The way to mitigate the branch penalty is to unroll your loops (space permitting). When compiling you C modules, you can use the -funroll-loops command line option to get that automatically.

by **tnt** » Thu Apr 25, 2013 6:46 am

Well, "prediction" for unconditional branches doesn't need a BTB, it just needs to match the unconditional opcode to know it's going to be taken.

by **ysapir** » Thu Apr 25, 2013 8:10 am

Yes, this is correct.

by **tino** » Thu Apr 25, 2013 3:10 pm

Yes, tnt actually makes a very valid point. An adjustment to the opcode decoder (and/or control unit) for these unconditional branches would save a 3 clock cycle penalty. Depending on the application and the level of parallelism, this adjustment could be beneficial for performance.

However I'm sure there's a reason the Parallela team decided to go with this option. Right?! (again probably regarding mux sizes, dynamic power etc)

by **aolofsson** » Thu Apr 25, 2013 4:09 pm

tino,

A couple of pointers, not sure if it helps explain our reasoning.

-In order to keep the instruction pipe fed with our pipeline we 'prefetch' instructions from the local instruction memory to the instruction decoder on every clock cycle. This basically does a PC+8 (brings in 64 bits at a time). In order to do this on every clock cycle, we can't wait until the instruction comes back, decode it, and decide which PC to fetch next. That's the penalty of having a pipeline that sends out a PC on FE, gets data back from memory at IM, decodes the instruction at DE, accesses the register file at RA, and executes an integer instruction (or branch) at E1. The current branch penalty is not something we can solve with a different instruction decoding methods (unless I am missing something??)

-We do have conditional move instructions that can help with simple control constructs (if, else) without the branch penalty.

-Except for repeating for loops, we felt that the 3 cycle branch penalty was not the end of the world.(compared to things like register file save/restore for function calls).

-For critical 'for' type static repeating loops, you can unroll the loops like Yaniv suggested, or you could try using our experimental "zero-overhead" hardware looping feature.(undocumented, I am waiting for someone to promise to test it

)

Andreas

by **tino** » Thu Apr 25, 2013 6:49 pm

Thanks Andreas

Definitely helped clarify things. It makes sense why those design choices were made for branch instructions now.

(P.S. I was more so referring to a "prefetching" and "predecoding" unit which would just look for the unconditional branch cases, and then update the PC to prefetch from the right address on the next cycle. Then from flushing 24 ((PC+8)*3c.c) instructions, we could reduce it to 8. But in retrospect if we start adding other units it might be just beneficial to start with branch prediction and other power hungry alternatives. What Parallela offers will more than suffice).

Can't wait!

by **piotr5** » Fri Apr 26, 2013 8:22 am

with hardware-loop you mean some piece of hardware regularily resetting the register that's storing where the current execution happens? I guess other cores could do that for you too, even better a clock-interrupt should be capable of that. so how is it a hidden feature in hardware, without overhead? (if the clock would fire an interrupt in exactly adjusted cycles, then there is an overhead for the memory storing all the useless return-addresses.)

I don't own any parallella, so I cannot test my theory. but I suspect changing the right memory-pos will cause a no-penalty jump, right? if that's the case then some other core could take over the whole branch-prediction. :-)

by **ysapir** » Fri Apr 26, 2013 11:24 am

Hardware loops, or Zero-Overhead-Loops in some processor contexts, are simply what their name imply - you have special core registers for the beginning, the end and the counter, and as the program flows, the PC is tested against the loop's end address. When it is about to get there, the counter is decremented and the fetch begins from the beginning of the loop (the loop-start register's value). Therefore, no stall cycles are wasted due to branch misprediction and pipeline flush at the top of the loop.

by **jenesuispasgoth** » Tue Apr 30, 2013 3:54 pm

Hi everyone,

First of all, thanks for all these answers.
Second, I was aware of the problems of having (or not having) BTBs, and my question was a bit provocative on purpose. :-)

My question was mostly because of the unconditional branches because if there had been a way to have no penalties on them, I could see an "easy" way to save 1 cycle on loops while putting a heavier cost on end of loops. Something like:

LOOP:
do_stuff_in_loop;
...
if (!LOOP_MUST_END) // predicted not taken
goto PC+8;
else
goto LOOP; // predicted "taken"
stuff_outside_the_loop;

The way I was seeing it, it would cost "only" 2 cycles per iteration, but 4 cycles if you needed to leave the loop.

Anyway, now that I know that there is a "zero overhead" in the works, I am very curious indeed.

In summary, thanks for your answers, as well as the diplomatic tone to my rather undiplomatic reaction.

Parallella Community

Branch instructions and branch penalty

Branch instructions and branch penalty

Re: Branch instructions and branch penalty

Re: Branch instructions and branch penalty

Re: Branch instructions and branch penalty

Re: Branch instructions and branch penalty

Re: Branch instructions and branch penalty

Re: Branch instructions and branch penalty

Re: Branch instructions and branch penalty

Re: Branch instructions and branch penalty

Re: Branch instructions and branch penalty

Who is online