Are ADD/SUB signed or unsigned operations?

Any technical questions about the Epiphany chip and Parallella HW Platform.

Moderator: aolofsson

Re: Are ADD/SUB signed or unsigned operations?

Postby notzed » Sat Aug 30, 2014 3:15 am

cmcconnell wrote:
notzed wrote:I know it seems a bit weird but that's what I saw from the cycle counters and a whole lot of tiny tests I used to isolate the precise behaviour.

I suppose it would be a lot of complexity, for only a little reward, to have a scheduler check for this kind of scenario. I.e. to be able to look ahead and see that it would be better to dual issue the second candidate pair.


I would think a full out of order execution pipeline would be a good deal more complex. Possibly the implementation of interrupts has a bearing on this part of the design.

notzed wrote:The eor's are just waiting for the fmadd results. The fmadd 'stage numbers' displayed should probably just go to 5 for rounding mode because you need 4 whole cycles between "1" of the first and "1" of the dependent instruction in this case.

Ah, suddenly it all makes sense! My problem was the uncertainty I mentioned previously about the precise meaning of 'Cycle Separation'. I was confusing the 4 execution stages with the stated 4-cycle separation. But, of course, your definition must be correct; there is after all a 1-cycle separation when you go the other way - from an IALU to a dependent FPU instruction - and that fits the same definition of a gap between the '1s' (i.e. 1_1234).


I kept thinking 4-instruction staggering and wondering why the hardware counters were showing such poor values. But it's really 5. It's separation, not stride?

Not sure about changing the display to go up to 5, though. I think I was just being a bit dim, and the change might not really be all that necessary. (Plus, to be consistent you would really need to add an extra stage to the IALU instructions too , displaying them as '12'. )


Well not really because the way it's displayed the iop value is usable after "1", but the flop result is ready after "5".

Another interpretation is that iops results are ready at the start of E1 and flops results (rounding mode) are ready at the end of E4 - i.e. shifting everything one to the left. But you can't really represent edges with ASCII and showing the non-E stages looks a bit cluttered.

Earlier you said -
The pipeline diagram details where the instructions are executed and retired (figure 14), and yes I believe it is one cycle less when using truncate mode. i.e. they will retire in E3 and not E4.

But does that mean that the required Cycle Separation is also reduced from 4 to 3? If it is, then a useful enhancement to your tool might be a command line option to specify that truncate mode be assumed, and to adjust the timings accordingly.


Yeah that's what I meant by retire. To confirm it absolutely I would do a test with the event counters, but if it didn't behave this way it would be inconsistent with everything else we know about the hardware.

I'm aware of it so I guess it's on the TODO list as such. I was thinking yesterday it's probably fairly easy to hack in; I already have a spot to track the FPU mode at issue time although with the delays in mode switching I think the pipeline is flushed between them anyway.

I'd recommend running your code on the hardware with the event counters running just to cross-check any details (edit: if you need absolutely accurate results that is. The tool doesn't simulate memory conflicts either).

notzed wrote:With that set of instructions you're probably not going to get any better timing: you need more work to fill all those holes.

That was just an opening fragment, as a test. I'm trying to figure out how best to take a particular, long set of repeating instruction sequences and interleave them to avoid stalls (plus maybe change some further ADDs to IADDs, to increase the occupancy of IALU2). Once I get the basic pattern right, I can hopefully just keep going. So long as I can get somewhere close to the best possible timing for the given task, I'll be happy. Even if it's simply not possible to get anywhere close to full occupancy of both units.

Ideally there'd be a loop involved, rather than fully unrolling everything. With a loop, I guess I would have to watch out for dependencies between the instructions at the end of the loop and those at the beginning.

Fun times ahead. :D


If the loop uses a branch then not very much because it takes one non-dual-issue instruction (1 full cycle) and then flushes the pipeline (3 full cycles). With hardware loops then of course yes.

Often unrolling 2-4 times will do enough unless it's something really simple or flop-serialised like convolution.

Setting up for the next loop at the end of the current one can be an easy way to fill scheduling slots and reducing the stalls from the values they create. Usually it doesn't matter if you over-run a read buffer for instance and saves the hassle of creating epilogue code for the edge cases.

Extending that idea a bit further: if the calculation can be separated into completely separate stages they can be "pipelined" instead of unrolled by just interleaving the stages. It works particularly well if each stage only requires results from the previous stage/iteration and not earlier ones.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: Are ADD/SUB signed or unsigned operations?

Postby cmcconnell » Sun Aug 31, 2014 7:22 pm

notzed wrote:I would think a full out of order execution pipeline would be a good deal more complex. Possibly the implementation of interrupts has a bearing on this part of the design.

This is well outside my area of expertise, but it seems to me there is no 'out of order' element to what I was considering. It's kind of like an order of precedence choice - Given a sequence of instructions A,B,C, where B will execute on a different unit to A and C, we can choose to treat that as 'A, then (B+C)', or as '(A+B), then C'.

Which is not to say that considering both and choosing the best isn't still too complex to be worthwhile.

notzed wrote:
notzed wrote:The eor's are just waiting for the fmadd results. The fmadd 'stage numbers' displayed should probably just go to 5 for rounding mode because you need 4 whole cycles between "1" of the first and "1" of the dependent instruction in this case.

Ah, suddenly it all makes sense! My problem was the uncertainty I mentioned previously about the precise meaning of 'Cycle Separation'. I was confusing the 4 execution stages with the stated 4-cycle separation. But, of course, your definition must be correct; there is after all a 1-cycle separation when you go the other way - from an IALU to a dependent FPU instruction - and that fits the same definition of a gap between the '1s' (i.e. 1_1234).


I kept thinking 4-instruction staggering and wondering why the hardware counters were showing such poor values. But it's really 5. It's separation, not stride?

Not sure about changing the display to go up to 5, though. I think I was just being a bit dim, and the change might not really be all that necessary. (Plus, to be consistent you would really need to add an extra stage to the IALU instructions too , displaying them as '12'. )


Well not really because the way it's displayed the iop value is usable after "1", but the flop result is ready after "5".

Another interpretation is that iops results are ready at the start of E1 and flops results (rounding mode) are ready at the end of E4 - i.e. shifting everything one to the left. But you can't really represent edges with ASCII and showing the non-E stages looks a bit cluttered.

Once again, I'm slightly confused. I think I need to take a closer look at the documentation, and conduct a few experiments, rather than just hassle you with an endless stream of questions!

But the thinking behind my previous comment was -

The documentation says there is a cycle separation of 4 between an FPU instruction and an IALU instruction, and a cycle separation of 1 between an FPU instruction and an IALU instruction (in both cases, assuming a register dependency).

I wasn't really sure what this means, but by experiment I saw that your tool displays a gap of 1 cycle between the '1' of an IALU instruction and the '1' of a dependent FPU instruction, and also a gap of 1 between the '4' of an FPU instruction and the '1' of a dependent IALU instruction (which equates to a gap of 4 from '1' to '1').

So I thought I had worked out how to relate the ezetime output to the figures in the documentation, but now I'm not so sure.


notzed wrote:I'm aware of it so I guess it's on the TODO list as such. I was thinking yesterday it's probably fairly easy to hack in; I already have a spot to track the FPU mode at issue time although with the delays in mode switching I think the pipeline is flushed between them anyway.

I think this may be another area where I don't yet have all the facts straight. I was working under the assumption that it would not be feasible for you to in any way dynamically track the mode of the FPU, since I believe it is controlled by setting the config register, which could be done in a different object file to the one being analysed. So I presumed a command-line argument would be the only way to handle this.

For the code I will be building, my intention is to compile the c files with the options -mfp-mode=int and -mfp-mode=truncate. But it's not clear to me whether I also need to manually set the config register somewhere during the program's initialisation.

(There's no fp code involved, and I don't want the overhead of switching between modes when I call and return from my asm routine, which depends on the FPU being in IALU2 mode.)

notzed wrote:If the loop uses a branch then not very much because it takes one non-dual-issue instruction (1 full cycle) and then flushes the pipeline (3 full cycles). With hardware loops then of course yes.

Often unrolling 2-4 times will do enough unless it's something really simple or flop-serialised like convolution.

Setting up for the next loop at the end of the current one can be an easy way to fill scheduling slots and reducing the stalls from the values they create. Usually it doesn't matter if you over-run a read buffer for instance and saves the hassle of creating epilogue code for the edge cases.

Extending that idea a bit further: if the calculation can be separated into completely separate stages they can be "pipelined" instead of unrolled by just interleaving the stages. It works particularly well if each stage only requires results from the previous stage/iteration and not earlier ones.

Thanks for the tips. It's probably a bad idea that I am simultaneously trying to learn the general principles of how to optimise assembly code, plus the specifics of how to apply those principles to the Epiphany. Fortunately, this is a hobby project and I'm not working to a deadline. :)
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: Are ADD/SUB signed or unsigned operations?

Postby notzed » Mon Sep 01, 2014 12:47 pm

cmcconnell wrote:
notzed wrote:I would think a full out of order execution pipeline would be a good deal more complex. Possibly the implementation of interrupts has a bearing on this part of the design.

This is well outside my area of expertise, but it seems to me there is no 'out of order' element to what I was considering. It's kind of like an order of precedence choice - Given a sequence of instructions A,B,C, where B will execute on a different unit to A and C, we can choose to treat that as 'A, then (B+C)', or as '(A+B), then C'.

Which is not to say that considering both and choosing the best isn't still too complex to be worthwhile.


Well if you had ABCDE and B was on FPU and the rest were not, and B required a 4-cycle stall from a previous instruction:

fpu: ----B-
alu: ACDE

And that is out of order (C starts before B). It was what happened as I developed the model before adding the RA stall-interlock. On the face of it it seems like that is how you would expect the hardware to work for best efficiency, but this doesn't match the output of the hardware counters. And since it doesn't there's probably a good reason.

Once again, I'm slightly confused. I think I need to take a closer look at the documentation, and conduct a few experiments, rather than just hassle you with an endless stream of questions!

But the thinking behind my previous comment was -

The documentation says there is a cycle separation of 4 between an FPU instruction and an IALU instruction, and a cycle separation of 1 between an FPU instruction and an IALU instruction (in both cases, assuming a register dependency).

I wasn't really sure what this means, but by experiment I saw that your tool displays a gap of 1 cycle between the '1' of an IALU instruction and the '1' of a dependent FPU instruction, and also a gap of 1 between the '4' of an FPU instruction and the '1' of a dependent IALU instruction (which equates to a gap of 4 from '1' to '1').

So I thought I had worked out how to relate the ezetime output to the figures in the documentation, but now I'm not so sure.


I think that understanding is correct, or at least matches mine.

The tool is tracking which execution stage in the pipeline model the instruction is 'alive' for. And since there is only E1 ... E4 it only displays 1-4. But a dependent flop doesn't enter E1 until one slot after the E4 of the source instruction because the register being ready is tested at the "edge" of the cycle (and the timing doesn't match the manual or the hardware counters otherwise). I could either 'go to 5' (as i am now even if i'm just printing a blank instead of a 5), or just not display the '1' for iops - but that isn't very useful.

No doubt the model could be improved.

Finding out yourself saves the confusion of forum posts, but at the cost of having to find out yourself :)

notzed wrote:I'm aware of it so I guess it's on the TODO list as such. I was thinking yesterday it's probably fairly easy to hack in; I already have a spot to track the FPU mode at issue time although with the delays in mode switching I think the pipeline is flushed between them anyway.

I think this may be another area where I don't yet have all the facts straight. I was working under the assumption that it would not be feasible for you to in any way dynamically track the mode of the FPU, since I believe it is controlled by setting the config register, which could be done in a different object file to the one being analysed. So I presumed a command-line argument would be the only way to handle this.

For the code I will be building, my intention is to compile the c files with the options -mfp-mode=int and -mfp-mode=truncate. But it's not clear to me whether I also need to manually set the config register somewhere during the program's initialisation.

(There's no fp code involved, and I don't want the overhead of switching between modes when I call and return from my asm routine, which depends on the FPU being in IALU2 mode.)


Oh yeah - a command line argument would be the way I will do it. I was originally going to snoop the status flags and track the current state at the time the instruction was issued but that would need most of an emulator and it's not worth it. (sorry i wasn't thinking about that detail in the last reply, just that trunc mode was something I was considering from the start).

You'll have to explicitly set the integer/rounding mode somewhere in your code. Even if the c-startup code (something like 'crt0.s' in newlib somewhere) sets it, it wont be what you want.

Thanks for the tips. It's probably a bad idea that I am simultaneously trying to learn the general principles of how to optimise assembly code, plus the specifics of how to apply those principles to the Epiphany. Fortunately, this is a hobby project and I'm not working to a deadline. :)


Well lucky the epiphany ISA is about as simple as possible for a dual-pipelined processor and most of the scheduling rules are too. And it's got a lot of registers so you likely don't have to deal with optimising the working set which is just not very fun.

Just as well you have no deadline, it can be a very effective time-sink!
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Previous

Return to Epiphany and Parallella Q & A

Who is online

Users browsing this forum: No registered users and 4 guests

cron