Parallella Community

by **solardiz** » Thu Apr 11, 2013 10:06 am

Given benefit/cost, here's my current proposal for most important additions/removals/changes:

Add destination register update modes, likely implemented in or near the register file (not in IALU/FPU). As a minimum, have:

0 - direct write (as is done now)
1 - XOR with current value in the register

Much better:

00 - direct write
01 - XOR
10 - NOT (ignores the current value in the register)
11 - ADD/ADC (depending on what instruction it's bundled with)

XOR and NOT can possibly be viewed as specialized versions of ADD/ADC, reusing a portion of the adder circuitry (XOR disables the carries, NOT also forces one operand to all 1's).

Alternatively, instead of NOT:

10 - SUB/SBB

If we can afford even more (luxury):

000 - direct write
001 - XOR
010 - NOT
011 - ADD/ADC
100 - AND
101 - OR
110 - reverse SUB/SBB (operands swapped) or AND-NOT
111 - SUB/SBB

For "reverse SUB/SBB (operands swapped) or AND-NOT", we can implement either one, or we can have this choice made based on whether the instruction is a load or arithmetic (implement reverse SUB/SBB for these) vs. bitwise (implement AND-NOT for these). In AND-NOT, the "NOT" should be applied to the instruction output, and then it should be AND'ed with the previous RD contents.

These extra bits, 1 to 3 of them depending on which approach we choose, would need to be encoded in integer instructions that have a destination register and in loads. For the loads, we may either introduce extra opcodes (e.g., just one more group of load opcodes if we only support 0/1 for direct/XOR) or we'd have to reduce the maximum displacement - well, or we may choose not to support these modes in loads with displacements (but support them in other kinds of load instructions).

The XOR mode alone is surprisingly capable. Even with no other changes (no added instructions), it lets us implement bitselect in 2 instead of 3 instructions. bitselect may be written not only in the obvious form:

c = (a & sel) | (b & ~sel)

but also:

c = ((a ^ b) & sel) ^ b

If we're constrained to 3-register instructions anyway, we'll put b in RD, a in RN and sel in RM, which gives:

RD ^= (RN ^ RD) & RM

Notice the outer XOR, which is implementable via the register update mode. Additionally, depending on where the inputs came from, the inner XOR may also be implementable as a previous instruction's destination register update mode.

XOR update mode also gives us load-XOR (for ciphers and e.g. RAID parity computation), and 3-input XOR-XOR (actually desirable in lots of places!), AND-XOR, OR-XOR (for ciphers, including reducing instruction count in bitslice implementations).

The NOT update mode gives us AND-NOT, XOR-NOT, OR-NOT - that is, three additional 2-input bitwise ops, which are present on many RISC archs (and AND-NOT is also present on x86's MMX and SSE*). These are handy for implementations of ciphers, and for a bit more.

The ADD/ADC update mode is really powerful. When combined with 32-bit loads, it should work as ADD (no carry input) - we get 1-cycle load-adds. With 64-bit loads, it should work as 32-bit ADD for the first register and as 32-bit ADC for the other, so we get 1-cycle 64-bit load-add. :-)

With arithmetic instructions, it should work as ADD (e.g., we can add 3 32-bit numbers in one instruction, or subtract two and add a third).

The SUB/SBB update mode is similar, allowing for 32-bit load-sub, 64-bit load-sub, and more.

The luxury AND, OR, AND-NOT update modes give us 3-input bitwise ops (beyond XOR-XOR), as well as load-AND, load-AND-NOT, and load-OR. This can be handy for speeding up crypto, although for other uses perhaps reverse SUB/SBB (instead of just one of these luxury modes) is more important (if we have to make this choice, although maybe we can have both, with one of them chosen by instruction category).

Now to instruction additions/removals/changes:

Replace ADD with ADD_ROT (with 5-bit immediate rotate count). If we have the XOR update mode, we don't need an ARX instruction - rather, we obtain it as ADD_ROT,xor. Indeed, we also obtain simple ADD and simple rotate in the obvious ways (need a zeroed register for the latter, though).

If we've added ADD update mode, drop the IMADD instruction. We'll obtain it as IMUL,add.

If we've added SUB update mode, drop the IMSUB instruction. We'll obtain it as IMUL,sub.

The ability to drop separate IMADD/IMSUB instructions makes me more confident that we can implement the ADD/SUB update modes without having to lower the clock rate. We also free up some opcode space for other instructions, such as:

Add XOR_AND, which will compute:

RD = (RN ^ RD) & RM

and be usable as bitselect when invoked with XOR update mode, (XOR_AND,xor).

Add FXOR, a XOR on the FPU, primarily to be usable to trigger the update modes via the FPU, for dual-issue. This way, the peak performance will be 4 integer+bitwise ops per cycle (at least one of them being XOR), or 3 (can be e.g. 3 integer adds). Arguably, FIADD (integer addition on FPU) would be of greater GP use (can do up to 4 integer adds/cycle then), but FXOR is also usable as MOV (when the two inputs are set to the same register), which lets us access any of the update modes as separate instructions via the FPU (to dual-issue them along with anything we do on IALU).

In fact, maybe XOR_AND and FXOR can be one and the same instruction, with a flag bit in it (enabling/disabling the AND). XOR_AND would probably be in the FPU anyway since it needs 3 inputs.

In load/store instructions, have the index treated as such, and not as byte offset. I guess the extra mux that Andreas was referring to would be incurred if both index and byte offset modes were to be supported; I'd be fine with only the index mode being supported, so the mux would be avoided then?

Note that there will be no change in instruction count from this, or even a decrease by one. We're dropping two (IMADD, IMSUB), we're adding two (XOR_AND, FXOR) or one (XOR_AND with a flag allowing to enable/disable the AND portion), we're replacing one (ADD with ADD_ROT).

I am no expert, but my gut feeling is that all of the above, except maybe for things marked luxury, should be implementable with under 10% cost (spent mostly on the register update modes), but the benefits are huge: up to 4x speedup for both GP integer and crypto code, and some speedup (likely in excess of the under 10% cost) for much floating-point code as well (due to faster computation of array indices, loop control logic, and such).

Other instruction additions that I had mentioned before may be considered as well, but compared to the changes proposed above they have lower benefit/cost.

by **solardiz** » Thu Apr 11, 2013 10:37 am

Here's how endianness swap will be implementable:

LSL RD,RN,8
AND tmp,RN,maskFF00FF00
AND RD,RD,maskFF00FF00
LSR,xor RD,tmp,8
ADD_ROT RD,RD,zero,16

where maskFF00FF00 and zero are registers filled in with these constants prior to the loop that needs fast endianness swap. This is 2 instructions fewer than we need now (the ,xor would be a separate instruction now, and we'd need 2 instructions for the rotate, assuming that we can use LSR and IMADD there).

Alternatively, if we don't need to preserve the mask register's contents:

AND tmp,RN,maskFF00FF00
LSL,and maskFF00FF00,RN,8
LSR,xor tmp,maskFF00FF00,8
ADD_ROT RD,tmp,zero,16

This is 4 instructions fewer than we need now.

If we also add 2x16-bit SIMD instructions (not proposed in the previous comment, but may be part of another extension):

ADD_ROT RD,RN,zero,16
ADD_ROT16 RD,RD,zero,8

That's 2 instructions instead of 7, or 3.5x reduction (and no pre-initialized register value is trashed). :-)

If we do add a 2x16-bit SIMD extension, it will need to have an ADD16, so it'd be natural to include ADD_ROT16 instead.

by **timpart** » Thu Apr 11, 2013 12:43 pm

by **ysapir** » Thu Apr 11, 2013 11:00 pm

by **Hoernchen** » Fri Apr 12, 2013 10:11 am

by **solardiz** » Mon Apr 15, 2013 2:21 pm

by **solardiz** » Mon Apr 15, 2013 7:23 pm

In terms of where to fit my proposed additions into the existing opcode table:

The 3 bits of destination register update mode may be put into bits 20-22, which works (and makes sense) for: ADD, SUB, AND, ORR, EOR, ASR, LSR, LSL, IADD, ISUB, IMUL, IMADD, IMSUB, FIX, MOV<COND>, BITR.

For MOV<COND>, ideally we need the update mode to apply only when the move does take place, and be skipped otherwise. We're less interested in having it applied even if the destination register's value is written back to the register (for no change) when the condition is false (although for some of the update modes, such semantics are useful too). If our preferred semantics are not easily implementable (that is, if MOV<COND> is currently implemented as always updating the destination register, with either its own value or the source register's value), then instead of supporting the update mode feature in MOV<COND> we may use those free bits to support an equivalent of C's tertiary operator ?: in MOV<COND>(32). Unfortunately, this would mean either deviating from encoding of the 16-bit version and from B<COND>'s in terms of where the condition bits are encoded, or encoding the register in an unusual place. If none of this is easily implementable, then I'm OK with having the update mode apply unconditionally to the value written even if it's been taken from the destination register itself (this is somewhat useful, e.g. with AND and OR it's the same as having it applied only if the condition is true, with XOR it means "zeroize or move", with ADD it means "double or move", with NOT it means "conditionally move and invert"). I'm also OK with leaving MOV<COND> alone for now; it would have been nice to enhance it while we're similarly enhancing other instructions, but the benefit here is relatively small.

In load instructions, unfortunately bits 20-22 are in use, but we have bits 16-19 free - so it may be best to move the 3 bits currently in 20-22 to 16-18, so that we can put the new update mode bits in 20-22 (same as in other instructions). Bit 19 may be used to enable/disable carry from data bit 31 to 32 on 64-bit load-add and load-sub (in other words, on loads we'll be able to specify up to 16 update modes, in bits 19-22, and we'll initially define and implement 10 modes).

Unfortunately, there are not enough free bits to encode a 5-bit rotate count in the existing ADD instruction (for ADD_ROT). However, there are exactly 6 free bits to encode another register in LSR(IMM) and LSL(IMM). Luckily, the proposed SHRD/"bit align" instruction (shall we call it LSRD for consistency with existing Epiphany mnemonics?) makes both LSR(IMM)(32) and LSL(IMM)(32) redundant (as long as there's a register containing zero), which gives us 7 bits to play with. LSR(IMM)(32) and LSL(IMM)(32) may be replaced with LSRD, the RM register may be put in its usual place (replacing the bits currently holding S2, S3, S4 and using 3 free bits at 23-25). Shift count bits S0, S1, S2 may be put in bits 4-6, S3 and S4 in 21 and 22. This leaves us bit 20 for the update mode, and it'd need to function as direct write (0) or XOR (1) here (this gives us fused rotate-XOR as needed for many symmetric crypto primitives).

It is a bit nasty that encoding LSR(IMM)(16) and LSL(IMM)(16) then has the shift count in different bit positions than LSRD does. To reduce this discrepancy, I propose to move the shift count in these instructions' encoding one bit position to the right, from bits 5-9 to bits 4-8. The LSR/LSL choice bit is then moved from 4 to 9. This way, S0, S1, and S2 are in the same place as they are in LSRD, with only S3 and S4 encoded differently. (I think this is better than having to encode RM differently.)

At assembly mnemonics level, we'll need to implement alias instructions (we already have one: RTS) that will map LSR and LSL with immediate shift count and at least one of the registers with number outside of the 0 to 7 range, to LSRD. It is nasty that this requires a register with zero in it. When explicitly programming with LSRD, this is not a big deal (the programmer will know to keep such a register, initializing it outside of any performance-critical loop). When adhering to an ABI, that ABI can stipulate that e.g. R31 (one of the registers currently described as "reserved for constants") must be kept at 0, and may be assumed to be 0 in compiler-generated code. I think this is fine. We merely need to include this requirement in the description of the LSR and LSL alias instructions. However, if we really want LSR and LSL to appear unrestricted to the first 8 registers even when operating outside of the ABI constraints, we may spend R31 on a read-as-zero, write-discard register, like many other RISC architectures have (and they incur a bigger hit from it since they usually have only 32 registers total, not 64). I don't know if treating writes into R31 specially would introduce a performance hit, though. My preference would be to stipulate its special status at ABI level only, not in hardware.

My proposed XOR-AND instruction (to form bitselect when used along with the XOR update mode), or rather IEOR_AND using naming consistent with Epiphany's current (how confusing), may use the opcode of e.g. FABS when the FPU is operating in integer mode (it needs to be on the FPU anyway, to access 3 input registers). My proposed XOR-on-FPU (for dual-issue) may use the opcode of e.g. FLOAT when the FPU is operating in integer mode. Its Epiphany consistent mnemonic would be IEOR. (FIX would remain available even in integer mode, then - such that existing floating-point data may be converted and processed.)

My proposal for treating the index in load and store as literally index and not byte offset does not require any change to the instruction encodings - only to the semantics. Alternatively, if we want to introduce it early, before we'd break binary compatibility for other reasons anyway, we may use a bit in the 16 to 19 range to request the new treatment of index.

Summary: the most important things I proposed in previous comments appear to be implementable with very little change to the instruction encodings. A subset of them are even implementable without breaking binary compatibility (up to 3-bit update mode on arithmetic instructions, the EOR_AND and IEOR instructions addition if FABS and FLOAT were not currently defined as supported in the FPU's integer mode, and new treatment of index on loads). The effect of these changes on integer and especially symmetric crypto code performance is huge (several times speedup). The effect on floating-point code is probably also positive, due to speeding up the integer portions (array indices, loop control).

Parallella Community

Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Who is online