Parallella Community

by **solardiz** » Mon Apr 15, 2013 11:41 pm

Regarding turning R31 into a read-as-zero, write-discard register: it may be not too bad an idea, given that there are no comparison instructions, so a register is probably wasted on them anyway (as destination register for SUB/ISUB/FSUB when these are used for comparison). Without special status in hardware, an ABI can't both have R31 at zero and use it for SUB/ISUB/FSUB destination in comparisons - it has to use another register there (maybe without specifying which, letting the programmer/compiler decide on a case-by-case basis, but that's occasional waste of an extra register anyway). So if we need a register holding zero, then making R31 special in hardware may actually save us a register.

by **solardiz** » Tue Apr 16, 2013 1:23 am

by **solardiz** » Tue Apr 16, 2013 2:47 am

For a simple example, let's see how MD5 is implementable. Its step function is:

#define STEP(f, a, b, c, d, x, t, s) \
(a) += f((b), (c), (d)) + (x) + (t); \
(a) = (((a) << (s)) | (((a) & 0xffffffff) >> (32 - (s)))); \
(a) += (b);

where f() is one of:

#define F(x, y, z) ((z) ^ ((x) & ((y) ^ (z))))
#define G(x, y, z) ((y) ^ ((z) & ((x) ^ (y))))
#define H(x, y, z) ((x) ^ (y) ^ (z))
#define I(x, y, z) ((y) ^ ((x) | ~(z)))

Notice that F() and G() correspond to bitselect(). Additionally, due to symmetry in H(), one of the two XOR results may in 8 of the 16 steps using H() be reused for the next step, so a double-XOR instruction is of slightly less help than it would be otherwise. Typically, x is in memory, t is a 32-bit constant (may be in memory or pre-loaded into a register outside of the loop for a subset of the steps - we have 64 steps, so can't do it for all), s is a small constant, the rest of the inputs to the STEP macro are in registers.

For now, let's take F() and assume that our t is in memory. On current Epiphany, I think the code may be something like this:

LDRW R0,[x]
LDRW R1,[t]
; or maybe we can do LDRD R0,[x] if we can somehow intermix x and t
EOR R2,c,d
AND R2,R2,b
EOR R2,R2,d
ADD R2,R2,a
IADD R0,R0,R1
ADD R0,R0,R2
LSR R1,R0,(32-s)
MOV R3,(1<<s) ; can hopefully do it outside of the loop, not here
IMADD R1,R0,R3
ADD a,R1,b

This is 10 to 12 instructions.

Here's how we can optimize this using some of the proposed enhancements to the ISA:

LDRW R0,[x]
LDRW R1,[t]
; or maybe we can do LDRD R0,[x] if we can somehow intermix x and t
MOV R2,d
EOR_AND,EOR R2,c,b ; alternatively, could reuse d here and use R2 in place of d further
ADD,ADD R2,a,R0
IADD R2,R2,R1
MOV a,b
LSRD,ADD a,R2,R2,(32-s) ; alternatively, could reuse b here, and swap a and b in the following code

This is 7 or 8 instructions. Two of them are MOV - would be nice to have them for free! Let's take it further:

LDRW,ADD a,[x]
MOV R2,d
LDRW,ADD a,[t]
EOR_AND,EOR R2,c,b ; alternatively, could reuse d here and use R2 in place of d further
ADD R2,R2,a
MOV a,b
LSRD,ADD a,R2,R2,(32-s) ; alternatively, could reuse b here, and swap a and b in the following code

Again 7 instructions, 2 of them MOV. This is like 5 instructions if the moves are free.

If the moves are non-free, we have less incentive to use the complicated instruction forms here. We can do:

LDRW,ADD a,[x]
MOV R2,d
LDRW,ADD a,[t]
EOR_AND,EOR R2,c,b ; alternatively, could reuse d here and use R2 in place of d further
ADD a,a,R2
LSRD a,a,a,(32-s)
ADD a,a,b

Also 7 instructions, but only 1 of them is MOV. If we consider making MOVs free (maybe at a later time), then the version with more MOVs is better.

ADD_ROT would save an instruction here.

In all of these examples, I assume that possible stalls can presumably be addressed by interleaving instructions from many instances of MD5 together (we're talking multi-core parallelism anyway, can as well have more of it per core). Also, more of the ADDs may be replaced with IADD as optimal.

And now a version using 64-bit load-ops for 2x32-bit SIMD computation:

LDRD,ADD32 a,[x]
LDRD,ADD32 a,[t]
LDRD R2,[c]
LDRD,EOR R2,[d]
LDRD,AND R2,[b]
LDRD,EOR R2,[d]
LDRD,ADD32 R2,[a]
LSRD a,R2,R2,(32-s)
LSRD a+1,R3,R3,(32-s)
LDRD,ADD32 a,[b]

10 instructions (8 loads, 2 IALU) for two instances of MD5. :-)

Notice that all of these instructions fit in the not-too-invasive instruction encoding changes that I proposed above. Here, the "[register]" notation refers to the memory-mapped address of the register (is such use of loads currently OK?), and indeed all registers are meant to be even-numbered. Similarly, "a+1" refers to the next register after "a". This is not real assembly source; the correct syntax for these things will need to be substituted. Also, the addresses of memory-mapped registers will need to be preloaded into other registers before the loop (or at least the base address will need to be preloaded, and then we'd need support for at least some limited displacements along with load-ops - looks practical to me). For actual implementation, this 2x SIMD implementation would probably need to be intermixed with a non-SIMD one above, so the loads will dual-issue along with IALU and FPU instructions there. There happen to be exactly two loads in the non-SIMD implementation, so they may be issued along with the LSRD instructions in the SIMD implementation (where the load/store unit would otherwise be idle). Even more instances may be intermixed (e.g., two SIMD and two scalar, computing a total of 6 MD5s in parallel) to hide instruction latencies.

As it happens, in two latest code revisions above I am using the ADD update mode on loads only. So only two extra 32-bit adders are being made use of, out of a maximum of four extra 32-bit adders needed per my earlier proposal. Maybe in practice it'd be sufficient to have 3 of the 4 (e.g. two for loads and one for either IALU or FPU, but not for both - supporting only bitwise update modes on top of the other unit's instructions).

On a related note, loads from memory-mapped registers are very powerful, if they are in fact supported and are fast - are they? Even without load-ops, such a 64-bit load can be used to replace two MOVs. Will it currently do the trick of copying two registers to two other registers in 1 cycle (in terms of throughput, not latency), which with MOVs would be two cycles?

by **solardiz** » Tue Jul 15, 2014 2:03 am

by **xilman** » Tue Jul 15, 2014 1:51 pm

Here's my tuppence worth. Bear in mind that I'm a computational number theory gee with interests in crypto. Numerous forms of asymmetric crypto would benefit from the following.

1) 64*64 -> 128 unsigned multiply.
1a) 64*64+64 -> 128 MAC would be even nicer
2) 128/64 -> 64 quotient and 64 remainder
3) 64 bit unsigned add with carry in and carry out
4) 64-bit unsigned subtract with borrow in and borrow out.

by **notzed** » Wed Jul 30, 2014 12:39 pm

Here's a few ideas in no particular order. Some have been mentioned here or in other threads already. I've considered how it fits with a 16-bit instruction set where applicable.

1) big+1 for a bitselect. Whilst not ideal i think a 3-register version would still be useful and fits in 16 bits.

2) exponent modifier for float and fix. they already seem to have spare bits for rm for this.

3) fcmpCC and cmpCC instructions which send results to all bits of a register, as here: viewtopic.php?f=43&t=1550#p9621 or vcmp from ARMv7. Ideally 3 reg + Cx bits but this doesn't fit the instruction layout and wouldn't support 16-bit opcodes. Even a 2-reg version would help.

fcmp could use the same mode-selection mechanism as in 6) below so also handle integers but a separate single-cycle version that runs on the ialu pipeline would also be useful.

4) small immediate size/shift multiplier for register-indexed/pm load/stores using bits [19:16]. i.e. ldr r0,[r1,r2*4]

5) immediate and/orr/eor with sign-extended immediates. It doesn't cover the whole range but covers a lot of very useful sizes and includes a unary 'not' operation. Only if there is room in the instruction selector space though.

6) Add the rounding and integer mode bits to the 32-bit versions of fadd/fsub/fmul/fmadd/fmsub/float/fix within [22:20] which completely override the ones in the config register. Mode switches are extremely expensive in time and code-size and there's no reason to 'parameterise' it. The one bit left could (future) select double-mode, or for the i* equivalents, 64-bit output.

6a) 16-bit immediate mode instruction for setting the fpu mode bits for 16-bit f* instructions that operates with no cycle penalty and doesn't need to read/modify/write anything to set the mode. fmode #i3 (with various aliases). 3 bits being integer/float, rounding/truncate, (future) double/single to mirror the 3-bits available in 32-bit f* opcodes.

6b) other ways to save/restore the fp modes. Obvious easiest solution is an alias of the config register which only includes round/truncate and integer/float so it can be written to without reading first. Actually the same goes for almost every functional group of bits in CONFIG.

7) load/store multiple?

8) Drop movCC for cmpCC + selb + "mov rd,rn" or if possible "mov[d][e] rd,rn". Where the 'd' bit indicates double (register pair), and 'e' indicates 'exchange' (swap).

1 and 3 are for better branchless algorithms. Good for performance and critical for using the hardware-loop to it's fullest. Together they could replace movCC as in 8) - I'd go so far as to suggest that if it's a question of movCC or cmpCC, then movCC should be removed, but adapteva may not want to break backward binary compatibility (cost of doing so seems low mind you).

2, 4, 5 would help with code-size, fixed-point algorithms, and array indexing and seem to me to be pretty low-hanging fruit.

6 is pretty critical to general performance - it would have a big time/space impact on most types of code. Right now the i* "instructions" are essentially worthless outside of full integer/fixed-point code due to the mode-switching cost. Even innocuous statements like "i=(int)f" or "f=foo[j*k]*3.0f" can turn into a couple of dozen lost cycles due to mode switching. The compiler optimiser doesn't seem to like it at all either.

by **algorithm** » Wed Aug 06, 2014 7:11 am

Branches are a big bottleneck for epiphany,since there is no branch prediction unit.A simple way to increase speed is to transform branches into something like: j* offset, slotnum , where slotnum is a number of 3 bits which indicates after how many instructions to branch.As a result misprediction penalty (epiphany predicts them as not taken ) is greatly reduced.The slotnum can be encoded in the higher bits of offset ,since offset is 24 bits long and the amount of memory per core is just 20 bits long.Notice that if slotnum is 0 it is the se as current branch instructions.

by **aolofsson** » Wed Aug 06, 2014 11:02 am

by **algorithm** » Wed Aug 06, 2014 12:12 pm

I don't understand you.Is the problem complexity?
Detailed algorithm:
Epiphany can execute at most 2 instructions per clock,so for 3 clocks it executes at most 6 instructions.
We can restrict slotnum to at most 6 instructions.
When the core finds a branch it just changes ip address.
When it has executed the required amount of instructions it discards the remaining wrongly fetched instructions and continues executing from the offset.

by **timpart** » Thu Aug 07, 2014 11:10 am

The chip doesn't execute an instruction immediately after fetching it. There is a pipeline involved, and the three cycle delay involved in a branch or jump is caused by refilling the pipeline

Tim

Parallella Community

Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Who is online