For a simple example, let's see how MD5 is implementable. Its step function is:
#define STEP(f, a, b, c, d, x, t, s) \
(a) += f((b), (c), (d)) + (x) + (t); \
(a) = (((a) << (s)) | (((a) & 0xffffffff) >> (32 - (s)))); \
(a) += (b);
where f() is one of:
#define F(x, y, z) ((z) ^ ((x) & ((y) ^ (z))))
#define G(x, y, z) ((y) ^ ((z) & ((x) ^ (y))))
#define H(x, y, z) ((x) ^ (y) ^ (z))
#define I(x, y, z) ((y) ^ ((x) | ~(z)))
Notice that F() and G() correspond to bitselect(). Additionally, due to symmetry in H(), one of the two XOR results may in 8 of the 16 steps using H() be reused for the next step, so a double-XOR instruction is of slightly less help than it would be otherwise. Typically, x is in memory, t is a 32-bit constant (may be in memory or pre-loaded into a register outside of the loop for a subset of the steps - we have 64 steps, so can't do it for all), s is a small constant, the rest of the inputs to the STEP macro are in registers.
For now, let's take F() and assume that our t is in memory. On current Epiphany, I think the code may be something like this:
LDRW R0,[x]
LDRW R1,[t]
; or maybe we can do LDRD R0,[x] if we can somehow intermix x and t
EOR R2,c,d
AND R2,R2,b
EOR R2,R2,d
ADD R2,R2,a
IADD R0,R0,R1
ADD R0,R0,R2
LSR R1,R0,(32-s)
MOV R3,(1<<s) ; can hopefully do it outside of the loop, not here
IMADD R1,R0,R3
ADD a,R1,b
This is 10 to 12 instructions.
Here's how we can optimize this using some of the proposed enhancements to the ISA:
LDRW R0,[x]
LDRW R1,[t]
; or maybe we can do LDRD R0,[x] if we can somehow intermix x and t
MOV R2,d
EOR_AND,EOR R2,c,b ; alternatively, could reuse d here and use R2 in place of d further
ADD,ADD R2,a,R0
IADD R2,R2,R1
MOV a,b
LSRD,ADD a,R2,R2,(32-s) ; alternatively, could reuse b here, and swap a and b in the following code
This is 7 or 8 instructions. Two of them are MOV - would be nice to have them for free! Let's take it further:
LDRW,ADD a,[x]
MOV R2,d
LDRW,ADD a,[t]
EOR_AND,EOR R2,c,b ; alternatively, could reuse d here and use R2 in place of d further
ADD R2,R2,a
MOV a,b
LSRD,ADD a,R2,R2,(32-s) ; alternatively, could reuse b here, and swap a and b in the following code
Again 7 instructions, 2 of them MOV. This is like 5 instructions if the moves are free.
If the moves are non-free, we have less incentive to use the complicated instruction forms here. We can do:
LDRW,ADD a,[x]
MOV R2,d
LDRW,ADD a,[t]
EOR_AND,EOR R2,c,b ; alternatively, could reuse d here and use R2 in place of d further
ADD a,a,R2
LSRD a,a,a,(32-s)
ADD a,a,b
Also 7 instructions, but only 1 of them is MOV. If we consider making MOVs free (maybe at a later time), then the version with more MOVs is better.
ADD_ROT would save an instruction here.
In all of these examples, I assume that possible stalls can presumably be addressed by interleaving instructions from many instances of MD5 together (we're talking multi-core parallelism anyway, can as well have more of it per core). Also, more of the ADDs may be replaced with IADD as optimal.
And now a version using 64-bit load-ops for 2x32-bit SIMD computation:
LDRD,ADD32 a,[x]
LDRD,ADD32 a,[t]
LDRD R2,[c]
LDRD,EOR R2,[d]
LDRD,AND R2,[b]
LDRD,EOR R2,[d]
LDRD,ADD32 R2,[a]
LSRD a,R2,R2,(32-s)
LSRD a+1,R3,R3,(32-s)
LDRD,ADD32 a,[b]
10 instructions (8 loads, 2 IALU) for two instances of MD5.
Notice that all of these instructions fit in the not-too-invasive instruction encoding changes that I proposed above. Here, the "[register]" notation refers to the memory-mapped address of the register (is such use of loads currently OK?), and indeed all registers are meant to be even-numbered. Similarly, "a+1" refers to the next register after "a". This is not real assembly source; the correct syntax for these things will need to be substituted. Also, the addresses of memory-mapped registers will need to be preloaded into other registers before the loop (or at least the base address will need to be preloaded, and then we'd need support for at least some limited displacements along with load-ops - looks practical to me). For actual implementation, this 2x SIMD implementation would probably need to be intermixed with a non-SIMD one above, so the loads will dual-issue along with IALU and FPU instructions there. There happen to be exactly two loads in the non-SIMD implementation, so they may be issued along with the LSRD instructions in the SIMD implementation (where the load/store unit would otherwise be idle). Even more instances may be intermixed (e.g., two SIMD and two scalar, computing a total of 6 MD5s in parallel) to hide instruction latencies.
As it happens, in two latest code revisions above I am using the ADD update mode on loads only. So only two extra 32-bit adders are being made use of, out of a maximum of four extra 32-bit adders needed per my earlier proposal. Maybe in practice it'd be sufficient to have 3 of the 4 (e.g. two for loads and one for either IALU or FPU, but not for both - supporting only bitwise update modes on top of the other unit's instructions).
On a related note, loads from memory-mapped registers are very powerful, if they are in fact supported and are fast - are they? Even without load-ops, such a 64-bit load can be used to replace two MOVs. Will it currently do the trick of copying two registers to two other registers in 1 cycle (in terms of throughput, not latency), which with MOVs would be two cycles?