Parallella Community

by **solardiz** » Wed Apr 10, 2013 2:54 am

It's not on my current wishlist, but I imagine that some others (and maybe me later) would also be interested in 16-bit SIMD/SWAR integer ops (similar to , which did two 16-bit ops in 32-bit regs) and/or in 16-bit half-precision floating-point (similarly, two such operations per instruction executed). Perhaps this is fairly low-cost, considering that PA-RISC got it in 1994.

by **aolofsson** » Wed Apr 10, 2013 12:40 pm

by **piotr5** » Wed Apr 10, 2013 1:49 pm

an addition that I would like to see in parallella is a random number generator, like the one in some x86. the reason for that is "compressed sensing" where the algorithms require that a totally random matrix gets created -- pseudo-random probably wont cut it. also encryption could gain something when entropy is created by 16 or 64 cores in parallel! remember, epiphany cores can't use the rng from linux, but linux can use their rng...

by **timpart** » Wed Apr 10, 2013 7:21 pm

Here are some thoughts on the current chip and what I thought were limitations.

Byte and halfword loads stall the chip for two cycles. Could this not be reduced, preferably to zero? My naive assumption is that it just some wires that go diagonally and some selection gates. The results of a load can't be used for a cycle so I'm a bit surprised it couldn't be fitted in.

There is no instruction that adds an immediate amount to a register (saving the answer back into the register) then uses the result to do a load or store. There is a handy instruction which changes the register after it has been used as an address. This results in the anomaly that a stack needs one instruction to go one way but two to go the other.

Adds subtracts and shift (one place) with carry. Also a clear carry instruction or some way of starting with no carry going in but carry coming out. Simplifies construction of multiword operations.

I was surprised to discover there is no condition test that checks just the overflow flag. Not a major one this but I had a situation where it would have been useful to branch on this.

I notice that one of the conditions is taken up for branch and link. This is only useful for branches and not the other conditional uses. Perhaps an oportunity there to squeeze in another condition?

The processor is either in floating point or signed integer mode and switching between the two takes the compiler quite a lot of instructions (it disables interrupts while doing it then turns them back on again). In the future there may be more options (double word versions?). Would it be possible to make these freely mixable each with their own opcode? The 32 bit instructions would seem to have some spare bits. Would make the pipeline more complex as it would have to remember which mode it is in as the data moves down it.

A rotate by a constant operation would make SHA easier to do.

Population (bit) count is or was popular for some things I hear and longwinded to do in other instructions.

Regards,

Tim

by **amylaar** » Wed Apr 10, 2013 8:58 pm

by **solardiz** » Wed Apr 10, 2013 10:41 pm

Thank you all for comments and additional ideas! Here are some more from me:

Instead of a 32-bit rotate instruction, we could introduce one that is more generic and even more suitable for crypto: ARX, which is the acronym for add-rotate-XOR (as ). The instruction would compute:

RD ^= rotate(RN + RM, imm5)

It would need 3*6 = 18 bits for register indices and 5 bits for the rotate count, leaving up to 9 bits for the opcode.

Maybe the regular ADD and XOR instructions should be converted into forms of this instruction (sharing some circuitry with its implementation). ADD is the same as the ARX instruction proposed above, but with the immediate value at 0 and with the XOR (register file update mode?) disabled (one bit in the instruction encoding? or maybe 0 in the 5-bit immediate value would be treated specially, enabling this mode?) Then we don't even have to spend a new opcode on this instruction - it will be an extension/replacement of ADD. :-)

XOR is also implementable as a special mode of ARX, but for the full 3-register form it's a bit trickier (need to have a bit in the instruction encoding that will change RM into input to XOR rather than to ADD).

If this is too tricky, then the fallback option is to implement just an ADD_ROT, which will also be usable as ADD in the straightforward way (at 0 rotate count), but having a full ARX instruction would provide speedup for ciphers that would use all 3 components of it or that would use the ADD and XOR components.

If opcode space permitted, we could even do:

RD ^= ((RN + RM) << n) ^ ((RN + RM) >> m)

where "n" and "m" are separate immediate values, 5-bit each, but this leaves only 3 bits for the opcode, which is clearly unacceptable (we'd waste 1/8th of our opcode space on this one instruction). That's a pity, since this instruction could also be usable for some bit permutations, with rotation being only a special case, yet not consume much/any more logic.

Speaking of bit permutations, we need to check out . 243 pages on optimal choice of bit permutation instruction(s) to implement. :-)

by **aolofsson** » Wed Apr 10, 2013 11:25 pm

More great inputs and again a reminder of the fun but exhausting days working on the tigersharc. A lot of the tradeoffs for future revisions will be about trading performance for ease of use(and programming method). The sky's the limit in terms of optimization. Check out this Galois field arithmetic accelerator that Yaniv worked on while at Analog Devices a few years back.

http://www.google.com/patents/US7269615.pdf

If a firmware programmable accelerator is acceptable, a 10-50x boost over the existing "pragma-less" C-programmable Epiphany ISA is achievable for some applications (can't speak specifically about crypto).

Andreas

by **solardiz** » Wed Apr 10, 2013 11:31 pm

by **solardiz** » Thu Apr 11, 2013 12:02 am

By the way, AMD GCN ISA (used on AMD's latest and largest GPUs) has load-ops. My copy of AMD_Southern_Islands_Instruction_Set_Architecture.pdf on page 231+ lists the instruction encodings for the load-ops. The available load-ops include addition, subtraction, reverse subtraction (I guess this can be fairly important), increment and decrement with saturation, min/max, bitwise ops (including 3-input XOR-NOT OR), compare and store, compare and swap. Quite impressive for a GPU, considering how many SIMD units they pack per chip. I think we'd be fine with a smaller set of load-ops: addition, subtraction, reverse subtraction, 2-input bitwise ops.

Parallella Community

Wishlist for next revision of Epiphany ISA

Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Re: Wishlist for next revision of Epiphany ISA

Who is online