Wishlist for next revision of Epiphany ISA

Forum for anything not suitable for the other forums.

Wishlist for next revision of Epiphany ISA

Postby solardiz » Wed Apr 10, 2013 2:24 am

Here are some additions to the ISA to consider:

  • 8-bit gather loads. These may be used e.g. to implement four DES S-box lookups in parallel (these are 6-to-4-bit, but 8-bit is more generic and needed for other things). They will also be usable to implement big-endian loads, with a displacement vector of 0x00010203, thereby eliminating the need for separate big-endian load or byte-swap instructions. And they will also be usable for other byte permutes, including reading from registers (since they're memory-mapped), eliminating the need for separate instructions of this sort.
  • 16-bit and/or other size gather loads.
  • 64-bit integer addition/subtraction support instructions, similar to 386's ADC/SBB - or is this functionality already present in some form and I missed it?
  • Trivial 64-bit integer operations on pairs of registers, similar to how Epiphany already has 64-bit loads and stores. By trivial I mean at least bitwise ops, and possibly also addition/subtraction and bit shifts/rotates, but not multiplication.
  • A bitselect() instruction (in OpenCL terms) would be very handy for speeding up many symmetric crypto algorithms. Since it has 3 inputs and 1 output, it'd probably either have to overwrite one of the inputs (really bad, will require extra register copies in many cases) or will in fact need to be encoded with 4 register operands (much better, if practical). The documentation says that the register file only allows for 3 reads + 1 write by the FPU, not IALU, so maybe this instruction will have to be in the FPU. This makes me wonder how IMADD/IMSUB work, though, since they also read 3 operands and write 1. (Perhaps the documentation needs to be improved in this respect.)
  • A full set of 2-input bitwise ops (16 total, 10 of them non-trivial), like UltraSPARC VIS has. This, along with bitselect(), will help reduce instruction count (and thus improve speed) of bitslice implementations of ciphers.
  • A few 3-input bitwise ops besides bitselect() - this will need serious consideration (I'd be happy to help identify the needed ones). At least the basic functions of MD4/MD5/SHA-1/SHA-256 (and SHA-512 if we go 64-bit) need to be implementable with only one instruction each. bitselect() achieves this in some cases, but some others are 2+ instructions even with bitselect(). A dual-XOR (that is, A xor B xor C) would be handy in other crypto contexts as well.
  • A 32-bit rotate instruction, also mostly for crypto (with immediate rotate count only is sufficient for most uses). Currently, IMADD and LSR may be used to implement this operation in two instructions, so it's not too bad (can IMADD's result be used on the very next cycle?), but having it as one instruction is preferable.
  • CISC'ish load-op instructions for some trivial ops. x86 CPUs manage to have single-cycle load-op SIMD instructions starting with Pentium MMX, so this should be implementable without too much signal propagation delay and die area cost. I don't know how they do it, but here's an idea (might be naive): can the register file support not only read/write, but also write-with-trivial-op accesses? Then those load-ops would have the op performed in the register file, not in ALU. I guess for bitwise ops this may be cheap enough, so we'd have e.g. LDR_EOR, etc (ideally, for a full set of possible 2-input bitwise ops). In fact, since we already have 64-bit loads defined, this will also automatically give us 64-bit bitwise ops, so we may avoid wasting separate instructions and ALU logic on those then.

I hope it is just unclear documentation, but in case this is really missing, then also:

  • Scaling of the index in LDR/STR instructions, maybe only by exactly the size of the datum being loaded/stored. In fact, shouldn't this (have been) the only mode of operation, given that unaligned access is not allowed? That is, the base is a byte offset, but the index is the array element number (0-based), not its byte displacement. The instruction forms with immediate displacement got this right: they don't encode those always-0 bits. (All of this understanding of the current state of things is per my reading of the documentation only.)

Since most of the above is not needed for floating-point code, I guess it won't be accepted - however, maybe the cost (in terms of possible negative effect on GFLOPS/chip and GFLOPS/Watt) for at least some of these is low or non-existent and they can get in anyway?

In a similar spirit, how about adding bitwise ops to the FPU as well (yes, a hack), to allow for dual-issue of them (along with IALU and with loads/stores)? Not consistent with Epiphany's main goals, but can double the speed of some crypto codes for maybe negligible cost.

Alternatively, maybe the bitwise ops should be moved to exist solely(?) as a special update mode into the register file (and implemented there), and they would then be accessible via MOV-like instructions (copy the input operand as-is) from any execution unit (thus, we'll get dual-issue for bitwise ops as well). As a bonus, we'll also get things like combined ADD_EOR (just an ADD insn with the "exclusive OR write mode" bit set), which are found in many ciphers. And if we don't remove the bitwise ops that are already supported in IALU, then we get those things like dual-XOR I mentioned above for free (without needing extra instructions). Together, the things I listed so far may provide like a 3x boost for symmetric crypto. That was just some brainstorming, crazy as it should be. :-)
solardiz
 
Posts: 28
Joined: Wed Apr 10, 2013 12:36 am

Re: Wishlist for next revision of Epiphany ISA

Postby solardiz » Wed Apr 10, 2013 2:54 am

It's not on my current wishlist, but I imagine that some others (and maybe me later) would also be interested in 16-bit SIMD/SWAR integer ops (similar to PA-RISC MAX, which did two 16-bit ops in 32-bit regs) and/or in 16-bit half-precision floating-point (similarly, two such operations per instruction executed). Perhaps this is fairly low-cost, considering that PA-RISC got it in 1994.
solardiz
 
Posts: 28
Joined: Wed Apr 10, 2013 12:36 am

Re: Wishlist for next revision of Epiphany ISA

Postby aolofsson » Wed Apr 10, 2013 12:40 pm

Solardiz,
Thanks for all the fantastic ISA suggestions. If we ever do a custom crypto version, we now know who to talk to first. :D

solardiz wrote:Here are some additions to the ISA to consider:

  • 8-bit gather loads. These may be used e.g. to implement four DES S-box lookups in parallel (these are 6-to-4-bit, but 8-bit is more generic and needed for other things). They will also be usable to implement big-endian loads, with a displacement vector of 0x00010203, thereby eliminating the need for separate big-endian load or byte-swap instructions. And they will also be usable for other byte permutes, including reading from registers (since they're memory-mapped), eliminating the need for separate instructions of this sort.
  • 16-bit and/or other size gather loads.
  • 64-bit integer addition/subtraction support instructions, similar to 386's ADC/SBB - or is this functionality already present in some form and I missed it?
  • Trivial 64-bit integer operations on pairs of registers, similar to how Epiphany already has 64-bit loads and stores. By trivial I mean at least bitwise ops, and possibly also addition/subtraction and bit shifts/rotates, but not multiplication.
  • A bitselect() instruction (in OpenCL terms) would be very handy for speeding up many symmetric crypto algorithms. Since it has 3 inputs and 1 output, it'd probably either have to overwrite one of the inputs (really bad, will require extra register copies in many cases) or will in fact need to be encoded with 4 register operands (much better, if practical). The documentation says that the register file only allows for 3 reads + 1 write by the FPU, not IALU, so maybe this instruction will have to be in the FPU. This makes me wonder how IMADD/IMSUB work, though, since they also read 3 operands and write 1. (Perhaps the documentation needs to be improved in this respect.)
  • A full set of 2-input bitwise ops (16 total, 10 of them non-trivial), like UltraSPARC VIS has. This, along with bitselect(), will help reduce instruction count (and thus improve speed) of bitslice implementations of ciphers.
  • A few 3-input bitwise ops besides bitselect() - this will need serious consideration (I'd be happy to help identify the needed ones). At least the basic functions of MD4/MD5/SHA-1/SHA-256 (and SHA-512 if we go 64-bit) need to be implementable with only one instruction each. bitselect() achieves this in some cases, but some others are 2+ instructions even with bitselect(). A dual-XOR (that is, A xor B xor C) would be handy in other crypto contexts as well.
  • A 32-bit rotate instruction, also mostly for crypto (with immediate rotate count only is sufficient for most uses). Currently, IMADD and LSR may be used to implement this operation in two instructions, so it's not too bad (can IMADD's result be used on the very next cycle?), but having it as one instruction is preferable.
  • CISC'ish load-op instructions for some trivial ops. x86 CPUs manage to have single-cycle load-op SIMD instructions starting with Pentium MMX, so this should be implementable without too much signal propagation delay and die area cost. I don't know how they do it, but here's an idea (might be naive): can the register file support not only read/write, but also write-with-trivial-op accesses? Then those load-ops would have the op performed in the register file, not in ALU. I guess for bitwise ops this may be cheap enough, so we'd have e.g. LDR_EOR, etc (ideally, for a full set of possible 2-input bitwise ops). In fact, since we already have 64-bit loads defined, this will also automatically give us 64-bit bitwise ops, so we may avoid wasting separate instructions and ALU logic on those then.

[*]8-bit gather loads-->GP useful, moderately expensive in terms of hw size, speed
[*]16-bit and/or other size gather loads.-->GP useful, moderately expensive in terms of hw,speed
[*]64-bit integer addition/subtraction support instructions-->GP useful, moderately expensive in terms of hw size, speed
[*]Trivial 64-bit integer operations on pairs of registers->GP useful, moderately expensive
[*]A bitselect() instruction (in OpenCL terms)-->IMADD is done in the FPU,not GP?, moderately expensive.
[*]A full set of 2-input bitwise ops (16 total, 10 of them non-trivial)-->not GP?, moderately expensive
[*]A few 3-input bitwise ops besides bitselect() -->not GP?,moderately expensive
[*]A 32-bit rotate instruction, also mostly for crypto-->not GP?, inexpensive
[*]CISC'ish load-op instructions for some trivial ops. -->GP,speed/complexity issue

A few more requests that have come up in the past based on benchmarks:
-unsigned multiply with 64 bit result
-add/subtract with carry
-sign extension on byte/short load
-1/x, 1/sqrt(x) seeds


solardiz wrote:
  • Scaling of the index in LDR/STR instructions, maybe only by exactly the size of the datum being loaded/stored. In fact, shouldn't this (have been) the only mode of operation, given that unaligned access is not allowed? That is, the base is a byte offset, but the index is the array element number (0-based), not its byte displacement. The instruction forms with immediate displacement got this right: they don't encode those always-0 bits. (All of this understanding of the current state of things is per my reading of the documentation only.)

We did not want to do the array element offset on index addressing (like we did with displacement) because it would have required another mux in the datapath. We realized that this may be a little confusing.

solardiz wrote:In a similar spirit, how about adding bitwise ops to the FPU as well (yes, a hack), to allow for dual-issue of them (along with IALU and with loads/stores)? Not consistent with Epiphany's main goals, but can double the speed of some crypto codes for maybe negligible cost....Together, the things I listed so far may provide like a 3x boost for symmetric crypto. That was just some brainstorming, crazy as it should be. :-)


Many of these instructions would not be used for general purpose code[imho], meaning that all code would pay a power penalty (dynamic and leakage power) but only a few applications would gain in terms of performance. Let's say we gain 2-3X by putting in these instructions for crypto, but every other program get's a 10-20% increase in power consumption and cost.[more transistors, harder to meet timing goals, etc] That's a tough choice to make..

I used lead the execution unit team on the TigerSHARC DSP which had one of the most massive instruction sets seen to date. A number of the instructions you suggest were implemented in some fashion in the TS. Designing the datapath was "like death by a thousand paper cuts."

http://www.analog.com/static/imported-f ... 01_pgr.pdf

Cost estimates above are VERY rough, the only way to give an accurate cost of these functions is to write up the RTL code, synthesized,P&R the logic. It might take 1-3 months, but then we could tell you the cost of adding all of them. It would likely fit nicely if we would remove the FPU. I try to write up a more elaborate post soon, sorry if I was too brief here.

Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Wishlist for next revision of Epiphany ISA

Postby piotr5 » Wed Apr 10, 2013 1:49 pm

an addition that I would like to see in parallella is a random number generator, like the one in some x86. the reason for that is "compressed sensing" where the algorithms require that a totally random matrix gets created -- pseudo-random probably wont cut it. also encryption could gain something when entropy is created by 16 or 64 cores in parallel! remember, epiphany cores can't use the rng from linux, but linux can use their rng...
piotr5
 
Posts: 230
Joined: Sun Dec 23, 2012 2:48 pm

Re: Wishlist for next revision of Epiphany ISA

Postby timpart » Wed Apr 10, 2013 7:21 pm

Here are some thoughts on the current chip and what I thought were limitations.

Byte and halfword loads stall the chip for two cycles. Could this not be reduced, preferably to zero? My naive assumption is that it just some wires that go diagonally and some selection gates. The results of a load can't be used for a cycle so I'm a bit surprised it couldn't be fitted in.

There is no instruction that adds an immediate amount to a register (saving the answer back into the register) then uses the result to do a load or store. There is a handy instruction which changes the register after it has been used as an address. This results in the anomaly that a stack needs one instruction to go one way but two to go the other.

Adds subtracts and shift (one place) with carry. Also a clear carry instruction or some way of starting with no carry going in but carry coming out. Simplifies construction of multiword operations.

I was surprised to discover there is no condition test that checks just the overflow flag. Not a major one this but I had a situation where it would have been useful to branch on this.

I notice that one of the conditions is taken up for branch and link. This is only useful for branches and not the other conditional uses. Perhaps an oportunity there to squeeze in another condition?

The processor is either in floating point or signed integer mode and switching between the two takes the compiler quite a lot of instructions (it disables interrupts while doing it then turns them back on again). In the future there may be more options (double word versions?). Would it be possible to make these freely mixable each with their own opcode? The 32 bit instructions would seem to have some spare bits. Would make the pipeline more complex as it would have to remember which mode it is in as the data moves down it.

A rotate by a constant operation would make SHA easier to do.

Population (bit) count is or was popular for some things I hear and longwinded to do in other instructions.

Regards,

Tim
timpart
 
Posts: 302
Joined: Mon Dec 17, 2012 3:25 am
Location: UK

Re: Wishlist for next revision of Epiphany ISA

Postby amylaar » Wed Apr 10, 2013 8:58 pm

piotr5 wrote:an addition that I would like to see in parallella is a random number generator,


No need to have an instruction for that - you could have a memory-mapped I/O circuitry that provides that.
amylaar
 
Posts: 31
Joined: Thu Jan 10, 2013 3:06 pm

Re: Wishlist for next revision of Epiphany ISA

Postby solardiz » Wed Apr 10, 2013 10:41 pm

Thank you all for comments and additional ideas! Here are some more from me:

Instead of a 32-bit rotate instruction, we could introduce one that is more generic and even more suitable for crypto: ARX, which is the acronym for add-rotate-XOR (as established in crypto lately). The instruction would compute:

RD ^= rotate(RN + RM, imm5)

It would need 3*6 = 18 bits for register indices and 5 bits for the rotate count, leaving up to 9 bits for the opcode.

Maybe the regular ADD and XOR instructions should be converted into forms of this instruction (sharing some circuitry with its implementation). ADD is the same as the ARX instruction proposed above, but with the immediate value at 0 and with the XOR (register file update mode?) disabled (one bit in the instruction encoding? or maybe 0 in the 5-bit immediate value would be treated specially, enabling this mode?) Then we don't even have to spend a new opcode on this instruction - it will be an extension/replacement of ADD. :-) XOR is also implementable as a special mode of ARX, but for the full 3-register form it's a bit trickier (need to have a bit in the instruction encoding that will change RM into input to XOR rather than to ADD).

If this is too tricky, then the fallback option is to implement just an ADD_ROT, which will also be usable as ADD in the straightforward way (at 0 rotate count), but having a full ARX instruction would provide speedup for ciphers that would use all 3 components of it or that would use the ADD and XOR components.

If opcode space permitted, we could even do:

RD ^= ((RN + RM) << n) ^ ((RN + RM) >> m)

where "n" and "m" are separate immediate values, 5-bit each, but this leaves only 3 bits for the opcode, which is clearly unacceptable (we'd waste 1/8th of our opcode space on this one instruction). That's a pity, since this instruction could also be usable for some bit permutations, with rotation being only a special case, yet not consume much/any more logic.

Speaking of bit permutations, we need to check out this paper. 243 pages on optimal choice of bit permutation instruction(s) to implement. :-)
solardiz
 
Posts: 28
Joined: Wed Apr 10, 2013 12:36 am

Re: Wishlist for next revision of Epiphany ISA

Postby aolofsson » Wed Apr 10, 2013 11:25 pm

More great inputs and again a reminder of the fun but exhausting days working on the tigersharc. A lot of the tradeoffs for future revisions will be about trading performance for ease of use(and programming method). The sky's the limit in terms of optimization. Check out this Galois field arithmetic accelerator that Yaniv worked on while at Analog Devices a few years back. :D http://www.google.com/patents/US7269615.pdf

If a firmware programmable accelerator is acceptable, a 10-50x boost over the existing "pragma-less" C-programmable Epiphany ISA is achievable for some applications (can't speak specifically about crypto).

Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Wishlist for next revision of Epiphany ISA

Postby solardiz » Wed Apr 10, 2013 11:31 pm

Andreas,

Thanks a lot for your comments. A custom crypto version sounds interesting (perhaps primarily FPU-less, to pack more cores per chip, but also with the added tweaks for crypto friendliness). Another custom(?) version could be with double-precision floating-point - crucial for potential use in supercomputers. So that's three versions, then - or can it be reduced to two: one is integer-only (but with fixed-point and 2x16-bit SWAR), the other is double-precision capable?

aolofsson wrote:A bitselect() instruction (in OpenCL terms)-->IMADD is done in the FPU,not GP?, moderately expensive.


Regarding bitselect being "not GP?", what if we redefine the existing conditional moves as a form of bitselect, where the selector bits vector would be set to all 0's or all 1's according to the flags? There would need to be an encoding of this instruction where the selector bits vector is read from a register, and other encodings where it is all 0's or all 1's according to flags. Would this simplify things (fewer instructions) or/and add signal propagation delay, limiting clock rate? My gut feeling is that this instruction is still a lot simpler than multiplication, so I'd expect the clock rate to be limited (solely?) by multiplication - is it not?

aolofsson wrote:A 32-bit rotate instruction, also mostly for crypto-->not GP?, inexpensive


The only "inexpensive" one you've identified. :-) I don't see how/why bitselect is significantly(?) more expensive than rotate (two inputs vs. just one? is that it?), but you're the expert. What about my ARX suggestion? Would it be "moderately expensive" rather than "inexpensive"?

aolofsson wrote:CISC'ish load-op instructions for some trivial ops. -->GP,speed/complexity issue


Yes, they add complexity, but maybe it can be negated with clever design. As to speed, due to GP relevance I think any slowdown such as from a slightly lower clock rate would be compensated for by needing to execute significantly fewer instructions for the same computation (this is assuming that we have load-add and load-sub, not just load-bitwise-ops, though).

aolofsson wrote:-add/subtract with carry


Isn't it the same as my suggestion of 386-like ADC/SBB?

aolofsson wrote:-sign extension on byte/short load


I guess this is mostly needed for poorly written high-level language code, where the program could as well use zero extension, but just happens to use signed data types. (I am guilty for writing some code like this too. It just does not matter performance-wise on typical CPUs.) It's a pity, but yes, the problem and thus the need is there.

aolofsson wrote:We did not want to do the array element offset on index addressing (like we did with displacement) because it would have required another mux in the datapath. We realized that this may be a little confusing.


I don't see where/why it would need an extra mux (and have a gut feeling that perhaps it could be avoided), but you're the expert.

aolofsson wrote:Many of these instructions would not be used for general purpose code[imho], meaning that all code would pay a power penalty (dynamic and leakage power) but only a few applications would gain in terms of performance. Let's say we gain 2-3X by putting in these instructions for crypto, but every other program get's a 10-20% increase in power consumption and cost.[more transistors, harder to meet timing goals, etc] That's a tough choice to make..


10-20% overhead is a lot. I was/am hoping that it can be lower than that, or at least the gain for GP code can compensate for it. Perhaps the right subset of instructions/features to add needs to be chosen, and maybe some may be removed/replaced/generalized (e.g., not needing ADD when we have ADD_ROT or ARX).

aolofsson wrote:I used lead the execution unit team on the TigerSHARC DSP which had one of the most massive instruction sets seen to date. A number of the instructions you suggest were implemented in some fashion in the TS. Designing the datapath was "like death by a thousand paper cuts."

http://www.analog.com/static/imported-f ... 01_pgr.pdf


Wow. Ouch. Of course, we need to stay sane.

aolofsson wrote:I try to write up a more elaborate post soon, sorry if I was too brief here.


No, you were not too brief. Your response is just sufficient for the time being. Thank you!

It'd help if you provide some info on how much the various components of the existing eCore cost (in terms of die area, as well as in other ways if you can identify those). Maybe the (die area? other?) costs of the existing instructions (although I guess they're intertwined, so the costs are not strictly additive).

Thanks again,
Alexander
solardiz
 
Posts: 28
Joined: Wed Apr 10, 2013 12:36 am

Re: Wishlist for next revision of Epiphany ISA

Postby solardiz » Thu Apr 11, 2013 12:02 am

By the way, AMD GCN ISA (used on AMD's latest and largest GPUs) has load-ops. My copy of AMD_Southern_Islands_Instruction_Set_Architecture.pdf on page 231+ lists the instruction encodings for the load-ops. The available load-ops include addition, subtraction, reverse subtraction (I guess this can be fairly important), increment and decrement with saturation, min/max, bitwise ops (including 3-input XOR-NOT OR), compare and store, compare and swap. Quite impressive for a GPU, considering how many SIMD units they pack per chip. I think we'd be fine with a smaller set of load-ops: addition, subtraction, reverse subtraction, 2-input bitwise ops.
solardiz
 
Posts: 28
Joined: Wed Apr 10, 2013 12:36 am

Next

Return to General Discussion

Who is online

Users browsing this forum: No registered users and 2 guests

cron