Given benefit/cost, here's my current proposal for most important additions/removals/changes:
Add destination register update modes, likely implemented in or near the register file (not in IALU/FPU). As a minimum, have:
0 - direct write (as is done now)
1 - XOR with current value in the register
Much better:
00 - direct write
01 - XOR
10 - NOT (ignores the current value in the register)
11 - ADD/ADC (depending on what instruction it's bundled with)
XOR and NOT can possibly be viewed as specialized versions of ADD/ADC, reusing a portion of the adder circuitry (XOR disables the carries, NOT also forces one operand to all 1's).
Alternatively, instead of NOT:
10 - SUB/SBB
If we can afford even more (luxury):
000 - direct write
001 - XOR
010 - NOT
011 - ADD/ADC
100 - AND
101 - OR
110 - reverse SUB/SBB (operands swapped) or AND-NOT
111 - SUB/SBB
For "reverse SUB/SBB (operands swapped) or AND-NOT", we can implement either one, or we can have this choice made based on whether the instruction is a load or arithmetic (implement reverse SUB/SBB for these) vs. bitwise (implement AND-NOT for these). In AND-NOT, the "NOT" should be applied to the instruction output, and then it should be AND'ed with the previous RD contents.
These extra bits, 1 to 3 of them depending on which approach we choose, would need to be encoded in integer instructions that have a destination register and in loads. For the loads, we may either introduce extra opcodes (e.g., just one more group of load opcodes if we only support 0/1 for direct/XOR) or we'd have to reduce the maximum displacement - well, or we may choose not to support these modes in loads with displacements (but support them in other kinds of load instructions).
The XOR mode alone is surprisingly capable. Even with no other changes (no added instructions), it lets us implement bitselect in 2 instead of 3 instructions. bitselect may be written not only in the obvious form:
c = (a & sel) | (b & ~sel)
but also:
c = ((a ^ b) & sel) ^ b
If we're constrained to 3-register instructions anyway, we'll put b in RD, a in RN and sel in RM, which gives:
RD ^= (RN ^ RD) & RM
Notice the outer XOR, which is implementable via the register update mode. Additionally, depending on where the inputs came from, the inner XOR may also be implementable as a previous instruction's destination register update mode.
XOR update mode also gives us load-XOR (for ciphers and e.g. RAID parity computation), and 3-input XOR-XOR (actually desirable in lots of places!), AND-XOR, OR-XOR (for ciphers, including reducing instruction count in bitslice implementations).
The NOT update mode gives us AND-NOT, XOR-NOT, OR-NOT - that is, three additional 2-input bitwise ops, which are present on many RISC archs (and AND-NOT is also present on x86's MMX and SSE*). These are handy for implementations of ciphers, and for a bit more.
The ADD/ADC update mode is really powerful. When combined with 32-bit loads, it should work as ADD (no carry input) - we get 1-cycle load-adds. With 64-bit loads, it should work as 32-bit ADD for the first register and as 32-bit ADC for the other, so we get 1-cycle 64-bit load-add. With arithmetic instructions, it should work as ADD (e.g., we can add 3 32-bit numbers in one instruction, or subtract two and add a third).
The SUB/SBB update mode is similar, allowing for 32-bit load-sub, 64-bit load-sub, and more.
The luxury AND, OR, AND-NOT update modes give us 3-input bitwise ops (beyond XOR-XOR), as well as load-AND, load-AND-NOT, and load-OR. This can be handy for speeding up crypto, although for other uses perhaps reverse SUB/SBB (instead of just one of these luxury modes) is more important (if we have to make this choice, although maybe we can have both, with one of them chosen by instruction category).
Now to instruction additions/removals/changes:
Replace ADD with ADD_ROT (with 5-bit immediate rotate count). If we have the XOR update mode, we don't need an ARX instruction - rather, we obtain it as ADD_ROT,xor. Indeed, we also obtain simple ADD and simple rotate in the obvious ways (need a zeroed register for the latter, though).
If we've added ADD update mode, drop the IMADD instruction. We'll obtain it as IMUL,add.
If we've added SUB update mode, drop the IMSUB instruction. We'll obtain it as IMUL,sub.
The ability to drop separate IMADD/IMSUB instructions makes me more confident that we can implement the ADD/SUB update modes without having to lower the clock rate. We also free up some opcode space for other instructions, such as:
Add XOR_AND, which will compute:
RD = (RN ^ RD) & RM
and be usable as bitselect when invoked with XOR update mode, (XOR_AND,xor).
Add FXOR, a XOR on the FPU, primarily to be usable to trigger the update modes via the FPU, for dual-issue. This way, the peak performance will be 4 integer+bitwise ops per cycle (at least one of them being XOR), or 3 (can be e.g. 3 integer adds). Arguably, FIADD (integer addition on FPU) would be of greater GP use (can do up to 4 integer adds/cycle then), but FXOR is also usable as MOV (when the two inputs are set to the same register), which lets us access any of the update modes as separate instructions via the FPU (to dual-issue them along with anything we do on IALU).
In fact, maybe XOR_AND and FXOR can be one and the same instruction, with a flag bit in it (enabling/disabling the AND). XOR_AND would probably be in the FPU anyway since it needs 3 inputs.
In load/store instructions, have the index treated as such, and not as byte offset. I guess the extra mux that Andreas was referring to would be incurred if both index and byte offset modes were to be supported; I'd be fine with only the index mode being supported, so the mux would be avoided then?
Note that there will be no change in instruction count from this, or even a decrease by one. We're dropping two (IMADD, IMSUB), we're adding two (XOR_AND, FXOR) or one (XOR_AND with a flag allowing to enable/disable the AND portion), we're replacing one (ADD with ADD_ROT).
I am no expert, but my gut feeling is that all of the above, except maybe for things marked luxury, should be implementable with under 10% cost (spent mostly on the register update modes), but the benefits are huge: up to 4x speedup for both GP integer and crypto code, and some speedup (likely in excess of the under 10% cost) for much floating-point code as well (due to faster computation of array indices, loop control logic, and such).
Other instruction additions that I had mentioned before may be considered as well, but compared to the changes proposed above they have lower benefit/cost.