by tnt » Tue Aug 04, 2015 7:43 am
Yeah, I'm pretty happy about it.
The big advantage of the epiphany in this case are:
- Large register file : except for fft data load / store, there is no memory access for temporary results. Despite having loop pipelining and processing 4 data per loop iteration (2 radix-2 ops in //), I only ever use registers, and even only the "caller saver" registers so I don't even need to save/restore them.
- BITR opcode : infinitely useful for this :p
- Easy to predict low level behavior: Because I can understand exactly how the CPU will execute stuff, I can tailor the operations manually much better. Optimizing for ARM (or even worse Intel) has so many rules to follow that I can't keep them all in my head ...
Next step will probably be to extend this for higher point FFTs using multiple cores. (The current one is local mem only, so you can do at most 2048 points, but more realistically 1024 when using double-buffering)