Very Fast Fourrier Transform

Very Fast Fourrier Transform

Postby tnt » Mon Aug 03, 2015 8:37 pm

Hi,

Last month I spent some time writing some FFT code for the epiphany.

You can find the resulting code here : https://github.com/smunaut/parallella-e ... _fft_asm.S

This is a classic Decimation-In-Time implementation. The output is out-of-order and needs to be re-ordered (but for some applications, including mine, this can be merged with the next step without overhead). I might take a look if a Decimation-In-Frequency implementation works better/worse on the epiphany ... at the moment I just took the one in the lena example as reference and re-implemented it in assembly.

Every loop has been hand assembled to reach maximum speed. Nearly every cycle is dual-issued, doing a floating point ops at the same time as a integer/store/load ops. It also uses the HW loop feature for zero overhead inner loops. Each instruction has been scheduled to minimize stalls. The loop has also been "pipelined", so at iteration i, the loop does the store or result i-1, the computation of i, and the pre-load of data of i+1, all those interlaced to maximize dual-issuing. You also have to "load" and "flush" the loop pipeline. And also the first and last stage of the FFT decomposition are degenerate and handled in dedicated functions.

This can make the code pretty hard to follow :p

And finally here are the numbers :

Code: Select all
fftw         : 221.977509 Mflops
fftw NEON    : 409.619659 Mflops
epiphany C   : 170.642029 Mflops
epiphany ASM : 668.507629 Mflops


All of theses are for a single core (so one ARM core or one epiphany core), and they dont take memory transfer into account (so for ARM it's working out of cache and epiphany out of local mem).

I thought this was pretty much as good as it was going to get, but turns out there is a commercial lib of optimized primitives for the epiphany that has a FFT that's about 10% faster than this, using Radix-4 instead of Radix-2. I might at some point see if I can do a Radix-4 version, but for the time being it's good enough for me.

Cheers,

Sylvain
tnt
 
Posts: 408
Joined: Mon Dec 17, 2012 3:21 am

Re: Very Fast Fourrier Transform

Postby aolofsson » Mon Aug 03, 2015 9:13 pm

So your routine on a simple scalar Epiphany core at 600MHz runs 39% faster than FFTW running on a 4-way A9 core at 667MHz?
I'd say that's pretty darn impressive! :D

Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Very Fast Fourrier Transform

Postby tnt » Tue Aug 04, 2015 7:43 am

Yeah, I'm pretty happy about it.

The big advantage of the epiphany in this case are:
- Large register file : except for fft data load / store, there is no memory access for temporary results. Despite having loop pipelining and processing 4 data per loop iteration (2 radix-2 ops in //), I only ever use registers, and even only the "caller saver" registers so I don't even need to save/restore them.
- BITR opcode : infinitely useful for this :p
- Easy to predict low level behavior: Because I can understand exactly how the CPU will execute stuff, I can tailor the operations manually much better. Optimizing for ARM (or even worse Intel) has so many rules to follow that I can't keep them all in my head ...

Next step will probably be to extend this for higher point FFTs using multiple cores. (The current one is local mem only, so you can do at most 2048 points, but more realistically 1024 when using double-buffering)
tnt
 
Posts: 408
Joined: Mon Dec 17, 2012 3:21 am

Re: Very Fast Fourrier Transform

Postby aolofsson » Tue Aug 04, 2015 12:26 pm

That's great to hear! Look forward to your inputs in the following topic.

viewtopic.php?f=23&t=3127
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA


Return to Assembly

Who is online

Users browsing this forum: No registered users and 1 guest