Subobtimal code for 'complex float' type

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Subobtimal code for 'complex float' type

Postby tnt » Wed Jan 09, 2013 10:13 am

Hi,

I was playing around with a bit of C to get a feel how the compiler behaved and how to code well for it.
(given I'm trying to minimize the size, I came to the conclusion trying to re-use libs was a bad idea, I need optimized version)

I'm working on SDR, so complex number processing is very importatn.

Here's a test function:


Code: Select all
#include <complex.h>

void test_complex(complex float *in, complex float *out, unsigned int n, complex float r)
{
        int i;

        for (i=0; i<n; i++) {
                out[i] = in[i] * r;
                r *= r;
        }
}


Compiled and dumped using

Code: Select all
e-gcc test.c -c -ffast-math -g -O3 -mfp-mode=round-nearest -mshort-calls
e-objdump -d test.o


And gives

Code: Select all
   0:   74dc 0400    str r3,[sp,+0x1]
   4:   883b 2000    sub r12,r2,0
   8:   545c 0400    str r2,[sp]
   c:   74dc 0400    str r3,[sp,+0x1]
  10:   954c 2400    ldr r12,[sp,+0x2]
  14:   e00b 4002    mov r23,0x0
  18:   200b 4002    mov r17,0x0
  1c:   2800         beq 6c <_test_complex+0x6c>
  1e:   1c7f 4806    lsl r16,r23,0x3
  22:   edaf 4007    fmul r23,r3,r3
  26:   a049 4100    ldr r21,[r0,+r16]
  2a:   8e2f 4087    fmul r20,r3,r12
  2e:   401f 410a    add r18,r0,r16
  32:   c8cc 4800    ldr r22,[r18,+0x1]
  36:   6eaf 4107    fmul r19,r3,r21
  3a:   041f 610a    add r24,r1,r16
  3e:   249b 4800    add r17,r17,1
  42:   4f2f 4107    fmul r18,r3,r22
  46:   653f 080a    sub r3,r17,r2
  4a:   7cef 0802    mov r3,r23
  4e:   e4ef 4802    mov r23,r17
  52:   734f 4507    fmsub r19,r12,r22
  56:   724f 0487    fmsub r3,r12,r12
  5a:   52bf 4507    fmadd r18,r12,r21
  5e:   920f 2907    fadd r12,r20,r20
  62:   6459 4100    str r19,[r1,+r16]
  66:   40dc 4c00    str r18,[r24,+0x1]
  6a:   da10         bne 1e <_test_complex+0x1e>
  6c:   194f 0402    rts


Now what's really annoying me here is the load and store operations.

Complex are two consecutive floats, so why wouldn't it use the ldrd and strd ?

Right now, I think those instructions :

Code: Select all
  26:   a049 4100    ldr r21,[r0,+r16]
  2e:   401f 410a    add r18,r0,r16
  32:   c8cc 4800    ldr r22,[r18,+0x1]
  3a:   041f 610a    add r24,r1,r16
  62:   6459 4100    str r19,[r1,+r16]
  66:   40dc 4c00    str r18,[r24,+0x1]


could be replaced by a couple of ldrd and strd with index (so that's 2 instruction instead of 6 and 8 bytes instead of 24 bytes).
That's a 15 % code reduction just because of this (and when trying to fit code in 8k, it starts to matter).

Cheers,

Sylvain
tnt
 
Posts: 408
Joined: Mon Dec 17, 2012 3:21 am

Re: Subobtimal code for 'complex float' type

Postby ysapir » Wed Jan 09, 2013 10:01 pm

@tnt,

This suboptimal issue was already identified and is in the to-do list.
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Re: Subobtimal code for 'complex float' type

Postby tnt » Thu Jan 10, 2013 12:12 am

Ok, good to know.

In the mean time I tried working around it using inline asm for the load / store, but unfortunately, trying to pass complex argument to inline asm just triggers a segfault / internal compiler error ...
tnt
 
Posts: 408
Joined: Mon Dec 17, 2012 3:21 am

Re: Subobtimal code for 'complex float' type

Postby jeremybennett » Thu Jan 10, 2013 11:09 am

I discussed this at some length with Joern Rennecke (who is responsible for the Epiphany compiler). I'll let him reply with a detailed explanation - there is too much risk of me getting the details wrong. However in summary, the underlying problem is with upstream GCC, which effectively insists on treating complex numbers as a pair of floats, 4 byte aligned.

This is something we want to fix, but it is a major piece of work, which will take some time. Although we would incorporate it first in GCC here, we would also want it included in upstream GCC as well (it will help other architectures, such as SH), which would have to wait until GCC 4.9 in 2014.

Joern has some suggestions for a workaround in the short term, which I'll let him explain.

HTH,


Jeremy
User avatar
jeremybennett
 
Posts: 61
Joined: Mon Dec 17, 2012 9:06 am

Re: Subobtimal code for 'complex float' type

Postby amylaar » Thu Jan 10, 2013 3:26 pm

As Jeremy already stated, you don't get 64 bit memory accesses for your code because of the missing alignment
of the complex type.
It is relatively straightforward to change your example code to use a union of complex float and long long, so that
the alignment is provided. However, we then run into issues with the tree optimizers decomposing the load/store
operations into two 32 bit operations each. I've filed a bug for this: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55923 .
Now, you can side-step this issue by using asms to join 64 bit load/stores with complex float operations. Except
that this doesn't work in gcc 4.6 / 4.7 due to an internal compiler error. This was also broken in gcc 4.8; I reported
this bug, and Jakub Jelinek has provided a fix, which I have just merged/pushed into our epiphany-gcc.git repository .

Now, the following code variant gets you 64 bit loads and stores:

Code: Select all
typedef union
{
  complex float cf;
  long long ll;
} ucf;
   
void test_complex2(ucf *in, ucf *out, unsigned int n, complex float r)
{
        int i;
        ucf ucf1;
        complex float cf;

        for (i=0; i<n; i++) {
                ucf1 = in[i];
                __asm ("": "=r" (cf): "0" (ucf1));
                cf *= r;
                __asm ("": "=r" (ucf1): "0" (cf));
                out[i] = ucf1;
                r *= r;
        }
}
amylaar
 
Posts: 31
Joined: Thu Jan 10, 2013 3:06 pm

Re: Subobtimal code for 'complex float' type

Postby tnt » Thu Jan 10, 2013 6:29 pm

Great thanks, works nicely. I had tried some ASM tricks and played with union and __attribute__((align)), but only hit errors ...

I've cleaned it up a little and wrapped the "hack" into accessors:

Code: Select all
#include <complex.h>

typedef union
{
        complex float cf;
        long long ll;
} ucf_t;

static inline complex float
cxf_read(complex float *p)
{
        ucf_t ucf;
        complex float cf;

        ucf = *((ucf_t *)p);
        __asm ("": "=r" (cf): "0" (ucf));

        return cf;
}

static inline void
cxf_write(complex float *p, complex float v)
{
        ucf_t ucf;

        __asm ("": "=r" (ucf): "0" (v));
        *((ucf_t *)p) = ucf;
}


void test_complex(complex float *in, complex float *out, unsigned int n, complex float r)
{
        do {
                cxf_write(out++, cxf_read(in++) * r);
                r *= r;
        } while (--n);
}


which compiles nicely into

Code: Select all
   0:   954c 2400    ldr r12,[sp,+0x2]
   4:   74dc 0400    str r3,[sp,+0x1]
   8:   8cef 4002    mov r20,r3
   c:   545c 0400    str r2,[sp]
  10:   74dc 0400    str r3,[sp,+0x1]
  14:   40ec 4200    ldrd r18,[r0],+0x1
  18:   d22f 4907    fmul r22,r20,r20
  1c:   68ef 0802    mov r3,r18
  20:   4a2f 4907    fmul r18,r18,r20
  24:   b22f 4507    fmul r21,r12,r20
  28:   48b3         sub r2,r2,1
  2a:   6e2f 0087    fmul r3,r3,r12
  2e:   4e4f 4887    fmsub r18,r19,r12
  32:   6e3f 0907    fmadd r3,r19,r20
  36:   98ef 4802    mov r20,r22
  3a:   924f 4487    fmsub r20,r12,r12
  3e:   968f 2907    fadd r12,r21,r21
  42:   6cef 4002    mov r19,r3
  46:   44fc 4200    strd r18,[r1],+0x1
  4a:   e510         bne 14 <_test_complex+0x14>
  4c:   194f 0402    rts


30% smaller than the original code :P
tnt
 
Posts: 408
Joined: Mon Dec 17, 2012 3:21 am

Re: Subobtimal code for 'complex float' type

Postby tnt » Sat Jan 12, 2013 10:10 am

Unfortunately, the register allocator is still pretty dumb and doesn't allocate contiguous registers for those values ... so you end up with a bunch of 'mov' in front of the strd. And most of the time the optimizer could have noticed those moves are not necessary and just replaced the destination registers of earlier opcodes.

Code: Select all
ldrd r2,[r19],+0x1
fmul r18,r12,r0
fmul r17,r2,r12
sub r20,r20,1
fmul r2,r2,r16
fmul r21,r16,r0
fmsub r18,r16,r1
fmsub r17,r3,r16
fmadd r2,r3,r12
mov r16,r21
fmadd r16,r12,r1
mov r12,r18
mov r5,r2
mov r3,r5
mov r2,r17
strd r2,[r19,-0x2]


And this is why I would have expected:

Code: Select all
ldrd r2,[r19],+0x1
fmul r18,r12,r0
fmul r17,r2,r12
sub r20,r20,1
fmul r2,r2,r16
fmul r21,r16,r0
fmsub r18,r16,r1
fmsub r2,r3,r16  /* Changed destination to r2 directly since it's not used after */
fmadd r3,r3,r12 /* Changed destination to r3 directly since it's not used after */
mov r16,r21
fmadd r16,r12,r1
mov r12,r18
strd r2,[r19,-0x2]
tnt
 
Posts: 408
Joined: Mon Dec 17, 2012 3:21 am


Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 5 guests