I was playing around with a bit of C to get a feel how the compiler behaved and how to code well for it.

(given I'm trying to minimize the size, I came to the conclusion trying to re-use libs was a bad idea, I need optimized version)

I'm working on SDR, so complex number processing is very importatn.

Here's a test function:

- Code: Select all
`#include <complex.h>`

void test_complex(complex float *in, complex float *out, unsigned int n, complex float r)

{

int i;

for (i=0; i<n; i++) {

out[i] = in[i] * r;

r *= r;

}

}

Compiled and dumped using

- Code: Select all
`e-gcc test.c -c -ffast-math -g -O3 -mfp-mode=round-nearest -mshort-calls`

e-objdump -d test.o

And gives

- Code: Select all
`0: 74dc 0400 str r3,[sp,+0x1]`

4: 883b 2000 sub r12,r2,0

8: 545c 0400 str r2,[sp]

c: 74dc 0400 str r3,[sp,+0x1]

10: 954c 2400 ldr r12,[sp,+0x2]

14: e00b 4002 mov r23,0x0

18: 200b 4002 mov r17,0x0

1c: 2800 beq 6c <_test_complex+0x6c>

1e: 1c7f 4806 lsl r16,r23,0x3

22: edaf 4007 fmul r23,r3,r3

26: a049 4100 ldr r21,[r0,+r16]

2a: 8e2f 4087 fmul r20,r3,r12

2e: 401f 410a add r18,r0,r16

32: c8cc 4800 ldr r22,[r18,+0x1]

36: 6eaf 4107 fmul r19,r3,r21

3a: 041f 610a add r24,r1,r16

3e: 249b 4800 add r17,r17,1

42: 4f2f 4107 fmul r18,r3,r22

46: 653f 080a sub r3,r17,r2

4a: 7cef 0802 mov r3,r23

4e: e4ef 4802 mov r23,r17

52: 734f 4507 fmsub r19,r12,r22

56: 724f 0487 fmsub r3,r12,r12

5a: 52bf 4507 fmadd r18,r12,r21

5e: 920f 2907 fadd r12,r20,r20

62: 6459 4100 str r19,[r1,+r16]

66: 40dc 4c00 str r18,[r24,+0x1]

6a: da10 bne 1e <_test_complex+0x1e>

6c: 194f 0402 rts

Now what's really annoying me here is the load and store operations.

Complex are two consecutive floats, so why wouldn't it use the ldrd and strd ?

Right now, I think those instructions :

- Code: Select all
`26: a049 4100 ldr r21,[r0,+r16]`

2e: 401f 410a add r18,r0,r16

32: c8cc 4800 ldr r22,[r18,+0x1]

3a: 041f 610a add r24,r1,r16

62: 6459 4100 str r19,[r1,+r16]

66: 40dc 4c00 str r18,[r24,+0x1]

could be replaced by a couple of ldrd and strd with index (so that's 2 instruction instead of 6 and 8 bytes instead of 24 bytes).

That's a 15 % code reduction just because of this (and when trying to fit code in 8k, it starts to matter).

Cheers,

Sylvain