Page 1 of 2

Example: Reducing run time by moving code to internal memory

PostPosted: Tue Jun 18, 2013 6:46 pm
by Dr.BeauWebber
When compiling aplc programs for the Epiphany cores, it is possible to reduce run time significantly by moving code to internal memory.

Consider a simple memory intensive program :
Create a large interger array (shape: 7x8x9x10) - 5040 32bit integers,
and transpose it, to get a matrix with shape : 10x9x8x7 :
transpose_7.apl :
Code: Select all
V .is 7 8 9 10
V
N .is  .times / V
N
A .is V .rho .iota N
.rho A
T .is .transpose A
.rho T

Code: Select all
$ aplcc transpose_7.apl
$ ./a.exe
 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7

Running this on an i7 processor (with timing statements added) :
Code: Select all
$  aplcc transpose_7_jts.apl

$ ./a.exe
 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7
 0.0009999275

So about 1ms.

using the e-run simulator :
Code: Select all
$ ./e-aplcc transpose_7_jts.c

$ /cygdrive/d/home/jbww/Src/Git/Parallella/INSTALL/bin/e-run a.out
 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7
 0.03300214

So about 33ms.

If we run an edited version (so we write to an output buffer), on the Epiphany cores :
Code: Select all
$ ./buildv11.sh

$ ./runv5.sh
  3: Message from eCore 0x888 ( 2, 0): " 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7
"

etc.
which takes about 500ms to 1s.

On the Parallella Arm, we can use gprof to find out in which routines the most time is being spent :
To get timings we need it to run for a longer time, so we use a 5x6x7x8x9x10x11 array :
Code: Select all
$ aplcc transpose_11_jts.apl

$ gcc -pg -I ../include -L ../lib transpose_11_jts.c -laplc -lm
linaro@linaro-ubuntu-desktop:~/Src/Arm/aplc-6.17b/Work$ ./a.out
 5 6 7 8 9 10 11
 1663200
 5 6 7 8 9 10 11
 11 10 9 8 7 6 5
 0.82039

and then use gprof :
Code: Select all
linaro@linaro-ubuntu-desktop:~/Src/Arm/aplc-6.17b/Work$ gprof -b a.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 46.25      0.37     0.37                             __divsi3
 30.00      0.61     0.24                             main
 15.63      0.74     0.13                             aplc_vsize
  7.50      0.80     0.06                             __aeabi_idivmod
  0.63      0.80     0.01                             aplc_vectrealloc

So we copy ELDF=${ESDK}/bsps/current/aplc.ldf to ELDF=${ESDK}/bsps/current/aplc_transp.ldf
and modify it so
divsi3 (which is in the newlib library) is transferred to internal memory.
aplc_vsize is slightly trickier, it is in the aplc library routine runmem,c, which in its entirety is too large for the remaining space in the internal memory, so we spit runmen.c into two routines, one of which just contains aplc_vsize, and put the latter in internal memory.
As yet I am not sure where aeabi_idivmod is located.
Code: Select all
  8: Message from eCore 0x80a ( 0, 2): " 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7
"

This now completes in about 100ms.


So a distinct speed-up.

Re: Example: Reducing run time by moving code to internal me

PostPosted: Wed Jun 19, 2013 6:52 am
by timpart
You mentioned wanting to track down where a routine comes from. The linker e-ld has an option to print out which input file a symbol is in. Use -y or --symbol-trace followed by the symbol name (you may need a leading underscore on the name). Doesn't seem to pick up references from libraries though.

Alternatively the -M map option on the linker might give you a clue as ou could see what is close to that routine and there might be another name you recognise.

Tim

Re: Example: Reducing run time by moving code to internal me

PostPosted: Wed Jun 19, 2013 1:14 pm
by ysapir
"e-objdump -D" will list a disassembly of the binary image. Doing "e-objdump -D file.e | grep '>:'" is what I use to see objects placement.

Can you please explain what "matrix with shape : 10x9x8x7" means? Is it a 4-dimensional array?

In your exercise, do you yry to invert the matrix (i.e., find its reciprocal) or to transpose it?

Re: Example: Reducing run time by moving code to internal me

PostPosted: Wed Jun 19, 2013 2:31 pm
by Dr.BeauWebber
Thanks Tim,
Nice to know these things.

Done more searching and it turns out that aeabi_idivmod is relevant to the linux version of the gcc compilations, but not the e-gcc compilations.

it would be nice to profile the code generated by e-gcc, but when I compile with -pg I get :
/opt/adapteva/esdk/tools/e-gnu.arm7l/bin/../lib/gcc/epiphany-elf/4.7.0/../../../../epiphany-elf/bin/ld: cannot find -lc_p
I have tried manually searching for a library libc_p.a or whatever and not found it.

Re: Example: Reducing run time by moving code to internal me

PostPosted: Wed Jun 19, 2013 2:42 pm
by Dr.BeauWebber
ysapir wrote:"e-objdump -D" will list a disassembly of the binary image. Doing "e-objdump -D file.e | grep '>:'" is what I use to see objects placement.

Thanks, I will have a look.
Can you please explain what "matrix with shape : 10x9x8x7" means? Is it a 4-dimensional array?

Yes it is.
In your exercise, do you yry to invert the matrix (i.e., find its reciprocal) or to transpose it?

This is a simple transpose. I wanted an example that would complete without having possible pathological cases.
So it is effectively just a data moving / structure changing example
The previous Matrix example : http://forums.parallella.org/viewtopic.php?f=37&t=372 finished with a matrix solve, closely related to an inverse.

Re: Example: Reducing run time by moving code to internal me

PostPosted: Wed Jun 19, 2013 3:05 pm
by ysapir
Interesting. Everybody knows how to transpose a 2-D matrix. I can easily imagine how to transpose a 3-D array (rotate the cuboid around the principal diagonal - but even then, I think there are 2 options for the amount of rotation - 120 degrees and 240 degrees). I have hard time imagining how to transpose a 4-D one...

Re: Example: Reducing run time by moving code to internal me

PostPosted: Wed Jun 19, 2013 3:10 pm
by ysapir
Dr.BeauWebber wrote:it would be nice to profile the code generated by e-gcc, but when I compile with -pg I get...


e-gcc does not support profiling at this time (there is no gprof support). Trying to profile, or measure time of the e-run runs is meaningless, as this is a functional simulator, not a cycle accurate one.

Re: Example: Reducing run time by moving code to internal me

PostPosted: Wed Jun 19, 2013 9:17 pm
by Dr.BeauWebber
ysapir wrote:Interesting. Everybody knows how to transpose a 2-D matrix. I can easily imagine how to transpose a 3-D array (rotate the cuboid around the principal diagonal - but even then, I think there are 2 options for the amount of rotation - 120 degrees and 240 degrees). I have hard time imagining how to transpose a 4-D one...

Indeed. Apl has various options for selecting some of these variants, but I get lost too.

Re: Example: Reducing run time by moving code to internal me

PostPosted: Wed Jun 19, 2013 9:22 pm
by Dr.BeauWebber
ysapir wrote:
Dr.BeauWebber wrote:it would be nice to profile the code generated by e-gcc, but when I compile with -pg I get...

this is a functional simulator, not a cycle accurate one.

Ah, another point I was incorrect on.

Re: Example: Reducing run time by moving code to internal me

PostPosted: Thu Jun 20, 2013 12:19 am
by ysapir
Dr.BeauWebber wrote:Ah, another point I was incorrect on.


Oh, well, as they say - RTFM (chapter 7) ;)

EDIT: *** I guess this is a little inappropriate. Please read as RTM ;) ***