Example: Reducing run time by moving code to internal memory

Moderator: Dr.BeauWebber

Example: Reducing run time by moving code to internal memory

Postby Dr.BeauWebber » Tue Jun 18, 2013 6:46 pm

When compiling aplc programs for the Epiphany cores, it is possible to reduce run time significantly by moving code to internal memory.

Consider a simple memory intensive program :
Create a large interger array (shape: 7x8x9x10) - 5040 32bit integers,
and transpose it, to get a matrix with shape : 10x9x8x7 :
transpose_7.apl :
Code: Select all
V .is 7 8 9 10
V
N .is  .times / V
N
A .is V .rho .iota N
.rho A
T .is .transpose A
.rho T

Code: Select all
$ aplcc transpose_7.apl
$ ./a.exe
 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7

Running this on an i7 processor (with timing statements added) :
Code: Select all
$  aplcc transpose_7_jts.apl

$ ./a.exe
 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7
 0.0009999275

So about 1ms.

using the e-run simulator :
Code: Select all
$ ./e-aplcc transpose_7_jts.c

$ /cygdrive/d/home/jbww/Src/Git/Parallella/INSTALL/bin/e-run a.out
 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7
 0.03300214

So about 33ms.

If we run an edited version (so we write to an output buffer), on the Epiphany cores :
Code: Select all
$ ./buildv11.sh

$ ./runv5.sh
  3: Message from eCore 0x888 ( 2, 0): " 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7
"

etc.
which takes about 500ms to 1s.

On the Parallella Arm, we can use gprof to find out in which routines the most time is being spent :
To get timings we need it to run for a longer time, so we use a 5x6x7x8x9x10x11 array :
Code: Select all
$ aplcc transpose_11_jts.apl

$ gcc -pg -I ../include -L ../lib transpose_11_jts.c -laplc -lm
linaro@linaro-ubuntu-desktop:~/Src/Arm/aplc-6.17b/Work$ ./a.out
 5 6 7 8 9 10 11
 1663200
 5 6 7 8 9 10 11
 11 10 9 8 7 6 5
 0.82039

and then use gprof :
Code: Select all
linaro@linaro-ubuntu-desktop:~/Src/Arm/aplc-6.17b/Work$ gprof -b a.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 46.25      0.37     0.37                             __divsi3
 30.00      0.61     0.24                             main
 15.63      0.74     0.13                             aplc_vsize
  7.50      0.80     0.06                             __aeabi_idivmod
  0.63      0.80     0.01                             aplc_vectrealloc

So we copy ELDF=${ESDK}/bsps/current/aplc.ldf to ELDF=${ESDK}/bsps/current/aplc_transp.ldf
and modify it so
divsi3 (which is in the newlib library) is transferred to internal memory.
aplc_vsize is slightly trickier, it is in the aplc library routine runmem,c, which in its entirety is too large for the remaining space in the internal memory, so we spit runmen.c into two routines, one of which just contains aplc_vsize, and put the latter in internal memory.
As yet I am not sure where aeabi_idivmod is located.
Code: Select all
  8: Message from eCore 0x80a ( 0, 2): " 7 8 9 10
 5040
 7 8 9 10
 10 9 8 7
"

This now completes in about 100ms.


So a distinct speed-up.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: Reducing run time by moving code to internal me

Postby timpart » Wed Jun 19, 2013 6:52 am

You mentioned wanting to track down where a routine comes from. The linker e-ld has an option to print out which input file a symbol is in. Use -y or --symbol-trace followed by the symbol name (you may need a leading underscore on the name). Doesn't seem to pick up references from libraries though.

Alternatively the -M map option on the linker might give you a clue as ou could see what is close to that routine and there might be another name you recognise.

Tim
timpart
 
Posts: 302
Joined: Mon Dec 17, 2012 3:25 am
Location: UK

Re: Example: Reducing run time by moving code to internal me

Postby ysapir » Wed Jun 19, 2013 1:14 pm

"e-objdump -D" will list a disassembly of the binary image. Doing "e-objdump -D file.e | grep '>:'" is what I use to see objects placement.

Can you please explain what "matrix with shape : 10x9x8x7" means? Is it a 4-dimensional array?

In your exercise, do you yry to invert the matrix (i.e., find its reciprocal) or to transpose it?
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Re: Example: Reducing run time by moving code to internal me

Postby Dr.BeauWebber » Wed Jun 19, 2013 2:31 pm

Thanks Tim,
Nice to know these things.

Done more searching and it turns out that aeabi_idivmod is relevant to the linux version of the gcc compilations, but not the e-gcc compilations.

it would be nice to profile the code generated by e-gcc, but when I compile with -pg I get :
/opt/adapteva/esdk/tools/e-gnu.arm7l/bin/../lib/gcc/epiphany-elf/4.7.0/../../../../epiphany-elf/bin/ld: cannot find -lc_p
I have tried manually searching for a library libc_p.a or whatever and not found it.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: Reducing run time by moving code to internal me

Postby Dr.BeauWebber » Wed Jun 19, 2013 2:42 pm

ysapir wrote:"e-objdump -D" will list a disassembly of the binary image. Doing "e-objdump -D file.e | grep '>:'" is what I use to see objects placement.

Thanks, I will have a look.
Can you please explain what "matrix with shape : 10x9x8x7" means? Is it a 4-dimensional array?

Yes it is.
In your exercise, do you yry to invert the matrix (i.e., find its reciprocal) or to transpose it?

This is a simple transpose. I wanted an example that would complete without having possible pathological cases.
So it is effectively just a data moving / structure changing example
The previous Matrix example : http://forums.parallella.org/viewtopic.php?f=37&t=372 finished with a matrix solve, closely related to an inverse.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: Reducing run time by moving code to internal me

Postby ysapir » Wed Jun 19, 2013 3:05 pm

Interesting. Everybody knows how to transpose a 2-D matrix. I can easily imagine how to transpose a 3-D array (rotate the cuboid around the principal diagonal - but even then, I think there are 2 options for the amount of rotation - 120 degrees and 240 degrees). I have hard time imagining how to transpose a 4-D one...
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Re: Example: Reducing run time by moving code to internal me

Postby ysapir » Wed Jun 19, 2013 3:10 pm

Dr.BeauWebber wrote:it would be nice to profile the code generated by e-gcc, but when I compile with -pg I get...


e-gcc does not support profiling at this time (there is no gprof support). Trying to profile, or measure time of the e-run runs is meaningless, as this is a functional simulator, not a cycle accurate one.
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Re: Example: Reducing run time by moving code to internal me

Postby Dr.BeauWebber » Wed Jun 19, 2013 9:17 pm

ysapir wrote:Interesting. Everybody knows how to transpose a 2-D matrix. I can easily imagine how to transpose a 3-D array (rotate the cuboid around the principal diagonal - but even then, I think there are 2 options for the amount of rotation - 120 degrees and 240 degrees). I have hard time imagining how to transpose a 4-D one...

Indeed. Apl has various options for selecting some of these variants, but I get lost too.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: Reducing run time by moving code to internal me

Postby Dr.BeauWebber » Wed Jun 19, 2013 9:22 pm

ysapir wrote:
Dr.BeauWebber wrote:it would be nice to profile the code generated by e-gcc, but when I compile with -pg I get...

this is a functional simulator, not a cycle accurate one.

Ah, another point I was incorrect on.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: Reducing run time by moving code to internal me

Postby ysapir » Thu Jun 20, 2013 12:19 am

Dr.BeauWebber wrote:Ah, another point I was incorrect on.


Oh, well, as they say - RTFM (chapter 7) ;)

EDIT: *** I guess this is a little inappropriate. Please read as RTM ;) ***
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Next

Return to APL

Who is online

Users browsing this forum: No registered users and 1 guest