Example: In-line maths calculation, in internal memory

Moderator: Dr.BeauWebber

Example: In-line maths calculation, in internal memory

Postby Dr.BeauWebber » Thu Jun 20, 2013 9:26 am

To do a simple test of what I intend to be an all-internal-memory test, we use an in-line aplc maths calculation :
Generate 10,000 numbers only slightly larger than unity, and multiply them together :
Code: Select all
×/ 1 + 0.000001 ×⍳10000
4.414302827E21

In the aplc ascii notation, with added timing statements :
mathtst_3_jts.apl
Code: Select all
T0 .is #jts
.times / 1 + 0.000001 .times .iota 10000
#jts - T0

So the full c-program does have library calls, but nearly all the execution time is I believe spent in the following c code :
Code: Select all
for (i5 = 0; i5 < i7; i5++) {
(res1.r = ((((double) 1) + (r_main[0] * ((double) i17++ ))) * res1.r));
}

Running this on an i7 :
Code: Select all
$ aplcc mathtst_3_jts.apl

$ ./a.exe
 4.414303e+21
 0.001000166

So about 1ms.
In the e-run simulator on an i7 :
Code: Select all
$ ./e-aplcc mathtst_3_jts.c
$ /cygdrive/d/home/jbww/Src/Git/Parallella/INSTALL/bin/e-run a.out
 4.414303e+21
 0.3440189

So about 1/3 s.

On the Epiphany cores (without the timing statements):
Code: Select all
$ ./buildv11.sh
$ ./runv5.sh
  0: Message from eCore 0x8ca ( 3, 2): " 4.414303e+21
"
  1: Message from eCore 0x84b ( 1, 3): " 4.414303e+21
"

etc.
But this requires nearly 10s to complete.
I have to be doing something wrong, or my understanding is faulty.
I am building an example archive showing what I am doing.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: In-line maths calculation, in internal memory

Postby Dr.BeauWebber » Thu Jun 20, 2013 4:22 pm

Ah I was including the libaplc.a library out of habit, and the resulting tar file was too large to upload.
But the .ldf file needs the library .o files, so it easiest to build the library from .c files.
So the .tar file is now attached.

Readme.txt :
Code: Select all
# To build and run ./src/mathtst_3_ed :
# J.B.W.Webber@kent.ac.uk 2013-06-20

./buildv11.sh
./runv5.sh

# once the library and buffer .o files are created, the relevant lines in the build script can be commented out.

See what timings you need for it to complete (in src/harness_01.c)
When it does complete you should get :
Code: Select all
  0: Message from eCore 0x8ca ( 3, 2): " 4.414303e+21
"
  1: Message from eCore 0x84b ( 1, 3): " 4.414303e+21
"

etc.

So what am I doing wrong that for me it is taking 10s to run ?

Edit, added :
The delay before the host reads the communication buffer is set in src/harness_01.c : the sleep statement.
Currently it is set for a 10s delay - please set it for shorter delays and re-build and run, to see if it completes with the correct output, on your hardware.

As the Readme says, once you have run it once, you can comment out the two compilation lines that generate the buffer and library .o files, so it will build quicker.
Attachments
skeleton_v4.tar.gz
example maths test - tar file
(221.09 KiB) Downloaded 613 times
Last edited by Dr.BeauWebber on Fri Jun 21, 2013 8:21 am, edited 1 time in total.
Reason: Added more information about testing the code.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: In-line maths calculation, in internal memory

Postby shodruk » Sun Jun 23, 2013 8:31 am

Dr.BeauWebber wrote:So what am I doing wrong that for me it is taking 10s to run ?


(this generation of) Epiphany has no double-precision FP processors. (only single-precision)

according to Epiphany architecture reference manual,

"Double-precision floating-point arithmetic is emulated using software libraries and should be avoided if performance considerations outweigh the need for additional precision."
Shodruky
shodruk
 
Posts: 464
Joined: Mon Apr 08, 2013 7:03 pm

Re: Example: In-line maths calculation, in internal memory

Postby Hoernchen » Sun Jun 23, 2013 12:16 pm

Also keep in mind that ieee-754 compliant code will also produce function calls, so if you want performance you might wanto to stick to one fp mode and drop ieee-754 compliance.
Code: Select all
-mno-soft-cmpsf -mfp-mode=round-nearest -ffast-math
Hoernchen
 
Posts: 41
Joined: Mon Dec 17, 2012 3:22 am

Re: Example: In-line maths calculation, in internal memory

Postby Dr.BeauWebber » Mon Jun 24, 2013 12:52 pm

OK, I created a modified version of the in-line calculation :
Code: Select all
      ×/ 1 + 10000 ⍴ 0.0001
2.718145927

i.e in. ascii notation :
Code: Select all
T0 .is #jts
.times / 1 +  10000 .rho  0.0001
#jts - T0

and then changed the double calls to real,
and compiled the c code and libraries using the suggested flags :
Code: Select all
-mno-soft-cmpsf -mfp-mode=round-nearest -ffast-math

This runs fine, but still requires 1 to 5 s to complete :
Code: Select all
  9: Message from eCore 0x808 ( 0, 0): " 2.718597
"
 10: Message from eCore 0x8c8 ( 3, 0): " 2.718597
"
 11: Message from eCore 0x8c9 ( 3, 1): " 2.718597
"
 12: Message from eCore 0x88a ( 2, 2): " 2.718597
"

The double precision version run in the e-run simulator on an i7 takes about 120ms.

Question : can you point me to why ;
natural e ~= product of N copies of (1 + 10-N) to N significant places.
I know there is a simple relationship between ln and lg_10, and I know some series expansions for e, but still don't find this relationship obvious ....
Last edited by Dr.BeauWebber on Mon Jun 24, 2013 1:05 pm, edited 1 time in total.
Reason: add information (e-run timing)
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: In-line maths calculation, in internal memory

Postby aolofsson » Mon Jun 24, 2013 3:14 pm

More often than not, when things run really slow, it's because code is being fetched out of external. Any PC address above 0x000xxxxx is going to slow things down by >100X.

Can you try this just to make sure all the critical code is running out of local?

1.) e-ojdump -D "your_epiphany.elf" > dump (tar and attach)
2.) e-run --trace "your_epiphany.elf" > trace (tar and attach)

Thanks,
Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Example: In-line maths calculation, in internal memory

Postby Dr.BeauWebber » Mon Jun 24, 2013 7:40 pm

Ah, thanks.
Ben has asked to reclaim access to their hardware, but I assume I can do this test on other hosts :
Can I just do the objdump and trace from a simple a.out file, as opposed to creating the actual Epiphany .elf files ?
I certainly get assembly code when I try.
If not, I am building a test Arm installation on a Raspberry Pi, to help validate the latest changes to the aplc compiler, and I could recreate the Epiphany code there.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: In-line maths calculation, in internal memory

Postby ysapir » Mon Jun 24, 2013 8:55 pm

a.out is just the default name for the executable generated by the gnu compilers, if you don't specify an explicit name.

In your question, do you mean if you could analyze the executable generated by ARM/x86 compiler instead of Epiphany compiler? If so, the answer is obviously no, since the binaries are different between platforms and there is limited amount of info that you can learn from that. However, you don't really need the actual hardware to do your test. You can work with the cross-build tools on your PC (esdk.4.13.04.24). e-objdump will work on the executable generated by e-gcc on any platform. e-run is a simulator, thus no need for the hardware.
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Re: Example: In-line maths calculation, in internal memory

Postby Dr.BeauWebber » Mon Jun 24, 2013 10:07 pm

Thanks, that seemed reasonable, but I am not always sure.
So these are the as compiled by aplc to c files, without addition of i/o buffer and translation of fprintf etc to sprintf :

OK I attach the tarred files for
1.) e-objdump -D "your_epiphany.elf" > dump (tar and attach)
2.) e-run --trace "your_epiphany.elf" > trace (tar and attach)

Since the run trace for a 10k iteration would be huge, I use a dataset of 10 items :
mathtst_8.apl
Code: Select all
.times / 1 +  10 .rho  0.1

Code: Select all
$ aplcc mathtst_8.apl

$ ./a.exe
 2.593742

Edit .c file to convert double to real :
Code: Select all
$ ./e-aplcc mathtst_8_sngl.c

$ /cygdrive/d/home/jbww/Src/Git/Parallella/INSTALL/bin/e-run ./a.out
 2.593743

$ /cygdrive/d/home/jbww/Src/Git/Parallella/INSTALL/bin/e-objdump -D a.out > dump

$ /cygdrive/d/home/jbww/Src/Git/Parallella/INSTALL/bin/e-run --trace ./a.out >& trace

Well the traced addresses seem to be out of the local range ; i.e. 0x8001b726
so not what we want.
Attachments
trace.tar.gz
compressed trace file
(330.96 KiB) Downloaded 621 times
dump.tar.gz
compressed dump file
(2.47 MiB) Downloaded 701 times
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example: In-line maths calculation, in internal memory

Postby ysapir » Tue Jun 25, 2013 7:09 am

A quick look at your objdump listing shows that all of your code is being run from external memory. Also note that for the specific code you attached, no LDF was used in compilation, so all addresses are resolved using the default linker LDF, where ext. mem is at 0x80000000. I am surprised if you manage to run this executable on a real hardware.
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Next

Return to APL

Who is online

Users browsing this forum: No registered users and 1 guest