Example : Matrix : simple array operations

Moderator: Dr.BeauWebber

Example : Matrix : simple array operations

Postby Dr.BeauWebber » Thu Jun 13, 2013 2:42 am

Matrix performs a series of simple array operations :

Generate vectors A and B;
perform an outer product to create a rectangular array C;
transpose C to obtain T;
perform an outer product (matrix multiply) to obtain O;
solve C to obtain S.
Code: Select all
     A ← ⍳ 5
      A
1 2 3 4 5
      B ← ⍳ 9
      B
1 2 3 4 5 6 7 8 9
      C ← A ∘.× B
      C
1  2  3  4  5  6  7  8  9
2  4  6  8 10 12 14 16 18
3  6  9 12 15 18 21 24 27
4  8 12 16 20 24 28 32 36
5 10 15 20 25 30 35 40 45
      T ← ⍉ C
      T
1  2  3  4  5
2  4  6  8 10
3  6  9 12 15
4  8 12 16 20
5 10 15 20 25
6 12 18 24 30
7 14 21 28 35
8 16 24 32 40
9 18 27 36 45
      O ← C +.× T
      O
 285  570  855 1140 1425
 570 1140 1710 2280 2850
 855 1710 2565 3420 4275
1140 2280 3420 4560 5700
1425 2850 4275 5700 7125
      S ← C ⌹ A
      S
1 2 3 4 5 6 7 8 9

Expressed in aplc's ascii notation :
Code: Select all
'A:'
A .is .iota 5
A
' '
'B:'
B .is .iota 9
B
' '
'C: inner product:'
C .is A .jot . .times B
C
' '
'T: transpose:'
T .is .tr C
T
' '
'O: outer product (matrix multiply):'
O .is  C  + . .times T
O
' '
'S: solve matrix:'
S .is C .domino A
S

This runs fine, but is surprisingly slow, even given most of this code is running in the library, in external DRAM :
Code: Select all
 14: Message from eCore 0x8cb ( 3, 3): "A:
 1 2 3 4 5

B:
 1 2 3 4 5 6 7 8 9

C: outer product:
 1  2  3  4  5  6  7  8  9
 2  4  6  8 10 12 14 16 18
 3  6  9 12 15 18 21 24 27
 4  8 12 16 20 24 28 32 36
 5 10 15 20 25 30 35 40 45

T: transpose:
 1  2  3  4  5
 2  4  6  8 10
 3  6  9 12 15
 4  8 12 16 20
 5 10 15 20 25
 6 12 18 24 30
 7 14 21 28 35
 8 16 24 32 40
 9 18 27 36 45

O: inner product (matrix multiply):
  285  570  855 1140 1425
  570 1140 1710 2280 2850
  855 1710 2565 3420 4275
 1140 2280 3420 4560 5700
 1425 2850 4275 5700 7125

S: solve matrix:
 1 2 3 4 5 6 7 8 9
"

In fact to solve the matrix, the delay has to be increased beyond 10s. For comparison, on a NIOS II soft processor on a low-grade FPGA, there is just a noticeable delay for the matrix to solve. Perhaps there is severe memory contention ?
It is also possible that the residual precision being asked of the solve is in appropriate, but it has not needed changing for other computer hosts, and should be set in the aplc building stage by configure.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example : Matrix : simple array operations

Postby ysapir » Thu Jun 13, 2013 4:57 pm

Obviously something is very flawed if you need 10sec to calculate this simple expression. Even running from external mem (that is located on the moon, communicated by a dial-up modem) should not be so slow...
User avatar
ysapir
 
Posts: 393
Joined: Tue Dec 11, 2012 7:05 pm

Re: Example : Matrix : simple array operations

Postby Dr.BeauWebber » Fri Jun 14, 2013 1:41 pm

something is very flawed

Yes indeed.
On my lap-top the matrix inversion takes 50ms (dropping to 10ms on repetition).

Using the e-run simulator on my lap-top (i7) the matrix inversion takes about 500ms.
Using the e-run simulator on the Parallella the matrix inversion takes about 3.5s.

I have been trying to run it on my lap-top with gprof, to see where it is spending longest, but keep getting _mcount undefined .... This is usually from compiling / linking something without -pg, but I can not see what I am doing wrong.
I was planning on moving the relevant library routines to internal memory.

Compiling the Epiphany code on the i7 laptop, and using e-gdb on the Parallella : just running takes about 8s.

Compiling the Epiphany code on the Parallella, and using e-gdb on the Parallella : just running takes about the same time, though as stated before, there is no output (using the same c source as in the lap-top case).

Single-stepping through this code appears to indicate that the following library functions are not found :
Code: Select all
srcw/newlib/libc/stdlib/mlock.c: No such file or directory.
srcw/newlib/libc/stdlib/malloc.c: No such file or directory.
srcw/libgcc/libgcc2.c: No such file or directory.
srcw/libgcc/fp-bit.c: No such file or directory.

Ah yes, the sdk source has been removed from this installation.
A lot of the calls are for srcw/libgcc/fp-bit.c .....

As I say, if I compile the modified version with sprintf instead of fprintf, using the output buffer, it does produce output, just extremely slowly, slower than the simulator.
Last edited by Dr.BeauWebber on Fri Jun 14, 2013 5:37 pm, edited 4 times in total.
Reason: adding information
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England

Re: Example : Matrix : simple array operations

Postby Dr.BeauWebber » Sat Jun 15, 2013 4:16 pm

Some more timings for various parts of the calculation, for various systems :

Matrix with some simple integer declarations (S is real) and some more timing :
Separate timings for creating array C and for the matrix inversion :
matrixinvint_jts2.apl
Code: Select all
:decl #int  #vector A, B
:decl #int C
:decl #scalar T0, T1

T0 .is #jts
A .is .iota 5
B .is .iota 9
C .is A .jot . .times B
T1 .is #jts
T1 - T0
S .is  C .domino A
#jts - T1
S

i7, Cygwin :
Code: Select all
$ aplcc matrixinvint_jts2.apl
$ ./a.exe
 0.001000166
 0.00999999
 1 2 3 4 5 6 7 8 9

i7 e-run :
Code: Select all
$ ./e-aplcc matrixinvint_jts2.c

$  ../../INSTALL/bin/e-run ./a.out
 0.0009999275
 0.488028
 1 2 3 4 5 6 7 8 9

The e-run simulated execution on the i7, for just the array definitions, is about the same speed as the native Cygwin timing.
The solve matrix is considerably slower.

Parallela e-run :
Code: Select all
$ e-run  matrixinvint_jts2.out
 0.002027035
 3.475946
 1 2 3 4 5 6 7 8 9

The e-run simulated execution on the Arm, for just the array definitions, is about twice as slow as the i7 e-run timing.
However the matrix inversion has now really taking longer.

Epiphany :
Code: Select all
 matrixinvint_nojts_ed
2ms : just printing following variable defining
20s : successfully prints inverse

So the array definitions timings are about the same as in the e-run simulator on the Arm - we could expect these to really speed up if it was all on internal memory and not in external memory.
The matrix inverse timing has extended even more.

I am currently working on the hypothesis that the solution of the matrix inverse is asking for more precision than the hardware / library is easily capable of. To be continued.
User avatar
Dr.BeauWebber
 
Posts: 114
Joined: Mon Dec 17, 2012 4:01 am
Location: England


Return to APL

Who is online

Users browsing this forum: No registered users and 3 guests

cron