Parallella Community

Posted: **Wed Jul 08, 2015 2:50 pm**

I have created my first Parallella program, an adaption of an MPI program named flop. The purpose of flop is to calculate an approximate value for Pi.

I ported the MPI flop program to Parallella using the ESDK to implement similar logic and extracted the computation from the main program and set it up to run on the Epiphany cores.

The code, including the MPI version of flop, can be gotten from my github repository: http://github.com/njpacoma/e-flop.git. The Parallella code includes a build.sh and run.sh.

For the MPI version, I am using MPICH3 on a cluster of 7 RPi Model B systems running at their standard clock rate. I am compiling and running the program with no special optimizations (mpicc flop.c -o flop).

Now for my questions:
- is this a good use of the Parallella architecture?
- have I implemented the Parallella version correctly? -- I looked at many Parallella and Epiphany examples as well as code examples on the web in figuring out how to duplicate the MPI logic
- is there a better way to implement this? except for my more recent exploration of MPI, the last time I did anything with parallel computing was in the early 1980s
- when the value of N is set to 1 billion, the Parallella takes a very long time (436+ seconds) -- my RPi cluster, using 7 systems, does the 1 billion computation in 24+ seconds -- this is partly the basis for my asking the other 3 questions above

Lastly, I wonder if am I reporting the performance correctly.

Here is a run where N is set to 1,000,000 and I am spreading the comutation across all 16 cores (assuming I implemented the program correctly):

parallella@parallella:~/workarea/flop$ ./run.sh
number of intervals is 1000000
number of cores 16
pi is approximately 3.1415926535898797, Error is 0.0000000000000866
wall clock time = 0.480555
estimated MFLOPS = 12.485564

The reason I question my results is that the MPI version running on a single RPi Model B gives the following results:

swoyer@RPi-1 ~/mpich3/code/flop $ mpirun -n 1 -f /home/tswoyer/machineFile.rpi ./flop
Process 0 of 1 on RPi-1
number of intervals is 1000000
pi is approximately 3.1415926535897643, Error is 0.0000000000000289
wall clock time = 0.171937
Estimated MFLOPs = 34.896505

If I raise the number of participating RPi systems, the results improve:

swoyer@RPi-1 ~/mpich3/code/flop $ mpirun -n 4 -f /home/tswoyer/machineFile.rpi ./flop
Process 0 of 4 on RPi-1
number of intervals is 1000000
Process 3 of 4 on RPi-4
Process 2 of 4 on RPi-3
Process 1 of 4 on RPi-2
pi is approximately 3.1415926535899033, Error is 0.0000000000001101
wall clock time = 0.065483
Estimated MFLOPs = 91.626704

tswoyer@RPi-1 ~/mpich3/code/flop $ mpirun -n 7 -f /home/tswoyer/machineFile.rpi ./flop
Process 0 of 7 on RPi-1
number of intervals is 1000000
Process 5 of 7 on RPi-Y
Process 4 of 7 on RPi-5
Process 3 of 7 on RPi-4
Process 1 of 7 on RPi-2
Process 2 of 7 on RPi-3
Process 6 of 7 on RPi-Z
pi is approximately 3.1415926535899055, Error is 0.0000000000001124
wall clock time = 0.055693
Estimated MFLOPs = 107.733178

Any feedback, suggestions, comments, corrections, discussion, etc. would be very welcome.

Posted: **Thu Jul 09, 2015 7:11 am**

Your software is using "double". The epiphany has no hw support for double precision floating points so everything ends up being emulated with integer math which is going to be awefully slow.

Posted: **Thu Jul 09, 2015 2:19 pm**

Thank you tnt, I didn't realize that from reading the documentation.

I went back into my program and altered the declarations from double to float and also type cast the constants/literals that are in the program (because they are double unless you explicitly cast them to float).

Now when I run the program with 1,000,000 iterations, I get results that are considerably different:

parallella@parallella:~/workarea/flop$ ./run.sh
number of intervals is 1000000
number of cores 16
pi is approximately 3.1415882110595703, Error is 0.0000045299530029
wall clock time = 0.026854
estimated MFLOPS = 223.430405

Changing the iteration to a billion, I get results much faster but, with float versus double, the precision of the calculation is compromised:

parallella@parallella:~/workarea/flop$ ./run.sh
number of intervals is 1000000000
number of cores 16
pi is approximately 1.0737417936325073, Error is 2.0678510665893555
wall clock time = 7.313982
estimated MFLOPS = 820.346558

I will keep this in mind as I continue to explore programming on the Parallella platform.

Posted: **Thu Jul 09, 2015 4:54 pm**

Also, things like division and sqrt() will be slow on the Epiphany cores.

I found via Google some coding tricks from the Quake III game that greatly enhanced some Nbody star simulations I am playing with.

So instead of diving by mass of a star in the gravity calcs, on the ARM core I calc the recipicle of the mass and multiply with that on the Epiphany.

Google around for those kinds of tips.

Posted: **Thu Jul 09, 2015 9:58 pm**

Parallella Community

my first program and many questions

my first program and many questions

Re: my first program and many questions

Re: my first program and many questions

Re: my first program and many questions

Re: my first program and many questions