my first program and many questions
Posted: Wed Jul 08, 2015 2:50 pm
I have created my first Parallella program, an adaption of an MPI program named flop. The purpose of flop is to calculate an approximate value for Pi.
I ported the MPI flop program to Parallella using the ESDK to implement similar logic and extracted the computation from the main program and set it up to run on the Epiphany cores.
The code, including the MPI version of flop, can be gotten from my github repository: http://github.com/njpacoma/e-flop.git. The Parallella code includes a build.sh and run.sh.
For the MPI version, I am using MPICH3 on a cluster of 7 RPi Model B systems running at their standard clock rate. I am compiling and running the program with no special optimizations (mpicc flop.c -o flop).
Now for my questions:
- is this a good use of the Parallella architecture?
- have I implemented the Parallella version correctly? -- I looked at many Parallella and Epiphany examples as well as code examples on the web in figuring out how to duplicate the MPI logic
- is there a better way to implement this? except for my more recent exploration of MPI, the last time I did anything with parallel computing was in the early 1980s
- when the value of N is set to 1 billion, the Parallella takes a very long time (436+ seconds) -- my RPi cluster, using 7 systems, does the 1 billion computation in 24+ seconds -- this is partly the basis for my asking the other 3 questions above
Lastly, I wonder if am I reporting the performance correctly.
Here is a run where N is set to 1,000,000 and I am spreading the comutation across all 16 cores (assuming I implemented the program correctly):
parallella@parallella:~/workarea/flop$ ./run.sh
number of intervals is 1000000
number of cores 16
pi is approximately 3.1415926535898797, Error is 0.0000000000000866
wall clock time = 0.480555
estimated MFLOPS = 12.485564
The reason I question my results is that the MPI version running on a single RPi Model B gives the following results:
swoyer@RPi-1 ~/mpich3/code/flop $ mpirun -n 1 -f /home/tswoyer/machineFile.rpi ./flop
Process 0 of 1 on RPi-1
number of intervals is 1000000
pi is approximately 3.1415926535897643, Error is 0.0000000000000289
wall clock time = 0.171937
Estimated MFLOPs = 34.896505
If I raise the number of participating RPi systems, the results improve:
swoyer@RPi-1 ~/mpich3/code/flop $ mpirun -n 4 -f /home/tswoyer/machineFile.rpi ./flop
Process 0 of 4 on RPi-1
number of intervals is 1000000
Process 3 of 4 on RPi-4
Process 2 of 4 on RPi-3
Process 1 of 4 on RPi-2
pi is approximately 3.1415926535899033, Error is 0.0000000000001101
wall clock time = 0.065483
Estimated MFLOPs = 91.626704
tswoyer@RPi-1 ~/mpich3/code/flop $ mpirun -n 7 -f /home/tswoyer/machineFile.rpi ./flop
Process 0 of 7 on RPi-1
number of intervals is 1000000
Process 5 of 7 on RPi-Y
Process 4 of 7 on RPi-5
Process 3 of 7 on RPi-4
Process 1 of 7 on RPi-2
Process 2 of 7 on RPi-3
Process 6 of 7 on RPi-Z
pi is approximately 3.1415926535899055, Error is 0.0000000000001124
wall clock time = 0.055693
Estimated MFLOPs = 107.733178
Any feedback, suggestions, comments, corrections, discussion, etc. would be very welcome.
I ported the MPI flop program to Parallella using the ESDK to implement similar logic and extracted the computation from the main program and set it up to run on the Epiphany cores.
The code, including the MPI version of flop, can be gotten from my github repository: http://github.com/njpacoma/e-flop.git. The Parallella code includes a build.sh and run.sh.
For the MPI version, I am using MPICH3 on a cluster of 7 RPi Model B systems running at their standard clock rate. I am compiling and running the program with no special optimizations (mpicc flop.c -o flop).
Now for my questions:
- is this a good use of the Parallella architecture?
- have I implemented the Parallella version correctly? -- I looked at many Parallella and Epiphany examples as well as code examples on the web in figuring out how to duplicate the MPI logic
- is there a better way to implement this? except for my more recent exploration of MPI, the last time I did anything with parallel computing was in the early 1980s
- when the value of N is set to 1 billion, the Parallella takes a very long time (436+ seconds) -- my RPi cluster, using 7 systems, does the 1 billion computation in 24+ seconds -- this is partly the basis for my asking the other 3 questions above
Lastly, I wonder if am I reporting the performance correctly.
Here is a run where N is set to 1,000,000 and I am spreading the comutation across all 16 cores (assuming I implemented the program correctly):
parallella@parallella:~/workarea/flop$ ./run.sh
number of intervals is 1000000
number of cores 16
pi is approximately 3.1415926535898797, Error is 0.0000000000000866
wall clock time = 0.480555
estimated MFLOPS = 12.485564
The reason I question my results is that the MPI version running on a single RPi Model B gives the following results:
swoyer@RPi-1 ~/mpich3/code/flop $ mpirun -n 1 -f /home/tswoyer/machineFile.rpi ./flop
Process 0 of 1 on RPi-1
number of intervals is 1000000
pi is approximately 3.1415926535897643, Error is 0.0000000000000289
wall clock time = 0.171937
Estimated MFLOPs = 34.896505
If I raise the number of participating RPi systems, the results improve:
swoyer@RPi-1 ~/mpich3/code/flop $ mpirun -n 4 -f /home/tswoyer/machineFile.rpi ./flop
Process 0 of 4 on RPi-1
number of intervals is 1000000
Process 3 of 4 on RPi-4
Process 2 of 4 on RPi-3
Process 1 of 4 on RPi-2
pi is approximately 3.1415926535899033, Error is 0.0000000000001101
wall clock time = 0.065483
Estimated MFLOPs = 91.626704
tswoyer@RPi-1 ~/mpich3/code/flop $ mpirun -n 7 -f /home/tswoyer/machineFile.rpi ./flop
Process 0 of 7 on RPi-1
number of intervals is 1000000
Process 5 of 7 on RPi-Y
Process 4 of 7 on RPi-5
Process 3 of 7 on RPi-4
Process 1 of 7 on RPi-2
Process 2 of 7 on RPi-3
Process 6 of 7 on RPi-Z
pi is approximately 3.1415926535899055, Error is 0.0000000000001124
wall clock time = 0.055693
Estimated MFLOPs = 107.733178
Any feedback, suggestions, comments, corrections, discussion, etc. would be very welcome.