Hello Parallella Community,
currently I am developing on the Parallella board for a student project. The task is quite simple. There is an update function which gets called every program loop for a lot of elements. Additionally the updates can be performed completely parallel.
Thus I programmed following sequence:
Host: start -> init -> (*) ->send signal to cores -> process results -> init core memory -> goto (*)
Cores: start -> init -> (*) -> wait for signal -> process N elements -> goto (*)
The software runs quite smooth, but I have difficulties with run times.
Here is an example output of my timing comparison procedure:
#; Starting host application ...
HOST simpleMath1 :: number of elements: 16384 loops: 8 measured time: 1400us
DEV worker1 simpleMath1 :: number of elements: 16384 loops: 8 measured time: 350us
HOST simpleMath2 :: number of elements: 16384 loops: 8 measured time: 11262us
HOST simpleMath3 :: number of elements: 16384 loops: 8 measured time: 10948us
DEV worker2 simpleMath2 :: number of elements: 16384 loops: 8 measured time: 5643153us
DEV worker3 simpleMath3 :: number of elements: 16384 loops: 8 measured time: 5645088us
DEV worker4 simpleMath3 :: number of elements: 16384 loops: 8 measured time: 4825429us
DEV worker4o simpleMath3 :: number of elements: 16384 loops: 8 measured time: 3800140us
#; Cleanup and finish application!
simpleMath1 is just adding two floating point numbers which is running on the Epiphany much faster.
simpleMath2 till simpleMath3 is mathematically the same calculation but the functions differs in how the parameters are transferred. worker4 follows there a different strategy and worker4o has additionally compiler optimizations activated. But still the HOST version is way faster than the DEVICE complement.
Since I am accessing the memory directly and using pointers in the following example I do not know how to improve the run time further. Every help and tips how to do this or how to avoid the problem are welcome!
For further investigation I have attached the source code of the upper example. The simpleMath.h contains the used mathematical functions. e_worker[1,2,3,4].c defines the different ways to manage the cores memory. And the main.c just executes sequentially the upper mentioned tests. The tests are separated by "////..///".
Thanks!