Parallella Community

by **evalero** » Fri Dec 11, 2015 1:01 am

I would like to implement a simple gemm (single and double precision) as a benchmark on parallella. In order to be portable, the implementation must be on OpenCL. It is my first experience with parallella and the initial results are very frustrating. Some ideas for achieve better results?

Here are the source:

https://www.dropbox.com/s/1sv8a4lu4uio1 ... ar.gz?dl=0
https://www.dropbox.com/s/vfc04w1gz6t3l ... ar.gz?dl=0

Both were builded with:

cc -I/usr/local/browndeer_new/include -o gemm_OpenCL gemm_OpenCL.c -L/usr/local/browndeer_new/lib -lcoprthr_opencl -O3 -lm -fopenmp -std=c99

by **jar** » Fri Dec 11, 2015 4:50 am

Double precision is a non-starter with the current Epiphany-III ISA. There are no double precision instructions so they are emulated in software, which is quite slow.

You might also look at these code resources which get reasonable performance for matrix multiply:
http://www.adapteva.com/white-papers/ef ... ng-opencl/
https://github.com/USArmyResearchLab/mp ... ter/cannon
https://github.com/adapteva/epiphany-ex ... /matmul-16

by **evalero** » Fri Dec 11, 2015 12:55 pm

Hi Jar, thanks.

I had some doubts on double precision implementation but your answer has clarified a lot of them.

The example matmul-16 is working and the performance of this example is fantastic, but the implementation is very obscure to, and I really like to know, for example, how to work with a bigger matrix. Some one can explain better please. On the other hand E-SDK is not a portable solution as OpenCL (;-)

by **jar** » Fri Dec 11, 2015 11:49 pm

You can get portability or performance with the Epiphany. There is very little middle ground. The OpenCL C language lacks the semantics to achieve high performance on Epiphany. For high performance, one must introduce non-standard extensions. So your host code may be portable, but your high performance device kernel code will be full of special things that will not work on other architectures. In particular, you must have an effective method to pass data between the cores. The off-chip bandwidth with Epiphany in the Parallella is poor, but once data is on chip and you can share between cores, much better.

The second link I provided uses a combination of the COPRTHR threaded MPI implementation and the ESDK (for off-chip DMAs). MPI is a standard message passing interface with legacy codes dating back to the early 1990's and used on networked clusters. It seems to be a reasonable method for message passing between Epiphany cores, but I wrote that device code, so I may be biased. It does allow you to calculate on larger matrices than the 32KB limits of the core local memory. Check out the code.

by **dobkeratops** » Sat Dec 12, 2015 6:58 am

Parallella Community

GEMM with OpenCL

GEMM with OpenCL

Re: GEMM with OpenCL

Re: GEMM with OpenCL

Re: GEMM with OpenCL

Re: GEMM with OpenCL

Who is online