GEMM with OpenCL

Generic algorithms that could be implemented in almost any language, e.g. matrix operations and FFT.

GEMM with OpenCL

Postby evalero » Fri Dec 11, 2015 1:01 am

I would like to implement a simple gemm (single and double precision) as a benchmark on parallella. In order to be portable, the implementation must be on OpenCL. It is my first experience with parallella and the initial results are very frustrating. Some ideas for achieve better results?

Here are the source:

https://www.dropbox.com/s/1sv8a4lu4uio1 ... ar.gz?dl=0
https://www.dropbox.com/s/vfc04w1gz6t3l ... ar.gz?dl=0

Both were builded with:

cc -I/usr/local/browndeer_new/include -o gemm_OpenCL gemm_OpenCL.c -L/usr/local/browndeer_new/lib -lcoprthr_opencl -O3 -lm -fopenmp -std=c99
evalero
 
Posts: 5
Joined: Tue Nov 24, 2015 4:45 pm

Re: GEMM with OpenCL

Postby jar » Fri Dec 11, 2015 4:50 am

Double precision is a non-starter with the current Epiphany-III ISA. There are no double precision instructions so they are emulated in software, which is quite slow.

You might also look at these code resources which get reasonable performance for matrix multiply:
http://www.adapteva.com/white-papers/ef ... ng-opencl/
https://github.com/USArmyResearchLab/mp ... ter/cannon
https://github.com/adapteva/epiphany-ex ... /matmul-16
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: GEMM with OpenCL

Postby evalero » Fri Dec 11, 2015 12:55 pm

Hi Jar, thanks.

I had some doubts on double precision implementation but your answer has clarified a lot of them.

The example matmul-16 is working and the performance of this example is fantastic, but the implementation is very obscure to, and I really like to know, for example, how to work with a bigger matrix. Some one can explain better please. On the other hand E-SDK is not a portable solution as OpenCL (;-)
evalero
 
Posts: 5
Joined: Tue Nov 24, 2015 4:45 pm

Re: GEMM with OpenCL

Postby jar » Fri Dec 11, 2015 11:49 pm

You can get portability or performance with the Epiphany. There is very little middle ground. The OpenCL C language lacks the semantics to achieve high performance on Epiphany. For high performance, one must introduce non-standard extensions. So your host code may be portable, but your high performance device kernel code will be full of special things that will not work on other architectures. In particular, you must have an effective method to pass data between the cores. The off-chip bandwidth with Epiphany in the Parallella is poor, but once data is on chip and you can share between cores, much better.

The second link I provided uses a combination of the COPRTHR threaded MPI implementation and the ESDK (for off-chip DMAs). MPI is a standard message passing interface with legacy codes dating back to the early 1990's and used on networked clusters. It seems to be a reasonable method for message passing between Epiphany cores, but I wrote that device code, so I may be biased. It does allow you to calculate on larger matrices than the 32KB limits of the core local memory. Check out the code.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: GEMM with OpenCL

Postby dobkeratops » Sat Dec 12, 2015 6:58 am

jar wrote: In particular, you must have an effective method to pass data between the cores. The off-chip bandwidth with Epiphany in the Parallella is poor, but once data is on chip and you can share between cores, much better.


sorry to repeat myself; this is for the benefit of the OP - who appears to be interested in OpenCL+Epiphany, and who might not have seen my thread.

OpenCL on the epiphany calls the scratchpad '__private', which means no inter-core sharing, as jar identifies, hence does not leverage the architecture.

So, could we not split it? e.g: say that half the scratchpad is __private; some of what remains is '__local' (sharable between workgroup members), and some is even __global (persistent and accessible to all workgroups). Call a rectangular group of adjacent cores (e.g.2x2, 4x4 ..) a workgroup, and give each member a pointer to the top left to address that groups' local memory correctly.

Am I missing anything here? (is there some reason it wasn't done this way; have I got the semantics right..)

You could give a workgroup size that is the whole chip, if you wanted whole-chip sharing whilst keeping 'global' to mean off chip memory.

OpenCL describes a sliding scale between locality & visibility; Epiphany hardware *does* have a sliding scale between locality & latency/network-bandwidth, so I believe you should be able to use openCL's model to reason about that. Epiphany can read/write anywhere, but if you can constrain reads/writes to a 2x2 group, there is a benefit.

The second link I provided uses a combination of the COPRTHR threaded MPI implementation and the ESDK (for off-chip DMAs). MPI is a standard message passing interface with legacy codes dating back to the early 1990's and used on networked clusters. It seems to be a reasonable method for message passing between Epiphany cores, but I wrote that device code, so I may be biased. It does allow you to calculate on larger matrices than the 32KB limits of the core local memory. Check out the code.


I fully agree MPI seems to be a more natural fit.

OpenCL for epiphany just needs an extremely complex compiler. (e.g. shape analysis , substitute high order functions for recognised patterns of indexing in arrays, which would in turn deal with DMA).

Maybe such work could be done in a way that translates to other hardware, making it efficient for the world. (e.g. such shape-analysis substitution could have other uses)

OpenCL runs on widespread GPUs, but we also have clusters of boxes with GPUs, 2-4way SLI, Facebook have some AI server with 8 cards in one box, there's AMD HSA which allows close collaboration between CPU & GPU .. etc. It has the useful property of defining non-overlapping kernel invocations and requiring explicit barriers for inter-kernel synchronization.. there is more useful parallel information to go on than in , say, traditional C++ programs.

portability + performance IS difficult, but we have better tools today compared to 10 years ago. (clang.. llvm)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk


Return to Algorithms

Who is online

Users browsing this forum: No registered users and 1 guest