by dar » Sat Dec 22, 2012 4:07 am
The kernel is certainly not optimized, but maybe 10-15 min of thought was given to making the kernel match with Epiphany to get decent performance. Have not run this on ARM. It was faster than CPU used with the earlier eval kit.
Your question raises a good point that programmers new to OpenCL (and in general perhaps) should be aware of - being able to express an algorithm in a common API does not mean that the precise form of your code will run well (or even work) on different hardware platforms. This is a bit of a fantasy. Code is portable across GPUs to the extent they have similar architectural features, and we begin to believe the fantasy sometimes. Performant code must be tuned for a given architecture.
With respect to the Mandelbrot kernel, you are correct, it was designed to write back a line to DRAM. There are other factors less obvious. Epiphany cores are more like a CPU than a GPU, i.e., its not SIMD or SIMT even though it can support both models. Its not a multi-threaded architecture, the cores are scalar, and it has a memory architecture different from a GPU. This impacts the code you write. For example, with a GPU you try to keep thousands of threads "in flight" and you have to pay attention to certain memory alignment rules between threads. With Epiphany I believe you gain nothing by keeping more threads in flight than the number of physical cores, and the memory rules are different.
The intent is to eventually provide a programming "best practices" guide for Parallella to help explain some of these things by example.