Page 1 of 1

Very slow OpenCL execution time (with example code).

PostPosted: Tue Apr 26, 2016 2:47 pm
by CyberHiPriest
Hi there,

I've been playing a bit with this awesome board and wanting to do some raymarching gfx with the 16 cores.

However, I hit a bit of a wall when it comes to OpenCL performance on the Parallella board.
It seems that using clforka() in a loop is very resource-intensive as a simple example runs 100x slower on the Epiphany chip than it does on the dualcore ARM.

I'm probably doing something stupid, but can't really tell what and how to fix it.

I made a simple program to showcase the delays in clforka(), so people smarter than me can take a look at it.
raymarching example code
(2.56 KiB) Downloaded 1108 times

You can compile & run it on a Linux x86_64 host, Parallella ARM and Epiphany by giving the program different parameters:
Code: Select all
(for the Epiphany)
# ./raymarch -a -nthreads 16
(for the ARM CPUs)
# ./raymarch -c -nthreads 2
(on your host core-i7 or something like that)
# ./raymarch -c -nthreads 8
(currently it does not run out-of-the-box on nvida/amd/intel GPU's, but has support for it)
# ./raymarch -g -nthreads 256

I used the browndeer's coprthr/opencl wrapper at the default location (/usr/local/browndeer) for both the parallella and on my x86_64 PC.
The only extra dependency is libSDL1.2-dev.

My idea is that clforka() does the transfer of RTE + ocl_kernel to all the cores and sets up & starts all the cores behind the scenes. This is not needed to be done every loop in my example, I just want to feed the OpenCL kernel updated parameters.

Re: Very slow OpenCL execution time (with example code).

PostPosted: Tue Apr 26, 2016 8:46 pm
by jar
The OpenCL startup time is problematic because the eSDK serial loader is slow. This is being addressed in COPRTHR 2 ( It's going to get much better soon.

Alternatively, you need to you something called "persistent threads" if your clforka() call is inside a loop. Basically, move the loop inside the kernel device code instead of the host code. I have not looked at your code.

Re: Very slow OpenCL execution time (with example code).

PostPosted: Wed Apr 27, 2016 3:05 pm
by CyberHiPriest
After reading the other topic on the forum (Sobel 35x slower)

I started to investigate the assembly and saw some calls to generic float/double code that was probably executed in the shared DRAM.

'aolofsson' wasn't kidding when he said that such code is 100x slower. More like 200x slower in my example.

Anyway, after adding some 'superfast' routines (sfdiv, sfsin, ...) and making sure all constants are floats, my frame-time went down from 9000(!) milliseconds to an insane 35 milliseconds! Very cool!! :D

Here's the adjusted code for those who want to see the difference too.

Code: Select all
int e_dma_copy(void *, void *, unsigned long);
#include <string.h>
#define e_dma_copy(x,y,z) memcpy(x,y,z)

#define PI   3.141592653589793f
#define PI_2 1.57079632679489661923f
#define PI_4 0.78539816339744830962f

float sfinv(float a)
   union fu32_u {
      float f;
      uint32_t u;
   union fu32_u x;
   x.f = a;
   x.u = 0x7eeeeeee - x.u;
   x.f = x.f * (2.0f - a * x.f);
   return x.f;

float sfdiv(float a, float b)
   return a * sfinv(b);

float sffloor(float x)
   return (float)((int)x);

float sfmod(float x, float y)
   return x - y * sffloor(x / y);

float sfsin(const float a)
   float val = 1.0f;
   float theta = sfmod(a, 2.0f * PI);
   val = 1.0f - theta * theta * 0.083333333f * 0.076923077f * val;
   val = 1.0f - theta * theta * 0.1f * 0.090909091f * val;
   val = 1.0f - theta * theta * 0.125f * 0.111111111f * val;
   val = 1.0f - theta * theta * 0.166666667f * 0.142857143f * val;
   val = 1.0f - theta * theta * 0.25f * 0.2f * val;
   val = 1.0f - theta * theta * 0.5f * 0.333333333f * val;
   return theta * val;

float sfcos(const float a)
   float val = 1;
   float theta = sfmod(a, 2.0f * PI);
   val = 1.0f - theta * theta * 0.083333333f * 0.090909090f * val;
   val = 1.0f - theta * theta * 0.10000000f * 0.11111111f * val;
   val = 1.0f - theta * theta * 0.12500000f * 0.14285714f * val;
   val = 1.0f - theta * theta * 0.16666667f * 0.20000000f * val;
   val = 1.0f - theta * theta * 0.25000000f * 0.33333333f * val;
   val = 1.0f - theta * theta * 0.50000000f * 1.00000000f * val;
   return val;

__kernel void raymarch_kern(float time, uint pitch, uint xres, uint yres, __global uint * frame)
   uint line[xres];
   int i, j;
   float2 fres;
   fres.x = (float)xres;
   fres.y = (float)yres;
   int y = get_global_id(0);
   uint *dst = frame + ((yres - y) * xres);
   float2 c = (float2) { 0, y };
   float4 color;
   uint fcolor;
   for (i = 0; i < xres; i++) {
      c.x = (float)i;
      color = (float4) { sfdiv(c.x, fres.x),
               sfdiv(c.y, fres.y),
              0.5f + 0.5f * sfsin(time),
      fcolor = 0xff000000 +
         ((uint) (color.x * 255.0f) << 16) +
         ((uint) (color.y * 255.0f) << 8) +
         ((uint) (color.z * 255.0f));
      line[i] = fcolor;
   e_dma_copy(dst, line, pitch);