OpenCL __local memory

Moderator: dar

OpenCL __local memory

Postby djm » Mon Sep 15, 2014 5:33 am

Does OpenCL __local memory (declared as __local in kernel function signature, and set to null arg_value and appropriate size in clSetKernelArg() ) map to the 32K local memory of each epiphany RISC core? I ask because when I tried to launch a kernel to run on each core, and have each kernel write the address of its __local pointer into a global memory buffer, the addresses were all the same, and all with high order 12 bits set to 0x808.

I'm confused. In one sense, as per OpenCL, local memory should be shared by all work items in a work group. For OpenCL on epiphany via COPRTHR there is only 1 work group with max 16 work items, each work item corresponding to a RISC core. So __local buffers should indeed be shared by all epiphany cores as per OpenCL. The answer I got is 'correct' except that each core will pay a different access cost because although the __local memory was in local storage for 1 core, it is not local for any other core, and is accessed through the inter-core mesh.

How do I make use of the 32K RISC core local memory in an OpenCL kernel to speed up execution by avoiding repeated access to global memory?

Thanks for the help.
djm
 
Posts: 11
Joined: Mon Sep 15, 2014 5:08 am

Re: OpenCL __local memory

Postby dar » Mon Sep 15, 2014 2:02 pm

This is where OpenCL memory model does not really work since it does not cover architectures like this.

__local should not be used from within a kernel, e.g.,
Code: Select all
__kernel void my_kern(...)
{
   __local float buffer[32];
}

since this is not supported.

An attempt was made to allow the OpenCL concept of local memory, which is based on GPUs, to work if allocated
from the host code and declared as an argument to the kernel, e.g.,
Code: Select all
__kernel void my_kern( ..., __local float* buffer, ...)
{ ... }


However, the use of OpenCL local memory concept is STRONGLY discouraged since it does not map to Epiphany, which has per-core local memory which is completely different from the OpenCL concept. The way the second case is supported is by allocating a portion of core 0 local memory and allowing this to be accessed by all cores in the OpenCL sematic. It is not efficient and not how you want to program a RISC array like this at all.

In order to use the per-core local memory you simply allocate an array in the kernel, e.g.,
Code: Select all
__kernel void my_kern(...)
{
   float buffer[32];
}

That's it, just like normal C, and what you will have are buffers allocated for each thread. Its tempting to call this allocation "OpenCL private" but its not - this is what I mean when I say OpenCL memory model does not cover this case. Even though the allocation is local to the thread, its visible from all other threads/cores as part of the basic design of Epiphany. Note that this per-thread memory allocation only exists within the execution of a workgroup (or thread block would be a better term). However, this should not be a problem since the optimized use of a processor like this should only launch a number of threads that match the number of physical cores. There are extensions that are implemented to allow for obvious operations like read/write to the local memory of a specific core, etc. This should be in the documentation, but let me know if you have questions.

Just to summarize, do not use concept of OpenCL local memory, use concept of per-thread local memory that is allocated private but has visibility of shared.
dar
 
Posts: 90
Joined: Mon Dec 17, 2012 3:26 am

Re: OpenCL __local memory

Postby djm » Mon Sep 15, 2014 3:30 pm

Thanks for this excellent explanation! I was suspecting that this was the case but couldn't find it spelled out so nicely anywhere.
djm
 
Posts: 11
Joined: Mon Sep 15, 2014 5:08 am

Re: OpenCL __local memory

Postby djm » Mon Sep 15, 2014 3:33 pm

A related question … if I don't know the size of the per-core buffer at compile time, can malloc be called from the kernel code? Thanks.
djm
 
Posts: 11
Joined: Mon Sep 15, 2014 5:08 am

Re: OpenCL __local memory

Postby dar » Mon Sep 15, 2014 4:03 pm

if I don't know the size of the per-core buffer at compile time, can malloc be called from the kernel code?


Its unclear what malloc() as provided by stdlib for epiphany will do, but I think it most likely allocates from global memory, and even then, I do not think its behavior is well defined.

I had intended to implement a call for the exact purpose you describe, but never got around to it. A possibility here is to use alloca() which is an existing unix call that allocates memory off the stack. This is the same as what you want to accomplish to dynamically allocate per-core local memory. If you only intended the scope of the buffer to be for that of the present call, this would work. If you wanted to have a child call allocate memory and have it persist until an explicit free, then this would not work, but that is a rather complicated scenario for kernels. In that case I could write a call analogous to alloca() that allocates memory from the bottom of freemem for the core, which could be done since this is a known boundary within the coprthr implementation.

All of this is speculative - I never implemented this call. Sorry.

-DAR
dar
 
Posts: 90
Joined: Mon Dec 17, 2012 3:26 am

Re: OpenCL __local memory

Postby djm » Mon Sep 15, 2014 7:49 pm

Thanks again for your clear and detailed response.
djm
 
Posts: 11
Joined: Mon Sep 15, 2014 5:08 am

Re: OpenCL __local memory

Postby cmcconnell » Mon Sep 15, 2014 9:18 pm

dar wrote:Its unclear what malloc() as provided by stdlib for epiphany will do, but I think it most likely allocates from global memory, and even then, I do not think its behavior is well defined.

I haven't checked the epiphany implementation, but doesn't malloc(), by definition, allocate from the heap?

The location of the heap is specified in the linker description file, and presumably can be made local if desired. The SDK reference does say that it's global by default -
Note that with all of the predefined LDF’s, the heap is allocated externally. This means that use of stdio library will render the program very slow.
Colin.
cmcconnell
 
Posts: 99
Joined: Thu May 22, 2014 6:58 pm

Re: OpenCL __local memory

Postby dar » Mon Sep 15, 2014 10:13 pm

All of that is very much correct I would think, but just a word of caution. Epiphany is operated as a co-processor, and has memory allocated from the host using dmalloc() and has a very specific layout in per-core local memory where memory is at a premium. It is not clear how this is reconciled with the LDF. The conventional malloc() call uses pages for accounting and a complex highly efficient allocation algorithm that may not operate correctly if the heap is set so as to define an effective total memory of a few pages worth of storage. Its possible magic just happens - that always nice - but I would test before assuming how it will work. These are interesting questions. Some of this I just do not know the answer. In the end, the best and most efficient solution to dynamic per-core memory allocation will be to use a specialized alloca() implementations designed to operate in a constrained memory environment more like a cache than global memory, allocating memory by raising the free mem boundary controlled within the coprthr implementation. You can see reference to this boundary in the kernel launch code where it must be set just past the program and special data used by the implementation.

If anyone experiments with malloc() I would find the results interesting. Just expect the result may be worst than a seg fault.

-DAR
dar
 
Posts: 90
Joined: Mon Dec 17, 2012 3:26 am

Re: OpenCL __local memory

Postby dobkeratops » Wed Dec 02, 2015 11:38 pm

Would it be possible to divide the scratchpad into some that is __private, and some that is __local.

If i've understood correctly ,in OpenCL __local is 'local to a workgroup' (a bundle of related kernel invocations);
The epiphany has the ability to read/write adjacent cores, with increasing latency the further you go;

So how about reporting the 'workgroup size' as 4(=2x2) on the 16 core chip; divide its' 32k as: 24k code+__private,8k 'local'- each of the 2x2 cores would have 32k of '__local' (=2x2x8k) an average of 1 hop away (they all need a pointer to the top-left of the 2x2 I guess). Its' discontiguous of course.
Then you've got yourself a way of leveraging the epiphany's inter-core communication a little through the standard OpenCL way of doing things.

Code: Select all
+=======+=======+=======+=======+
I 24k   |   24k I 24k   |    24kI
Iprivate|       I       |       I 
I   /  32k  \   I   /  32k  \   I
+---|__local|---+--| __local|---+
I   \_______/   I   \_______/   I
I       |       I       |       I
I 24k   |   24k I 24k   |   24k I
+=======+=======+=======+=======+
I 24k   |   24k I 24k   |   24k I
I       |       I       |       I
I   /  32k  \   I   /  32k  \   I
+---|__local|---+--| __local|---+
I   \_______/   |   \_______/   I
I       |       |       |       I
I 24k   |   24k | 24k   |   24k I
+=======+=======+=======+=======+


.. then each block of cores gets to access '__local' memory that is 'reasonably close'. Think of it like "L2 scratchpads"
(I suppose on bigger chips you could extend that pattern and divide off another portion that is global, read/write for all but scattered across the entire chip; you'd really start needing to compile array access as tiled-array access I guess.)

I realise the main problem mapping to openCL is that it doesn't really directly express data flow, which is what epiphany *really* wants. Perhaps some extremely complex code-transformation tools looking at the information in how kernels read & write (in relation to their indices), and the event fences between them, could deal with that.. "kernel F writes A[i] .. kernel G reads A[i] and waited for X --> turn that into DMA between one group of cores running F, and another running G" .
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: OpenCL __local memory

Postby jar » Fri Dec 04, 2015 3:57 pm

The problem with the OpenCL standard as it relates to the Epiphany architecture is that the OpenCL C language does not define both locality and accessibility. The locality implicitly defines accessibility. The __private memory is accessible by threads within the processing element and __local memory is accessible by all threads within an OpenCL workgroup. Epiphany doesn't really have a hardware workgroup and it can access all memory at any location. There's no programming mechanism within OpenCL to allow thread 0 in workgroup 0 to access __private or __local memory from thread 0 in workgroup 1, although Epiphany could do it.

In my opinion, it would be good to just let OpenCL be and not try to force Epiphany to conform to it. The Epiphany architecture is much more capable than the virtual OpenCL device model.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Next

Return to OpenCL

Who is online

Users browsing this forum: No registered users and 1 guest

cron