clndrange_init2D

Moderator: dar

clndrange_init2D

Postby nickoppen » Mon Sep 08, 2014 10:49 am

Hi dar,

Could I just check with you on my understanding of clndrange_init2D(NULL, gsz0, lsz0, NULL gsz1, lsz1) please?

Am I correct in saying that passing the resulting clndrange_t value to a fork or forka results in gsz0 * gsz1 calls to the given kernel split between the available cores with lsz0 and lsz1 being passed to the kernel as the result of get_local_size(0) and get_local_size(1) respectively?

If I am correct, are there any restrictions on what the values of these four arguments are? Writing my latest blog post was extremely frustrating because it seemed that for many combinations of global and local sizes the kernel was not called at all. The code I posted was the only combination that seemed to work reliably.

I tried the same thing with a bigger data set and bigger global sizes (256 and 1024) and that did not seem to make any difference other than to produce the following output:


root@linaro-nano:/home/linaro/git/nD/Debug# ./nD
[3832] clmesg WARNING: libocl.c(158): cannot read ocl.conf, using ICD fallback (/etc/OpenCL/vendors
coprthr-1.6.0 (Freewill)
[3832] clmesg info: cmdsched.c(86): cmdqx1: run
[3832] clmesg WARNING: command_queue.c(39): __do_create_command_queue_1: cmdq exists
[3832] clmesg WARNING: command_queue.c(39): __do_create_command_queue_1: cmdq exists
dmalloc(0,65536): 0x8e104000 0x8e104000 0x90004000
dmalloc(0,65536): 0x8e104000 0x8e114000 0x90004000
[3832] clmesg info: cmdsched.c(86): cmdqx1: run
[3832] clmesg WARNING: ocl_enqueue.c(893): clEnqueueNDRangeKernel: ignoring global_work_offset
[3832] clmesg info: e32pth_engine_needham.c(100): e32_engine_startup: engine is path-through
[3832] clmesg ERROR: e32pth_engine_needham.c(182): exceeded maximum thread block size
[3832] clmesg WARNING: ocl_enqueue.c(893): clEnqueueNDRangeKernel: ignoring global_work_offset
[3832] clmesg ERROR: e32pth_engine_needham.c(182): exceeded maximum thread block size
[3832] clmesg info: cmdsched.c(181): cmdqx1: shutdown
root@linaro-nano:/home/linaro/git/nD/Debug#

Despite reporting an error, everything seemed to be called and the program seemed to exit properly.

Confusing and frustrating.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: clndrange_init2D

Postby dar » Wed Sep 10, 2014 2:55 pm

... for some reason I stopped getting email notifications - let me take a look at this and get back to you.

-DAR
dar
 
Posts: 90
Joined: Mon Dec 17, 2012 3:26 am

Re: clndrange_init2D

Postby dar » Fri Sep 12, 2014 2:00 am

Without getting too far into the weeds looking at specific cases, let me provide some general comments and see if they shed some light on what you are seeing.

The maximum workgroup size for 1D is 16, and if you use a 2D NDRange the product of the workgroup size for each dimension must be <= 16. This limit is imposed from the implementation to achieve good performance and encourage programmers to use the chip most efficiently. Epiphany is not a multithreaded processor like a GPU so over-subscribing the physical cores is not easily done, and not efficient. (We will put aside the benefit of amortizing memory stalls.) We experimented with software threading using setjmp/longjmp threads (picture software hyper-threading for each Epiphany core) but this was not the optimal model - the issue was the excessive use of precious local memory for multiple stacks. So the bottom-line is that workgroup sizes should be thought of as directly selecting the number of cores you wish to employ on the problem. The concept of workgroup size is an artificial one and does not follow from anything in the physical problem you are trying to solve - its a tuning parameter for almost all algorithms. In some cases for GPUs the workgroup is used to carefully orchestrate collective actions by the threads (work items) for memory access issues specific to a given GPU, but there is no general parallelization concept involved, its just architecture specific optimization. For Epiphany, you do not want to use any of these GPU tricks.

You are correct that for 2D the product of global sizes for each dimension defines the number of total threads launched to execute your kernel. They can be thought of as being launched in "bundles" of size defined by the workgroup size(s).

For the Mandelbrot example, the reason this 2D problem uses 1D NDRange is to avoid what you suggested would be complexity for complexity sake. The best performance is attained by parallelizing over height and having a single thread loop over width. This is so each thread is given enough work to do to amortize the overhead of its launch and shutdown. If you do not give a thread much to do you will see very bad performance.

You noted that NDRange addresses only the thread or kernel launch and was not connected directly to data partitioning - that nothing terribly clever was going on. This is very correct. The programmer usually makes this connection, the API imposes nothing here. An argument can be made (and I will make it) that the N-Dimensional part of the NDRange is a waste and overcomplicates things. Its put in to access very specific 2D hardware GPUs have had, but as a general programming concept, it would be better to simple treat parallelization as applying n threads to solve the problem, like OpenMP, and dispense with the multi-dimensionality since it can always be easily imposed by the programmer anyway, its not needed in the API.

You noted you cannot cast the clmalloc'd memory to a 2D (C) array. This is correct, and the source of many mistakes when using math libraries. A 2D C array is an array of pointers to arrays of 1D data. This is different from a flat allocation even if you size it to hold the elements for a 2D array.

Let me know if any of this helps clarify things or if questions still remain.

-DAR
dar
 
Posts: 90
Joined: Mon Dec 17, 2012 3:26 am

Re: clndrange_init2D

Postby nickoppen » Sun Sep 21, 2014 7:41 am

Hi DAR,

Thanks for your reply and review of my blog post.

I agree that the whole ndrange idea seems to be not well suited to the epiphany and a little under cooked in general.

When you say that the maximum workgroup size is 16, it appears from my experiments that 16 is the minimum number at least for clndrange_init1D. It also seems that for clndrange_init1d, multiples of 16 also work although I agree with you that running the same code over the same data multiple times does not seem like a very good idea. When I attempt to use workgroup sizes other than multiples of 16 I get combinations of seg faults and Invalid Instruction errors or it seems as if the kernel is not called at all.

Should we be talking about the number of cores on the epiphany rather than 16? I don't have an e64 to experiment with but I would think that the same reasoning would apply.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: clndrange_init2D

Postby dar » Mon Sep 22, 2014 12:27 pm

You should be able to use a workgroup size of 8, for example, but I do not believe the scheduler at present would efficiently launch 2 workgroups at a time (8+8) to use the chip efficiently. If this is not working, something might be wrong. You should at least be able to launch < 16 threads on the 16-core chip. Possible something subtle was broken when the low-level code was redesigned for the updated eSDK. And yes, we should be talking about cores - for the 64-core chip the limit would be 64.

The global index space should always be a multiple of workgroup size - this is general true for most platforms, and leads to trivial code for cases that are not commensurate, so its not clear why acceptable cases are not more general.

You may find that you can use 32 for workgroup size on 16-core chip and conclude that 16 was a minimum, but this is technically not true, it should in fact be a maximum. The reason you might get away with this is if you have a kernel where the workgroup size is completely artificial - for a clear majority of algorithms it is just an artificial creation. However, if you had an algorithm that expected 32 threads to be running concurrently (or functionally equivalent to concurrently with context switches) you will find a real problem. Since Epiphany does not support multi-threading per core (and the software design does not employ a software solution for efficiency reasons), it must run 16 threads/kernels to completion without interrupt.

Imagine your 32 work items in your workgroup blocked waiting for everyone to get to some point in the kernel. This will never happen since the first 16 work items must complete before the second half can even launch. Deadlock. So whatever might be happening with workgroup of > 16, I am not exactly sure, it might be broken up and run in succession, and tis might be fine for many cases, but for complicated cases you will get into trouble.
dar
 
Posts: 90
Joined: Mon Dec 17, 2012 3:26 am


Return to OpenCL

Who is online

Users browsing this forum: No registered users and 7 guests