could openCL represent dataflow

Moderator: dar

could openCL represent dataflow

Postby dobkeratops » Thu Dec 03, 2015 6:28 pm

So, in openCL you have buffer objects and invocations of kernels in a queue.

to fully utilize the epiphany's unique features, you need to express dataflow? - the ability to keep temporaries on chip, flowing directly between cores.

Q1 Does the OpenCL programming model already have sufficient information to deduce this (even if an implementation doesn't)

Q2 if not, is it on their roadmap

Q3 Could a minor nonstandard extention to OpenCL (ifdef'd out away elsewhere) pre-emptively implemented for the epiphany chip amend this (providing a hint to their standards committee )

example -
lets say you have 3 buffer objects A,B,C, and 2 kernels,F,G;
then you enqueue an invocation of 'F' reading A,writing 'B'; then an invocation of 'G',waiting for 'F' to complete, reading B, writing C.
Code: Select all
   [A]  --F-->[B] --G--> [C]


You want A and C in peristent memory, and 'B' to be an on chip temporary, never going off chip. (maybe the implementation would have a core running a program arbitrating between producers & consumers, with the data in 'B' arriving in packets).
and the compiler could note: F only reads 'B', so it can actually start earlier, just not on elements of 'B' that haven't yet been written.

It seems to me this could be a data flow so long as you gave the API a hint that 'B' will never be read by the host or any subsequent kernels.

Also although this said 'G' waits for 'F', what we really wanted was: just a hint "'G' only reads elements of 'B' after F has written them".

(i think this also requires the ability to automatically figure out if a kernel writes or reads to an intermediate in a sufficiently simple manner, ie. if read/write indices are a simple function of the invocation index; even if its' dynamic, GPUs keep transformed vertices 'on chip' prior to a 'gather' by primitive assembly, you might be able to figure out something similar for dynamic indexing to & from 'B')

Perhaps this could all be achieved with a 'clEnqueueDiscardBufferContents()' , which could tell openCL explicitely to discard 'B' by a certain point in time in the command-queue.
or better still, write a hint prior to the call to 'F': clHintTemporaryBufferBegin(B,..) .. clHintTemporaryBufferEnd(B) from which it can then deduce any kernel writing it should be compiled to emit it's output to an on chip stream, and any kernel reading it takes that stream, and it will never be read by the host. (I think it also needs to be told 'no data in B' is relevant before 'F' writes it).

perhaps the simplest way would be to hint on creation of 'B' (maybe "WRITE_ONCE_READ_ONCE") that every data element is only written once and invalidated when read (hence the invocation of a kernel reading it allows its' contents to be discarded)

I realise this would be rather complex to implement. But there are many devices on which openCL can run, there might be other contexts where data flow could be relevant (e.g., xeon-phi ? multi-chip GPU cards ? HSA? FPGAs ? or perhaps the ability to run openCL on supercomputers/clusters)

( what the world needs to enable adoption of something like the epiphany, IMO, is a means of writing code that can run reasonably efficiently on other common hardware allowing people to migrate)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: could openCL represent dataflow

Postby jar » Thu Dec 03, 2015 10:10 pm

Unless I'm misunderstanding you, OpenCL 2.0 already supports this with Pipes.

A1. In 2.0, it's very explicit with pipes, but yes.

A2. It was on their roadmap a while ago. Actual implementations may vary.

A3. I wouldn't expect many improvements to the Epiphany OpenCL stack anytime soon unless the community does it. The OpenCL device model just doesn't map well to the Epiphany architecture and others like it.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: could openCL represent dataflow

Postby dobkeratops » Thu Dec 03, 2015 10:46 pm

jar wrote:Unless I'm misunderstanding you, OpenCL 2.0 already supports this with Pipes.

A1. In 2.0, it's very explicit with pipes, but yes.

A2. It was on their roadmap a while ago. Actual implementations may vary.

A3. I wouldn't expect many improvements to the Epiphany OpenCL stack anytime soon unless the community does it. The OpenCL device model just doesn't map well to the Epiphany architecture and others like it.


ok, seems like that does indeed to it. Unfortunately openCL 2.0 support isn't widespread (mac OS X, nvidia) , but thats' not their fault.

My problem remains in that I can't use OpenCL 2.0 and run that on my mac laptop or nvidia desktop GPU (it seems thats' a problem nvidia create).

I guess some IFDEFs could write a single code path that can compile to use pipes on a 2.0 device, or simulate them with a buffer elsewhere.


(What I have in mind is probably a lot harder to implement and would be nonstandard, but would allow similar behaviour coming code that's written for openCL 1.2 .(the compiler transforming specific buffer reads & writes into a pipe, basically))


The OpenCL device model just doesn't map well to the Epiphany architecture and others like it.


I know a kernel usually involves taking an invocation index using it to read from global data, which the epiphany can't really do, but I'd wondered if there already any attempts in the compiler to recognize simple relationships between that index & the global reads, and hence 'drive' the kernel with DMA packets ("here's a chunk of your input buffer, holding this index range, suitable for servicing the reads done by global_indices ... , so call the kernel on those..") I think the e-cores can do random writes better?

And what about future chips: if they have bigger memories, would it be possible to allocate some smaller buffers on chip - and try to run kernels on cores closer to the buffers they happen to read. 256 cores x 128k per core would give you 32MB on chip. I suppose you'd still have to use OpenCL in a very different way, i.e. with many small buffers, or again have a very complex compiler capable of splitting the kernel into several steps that can be run in different places ('first part, calculates read indices, sends a message to the core holding the right fragment of the object along with any temporaries'.. a kernel with a lot of indexing would be split into multiple stages pushing locals downstream instead of gathering.)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk


Return to OpenCL

Who is online

Users browsing this forum: No registered users and 1 guest

cron