Cool. Good work. Maybe you could add this to the parallella-examples repository on github with a pull request?
I perused your device code. It's barely OpenCL other than determining the thread ID and using float2/float4 data structs. And I don't mean that as a knock on the code, but rather to suggest that you are using a complicated programming model and just about none of the features. For ray tracers, the code is trivially parallel amd they have significant arithmetic complexity so that they don't really stress the memory heirarchy or inter-core memory movement. This is fine, but it's not able to really stress the API in order to demonstrate why OpenCL doesn't match the architecture well. If that makes sense...