Have you looked at the Thrust library? ( https://github.com/thrust/thrust )
It uses templates for GPU programming.
It also has some backends for TBB and OpenMP.
The interesting case is where it targets a CUDA backend, since that has the same issue with different host/target architectures.
Based on cursory examination, it appears to use the ability of NVCC to separate the code into host and target pieces to compile separately, apparently based on directives. Maybe this could be replicated with appropriate preprocessor macros? The build step would require two compile passes - one with the ARM host compiler, and one with the Epiphany target compiler, but that doesn't seem too bad.
The other issue is overhead from using the C++ for the target. I have no idea how much that might be.