Hi and welcome
Yes, this is very possible and I would think of the dual core ARM as the host and the Epiphany chip as the device (or accelerator.) Similar to how we program GP-GPUs, you could run an MPI process on each Parallella board (the host CPU, so based on what you were saying 4 MPI processes) and then each process will offload some computation to its matching Epiphany chip. Ideally you would try and keep both the host CPU and device Epiphany busy (maybe performing different calculations) for optimal performance, but this is not always possible. In terms of the offload, you could either use the Epiphany API directly or a technology such as COPRTHR 2 (which is well documented here) might suit you well.
Let us know how you get on