I tried this one, on the two arm cores
seems very slow 0.47khash/sec/core
the C code port (with simple gap re-calculate takes 20KB code space), really tight.
don't have time to optimize or try, but I don't expect the per core performance would be better than ARMcore.