It should be possible to write code that issues 1 fmadd per clock cycle for an inner-loop over as much data as can fit on-core but I don't think the C compiler can manage it.
The matmul example probably includes the host->core->host transfers which add significant overheads.
Without being able to see your code (and/or what the compiler is doing to it) it's hard to suggest what you could do to improve what you have, or if you're timing the calculation in a way which has a reasonable expectation of achieving that fmadd rate.