Re: Questions about Performances
Posted:
Thu Jul 24, 2014 9:31 am
by notzed
It should be possible to write code that issues 1 fmadd per clock cycle for an inner-loop over as much data as can fit on-core but I don't think the C compiler can manage it.
The matmul example probably includes the host->core->host transfers which add significant overheads.
Without being able to see your code (and/or what the compiler is doing to it) it's hard to suggest what you could do to improve what you have, or if you're timing the calculation in a way which has a reasonable expectation of achieving that fmadd rate.
Re: Questions about Performances
Posted:
Thu Jul 24, 2014 11:40 am
by notzed
I tried:
http://a-hackers-craic.blogspot.com.au/ ... -fadd.htmlI wouldn't expect the C compiler to get very close to that, but i'm prepared to be surprised.