Thanks for the replies greytery and mhonman.
greytery:
Yes, you’re absolutely correct. I should have been more careful - I did note that the Mandelbrot set demo was “almost linear” when I watched it but I guess I was sloppy in my post….obviously the difference is very important through.
I will read N. Gunther, thanks for the tip.
That’s really interesting about your work on those large systems. I’ll read a bit about that too. I’m familiar with Amdahl’s Law but never really get to run stuff on anything more than a small cluster.
mhonman;
you’re right about Turbo Boost and the performance drop off. It’s something I looked into before.
Also, my CPU has 4 cores but has hyper threading so 8 virtual cores. If I understand it correctly, when one virtual core needs to wait on a memory access, the real CPU can process instructions from the instruction stream from the other virtual core. That will need memory access at some stage and so the hard CPU will switch back over to the instruction stream from the first virtual core, which by now has the memory it needs, and so on. And I presume there’s a scheduling algorithm in case one virtual core is running a process which is so CPU bound that it doesn’t need a memory access.
I have found that Turbo Boost does actually give a decent boost as it utilises dead cycles while waiting for memory.
For reference, my CPU's base frequency is 2GHz and my bins are 6689 for 4/3/2//1 cores. So the steps are: 2.6, 2.6, 2.8, 2.9 GHz.
With Turbo Boost disabled, here are the numbers for the 2048x2048 matrix multiplication - the first 2 are ALMOST linear (see greytery, I promised I’d be more careful in future ) but then it quickly drops off as at 4 cores all hard CPU’s are busy but yet the OS still has to run so there will be context switches (one of the possible advantages of Epiphany as it’s bare metal) and at 8 cores there is a small improvement in speed but efficiency is way down.
Cores Seconds
1 123
2 64
4 42
8 34
I purposely didn’t disable TurboBoost for my original test as there are just too many differences to try to normalise things. For my thesis I thought about trying to normalise on clock speed, but then there’s pipeline length, cache size, ALUs etc to consider. I think it’d be nearly impossible to normalise to the point where you could really compare directly and expect the same normalised result.
What I wanted to see was how did the i7 (used as it is normally used - through an OS, possibly with OpenMP, and with all of it's "fancy" features turned on) compare to an Epiphany, as it is normally used.
I had an idea about something I could do even without the board. I realised that I did actually have a runtime for a matrix multiplication on the board - it’s in all of the demo videos! In Shodruky’s video it takes 165.5ms for the 512x512 matrix multiplication. On my i7, on 4 cores (which gives the best time), using OpenMP, it takes 184 ms. When doing this calculation my i7 consumes 14W of power. The 16 Core Epiphany consumes a max of 2W. That’s a huge difference, and it’s faster.
I’m aware of startup time (but only the multiplication is being timed so I don’t believe it’s an issue), and the i7 not getting a chance to go into Turbo mode because the program execution is so short, and perhaps with a longer run the i7 would come out way on top. But then what would happen with 64 cores (which also apparently only consumes a max of 2W - I’m getting these figures from the data sheets)?
It’s all really interesting stuff.
Thanks,
Eoghan.