Parallella Community

by **eoghanoh** » Fri Jan 17, 2014 6:22 pm

Thanks for the replies greytery and mhonman.

greytery:

Yes, you’re absolutely correct. I should have been more careful - I did note that the Mandelbrot set demo was “almost linear” when I watched it but I guess I was sloppy in my post….obviously the difference is very important through.

I will read N. Gunther, thanks for the tip.

That’s really interesting about your work on those large systems. I’ll read a bit about that too. I’m familiar with Amdahl’s Law but never really get to run stuff on anything more than a small cluster.

mhonman;

you’re right about Turbo Boost and the performance drop off. It’s something I looked into before.

Also, my CPU has 4 cores but has hyper threading so 8 virtual cores. If I understand it correctly, when one virtual core needs to wait on a memory access, the real CPU can process instructions from the instruction stream from the other virtual core. That will need memory access at some stage and so the hard CPU will switch back over to the instruction stream from the first virtual core, which by now has the memory it needs, and so on. And I presume there’s a scheduling algorithm in case one virtual core is running a process which is so CPU bound that it doesn’t need a memory access.

I have found that Turbo Boost does actually give a decent boost as it utilises dead cycles while waiting for memory.

For reference, my CPU's base frequency is 2GHz and my bins are 6689 for 4/3/2//1 cores. So the steps are: 2.6, 2.6, 2.8, 2.9 GHz.

With Turbo Boost disabled, here are the numbers for the 2048x2048 matrix multiplication - the first 2 are ALMOST linear (see greytery, I promised I’d be more careful in future

) but then it quickly drops off as at 4 cores all hard CPU’s are busy but yet the OS still has to run so there will be context switches (one of the possible advantages of Epiphany as it’s bare metal) and at 8 cores there is a small improvement in speed but efficiency is way down.

Cores Seconds
1 123
2 64
4 42
8 34

I purposely didn’t disable TurboBoost for my original test as there are just too many differences to try to normalise things. For my thesis I thought about trying to normalise on clock speed, but then there’s pipeline length, cache size, ALUs etc to consider. I think it’d be nearly impossible to normalise to the point where you could really compare directly and expect the same normalised result.

What I wanted to see was how did the i7 (used as it is normally used - through an OS, possibly with OpenMP, and with all of it's "fancy" features turned on) compare to an Epiphany, as it is normally used.

I had an idea about something I could do even without the board. I realised that I did actually have a runtime for a matrix multiplication on the board - it’s in all of the demo videos! In Shodruky’s video it takes 165.5ms for the 512x512 matrix multiplication. On my i7, on 4 cores (which gives the best time), using OpenMP, it takes 184 ms. When doing this calculation my i7 consumes 14W of power. The 16 Core Epiphany consumes a max of 2W. That’s a huge difference, and it’s faster.

I’m aware of startup time (but only the multiplication is being timed so I don’t believe it’s an issue), and the i7 not getting a chance to go into Turbo mode because the program execution is so short, and perhaps with a longer run the i7 would come out way on top. But then what would happen with 64 cores (which also apparently only consumes a max of 2W - I’m getting these figures from the data sheets)?

It’s all really interesting stuff.

Thanks,
Eoghan.

by **mhonman** » Fri Jan 17, 2014 9:02 pm

Some side-notes (or as our Physics lecturer used to say, "think-abouts"):

Speedup and scaling

There are two ways at looking at scaling... is the end-goal of parallelising to solve a given problem faster, or to solve larger/more complex problems in the same amount of time?

Given that the serial component of a solution of size X will be the same regardless of the degree of parallelism, as the parallel-execution time drops the serial execution becomes an increasingly large part of it - on the other hand if the run-time of a problem of size 4X on 4 cores is the same as size X on one core, that's an acceptable speedup as long as the larger problems is actually one worth solving!

The OpenMP example program

This uses double precision arithmetic, which is not supported in hardware on Epiphany. For a direct comparison you would want to use single-precision in both cases - but on the other hand some algorithms need double-precision floating-point so that is also a case worth testing even though it will penalise the Epiphany.

Message-passing

Looking ahead to Jacobi methods... this is the Epiphany's strong point and especially if you use the communication primitives provided in the architecture (direct writes to neighbouring cores, and the WAND barrier instruction) the communication should be *very* low-overhead for that kind of SPMD program. There is a lot of potential for overlapping computation and communication.

by **greytery** » Sun Jan 19, 2014 4:35 pm

Interesting to see the effects of Turbo boost on the results, particularly comparing the 8 thread results.

Another point about both sets of results is that they are effectively being run on two different systems. You don't say what OS is being used here (or I'm being a sloppy reader) but Windows tries to reduce the conflicts of the Hyperthreading processors. That is, it will schedule the required threads of a particular application across the available physical CPUs before putting two threads on a single physical CPU. {Assuming you haven't got some sort of CPU affinity set up (?)}. This is to reduce the chances of coherence conflicts within the application. It doesn't do much for conflicts arising from other unrelated applications and OS processes running alongside, of course, which compete for the same on-chip resources.
So - being slightly sloppy - the 1,2,4 thread results are effectively run on a "4 CPU" engine - nearly an i5 perhaps - while the 8 thread result is when the on-chip hyperthreading conflicts kick in.

I'm not a big fan of Hyperthreading (- the clue is in the name!), and your results make me a little bit smug as I look at my (non-hyperthreading) i5 PCs. The extra heat and noise of an i7 would get on my nerves!

tery

by **greytery** » Mon Feb 24, 2014 6:29 pm

by **shodruk** » Tue Feb 25, 2014 11:12 am

by **greytery** » Mon Mar 17, 2014 11:34 am

@shodruck:
Any news on a "VERY HEAVY" scenario running on E16?

@adapteva:
Any news on the "TOO LIGHT" mandelbrot demo running on E64?

@shodruck & @adapteva:
Any news on the "VERY HEAVY" scenario running on E64?
(I'm sure you can see where I'm going with this ... :geek:

)

tery

by **shodruk** » Tue Mar 18, 2014 12:17 pm

by **greytery** » Wed Mar 19, 2014 9:54 am

Perhaps we have a worthy candidate for the "VERY HEAVY" scenario.

See:

A benchmark for accelerators. All the big boys seem to be on the chart. Step forwards Epiphany!!

tery

by **shodruk** » Sat Mar 22, 2014 4:01 pm

Thanks! I'll look into that.

Parallella Community

Epiphany vs Intel Core i7

Re: Epiphany vs Intel Core i7

Re: Epiphany vs Intel Core i7

Re: Epiphany vs Intel Core i7

Re: Epiphany vs Intel Core i7

Re: Epiphany vs Intel Core i7

Re: Epiphany vs Intel Core i7

Re: Epiphany vs Intel Core i7

Re: Epiphany vs Intel Core i7

Re: Epiphany vs Intel Core i7

Who is online