Parallella Community

by **keithsloan52** » Fri May 29, 2015 7:07 pm

Well I was not sorry to only hear about Parallella till after the Kickstart ended. It gave me time to assess things better and I came to the conclusion that there was a big question mark about the small amount of memory per Epihany core enough for me to decide on a wait and see approach. I initally followed the forum, but after a while stopped as no significant applications (for me) were showing up. It was only news about Supercomputer.io that has brought me back to look at the forum. It will be interesting to see how Supercomputer.io progresses. The fact that there are nearly 2500 Parallella's signed in indicates to me that a lot of people have not really found a use for one. As far as using the two Arm processors is concerned there are better and cheaper SBC's out there, Its only if people start exploiting the Epihany processors and or the FPGA does the Parallella make sense. I think an FFT library that exploited the Epihony cores would be a good development. I invested in a raspberry Pi when an FFT libary was announced for it. Unfortunatly the 4 core Raspberry PI model B overtook events before Einstein@home made use of the FFT library and I fortunately managed to sell my model B to a friend to use as a Kodi media server, so I could upgrade.

by **piotr5** » Fri May 29, 2015 9:47 pm

when I first learned about raspberry pi I experienced the same thing as you with parallella: not enough memory, problems with usb-connections to hardware, lack of ready-made hardware-components to plug in. now the pi has a camera and loads of comercially quite successful addons. so now I await the same to happen for epiphany on the software side.

as for fft, it's nonsense to use epiphany for that alone. even if I had a graphics-card instead, with simd to program, why would I want to restrict myself to using it for fft? for fft I'd feed it with data at a certain rate, and I'd get the result at the same speed. then it really doesn't matter beyond a certain amount of cores if the card has more power to offer, the transfer-speed is always the bottleneck, with every co-processor chip. and the fft-algorithm is so blazingly fast that with a handful of cores this bottleneck becomes relevant. fft is really better implemented in the main processor or fpga. a multi-core co-processor is supposed to get small compressed input, unpack it on demand, and then return again some small compressed data. any other way your application wont scale up well with rising number of cores. the approach parallella is using now is: create a lib, called pal, which runs on the epiphany chip, on each core, and which offers fft there! so, just like at the old times where fpu or mmx/sse were introduced, programmers are needed to make use of it instead of relying on the compiler to apply some limited set of algorithms when needed. in maths I've seen only one scaleable application for multi-core: partial sums/products in logarithmic time! (well, of course input still could be large, would need some compression, but output can be a single number!) but this doesn't need to be the only application, I'm curious to find other ideas...

by **keithsloan52** » Sat May 30, 2015 9:07 am

by **xilman** » Sun May 31, 2015 9:15 am

by **piotr5** » Mon Jun 01, 2015 8:08 am

as I see it there are 2 possibilities to use a parallell processor architecture: either start many programs at the same time, each occupying 1 core, or start a single program and let it spawn new threads as it proceeds. the former requires that for each thread started you fulfill the memory-requirements. so if you have a system capable of running 24 threads in parallell (i.e. intel xeon) and you have a program that requires 4GB of memory (i.e. image-processing or even compiling of modern c++ code) then you'll need a total of 96GB of memory to do anything useful with these processors -- and accordingly instead of 2Gb/s memory-bandwidth you then need 48Gb/s. without these you cannot make full use of these cores. on x86 that means you need memory-modules of 6Gb/s and 12Gb storage, on a motherboard with 8 slots and a 48Gb/s memory-connection to the cpu. on parallella you have less than 320MB/s, that's less than 20MB/s for each core. guess there are discs communicating at a higher speed. also the shared memory isn't all that large. the software-cache-memory is good enough to fit programs from the good old msdos epoch in the nineties and I guess the speed is also fitting that time's disc-speed. what development tools were used back then? afaik Assembler! so if you wish to use some high-level language you have no other option than to abandon that paradigm and use the other, less explored one. i.e. allow your programs to spawn as many threads as possible, as many as the program-design allows. and then you need to fine-tune your program's capability to spawn many parallell tasks. to do that you need some visualization tool which shows you where in your code parallell tasks are possible and how many. if you have a for-loop with 8 iterations, you shouldn't be surprised that only 8 parallell processes are possible. you need to search for such places and discover mathematical independence where it hasn't been detected by the compiler. (and naturally the compiler must make use of its knowledge of which parts of the program can be parallellized because of being independend, and then create from a single source-file a whole lot of epiphany programs that are supposed to run in parallell.) from my point of view this is what parallella really needs, and such a tool-set doesn't even need to be implemented on a parallella!

as for keithsloan52, as I understood it was clearly said, someone needs to port einstein@home to the parallella. there hasn't been much contribution to the boinc project lately.

by **theover** » Tue Jun 02, 2015 6:57 am

Well guys, that's a lot of discussion. It's not all fair to compare the cheap little Parallella board with for instance the 6core 4.5 Ghz extreme PC I use, with 4*20GB/sec memory access because of it's 2000+ pins.

I still like the idea of the (well working) Zinq Linux machine with extra standalone cores, but it needs to perform a bit better in the basic department of how the Open Source design can perform in various setups, and I have to say I hoped for some video application that was promised in the Kickstarter page to be able to act as an OS example to base other things on.

Of course it is hard to argue with the style of a big Hardware/Software design including custom processors and a working Linux, and I think it somehow is quite an achievement that this all works to an extend, but it was late, it has annoying limitations, and the readability of the important parts of the software and hardware code is limited, and it's appeal to base all kinds of projects on as a consequence apparently has suffered as a consequence.

Let us please remember though that this is a very cheap board, as far as I know probably the cheapest Zynq based design around!

T.

by **mhonman** » Tue Jun 02, 2015 9:20 am

In defence of the Epiphany architecture, there is a lot of _on-chip_ memory bandwidth - IIRC each core can read or write 16 bytes per cycle to/from local RAM (or is it 8 bytes/cycle?). So 16 * 16 * 600MHz = 153GB/s. It's the absence of a middle tier in the memory hierarchy that is the killer.

Regarding programming models, I think that with 16+ cores one always seems to end up with spatial decomposition of the problem (or maybe that just the way I think, and can't see beyond my mental rut). In other words chop up the problem based on the number of cores, and give each core a chunk of it. This means each core runs more or less the same program, and data accesses are local (at which point one could say - why not use SIMD? - but that's another story). Furthermore this kind of SPMD approach scales well as long as global synchronisations are minimised (Epiphany does have a clever solution to the global synchronisation problem, which is very welcome). The problem is that each core is running the same program, which in the case of Epiphany means Ncores copies of the entire program - wasting the very precious on-chip memory - OR a return to the program-overlay concept of the 1970s.

To my mind an ideal manycore architecture would combine the memory hierarchy of the TMS320C6678 with the mesh-network of the Eiphany - and XMOS' synchronous message passing.

by **xilman** » Tue Jun 02, 2015 10:16 am

by **piotr5** » Wed Jun 03, 2015 11:32 am

by **mhonman** » Wed Jun 03, 2015 7:05 pm

Parallella Community

What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Re: What is the reason Parallella isn't blazing a trail ?

Who is online