Parallella Community

by **aolofsson** » Fri Jun 21, 2013 1:31 pm

We now have ~30 Parallella prototype systems in the field and will be shipping another 80 final form factor Parallella boards very soon to early access backers. Unfortunately, there are still a lot of skeptics out there who don't believe the "wimpy" Epiphany cores are any good. Amazingly, some folks still think that the Epiphany chip and the Parallella platform are vaporware!

We need your help! If you have used the Parallella board it would be great if you could reply to this post to summarize your experience so far. Please be honest and fair. :-)

Thanks!
Andreas

by 8l » Fri Jun 21, 2013 4:21 pm

i will post when i get my board(64 cores!).

btw, i don't think they thought it vaporware,
on the contrary, they hope it, because, stinky old x86 platform became red-ocean..

by **tnt** » Sun Jun 23, 2013 9:49 pm

Hi,

So, I've had a board for a few month now, and although I didn't have as much time as I would have liked to play with it, I'll try to give a bit of history on what I did. To be clear, I'll talk about the complete experience with the prototype and not just the epiphany related part, so some of my feedback will not be applicable to the final parallela.

So, I got a kit back at the very end of February which was very exciting. It didn't start out very well though since the zedboard died within a couple of minutes. Hopefully I managed to diagnose the issue to the fpga core voltage switching regulator having fried (no idea why). I repaired it as best I could and the zedboard was running again and nothing else seemed damaged, the epiphany examples were running without issues. A second issue with the zedboard that popped up not long after is that the SDCard shipped with it by defaut sucks ... it's slow and can corrupt things easily. Better get a good new one directly.

With that false start cleared up, I got started by testing the impact of accessing the external DRAM (EDRAM) both for instructions and data. Those tests pretty much confirmed what I suspected from just reading the architecture manuals: If you're going to read from the EDRAM often without DMA, you might as well run your code from the ARM because the epiphany is going to spend the majority of its time waiting for data. So you need to design your task to ship data to/fram EDRAM using DMA while processing buffers in SRAM in parallel.

So with this in mind I tried looking at what I could do with this platform, mostly in the SDR domain. My first idea was to attach a radio front-end, do things like channelization in the FPGA, then do packet demodulation / channel decoding in the epiphany and pass the L2 data to the ARM. When I tried to port existing code I had, I was struck by just how small 32k is. 32k is the total amount of SRAM each core has, and in there you need to fit the code _and_ the buffer, including some double buffering if you want to use DMA to efficiently move data in/out. At 4byte per instruction that's only 8192 instructions (yes, there is a short form of 2 byte, but it's so register constrained that in most case it doesn't apply). And when working with complex float data, each sample is 8 bytes meaning you can only fit 4096 samples. The other thing I realized then is just how bad gcc was at optimizing 'complex float' operations, meaning that for the code to be small and efficient in those tight loops, you pretty much had to use hand-crafted assembly.

At that point I decided I was going to try something a bit easier and port over some simple blocks of GNURadio. First step was to actually just try to compile GNURadio and auxilliary stuff (like rtl-sdr). Given the lack of a good pre-made cross-compile environment for the linaro image, I decided to just compile natively (I knew it would take some time, but just CPU time, not my time). That wasn't too hard, the trick was mostly to compile on an external hdd with some swap and disable the python stuff (that just takes too much RAM).

So then I tried to look at some GR blocks and how to organize them. Some would be pretty easy to fit, like the FIR filters. Some would be more challenging. Again, the issue is mostly how to organize data inside the 32k but those processing blocks are often simpler. But one thing was clear for all is that the scheduling of data input/output and where to put them was sometimes pretty tricky and would have to be handled by the cores themselves.

This is when I started testing the DMA and memory bandwidth. During those tests I encountered some issues as discussed in the "DMA bug" thread. And after a lot of time trying to find out exactly what happenned it eventually got narrowed down to some write transactions from the host getting lost in transit. This bug is AFAIK still not fixed and is a very serious bug for my application. To feed data to the blocks and between the blocks, I came up with a small 'fifo' style interface, and trying to write to this fifo from the host just has _horrible_ performance. Basically the buffer in the core is so small and consumed so fast that to avoid the core underflowing (and then just doing nothing waiting for data), the host has to be in a tight busy loop sending samples, and so it uses like 100% of the cpu of the host to keep up. And even in the best of case (pure benchmark), the write performance from host to epiphany is terrible (14 MB/s). Until this gets fixed it's pretty hard for me to do anything useful with it in the SDR domain.

More recently I decided to look at the other aspecs of the boards, more specifically how I would interconnect the Radio front end I intend to use : the MyriadRF.

To do so, the first step was just getting acquainted with the various bits of the hardware and rebuilding everything from scratch (fpga bitstream, linux kernel, FSBL, u-boot). I encountered some issues along the way like the device tree issue that stuck the ethernet into 1G mode or the recent u-boot version that just wouldn't actually boot. But eventually it all worked out and I could start modifying stuff. The first thing I attempted was to enable the SPI device of the Zynq PS so I could controle the RFE. This turned out to take much more time that I anticipated, mostly because I wasn't familiar with all the aspects of the zynq and made rookie mistakes (like not rebuilding the FSBL when I change MIO assignments) and also because I forgot the first rule of hardware design : Read the damn erratas first ! But eventually it all worked out and I have SPI access from the linux userspace so I can control the chip.

Next step is to get the samples into memory and this is what I'm working on right now. This mostly involves writing some AXI interface, both slave for the various CSRs for the hardware and a master to directly DMA the samples into memory. It's probably going to take a bit of time to get it working, but hopefully I'll get it working before I get my final board in August.

What will I do next ?

Well obviously finish that RFE interface on the protoboard. I'm also looking forward to the final parallela specs so I can design the carrier board to connect it to the radio front end. If it's still not solved by then, I might also look at that lost transaction issue or look for alternatives, like putting an AXI interconnect between the PS and the e-link interface so I can write directly from the RFE dma unit into the epiphany SRAM without going through the DDR at all (and without using e-link read transactions).

As you can see my experience was a bit rough on the edges, and as one of the early user I also was faced with some time unclear documentation or constantly changing SDK interface/versions. But I still remain very hopeful for the platform and I can tell things are much better now than at the very beginning, so at least it's improving. One quibble about the SDK though is that the updates are just big "dumps" of all the changes in that update and not nice splitted commit, 1 commit-per-change model used in most open source project and so trying to follow the changes is not always easy.

One thing is clear though: the platform has a lot of horse power, but exploiting them properly is far from easy. It's harder than I thought it would be (and knowing both FPGA and GPU programming I already wasn't expecting it to be easy).

Cheers,

Sylvain

by **aolofsson** » Sun Jun 23, 2013 10:53 pm

Sylvain,

First of all my apologies for all the issues you had with the board. It seemed like everything that could go wrong did go wrong:-( Most users would have given up after such a string of misfortunes.Thanks for your patience and tenacity of the last few months!!

Here are some updates regarding the issues you were having over the last few months:

1.) SD card: We have switched over to using an SDHC card from Sandisk and it's made a world of difference.(speed and uptime). Unfortunately it's still possible to hose the RFS if you do a lot of power on/off as I painfully discovered.

2.) The "epiphany write issue" is a high priority item, unfortunately there are other board bring up items that are still more urgent.

3.) For the memory size issue, we are working on something together with Embecosm that could make an immediate impact in terms of bringing in larger code. It will definitely be released before the end of the summer.

4.) Fixing the host-epiphany write speed will require an axi-master interface and we will probably not get to this any time soon. In the meantime, our recommendation is to let the Epiphany cores read in buffers of data from DRAM on their own.

5.) Definitely agree with the comment about not updating the SDK repository often enough. Our long term mandate is to managed the SDK and parallella hardware like an true open source project, with the work being incrementally pushed into main line on a frequent (daily?) basis.

Really great news that you were able to implement the SPI in the zynq fpga logic, that is a great milestone for the Parallella platform. Can't wait to see how it hooks up to Myriad!

Your experiences and comments are all 100% accurate (and more than FAIR). It is thanks your effort and the effort of other early users that we are now in much better shape for the next 100 boards (and then 1000's)!

Thanks,
Andreas

by **tnt** » Tue Jun 25, 2013 9:44 am

Hi Andreas,

Oh, no apologies necessary; when working with early platforms,especially one I wasn't too familiar with, some issues are to be expected. But I'd be curious about the experience of others.

wrt to 3, along which lines ? Something like a software mechanism to dma code in/out, or more like putting some fast RAM (like the zynq BRAM) right next to the elink interface ?

wrt to 4: As you posted in the memory benchmark test, the read speed with "load" instructions without DMA is very slow as well.

Cheers,

Sylvain

Parallella Community

So what have you done with your Parallella?

So what have you done with your Parallella?

Re: So what have you done with your Parallella?

Re: So what have you done with your Parallella?

Re: So what have you done with your Parallella?

Re: So what have you done with your Parallella?

Who is online