Page 2 of 2

Re: Minimal Vivado Project 2015.3

PostPosted: Fri Dec 18, 2015 11:12 am
by qrios
Thanks for the explanation. Until now I had no time to dive in to this device tree topic.

I modified the exported devicetree.dts with a interesting feature: You can lock the hole Linux system to one core of the Zynq. The parameter "isolcpus=1" as bootargs forbid any task to run on the second core.

Code: Select all
chosen {
      bootargs = "root=/dev/mmcblk0p2 rw earlyprintk rootfstype=ext4 rootwait isolcpus=1";
      linux,stdout-path = "/amba@0/serial@e0001000";

After booting with this parameter it is possible to lock the test-app on the second core by start "sudo taskset -c 1 ./uio_mult_test". On my current setup it runs ~10% more performence and the runtime is more predictable. It can be called "poor mens realtime system".

btw: Do you have any plans to update your project with a broader AXI implementation? Currently the bandwith to/from memory is a real bottleneck in my project.

Re: Minimal Vivado Project 2015.3

PostPosted: Mon Dec 21, 2015 9:13 pm
by kirill
Interesting feature indeed, I'll definitely try using it at work when benchmarking, thanks for sharing.

As far as performance of the sample core goes -- it was not designed for high throughput. I do not expect it to be faster than a cpu based implementation of the same computation. Sample core is using AXI-Lite memory mapped interface. It's the easiest one to get going and there are ready-made examples on-line from which I borrowed heavily, and this is why it was chosen. This interface is good for low-volume data transfers: setting parameters, controlling low-bandwidth peripherals. For typical accelerator design you would want to use AXI-Stream. Your custom block would have one axi stream input (slave) and one axi-stream output (master), and maybe another AXI port for run-time configuration. This block will consume data from input stream, transform it in some way and write to the output stream. You then connect your block to DDR memory using AXI-DMA as described here: Just replace FIFO with your custom block.

I have followed example linked above on Parallella and tested it under Linux using UIO driver for controlling DMA block, this worked fine. I am yet to implement custom block with stream-in-stream-out interface. Vivado can generate a template for that kind of block, but obviously writing all the logic is up-to you. And it's not clear to me yet, how much of AXI-stream understanding is required to write one. Another option is to use Vivado HLS, I think trial license is 30 days.

I haven't found yet any written down tutorial on creating axi-stream custom block, so I'm slowly going through this video tutorial I'm upto the middle of the video 3.

Re: Minimal Vivado Project 2015.3

PostPosted: Tue Dec 22, 2015 12:13 am
by cmcconnell
kirill wrote:Another option is to use Vivado HLS, I think trial license is 30 days.

HLS is apparently now included for free in 2015.4.

Re: Minimal Vivado Project 2015.3

PostPosted: Wed Dec 23, 2015 11:30 am
by qrios

I do not expect it to be faster than a cpu based implementation of the same computation.

You are right if we only use one PL clock cycle between reading the inputs from the AXI and writing the results to the AXI bus. After some research I've found that at the minimum there are 15 clock cycles on the PL side between the access frames to the DRAM. To calculate this number I flip a bit of the first 32bit at the mapped memory from the ARM side. If the PL recognize the flip it starts the counter. This counter ist connected to the next memory block. If I then read this block from the ARM I get most the time the value 15 (at 10% it is 16). The next 32 bit need 30 PL clock cycles.

Ths means, that it is possible to build a pipeline with max 15 clock cycles. And even better we have a lot of logic available at these clocks. In my project currently I use more than 2400 signals and 30000+ signal assignments in 95 processes. All in less then 15 clock cycles.

Interestingly the 15 clock cycles comes from the ARM side. It seems that there is much more room to reduce the time between the memory updates. Next I'll try to use ARM assembly.

But still: thanks a lot for your work and sharing this project!

Re: Minimal Vivado Project 2015.3

PostPosted: Wed Dec 23, 2015 12:53 pm
by kirill
HLS is apparently now included for free in 2015.4.

This is awesome news. Just installed Vivado HLx edition from and sure enough vivado_hls is included, haven't tried using it yet.


With AXI-lite interface computation is likely to be constrained by communication bandwidth unless you doing something really computationally intensive like searching for hash collision or some such. Don't forget that ARM also has NEON and L1/L2 caches, so you better do a lot of computation on very little data for FPGA to be faster in this kind of setup. Structuring your computation as a filter applied to a stream of data and using AXI DMA block to feed the PL logic with data from DDR without CPU involvement is the way to get good throughput, also frees up CPU to do other useful work.

Honestly, I write these things down as a way to solidify my own understanding first, and to be able to come back to it in the future second, but of course it's nice to know that others find it useful too.

Re: Minimal Vivado Project 2015.3

PostPosted: Tue Jan 19, 2016 5:52 pm
by theover
In the pursuit of trying more complicated C programs coming from the vivado_hlx tool, I thought I would go through the tutorials of the second post in this thread, and see if the free (web) version would run all the required steps to come to the "minimal example" myself. Yesterday I tried the making of the multiplier block, and I chose the wrong language somewhere (I got VHDL code), which I chose to ignore, pressed "create ..." anyhow, but now I somehow didn't get a "hardware definition" file, so SDK doesn't start up from the completed IP project window (it does start up and allow project creation separately). I'm hoping to be able to use streaming examples from C code, even though the current C code to bit compilation capacity at the moment is far less than what should be possible efficiency-wise if more of the dedicated IP blocks from Vivado/Ise would be used. It could be cool to compile the (small and not so fast) FFT C example and get it to work as streaming AXI application from C code on the Parallella !

Maybe I'll take a look at those videos, too!

Theo V.

Re: Minimal Vivado Project 2015.3

PostPosted: Wed Jan 20, 2016 8:34 am
by kirill

Regarding SDK problems, I had them too, it was a mistake in the build script, see this change: ... 84e44c2e25

You can either use fixed version of the script or try to delete and re-create hdl "top wrapper" for block design inside an existing Vivado project.