Parallella Community

Posted: **Sun Nov 30, 2014 1:25 am**

If you use that exact code from my previous post, it does happen to be slower. Because the size of data is always a multiple of 8 bytes I had modified it so that it uses a double word (DWORD) and 8-byte access. This improves performance to better than the series of 1D DMAs but it doesn't seem significant anymore (<10% of total runtime for this code). I believe it's getting something like 1.24 Gbps (~150 MB/s) on the off-chip copying. I'm not too concerned about it at the moment.

I didn't have any experience with 2D DMAs before. It makes a lot more sense now that I understand why a hardware designer would do it that way. A software developer thinks of strides differently than a hardware designer. Thanks again.

Posted: **Sun Nov 30, 2014 5:28 am**

The difference will be small if you're in a tight loop since the dma goes idle as soon as the last transfer is started and a new one is only a few 10s of cycles to start at most. The big difference with a 2d dma (well any dma) is that you can be doing something else useful with the cpu while it's running rather than having the cpu tied up in such a mundane task as copying memory.

Posted: **Tue Jan 31, 2017 5:54 am**

Posted: **Tue Jan 31, 2017 5:46 pm**

Hi Nick,

2D DMAs are a challenge to comprehend and the API isn't obvious.

The 4s are there because that is the size of a word (4 bytes) in my example. An 8 byte double-word is the preferred transfer amount since the network is 64-bit. However, sometimes that wont work out as the size is data-dependent (alignment and sizes). Notzed's code assumes 4 bytes per word and aligned memory (he increments pointers to ints which implicitly add 4 bytes at a time). The outer stride is the big N (for the source) minus the little n (for the destination).

Most of the difficulty I had with understanding was the outer stride for the source and destination. The inner stride concept is easy. Because the inner stride of the destination was increasing by 4 and I wanted the block to copy to contiguous (linear) memory, the outer stride of the destination is also 4. This results in a dsta += 4 - 4 (nop) operation. Because the destination pointer was already incremented n times in the inner loop, we have to subtract that off: src += 4*(N-n+1) - 4.

I wanted to use this to simply copy rectangular (or square) blocks. But you could probably use this generic interface to copy "parallelogram shapes of data", transpose data using a combination of negative strides, or copy just even columns. I would have to think about those some more.

The "dma.inner_stride = 0x00010001 << shift;" is just for configuring the register that kicks off the DMA. 0x00010001 is byte alignment/transfers. 0x00080008 is double-word alignment/transfers. Using a shift operator makes this easy. You could also use other weird combinations like 0x00020004, but that gets harder to imagine.

Hope that helps.

Posted: **Wed Feb 01, 2017 2:59 am**

Thanks James,

I think I'll have to build a test case and see how it works.

I want to write another blog post so I'll try to cover the basics so that others don't have to nut it out from scratch.

nick

Posted: **Wed Feb 01, 2017 3:45 am**

That's eventually how I worked it out. Obviously, everyone has had difficulty explaining this to others. If you manage to make it simpler, or even provide code for common use cases, that would be useful. Good luck.

Parallella Community

2D DMA

2D DMA

Re: 2D DMA

Re: 2D DMA

Re: 2D DMA

Re: 2D DMA

Re: 2D DMA

Re: 2D DMA

Re: 2D DMA

Re: 2D DMA

Re: 2D DMA