A new blog post about broadcast strategies and benchmarking.

Moderator: dar

A new blog post about broadcast strategies and benchmarking.

Postby nickoppen » Sun Oct 09, 2016 10:24 am

Hi,

My latest blog post is about four different broadcast strategies using direct memory writes. Building on my last post on this topic I try to improve on the strategies that I've used since then.

To come to a conclusion about which is better I use the e_ctimer functions to measure execution time and the number of clock ticks that the cores have to wait due to mesh traffic.

Please find the post here:

http://nicksparallellaideas.blogspot.com/2016/10/benchmarking-broadcast-strategies.html

and comment below.

Thanks,

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia

Re: A new blog post about broadcast strategies and benchmark

Postby jar » Mon Oct 10, 2016 2:00 pm

Nick,

Thanks for posting this. You mentioned the barrier (elib/eSDK?) dominates the time. This is a slow, linear algorithm. I've demonstrated a logarithmic time barrier that is 9x faster for all cores.
https://arxiv.org/pdf/1608.03545v1.pdf (section 3.6)

However, one should avoid collective barriers, particularly when only two cores really need synchronization in a message passing paradigm.

Also, "one to all" broadcast operations can be parallelized if each core is participating in the collective operation. The broadcast can occur in stages. For example, assuming core 0 is sending to all others:
stage 0: 0 -> 8
stage 1: 0 -> 4, 8 -> 12
stage 2: 0 -> 2, 4 -> 6, 8 -> 10, 12 -> 14
stage 3: 0 -> 1, 2 -> 3, 4 -> 5, 6 -> 7, 8 -> 9, 10 -> 11, 12 -> 13, 14 -> 15

By sending data "farthest first", it avoids some of the communication issues you discussed. For small broadcasts, the latency is overwhelmingly dominated by control flow and not network bottlenecks. Admittedly, my software design was optimized for large transfers, but it seems to do okay with small transfers. Using shmem_broadcast64 to broadcast 2 integers from one core to all 16 cores takes about 250 clock cycles (see figure 6). If I understand your figure correctly, it looks like your linear broadcast is around 500 clock cycles for 2 integers.

There is also a hardware broadcast feature that I never really explored because it's kind of broken on E3.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: A new blog post about broadcast strategies and benchmark

Postby nickoppen » Mon Oct 10, 2016 9:03 pm

Thanks JAR,

I think that we basically came to same overall conclusion. I'll have a read of your paper. I have not really considered a scenario where all the cores act cooperatively to share their data.

I really appreciate the feedback.

nick
Sharing is what makes the internet Great!
User avatar
nickoppen
 
Posts: 266
Joined: Mon Dec 17, 2012 3:21 am
Location: Sydney NSW, Australia


Return to OpenCL

Who is online

Users browsing this forum: No registered users and 3 guests

cron