Parallella Community

by **fdeutschmann** » Sun Sep 22, 2013 11:01 pm

Hi all,
I couldn't find this answer anywhere in the docs or the forums, so here's my question:

How do the eMesh router nodes handle message buffering / flow control (or is there a mechanism which obviates this)? If a node has multiple messages headed in the same outbound direction on a single mesh (e.g.: relaying a message from the west node to the next-hop east, relaying a message from the north node to the next-hop east, and sending a message from the local core to the next-hop east, all on <some> mesh), how is access to the outbound mesh arbitrated, and how much buffering for the loosing messages is provided (or how are the loosing messages flow controlled)? (Assuming that messages always make forward progress towards the destination, never routing away for traffic - is this also a correct assumption?)

The Epiphany Architecture Reference mentions round-robin arbitration (sec 5.5) and fixed-priority arbitration (sec 6.3), but these aren't clear to me in terms of message buffering vs flow control, etc. Can anyone shed more light on this?

Also, there's a minor bug in that doc on page 25 (Rev 3.12.12.18): the column headings in Table 3 for Address-Row and Address-Column are swapped, relative to the contents of the table.

Thanks in advance
-frank

by **over9000** » Mon Sep 23, 2013 12:29 am

by **fdeutschmann** » Mon Sep 23, 2013 12:55 am

Thanks for the reply. (And yes, I did mean 'losing', as in the messages that lose the arbitration.)

Thinking about this more, and looking at the eLink description in the datasheet, I'm thinking:

1 - the routers have a 1 message buffer capacity on the inbound side of each NEWS direction, on each mesh
2 - the connected source gets flow controlled when this buffer is occupied
3 - flow control will cause an eCore stall if a load / store access to a mesh sees an occupied buffer
4 - the routers grant access to the outbound side of each directional mesh by round-robin between 5 links: [NEWS incoming and eCore link]

Correct?
Thanks.

by **hewsmike** » Mon Sep 23, 2013 5:00 am

Re priority : my understanding/imagining is how/like emergency vehicles ( should ) progress through traffic. Cars pull over to allow the ambulance/fire-brigade to run through, then everybody rejoins flow ie. 'normal' where that means a traffic cop serves all intersection approaches in turn. So 'pulling over' and 'rejoining flow' imply buffers are involved.

Hence if you have two/more fire-brigades wanting to cross the same intersection at the same time, then there is still round-robin resolution as it applies to any equal priority traffic. When all the fast movers pass through the ordinary traffic is serviced round-robin too. Rinse, repeat.

In the absence of higher priority traffic transiting a core's mesh node, then data wanting to enter the mesh is dealt with as per any data arriving at the node. That is : is there contention for a given routing direction out of the node ? One could conceive of four data packets all wanting to go North say, one each from East/South/West and one generated at the node to be launched into the flow. Of course than then implies that any delay upon clearing, say, East-side of one node has an implication for directing traffic to the West from an immediate East side node. Mutatis-mutandis through any chain of backlog.

[ Note that if there is no contention for which direction to leave a node, then there is nothing to resolve. You still only get one quantum of traffic from each of NEWS and the node itself. In that sense a mesh node is 5-way : 4 'through' and one 'dead-end'. ]

I guess the saving grace is that all traffic on a given mesh subset ( rMesh or cMesh or xMesh ) is handled independently of the other two. Add to that the protocol of routing decisions serving progress along one array axis before routing decisions affecting the orthogonal axis. Hence you eliminate read/write contentions b/w a given pair of cores as eg. read requests going along two sides of a quadrilateral are returned as write data along the other two sides of that same quadrilateral. Thus you can't have a read/write from coreA to coreB contesting with a write/read from coreB to coreA. So while the timing is not deterministic, ultimate delivery is still assured.

Cheers, Mike.

by **fdeutschmann** » Mon Sep 23, 2013 2:30 pm

Thanks for trying to be helpful, but it seems to me that you (and I) are making a lot of assumptions on areas where the documentation is not at all clear. I would really like to understand the actual operation better, particularly the buffering (if any) and the details of the round-robin / priority arbitration scheme.

This isn't just idle speculation: understanding these issues is key to working out placement rules for assigning tasks to eCores, and understanding how to write code that can operate in deterministic time.

Thanks
-frank

by **hewsmike** » Mon Sep 23, 2013 11:19 pm

by **notzed** » Tue Sep 24, 2013 5:36 am

Your 4 point list looks right to me.

There's no way that you can get 'outgoing' conflicts because there is only one 'incoming' source considered at a given time for each outgoing port and bus.

There is no buffering (>1) or priority handling in the e-mesh network apart from the implicit priority of having 3 separate networks.

The priority arbitration is part of the e-core network interface. Both for outgoing (section 6.1.5) and incoming (section 6.3) traffic.

This makes it very simple, deterministic and deadlock free ... but the it has some strange characterstics that are exposed to the programmer.

e.g. 4 cores in a row (only) doing an external DMA at the same time will saturate the rmesh along that row, and the effective bandwidth of each will be 1/2, 1/4, 1/8, 1/8 rather than 1/4 each (whilst it is busy, since they wont finish at the same time). Each network node apart from the last one alternates between the local core and the incoming link so divides the bandwidth the more hops you need to traverse. But 4 cores in a column will get an equal share. If the 4 in the same row use software interlocks to take turns then they would get even bandwidth.

All 3 cases are 'deterministic', but only 2 are 'fair' as one would expect and only one will finish at the same time. So this is probably what's meant by using software to give a particular behaviour.

(this is based on some code testing on the device and trying to fit the observed results with the docs, but i'd like to know if i'm wrong on this).

It's not clear from the manual as to whether the e-core will block if sending a transaction in a "clear" direction whilst blocking on another one on the same network. Unfortunately from the way it hooks onto the e-mesh router it looks like it would (e.g. figure 2).

by **aolofsson** » Tue Sep 24, 2013 5:38 pm

Everyone,

Great discussion and I will try to feedback some of the questions into the manual!

More eMesh design details:(much of this was already correctly guessed in the thread, but I figure it can't hurt to repeat it?)

Arbitration:
On every outgoing link (south,east,west,north) there is basically a 4 to one mux (one entry for the traffic from the core at that node) that compete for the outgoing link. The arbitration scheme is round robin. Fixed priority arbitration refers to everything that happens inside the eCore or at the interface between the eCore and the eMesh.

Routing:
Always forward traveling along the row first before traveling in the column direction.

Buffering:
There is a register at each output and each input to the mesh node. Each node is clocked on the opposite clock edge of the nodes around it.

Flow Control:
There are "wait" signals going in each direction that push back at the same pace that the buffers fill up, so there are never lost packets. The flow control propagates throughout the network and of course extends all the way back to any master generating traffic on the eMesh (including the off-chip links, DMAs, LDR/STR)

Mesh Interaction:
Note that there are three completely independent meshes (read requests, on-chip write, and off-chip write). These meshes interact only in one place, at the "network interface" between the eMesh and the eCore. The arbitration priority should be described in the priority table 6.3?

Deterministic Latencies:
The intent of the sentence on page 27 was to explain that if the application can guarantee that there is never round robin arbitration losers, then you would have deterministic delays and bandwidths. For some examples of latency and bandwidth tests that we have run:
https://github.com/adapteva/epiphany-examples

Blocking:
The mesh network is truly bidirectional and only "blocks" in the direction of the "losers" in the round robin arbitration. ie. If core, west, north, and south all want to exit to the east and core wins the round robin arbitration, then west and north will be told to wait.

Off-chip links:
As notzed pointed out, the arbitration is such that the bandwidth allocation becomes unfair in some scenarios. Something that may not be clear is that the current implementation of the eLink only reaches peak bandwidth when there is a stream of write transactions with addresses incrementing by 8 (ie. 0x0,0x8,0x10, etc). If there is random interleaving of transactions, the effective off-chip bandwidth drops to less than 1/3 of peak. For off chip access, it's definitely recommended to use some kind of token mechanism (for now) to get best performance. This would take care of the unfair arbitration mechanism as well.

Andreas

by **tnt** » Fri Sep 27, 2013 8:36 pm

Also :
https://www.google.com/patents/US20100111088

by **fdeutschmann** » Thu Oct 03, 2013 9:55 pm

This is all fantastic; thanks very much for the detailed replies and info.
-frank

Parallella Community

eMesh buffering?

eMesh buffering?

Re: eMesh buffering?

Re: eMesh buffering?

Re: eMesh buffering?

Re: eMesh buffering?

Re: eMesh buffering?

Re: eMesh buffering?

Re: eMesh buffering?

Re: eMesh buffering?

Re: eMesh buffering?

Who is online