Parallella Community

by **GreggChandler** » Mon Feb 20, 2017 5:21 pm

Is it to be expected that "close" cores receive preferential performance when accessing (attempting to lock) a mutex?

I modified the sample "mutex_test" application found here: https://github.com/adapteva/epiphany-ex ... /cpu/mutex to use "e_mutex_trylock" in "e_mutex_test" rather than "e_mutex_lock". I then spun incrementing a local counter until the lock was acquired. I also modified the application (e_mutex_test) to consume a random number of cycles after the lock was acquired and before it was released. Afterwards, I read the spin lock counts from the driving harness (mutex_test). To my surprise, one of the "close" processors always appears to acquire the lock first when comparing multiple runs of the application. This appears statistically significant. I suppose that I could more extensively modify the app to histogram the results against thousands of executions, however, if my initial observation is correct, it appears possible that on an applications where multiple cores were looking to grab the mutex, cores that are farther away could be starved out.

Is this the expected behavior for your mutex implementation? If so, this should be documented, as there appears to be no way to prevent the starvation with a sufficiently parallelized application demanding high frequency access to a mutex.

by **GreggChandler** » Tue Feb 21, 2017 5:12 am

by **jar** » Wed Feb 22, 2017 4:24 am

Gregg,

Your results aren't surprising. The mutex routines are essentially thin wrappers of the testset instruction. The testset instruction must propagate through the network and then return to the original core. It's expected that this would lead to a load imbalance. If you need to have a "fair" or ordered locking method, that can be implemented in software. That is the solution, but you pay the price with performance.

Regarding your comment in the other thread, the eSDK barrier isn't well optimized for performance. A dissemination barrier has been shown to be significantly faster. Additionally, there is a hardware WAND barrier if you're using all of the cores.

by **GreggChandler** » Wed Feb 22, 2017 3:48 pm

jar, thank you for your response.

I have to think about this more, but I would expect that most, if not all, mutex implementations for SMP processors such as a multicore x86 are built around some type of atomic conditional read/modify/write instruction. In the case of the Epiphany, as the number of cores increases, one can only expect that the "fairness" of the mutex implementation will only get worse, i.e. it won't scale well as the range of distances to the mutex core increases. Furthermore, I would imagine that selection of the mutex core location is another important factor, i.e. center of the grid versus the localized origin of the grid as in the example program provided by Adapteva. (In my results, wrapping around the end for core [0,3] appears most favored position.)

As these results appear to be the consequences of fundamental architectural features/constraints of the Epiphany, it is not clear how any amount of software scaffolding around the test/set instruction could compensate for the design. From my admittedly inexperienced familiarity with the architecture, I would expect that moving the mutex to External Memory would not reduce the bias that I measured, and would only server to be much slower. Do each of the cores access external memory with the same access penalties, or do those closer to the edge of the chip experience faster response times? (I seem to remember reading somewhere that external memory was implemented off of one of the edges of the chip.)

As I wrote previously, if this is the expected behavior of the eSDK, then it should be documented. I don't believe that any of the other mutex's that I have used (Unix, Linux, Microsoft Windows, Mac) exhibit such behavior. (Admittedly these are operating systems, not bare iron, and mostly running x86, although the eSDK is not bare iron either.) Perhaps I need to retest them however?

As you appear to have correctly anticipated, my next experiment was going to be integration with barriers, although perhaps you have saved me the time!

by **jar** » Thu Feb 23, 2017 5:22 am

Gregg,

Epiphany, in this context, is not an SMP processor so we shouldn't expect that fairness. You can implement a bakery algorithm to introduce your fairness into a locking scheme. External memory does indeed have worse performance with respect to latency and is not symmetric.

I have not run into a fairness issue with the testset locking scheme in practice. I would be interested in any applications where it is problematic.

Regarding barriers, feel free to peruse some of the OpenSHMEM code written for them:
https://github.com/USArmyResearchLab/op ... rier_all.c
https://github.com/USArmyResearchLab/op ... _barrier.c
https://github.com/USArmyResearchLab/op ... rier_all.c

The routines have been optimized for performance. The WAND barrier requires inline assembly. The software dissemination (butterfly) barrier is "loud" in the sense that it creates more network traffic than a tree-based locking algorithm (not shown). There are a few algorithms for barriers. If you find a better one for Epiphany than the code above, I am interested in hearing about it.

by **GreggChandler** » Wed Mar 22, 2017 5:38 pm

I apologize for the tardiness of this reply. I have been recovering from surgery. The parallella was a distraction during my surgery preparation.

I looked up Lamport's bakery algorithm. My reading indicated it was far from a "fair" algorithm. In fact, it appears to be a highly prioritized algorithm: https://en.wikipedia.org/wiki/Lamport%2 ... _algorithm

Nonetheless, I coded it up with the data in external shared memory. Although it appears to work well, and possibly will let me include the host processor in the exclusivity, I have not exhaustively tested it yet. It is very prioritized (as suggested in the above link)--unless I have a bug. As with the eSDK algorithm, it won't scale well either--at least when there is significant contention for the mutex (think thousands or millions of cores). I suspect that the real answer I am looking for will require additional transistors between the cores rather than relying on inter-core communication already present--and that probably won't happen.

I also need to think through the parallella's weak memory ordering to see if it it preserves the integrity of Lamport's original algorithm and his proof. http://lamport.azurewebsites.net/pubs/bakery.pdf A number of others appear to have tweaked the original in various ways addressing the unbounded arithmetic, perhaps one of them fixed the "fairness"? Perhaps one of them generalizes beyond the memory ordering? Perhaps memory ordering is not really significant? All interesting questions.

Additionally, I modified my original program to support moving the location of the eSDK mutex between the various cores. As before, I was surprised by the results. I would have expected that the "closest" cores would have most favored access to the mutex. That does not appear to always be the case. There appears to be something in the interprocessor routing/access algorithm that I don't understand yet!

As for applications where the "fairness" is problematic, I am approaching the machine more as a multi-core general processor instead of a dedicated co-processor, or GPU. In the past, I have licensed software to the government, (including the U.S. Army,) and other organizations, that was used to test large systems. Historically, that code was run on, at times, hundreds of processors, (full boxed computers,) that were used to load a server or test an application. The hundreds of boxes typically emulated thousands of simultaneous users. The co-ordination among those boxes to drive the tests implemented exclusion, barriers and shared memory--although we did not use most of those terms, and most of the code was custom written by one of my customers from scratch. In fact, one installation used a separate LAN to isolate the generated network traffic from the application traffic to ensure fairness. If the exclusion algorithm's were unfair, it could impact the test results as bias would be introduced. (Not all processors were executing identical code/scripts or were testing identical parts of the system/application.) I am exploring replacing hundreds of boxes, with hundreds/16 parallella's. A significant problem, however, is tying each of the parallella's cores to the network. Taking all of the data through the Arm would be a problem. Possibly putting additional NIC's in the FPGA might help.

by **jar** » Thu Mar 23, 2017 2:11 am

Gregg,

No worries on the delay. I hope you're feeling better.

I had discussed, with a colleague, applications where locking fairness is problematic and he said he has experienced it on some applications, but it is rare. And that was with with SMP x86. The problem isn't unique to Epiphany and I'm optimistic clever solutions in software are possible. And I can appreciate the concerns for scaling to thousands and millions of cores. You may be right that some extra transistors may be needed at that scale.

If you're willing to share, I am interested in taking a look at your Lamport's algorithm with the external shared memory. Your decision to use external shared memory is curious to me as the ARM host can access the Epiphany core memory.

If you've run code on Army clusters, we probably know some of the same people.

Take care.

by **GreggChandler** » Fri Mar 24, 2017 12:43 pm

The design of my Lamport bakery implementation should run in either core memory or external shared memory. I put a light-weight c++ class around a pointer to the data to be shared. The code executing on the host and the core(s) are identical, with different initialized pointer values. Thus, my mutex allows host or core code to lock it. I similarly implemented my own barrier that is global through host or core--although I have not tested the host side of the barrier yet. This design allows me to initialize my data structures on the host before the workgroup is started. The initialization is isolated to class member functions. The resulting applications exhibit a certain elegance and simplicity. I also wrote my own version of the "e-read" application that allows me to inspect/debug my code/data from another host thread distinct from the driving host application.

I chose to put my data in the shared external memory rather than core memory due to my understanding of the system architecture as documented in the eSDK and architecture manuals that I have access to. I have not read the eSDK source to determine how the loader works, but from the eSDK documentation, it does not appear that the host can directly map core memory as it can external memory. Use of the e-read() and e-write() functions to access the core memory, while possible, is less than elegant. I could copy the block of data to the host, access it, and copy it back--but without a mutex, this would be problematic.(grin) I presume that the loader and other eSDK functions that access the registers do so through the mesh network protocol via the FPGA and the link protocol--but I have not studied this yet. I could possibly write some assembly code to make this happen, but am not that far yet.

To test my code, I implemented a third shared memory data structure that supports formatted printing from the core through the host to the console. I guarded access to the data structures with my mutex from both the host and core processors. I also created light-weight c++ classes layered over the eSDK definition to isolate me from eSDK changes and relieve the tedium of much of the code I found myself writing. One class supports the eHAL functions, the other supports the eLib functions. Lastly, I implemented a core application that executed random wait's to simulate work and printed core data/numbers that were interspersed with barriers to see it all work. It was thorough enough to flush out a bug or two in both the barrier and the mutex code, however, I would not claim them to be exhaustively tested. The resulting host or core code does not look like any of the sample epiphany applications that I have read--although I have only read a few, however, it does allow me to reuse algorithms and code more easily.

As for my code, I pretty much followed Lamport's pseudo-code as detailed in the link to his ACM article previously posted. It is not finished yet, and I have not analyzed performance yet, but it does appear to work. The downside of the algorithm is that thread number, or in my case core number, determines priority in locking the mutex. One can re-configure the priorities, however, as discovered by Lamport, thread/core 1/0 will always have priority over 16/15. (His algorithm indexes from one rather than zero.) The good news is that it is predictable. The bad new is that it is not fair. To implement fairness, one probably would need to add at least a primitive scheduler as an Operating System would. The implications of the prioritization are that independent of order of mutex lock request, thread/core 1/0 will always have priority if it is waiting for the mutex.

P.S. My initial implementation dynamically allocated the buffers based upon the number of cores in the system, however, I found that executing a "new" or "malloc" brought in way too much library code and data. I am exploring alternatives.

e-test.cpp: core application; (2.67 KiB) Downloaded 1381 times

bakery.h: Lamport's Bakery Header; (986 Bytes) Downloaded 1354 times

bakery.cpp: Lamport's Bakery Implementation; (3.76 KiB) Downloaded 1368 times

by **GreggChandler** » Sat Mar 25, 2017 9:08 pm

Parallella Community

mutex performance

mutex performance

Re: mutex performance

Re: mutex performance

Re: mutex performance

Re: mutex performance

Re: mutex performance

Re: mutex performance

Re: mutex performance

Re: mutex performance

Who is online