by piotr5 » Sat Jun 13, 2015 9:59 pm
so, as I understood, the only work of this epiphany-mmu in fpga is to translate a hand-ful of consecutive e-core positions into actual physical memory. well, my request here wasn't about the memory-stuff, I rather tried to ignite some inspiration for a new parallella board. let's call it the scaleable parallella.
the idea is: it would have only network connector for debugging, no usb nor hdmi, maybe not even a sd disc. (I'm thinking of net-boot and booting from another epiphany.) there would still be zync and ram, but the memory-structure would differ slightly between parallella and epiphany. scaleable parallella must have a 3rd eLink connector and this one should feature a counter-clockwise rotation of address-space. i.e. the message goes to the column with lower number, as it exits the epiphany the destination address must be rotated around the original address. i.e. it will seem the message which previously went to the west will go to the south, and conversely when the message travels in the other direction. the coordinates of the epiphany it just left stay roughly the same, although they too are rotated. this way, addressing something way down to the west will move on towards the south. and messages for something northwest will again leave through the west-eLink of the next parallella. the scaleable connector in the other thread could then keep the current parallella in a fixed position, or maybe this functionality should be built-in too if such a connector turns out to be too difficult to implement cheaply.
the usecase is again my tree-like cyclic network, but in the center there is an ordinary parallella desktop or embedded. on the software side, fpga should map the current epiphany into local ram, the neighbouring ones into the same space but with an offset -- hopefully next generations of epiphany will use only 256K or less. the ram shadowed by such an assignment can then be used as cache memory for the 1G RAM on the neighbouring parallellas. since each epiphany can only transfer as much as it has memory (32K currently) this will then be the size of each cache-page being pre-buffered. one such page probably must be sacrificed for storing each page's address. the goal is to have all 3 neighbours and the own memory as a flat 4G virtual memory where the current parallella and the neighbour agree on the 2G they have stored. even the epiphany of the neighbours is visible in that 4G, so that you can initialize a DMA-transfer from the neighbour's neighbour's neighbour. writes to the cache are dispatched immediately, also reads happen when the memory is requested. but then, maybe that part of the memory-logic should be configurable at runtime...
an interesting addition to such a design would be connectors between 1 fpga and the fpga of the 3 neighbours. i.e. somewhere you could plug-in 3 connectors. without hdmi or usb this should be quite a good connection, maybe even better fit for feeding the cache.
I still need to think a bit about this design, but I guess the west-connector would be used for the root-direction of the tree. i.e. to send a message down the tree towards the leafs, you go north or south, there's 8 parallellas in each of those directions. when the message goes back to the root, you'd send it northwest at each hop. afterall, the usb and main sd card are in the root. but maybe there is some design which offers a communication with fewer hops. afterall you can reach a total of 8x8=64 parallella-boards just by going north-west, plus the 3x8=24 boards accessible in north, south and west, 89 parallellas live on the fast lane...
I should emphasize, in my design the memory as seen from the point of view of epiphany is: 1.875G to the east, 900M to the south-west (located on the western neighbour), and epiphany cores to the south and north and west and northwest. seen from arm there is nearly the whole 4G of main memory available, from this parallella and its neighbours, but only 10x16 epiphany cores are visible, those on the neighbouring epiphany chips and the current one. therefore my design makes sense when you have more than 10 scaleable parallellas. maybe replacing fpga by something fixed could reduce production-costs so that people actually can afford those 89 parallellas for which it actually scales up nicely?
next step after the design is actually implementing my idea, just with fpga instead of the west-connection. before going into production this needs thorough testing. as was said in this thread, such testing requires no new boards. so my request is someone implements it and tests it with a few parallellas. for example there exists no raid driver combining multiple sd-cards located on parallellas connected through eLink. just imagine, 2 other sd-cards at 2G/s each, in addition to the local one. and then maybe another such sd-card accessible through fpga connected to north-eLink...
as for casing, that's complicated. there's no case fit for storing a tree-topology, even less when the tree loops through its leafs. but I suspect the case should be a triangle, connectors on edge or corner, flexibly tilted up or down...