Parallella Ubuntu ESDK 2016.3 released

Forum about Parallella boot process, linux kernel, distros, SD-cards, etc.

Parallella Ubuntu ESDK 2016.3 released

Postby olajep » Mon Apr 11, 2016 1:14 pm

Updated: 2016-05-12

Use 2016.3.1 bug fix release here instead:

Forum post:
viewtopic.php?f=48&t=3682

https://github.com/parallella/pubuntu/r ... k-2016.3.1

Highlights

  • Upgraded to Ubuntu Vivid 15.04 (Linaro Nano base image)
  • Redesigned FPGA elink
  • Redesigned kernel driver
  • Updated toolchain
  • SREC support removed (use elf format instead)

SD card images and full release notes here:
https://github.com/parallella/pubuntu/r ... sdk-2016.3

Grab it while it's hot!

// Ola
Last edited by olajep on Thu May 12, 2016 9:54 pm, edited 3 times in total.
_start = 266470723;
olajep
 
Posts: 139
Joined: Mon Dec 17, 2012 3:24 am
Location: Sweden

Re: Parallella Ubuntu ESDK 2016.3 released

Postby pgater » Mon Apr 11, 2016 7:10 pm

Hi Ola,

Many thanks! :D

Regards,

Paul
User avatar
pgater
 
Posts: 43
Joined: Mon Dec 17, 2012 3:25 am
Location: Nantwich, Cheshire, UK

Re: Parallella Ubuntu ESDK 2016.3 released

Postby sebraa » Mon Apr 11, 2016 8:30 pm

Thank you!
sebraa
 
Posts: 495
Joined: Mon Jul 21, 2014 7:54 pm

Re: Parallella Ubuntu ESDK 2016.3 released

Postby MiguelTasende » Tue Apr 12, 2016 12:21 pm

Great!
Will grab it today :P
MiguelTasende
 
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Parallella Ubuntu ESDK 2016.3 released

Postby jar » Tue Apr 12, 2016 3:23 pm

Tables 6, 7, and associated text should be updated in the Epiphany architecture reference for the ABI changes.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Parallella Ubuntu ESDK 2016.3 released

Postby MiguelTasende » Thu Apr 14, 2016 5:24 pm

My old Epiphany-Arm communication method is not working (PROBLEM SOLVED; still e-bandwidth-test numbers are not nice... why?). I tested old code with the new esdk, and it doesn't work.
In particular:

EPIPHANY---------------------------------------------------------------------------------------------------------------------------------------------
This is the Epiphany code to wait for ARM "go":
Code: Select all
        *estado = 0;

        e_barrier_init(barriers,tgt_bars);

        //Espera de la señal de inicio<-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><->
        if (coreNum == 0){
            //Espero del ARM la señal de inicio
                        while(*estado == 0) {};
        }
        //Todos los núcleos esperan al cero (que recibe del ARM)
        e_barrier(barriers, tgt_bars);
        //<-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><->


And this is the one to signal to the ARM, "I'm done":

Code: Select all
        e_barrier(barriers, tgt_bars);

        if(coreNum == 0){
            *estado = 0;
        }


/EPIPHANY-------------------------------------------------------------------------------------------------------------------------------------------

"estado" is a variable located in the Epiphany local memory, and the Host code would be (for "go"):

HOST--------------------------------------------------------------------------------------------------------------------------------------------------
Code: Select all
        int signalLocal;
        //Levanto señal de inicio del Epiphany<-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><->
        signalLocal = 1;
        e_write(&dev,0,0,COMPARTIDA_DEV,&signalLocal,sizeof(signalLocal));
        //<-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><->


And Host for "wait until Epiphany is done":
Code: Select all
        //Espero a que el Epiphany termine<-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><->
        signalLocal = 1;
        while (signalLocal != 0){
            e_read(&dev, 0, 0, COMPARTIDA_DEV, &signalLocal, sizeof(signalLocal));
        }
        //<-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><-><->


/HOST-------------------------------------------------------------------------------------------------------------------------------------------------

That works in old ESDK, but not in the new. I saw that the new "hello_world" example uses "new" (at least I had not seen them) functions like "e_shm_attach". I will try to adapt my code according to the new examples I find, but...
EDIT: I just checked... Those functions were already there in the old esdk... I just learned with another "hello world", that's why I hadn't seen them.

- Has the ESDK API "changed"? I know the old API still compiles. The question if it is still working with the new e-link, for example.

- Will there be any documentation (or added Apendix to the ESDK Reference Guide) regarding that?

Maybe the problem is another, and the behaviour of my old code changed for another reason (different timings? unlikely. Others?)...

UPDATE: It looks like the problem was not in the code, but probably in the use of the e-loader utility (I used it externally in one version). ELF for SREC was already changed, but probably something more was missing...


Edit: Also tested "e-bandwidth-test", wich works "out of the box", but it doesn't seem to show dramatic improvements in speed (perhaps confirming that the old esdk API may not be the one to use now... ?):

(I don't like posting these numbers... in case it is bad advertisement :P... I will edit and erase if there is something wrong on that measurements... just in case someone looks "on the fly" and gets the wrong idea...)

Old Parallella image:
Code: Select all
ARM Host    --> eCore(0,0) write speed       =   43.47 MB/s
ARM Host    --> eCore(0,0) read speed        =    5.25 MB/s
ARM Host    --> ERAM write speed             =   88.23 MB/s
ARM Host    <-- ERAM read speed              =  131.52 MB/s
ARM Host    <-> DRAM: Copy speed             =  353.73 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1242.38 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA)  =  401.46 MB/s
eCore (0,0) --> ERAM write speed (DMA)       =  233.94 MB/s
eCore (0,0) <-- ERAM read speed (DMA)        =   87.71 MB/s


New Parallella image:
Code: Select all
------------------------------------------------------------
ARM Host    --> eCore(0,0) write speed       =   52.40 MB/s
ARM Host    --> eCore(0,0) read speed        =    8.10 MB/s
ARM Host    --> ERAM write speed             =   90.26 MB/s
ARM Host    <-- ERAM read speed              =   49.01 MB/s
ARM Host    <-> DRAM: Copy speed             =  432.94 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1270.67 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA)  =  388.33 MB/s
eCore (0,0) --> ERAM write speed (DMA)       =  238.13 MB/s
eCore (0,0) <-- ERAM read speed (DMA)        =  161.07 MB/s
------------------------------------------------------------


It can only be seen an improvement in eCore <-- ERAM (x2),
while Arm <-- ERAM goes lower.

Again I don't know how to read this data... just asking.
MiguelTasende
 
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Parallella Ubuntu ESDK 2016.3 released

Postby MiguelTasende » Thu Apr 14, 2016 8:02 pm

Comment: Looks like it is necessary to remove "_" in assembly global function names... (when mixing asm and C)
After that it works...

I could see some differences transfer in times, but still testing...

Edit: Yep, it goes faster :) . By now just a small gain "out of the box", but I saw some real evidence that the new e-link is there.
MiguelTasende
 
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Re: Parallella Ubuntu ESDK 2016.3 released

Postby aolofsson » Thu Apr 14, 2016 9:35 pm

-ARM initiated transfers are are software bound, nothing to do with the elink
-As you can see the ERAM read speed increased by 2X (that's signifiacant!)
-If you run the matmul example, you can see a significant wall time speed up.

We'll look into the ARM-->ERAM read/write speed (has noting to do with elink, but agree that slow down is not good)
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Parallella Ubuntu ESDK 2016.3 released

Postby olajep » Fri Apr 15, 2016 9:37 am

MiguelTasende wrote:Old Parallella image:
Code: Select all
ARM Host    --> eCore(0,0) write speed       =   43.47 MB/s
ARM Host    --> eCore(0,0) read speed        =    5.25 MB/s
ARM Host    --> ERAM write speed             =   88.23 MB/s
ARM Host    <-- ERAM read speed              =  131.52 MB/s
ARM Host    <-> DRAM: Copy speed             =  353.73 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1242.38 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA)  =  401.46 MB/s
eCore (0,0) --> ERAM write speed (DMA)       =  233.94 MB/s
eCore (0,0) <-- ERAM read speed (DMA)        =   87.71 MB/s


New Parallella image:
Code: Select all
------------------------------------------------------------
ARM Host    --> eCore(0,0) write speed       =   52.40 MB/s
ARM Host    --> eCore(0,0) read speed        =    8.10 MB/s
ARM Host    --> ERAM write speed             =   90.26 MB/s
ARM Host    <-- ERAM read speed              =   49.01 MB/s
ARM Host    <-> DRAM: Copy speed             =  432.94 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1270.67 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA)  =  388.33 MB/s
eCore (0,0) --> ERAM write speed (DMA)       =  238.13 MB/s
eCore (0,0) <-- ERAM read speed (DMA)        =  161.07 MB/s
------------------------------------------------------------


It can only be seen an improvement in eCore <-- ERAM (x2),
while Arm <-- ERAM goes lower.

Again I don't know how to read this data... just asking.


First, we need to use the same version of e-bandwidth-test so we're comparing the same benchmark.

This seems to be mostly an issue with compiler optimization flags.
With the 2016.3 image building with -O3 gives slightly better read performance compared to the 2015.1 image.
Strangely on the 2015.1 image, optimization flags seem to decrease read performance with 2x.

ESDK 2015.1, epiphany-examples 2016.3
Code: Select all
------------------------------------------------------------
ARM Host    --> eCore(0,0) write speed       =   50.72 MB/s
ARM Host    --> eCore(0,0) read speed        =    5.27 MB/s
ARM Host    --> ERAM write speed             =   88.12 MB/s
ARM Host    <-- ERAM read speed              =  127.28 MB/s
ARM Host    <-> DRAM: Copy speed             =  351.01 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1242.38 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA)  =  401.46 MB/s
eCore (0,0) --> ERAM write speed (DMA)       =  232.11 MB/s
eCore (0,0) <-- ERAM read speed (DMA)        =   87.72 MB/s
------------------------------------------------------------
TEST "e-bandwidth-test" PASSED

ESDK 2016.3, epiphany-examples 2016.3 -O3
Code: Select all
------------------------------------------------------------
ARM Host    --> eCore(0,0) write speed       =   50.69 MB/s
ARM Host    --> eCore(0,0) read speed        =    7.59 MB/s
ARM Host    --> ERAM write speed             =   87.88 MB/s
ARM Host    <-- ERAM read speed              =  134.81 MB/s
ARM Host    <-> DRAM: Copy speed             =  487.67 MB/s
eCore (0,0) --> eCore(1,0) write speed (DMA) = 1270.67 MB/s
eCore (0,0) <-- eCore(1,0) read speed (DMA)  =  388.33 MB/s
eCore (0,0) --> ERAM write speed (DMA)       =  238.13 MB/s
eCore (0,0) <-- ERAM read speed (DMA)        =  161.07 MB/s
------------------------------------------------------------
TEST "e-bandwidth-test" PASSED

Pushed fix to epiphany-examples

// Ola
_start = 266470723;
olajep
 
Posts: 139
Joined: Mon Dec 17, 2012 3:24 am
Location: Sweden

Re: Parallella Ubuntu ESDK 2016.3 released

Postby MiguelTasende » Tue Apr 19, 2016 2:26 pm

OK, after changing to -O3 I got the same results as you do.
Thanks for the answer.

-As you can see the ERAM read speed increased by 2X (that's signifiacant!)
-If you run the matmul example, you can see a significant wall time speed up.


Yes, it is significant. I tested (epiphany-examples) matmul, and it improves from 1.5 GFLOPS/s to 2.0 GFLOPS/s, aproximately.
My own matmul code (which I hope will be soon released; it doesn't depend on me, at this moment), goes from 3.5 GFLOPS/s to 3.9 GFLOPS/s, and I am working on some improvements to take it to around 5 or 6 (if really lucky).
I think I was just hoping for a super-e-link improvement that violated the quantum and relativity laws to get about 16GFLOPS/s in matmul (and who knows how much, in more complex applications)... :P (maybe next time)

-ARM initiated transfers are are software bound, nothing to do with the elink


I am really interested in understanding what is going on with the "host" <-> "Shared RAM" transfers... When transfering to other sections of the RAM the speed is much better. So I got to think of some hypothesis:

- Can the e-link, even if not directly used to access the RAM by the host, be interfering with the transfer? (being a shared resource, I thought e-link could put some overhead to ARM transfers, some logic gates in the path to ensure that there are no concurrent attempts to acces the same portion of the RAM? I really haven't looked at the schematics/FPGA design to be able to say, just asking...)
- Maybe the sharing of the RAM region requires to change some "access configuration", which has nothing to do with e-link, maybe the way the ARM uses "caches" is changed and that is the cause?
- Others?
Maybe that is for another topic... (sorry)
MiguelTasende
 
Posts: 51
Joined: Tue Jun 30, 2015 12:44 pm

Next

Return to Linux/U-Boot

Who is online

Users browsing this forum: No registered users and 4 guests

cron