Problems when running Epiphany tests

Hardware related problems and workarounds

Problems when running Epiphany tests

Postby ralphmcardell » Fri Jul 04, 2014 11:12 am

Hello all,

I have finally got to the stage of bringing up my Kickstarter mini-cluster of 4 Parallellas and am having problems when running tests on the Epiphany chips.

Any suggestions on things to check or change would be much appreciated.

Setup:
OS images created from:

    - ubuntu-14.04-140611.img.gz archive.
    - kernel-hdmi-default.tgz boot files (embedded tar file dated 05-Jun-2014)
Running headless via SSH terminal sessions - although have connected monitor successfully to boards initially to see if they booted OK and occasionally when having problems. Have not connected anything at all via the uUSB connector.

Boards have the provided heatsinks attached to their Zynq 7020 chip and are fan cooled by 120mm fans - one pushing and the other pulling air. No heatsinks were attached to the Epiphany chips initially.

The ztemp.sh script reports temperatures for the 7020's between about 48 deg. C and 55 deg. C

Power supplied by a mini PC PSU having 5V rated at 12A.
Voltage measured when running: 5.02V +- 0.01V

Problem:
All boards generally boot OK although network randomly fails to come up.
Once up they seems to work OK as ARM based Linux systems.

When initially testing the Epiphany chips by running the

    /home/linaro/epiphany-examples/scripts/TestEpiphany.pl
script with the LIST.E16 configuration file 3 of the 4 Parallellas have always passed every test even when run repeatedly - hooray!

The 4th board however usually fails on the first run, most often when running

    test/e-mem-test/test.sh
or
    test/e-matmul-test/test.sh
Occasionally it does not even complete

    test/e-reset/test.sh
Sometimes it passes all tests but will fail on a subsequent run.
Best result was 2 complete run throughs followed by failure in test/e-reset/test.sh.

The failure mode is that the test does not return and the whole SSH session becomes unresponsive. It seems the whole system locks up or at least the network is borked by the failure - a reset or power cycle is required to recover. If I have video output connected via HDMI then the login screen image remains stable after lock up.

Thinking this Epiphany might be running a bit hot (although it did not ever seem that hot whenever I checked using a finger!) I attached a heatsink and then re-ran the tests. There was no improvement.

As a second set of tests I have been running the

    /home/linaro/epiphany-examples/apps/matmul-16
run4ever.sh script on all 4 boards.

As expected the known dodgy board manages between 0 and about 85 iterations before getting stuck and ctrl-C has little effect (it might break out of the stuck iteration only to get stuck and become unresponsive on the next).

The other three boards also get stuck every once in a while but in these cases ctrl-C will fail the stuck iteration and things then progress OK, until the next sticking point, where upon ctrl-C will start things going again.

Thanks for reading.
Ralph
ralphmcardell
 
Posts: 12
Joined: Mon Dec 17, 2012 3:25 am
Location: London UK

Re: Problems when running Epiphany tests

Postby zmc » Fri Jul 04, 2014 5:59 pm

I have a similar issue, running matmul-16 pretty consistantly fails. It fails in the while loop waiting for the epiphany cores to finish, that loop runs for over 5 minutes, probably forever.

The interesting part is that this test ran flawlessly and repeatedly earlier. I started trying to run the John the ripper FPGA implementation and create my own bistream using Xilinx PlanAhead, and now it fails sometimes, even when using the adapteva hdmi or headless bitstreams. If I control-c the process and restart it a few times, I will get a result and normal times from the epiphany chip.

I'm just running the TestEpiphany.pl now, and it seems to be stuck on the fft2d test. If I can help debug in anyway I would be glad.
[update]
It seems the only tests that fail sometimes are the matmul-16 and fft2d
zmc
 
Posts: 24
Joined: Thu Jul 03, 2014 10:01 pm

Re: Problems when running Epiphany tests

Postby ralphmcardell » Sun Jul 06, 2014 3:36 pm

Hi zmc,

Thanks for the offer of debugging help, not sure what I could ask at this point, still getting up to speed on Parallella development & debugging. Had hoped to get the cluster stable before exploring such things too deeply...ho hum.

I found the same for the stuck matmul-16 executions - that they get stuck waiting for the Epiphany to signal its done by setting Mailbox.core.go to 0.

Why it does not see this signal I have not yet determined - whether the Epiphany writes the signal but the host never reads it or whether the Epiphany does not complete the task for some reason (race condition maybe??) I have not determined.

Time to look deeper at the code I suppose...

I have not make any progress as to why one board locks up unrecoverably (other than reset or power cycle) when it gets stuck running tests / apps on the Epiphany. I did try the headless FPGA bit stream on the dodgy board and once I finally managed to get it to boot with network running OK found no impovement. Annoyingly once in a while it runs OK for a bit - several hours during Friday afternoon after my original post, only getting recoverably stuck running matmul-16. Then it reverts to its broken type dashing my hopes :(
ralphmcardell
 
Posts: 12
Joined: Mon Dec 17, 2012 3:25 am
Location: London UK

Re: Problems when running Epiphany tests

Postby zmc » Wed Jul 09, 2014 1:41 pm

Update, it seems that the problems that I was having with matmul and fft had to do with the dma engine. I was using a power supply rated at 1.5a, now I am powering off a huge usb hub rated at 3A, and I am running with a HDMI montior/fpga bitstream and it seems stable.

Later on I'm going to try switching back to the old power supply and see if I can reproduce the error now that I have my ftdi/ttl cable and can see exactly what's going on. The part I am most interested in is that in my case the problems only began after I changed the bitstream to a custom one. From what I understand all the FPGA fabric configuration is completely volatile and no longer exists when the chip is powered down so I'm not sure if there is anything hardware-wise that could hold any state. If anyone knows more about the epiphany internals please let us know if this is possible.
zmc
 
Posts: 24
Joined: Thu Jul 03, 2014 10:01 pm

Re: Problems when running Epiphany tests

Postby aolofsson » Wed Jul 09, 2014 7:24 pm

Thanks for the followup! Please do report back with your findings. Seems like the PSU was not strong enough to power the board.
Are you powering the board from the microUSB next to the RJ45?
Andreas
User avatar
aolofsson
 
Posts: 1005
Joined: Tue Dec 11, 2012 6:59 pm
Location: Lexington, Massachusetts,USA

Re: Problems when running Epiphany tests

Postby zmc » Sat Jul 12, 2014 3:38 am

I am using the micro-usb near the RJ45 port to power the board, I did try using the barrel connector but the power supply that I used was only rated for 1A. The power supply I am using now is rated for 3A.

I just did some testing, and I can't reproduce the failures now. Using the old power supply and a hdmi enabled bitstream it was able to run the test perfectly every time. The power supply might have been a contributing factor but I'm not sure it was the cause.

When I first got the board the tests ran perfectly, the problem only started after I tried the custom bitstream.

What could possibly be holding state across the reboots?
zmc
 
Posts: 24
Joined: Thu Jul 03, 2014 10:01 pm

Re: Problems when running Epiphany tests

Postby 9600 » Sat Jul 12, 2014 5:23 am

zmc wrote:I am using the micro-usb near the RJ45 port to power the board, I did try using the barrel connector but the power supply that I used was only rated for 1A. The power supply I am using now is rated for 3A.


The best solution is to use a 2A minimum power supply feeding the board via the barrel connector or mounting hole pads. If you are using the MicroUSB connector at present it may be worth investing in making up a cable for using the barrel connector or pads, and the latter being a neat solution if you want to stack boards with metal hex spacers (although you then need an n x 2A PSU).

Regards,

Andrew
Andrew Back (a.k.a. 9600 / carrierdetect)
User avatar
9600
 
Posts: 997
Joined: Mon Dec 17, 2012 3:25 am

Re: Problems when running Epiphany tests

Postby ralphmcardell » Tue Jul 22, 2014 2:36 pm

Update:
-------

Hello all,

I am still having problems with my mini-cluster of Parallellas, however I have made some progress.

To make sure that power was not the issue I have been powering Parallellas one board at a time and have also tried a different PSU to power a single Parallella - a TDK Lambda 5V/10A affair - with no change in stability for any of the boards.

Finally I acted on a vague observation that the most stable runs seemed to be on very hot days (for the UK!) in the mid to late afternoon when it was hottest (note: Parallellas are in a quite small room with only a door and windows for air conditioning).

I let a board get hot by restricting the air flow around it. I have so far concentrated on the board that locks up the system (or possibly just crashes the network) when it fails as this is the more serious problem.

I find that if I let the board warm up so that the ztemp.sh script reports temperatures above 60 degrees C (but below the 70 degrees C recommended maximum) then this board magically starts to repeatedly pass all the Epiphany tests listed in the LIST.E16 file run by /home/linaro/epiphany-examples/scripts/TestEpiphany.pl.

I have been running this board hot over that last few days - and cooler to get an idea of the range of temperatures involved). Today I have been running the board at 64 degrees C to 67 degrees C for over four and a half hours and 265+ iterations of the tests with no lock ups and no failures or interruptible getting stuck incidents. Note that I had no luck with an earlier setup that cooled the Zynq chip to ~57 degrees C.

I am not convinced it is the Zynq chip that needs to be this hot nor necessarily the Epiphany chip, but might well be some other component or components that of course get warm as well.

Any ideas as to what could cause such behaviour - i.e. Running Epiphany tests and examples causes board to fail unless something - or somethings - are warm enough - would be very much appreciated. As would any suggestions as to other things I could check such as voltage levels or frequencies (I have a multi-meter and an old analogue CRT 20MHz oscilloscope available).

Regards

Ralph
ralphmcardell
 
Posts: 12
Joined: Mon Dec 17, 2012 3:25 am
Location: London UK

Re: Problems when running Epiphany tests

Postby mtimms2 » Fri Aug 15, 2014 2:16 pm

I also have similar issue with a 7020 Parallella board imported to the UK and received May 2014 with following details:
SKU A101040
S/N 0001717

Everytime I run the Epiphany or Parallella examples, approximately one second after starting the code, the system locks up on both serial port, XWindows terminal or ssh connection. The lockup occurs whenever the code tries to interact with the Adapteva processor. I have tried running the examples with X Windows disabled and see the same result.

I have loaded the current Linux firmware and have tried both the 7020 HDMI and headless Zync image. I wonder if the 7020 FPGA code is missing any fixed which are included in the 7010 version??

I have a fan with airflow across the width of the board which keeps both the Adapteva and Zync chips at a low temperature. I am using a 5V 4A supply connected to the barrel connector, so there should be enough supply current to ensure that there are no voltage drops across the board whilst the Adapteva cores are being loaded with code.

When only using the Zync chip, the system and OS is stable and X Windows, Linux and other applications work ok. My intention however was to to use the Parallella for a project that requires full use of the 16 Adapteva parallel cores.
mtimms2
 
Posts: 4
Joined: Fri Aug 15, 2014 12:53 pm

Re: Problems when running Epiphany tests

Postby ralphmcardell » Fri Aug 15, 2014 3:44 pm

@mtimms2,

Sorry to hear you also have a board that fails when using the Epiphany cores. On the other hand it seems it is not quite such an isolated incident as just my one board.

Have you tried allowing the board to run warmer as I mention in a previous post:

viewtopic.php?f=50&t=1438&sid=8f1a61350979fc73303884fb460b8ff2#p9488

?

I did manage to get my really dodgy board to execute tests for just over 24 hours by keeping the temperature reported for the Zynq chip above 65 degrees C but mostly below 70 degrees C (I think it spiked briefly at 73 degrees in the warm late afternoon before I could adjust the airflow!).

It still locked up eventually however. My brother,who many years ago used to work repairing micro computers, mentioned that in his experience when a board works better when warm it usually indicated a bad joint which became less bad (as it were) under heat expansion. If this is the case then who knows if such a fault occurred at manufacture, transit or installation. On the other hand it could be some other fault - maybe signalling or timing. No one else, you might have noticed, has come forward to offer any possibilities and things that might be looked into to help track down the problem :cry:

I have now removed this board from my mini-cluster and am in the process of replacing it with a new 7010 based P1601 (its the 2nd - the 1st lasted about 24 hours before being dead and was replaced by RS - I am doing burn-in tests on the replacement before installing it in the cluster so if it fails I will not have made any changes such as bridging J15 for power via mounting pads).

The bad 7020 A101040 board I will use for non-Epiphany development and experiments - learning a bit about FPGA programming would seem an appropriate use. I did not try to RMA this boards as firstly I would have to get it back to Adapteva in the USA from the UK, and secondly in trying to sort out what was wrong I blew the 4A fuse on the board while trying to check the 5V test point (which is surrounded by grounded things) and had to bridge the fuse - not as expertly as some I just made a solder bridge.

However, all my boards, including the new 7010 P1601 incumbent, exhibit the 'soft' failures with matmul-16 and fft2d. The new board is being powered during 'acceptance testing' via the barrel connector using one of the wall-wart power supplies supplied by Adapteva as part of my mini-cluster reward - maybe it will improve when I connect it to the 5V, 12A supply shared by all mini cluster boards and their network switch, but I'll not be holding my breath...
ralphmcardell
 
Posts: 12
Joined: Mon Dec 17, 2012 3:25 am
Location: London UK

Next

Return to Troubleshooting

Who is online

Users browsing this forum: No registered users and 4 guests

cron