Data doesn't fit into the core

Discussion about Parallella (and Epiphany) Software Development

Moderators: amylaar, jeremybennett, simoncook

Data doesn't fit into the core

Postby Nightingale » Tue Oct 14, 2014 9:34 am

I am trying to make face detection program to run using epiphany. To detect faces it uses cascade file. When loaded, it takes 173 kilobytes space (technically it can be reduced to around 99kb). How can I deal with that?
The only solution I could come up with is to store picture in shared memory (512x512 grayscale picture should fit, right?) and then partition cascade into smaller bits, so that they fit cores and each core would be responsible for it's own part. This approach will mean program will do quite a bit of extra work, but I can't come up with anything else.
So my question is: is there a way to access 99kb of raw data in each of the cores relatively quickly and compare that data against picture data.
Second question: What is the best way for cores to access a 512x512 grayscale picture (or partitioned portions of it)?

EDIT:
volatile char *process_done = static_cast<char*>((void *)0x2000);
this line declares that process_done variable is to be stored inside core's local memory at specific address, right? How do I access shared memory? is this the correct way:
volatile float *pic = static_cast<float>((void*)0x1e010000);
Nightingale
 
Posts: 11
Joined: Fri Sep 19, 2014 11:38 am

Re: Data doesn't fit into the core

Postby notzed » Wed Oct 15, 2014 7:48 am

I spent considerable effort (several full-time weeks worth - i had a lot of spare time last year) on this problem on several different platforms including epiphany, and my conclusion is that the viola jones algorithm is just a poor one for any type of modern (particularly but not limited to parallel) hardware. It just takes too much data to process a tiny window of image, summed area tables aren't actually efficient unless you think a cpu works like one made in 1980, and local parallelism is not possible (e.g. simd).

On the epiphany, my best result involved:

o turning the cascade into size and performance optimised stream format of fixed-sized chunks. I can't remember the exact size i got it down to, but it was under 64K for a face detector.
o assembly language inner loop
o working on as big a tile as would fit in memory (i forget the size, but something like 64x32)
o tile size is fixed so the cascade format can be pre-calculated to the tile stride (and/or use shifts for multiply)
o using asynchronous/double buffered dma to load the cascade stream
o storing some first NkB of stream permanently on-core, so that at least some cases can be performed with no extra loads.

Even with this just loading the casdcade becomes a massive bottleneck after you have more than a couple of cores running, and load balancing at small scales is difficult due to the windowing. I gave up and then got sidetracked for months on writing my own epiphany "driver" (actually that grew out of wanting to have a more complicated program to support the whole face detection pipeline, and realising the esdk was going to make that really painful, i just never went back to the fd problem).

Subsequently I invented and refined an entirely new (I believe) face detection algorithm designed for parallel hardware but haven't tried it on the epiphany. Several detector definitions will fit easily in on-core memory though. I'm using the algorithm in a semi-production system to successfully replace a vj cascade.

Can't help on the C++ question, sorry.

Oh, I knew I blogged about it so i looked it up:

http://a-hackers-craic.blogspot.com.au/ ... date=false

If that link works it should start at 6/9/2013 and go backward in time from the last time i posted much about it to earlier work that got to that point. There is quite a lot of detail but I usually wrote it after a late evening of hacking so a lot of it isn't very readable.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia

Re: Data doesn't fit into the core

Postby Nightingale » Mon Oct 20, 2014 9:22 am

thanks for the tip. Right now I am trying to split cascade into eight parts, so that I practically have each part is loaded into 2 cores. Now my problem seems to be with cores crashing when I run the code. Thus I have a few followup questions:

- Suppose I have 16kb of raw data loaded into core.
client side:
volatile int *cascade_data = static_cast<int*>((void*)0x4000);
host side:
int cascade_data[31528];
read_cascade(cascade_data);
for(row=0;row<4;row++){
for(col=0;col<4;col++){
e_write(dev, row, col, 0x4000, &(cascade_data[3941*(int)((col*4+row)/2)]), 3941*sizeof(int));
}
}

Do I need to do something to protect that data from being overwritten during the execution of the program?

- Where does the core store the compiled program so that I can avoid that place?
Nightingale
 
Posts: 11
Joined: Fri Sep 19, 2014 11:38 am

Re: Data doesn't fit into the core

Postby notzed » Mon Oct 20, 2014 11:33 pm

Use e-objdump to find out the layout of the code on the core on the elf file. You can use this to find out where sections are loaded (I can't remember which flag, -h or -p i think).

The safest way to use a fixed block is to use a linker script to reserve the space, but linker scripts are kind of obtuse and take a bit of learning. The examples should really be doing this just to show how it's done. I don't know off the top of my head but just look at internal.ldf or some of the others.

You can also just use global variables and let the linker assign addresses ( that what it's for). I don't think worrying about bank alignment is all that important especially if you're just trying to get something working.

Moving the stack to the end of the last bank of code can help free up more contiguous banks too.

BTW you should put the cascade data in shared memory and have the cores read it themselves. Using 8-byte aligned DMA will be faster.
notzed
 
Posts: 331
Joined: Mon Dec 17, 2012 12:28 am
Location: Australia


Return to Programming Q & A

Who is online

Users browsing this forum: No registered users and 8 guests

cron