by over9000 » Tue Aug 06, 2013 2:59 am
Hi,
I'm not an expert on it, but I know generally how the BWT transform works. You're right to say that you can't process a full block on one of the cores, but I don't think that it's a lost cause. With 16 cores and 32Kbytes each you've got nearly (less program space) 512Kbytes to work with. So say that have you space to store about 1/2 the maximum block size of 900K. If you assign each of the 16 cores to work on ~32K of data (doing something like quicksort on it), then do 5 passes with a merge-sort algorithm (with each pass combining two sorted lists) then you've got half the maximum block size sorted. With some careful interleaving (some cores sorting ~32K blocks, others doing the merge operations) and some work to minimise the amount of data transfer (which presumably will be the slowest part), there's no reason why 16 memory-limited cores shouldn't be able to sort through 900Kbytes of data with a decent throughput.
In practice, I think that maybe BWT has more memory requirements than this for the sorting stage. Does it have to store pointers to where the bytes were originally? If so, quadruple the 900K memory requirement (one byte for data, three for the address). Even so, this sort of merge-based sorting algorithm should be able to give good performance provided main memory reads/writes aren't overly expensive and/or we have cheaper inter-core communication streaming to enable us to avoid main-memory writes.
I'm sure other refinements of the basic recursive merge-sort algorithm are possible. Since writes back to memory are unavoidable given the n-pass nature of the algorithm (and limited local memory), you could front-load a merge phase designed to sort, say, the first quarter or one eighth of the list and apply some of the cores to the quicksort, some to the merge and some others to tracking which elements from the original list have been processed (one bit per element). Since the number of cores needed to process each pass (roughly) halves (since merge is a much quicker operation than a full quicksort) with each iteration, and because we've front-loaded the algorithm to sort the start of the list faster, if you have an adaptive (single-pass) compression algorithm that the BWT feeds into, then you can be producing compressed output before you've finished sorting the entire 900Kbyte buffer. This sort of arrangement should also get much better utilisation of all processing cores and potentially (due to the use of bit vectors in the mark and sweep stages) reduced accesses to main memory. Of course an adaptive compression scheme won't be as good as a two-pass one, but the benefits (increased throughput) may be worth it.
HTH