Parallella Community

by **Gravis** » Tue Aug 06, 2013 2:05 am

bzip2 is a container format that allows for multiple compression algorithms. what you really want to do is find how to execute each compression algorithm on a data stream. the compression algorithms are listed .

however, if i were you, i would go with just as it usually results in better compression than anything in bzip2 and is just a single algorithm. however, there is the LZMA2 container which was designed for multithreading but it may be possible to use multiple cores together (multicoring?) to accelerate the compression algorithm and keep the original LMZA format.

by **CIB** » Tue Aug 06, 2013 2:16 am

by **over9000** » Tue Aug 06, 2013 2:59 am

Hi,

I'm not an expert on it, but I know generally how the BWT transform works. You're right to say that you can't process a full block on one of the cores, but I don't think that it's a lost cause. With 16 cores and 32Kbytes each you've got nearly (less program space) 512Kbytes to work with. So say that have you space to store about 1/2 the maximum block size of 900K. If you assign each of the 16 cores to work on ~32K of data (doing something like quicksort on it), then do 5 passes with a merge-sort algorithm (with each pass combining two sorted lists) then you've got half the maximum block size sorted. With some careful interleaving (some cores sorting ~32K blocks, others doing the merge operations) and some work to minimise the amount of data transfer (which presumably will be the slowest part), there's no reason why 16 memory-limited cores shouldn't be able to sort through 900Kbytes of data with a decent throughput.
In practice, I think that maybe BWT has more memory requirements than this for the sorting stage. Does it have to store pointers to where the bytes were originally? If so, quadruple the 900K memory requirement (one byte for data, three for the address). Even so, this sort of merge-based sorting algorithm should be able to give good performance provided main memory reads/writes aren't overly expensive and/or we have cheaper inter-core communication streaming to enable us to avoid main-memory writes.
I'm sure other refinements of the basic recursive merge-sort algorithm are possible. Since writes back to memory are unavoidable given the n-pass nature of the algorithm (and limited local memory), you could front-load a merge phase designed to sort, say, the first quarter or one eighth of the list and apply some of the cores to the quicksort, some to the merge and some others to tracking which elements from the original list have been processed (one bit per element). Since the number of cores needed to process each pass (roughly) halves (since merge is a much quicker operation than a full quicksort) with each iteration, and because we've front-loaded the algorithm to sort the start of the list faster, if you have an adaptive (single-pass) compression algorithm that the BWT feeds into, then you can be producing compressed output before you've finished sorting the entire 900Kbyte buffer. This sort of arrangement should also get much better utilisation of all processing cores and potentially (due to the use of bit vectors in the mark and sweep stages) reduced accesses to main memory. Of course an adaptive compression scheme won't be as good as a two-pass one, but the benefits (increased throughput) may be worth it.
HTH

Parallella Community

Compression algorithms on the parallella

Compression algorithms on the parallella

Re: Compression algorithms on the parallella

Re: Compression algorithms on the parallella

Re: Compression algorithms on the parallella

Who is online