Thursday, May 24, 2007
Small forays into multiprocessing programming II
So, my attempt at parallelizing the Pythagoran Triples program. I rewrote it as a function to make it easier to parallelize:
struct params { double start; double end; int count; } int pythagorean_triple(void *data) { struct params *p = data; double a; double b; double c; for (p->count = 0 , c = p->start ; c <= p->end ; c++) { for (b = 1 ; b < c ; b++) { a = sqrt(c * c - b * b); if (trunc(a) == a) p->count++; } } return(0); }
That way, I could then do:
/*-------------------------------------- ; the reason for these structures ; is that threads created with clone() ; only get a single pointer parameter. ; This way, I can pass in more data ; using that one pointer. ; ; And yes, I'm using C. Nyah. ;-------------------------------------*/ struct param t1; struct param t2; t1.start = 2; t1.end = 16383; t2.start = 16384; t2.end = 32768; clone(pythagorean_triple,threadstack1,CLONE_VM,&t1); clone(pythagorean_triple,threadstack2,CLONE_VM,&t2);
And I was surprised that it only took one second less when the processing was split using this method than when it ran not split at all.
I ran it several times, and yes, both threads were being created, but one was finishing way sooner than I expected. It was then that I realized the reason—it takes less time to count to a thousand than it does to thirty thousand.
D'uh!
Obvious in hindsight.
Basically, thread one was doing:
for (c = 2 ; c <= 16383 ; c++) for (b = 1 ; b < c ; b++) ... ;
but thread two:
for (c = 16384 ; c <= 32768 ; c++) for (b = 1 ; b < c ; b++) ... ;
Now … how to split the load evenly between two (or more) processors? Obviously splitting the range in half wasn't cutting it, but is that the only way to split the workload? What if I were to have one processor do even numbers, and the other one odd numbers? Something like:
for (c = 2 ; c <= 32768 ; c += 2) for (b = 1 ; b < c ; b++) ... ; for (c = 3 ; c <= 32768 ; c += 2) for (b = 1 ; b < c ; b++) ... ;
Yeah, that might work.
And it did—the workload was evenly distributed across the processors. The code slightly changed though:
struct params { double start; double end; double delta; int count; } int pythagorean_triple(void *data) { struct params *p = data; double a; double b; double c; for ( p->count = 0 , c = p->start ; c <= p->end ; c += p->delta ) { for (b = 1 ; b < c ; b++) { a = sqrt(c * c - b * b); if (trunc(a) == a) p->count++; } } return(0); } { struct params t1; struct params t2; t1.end = t2.end = 32768; t1.delta = t2.delta = 2; t1.start = 2; t2.start = 3; clone(pythagorean_triples,threadstack1,CLONE_VM,&t1); clone(pythagorean_triples,threadstack1,CLONE_VM,&t2); }
This concurrent programming certainly requires a different way of thinking about things. And obvious ways of splitting up workloads don't always work out as expected.