Multicore program crashes or produces garbage when calculating large prime numbers

samuell · 2016-04-04 08:48

I am designing a program that outputs prime numbers between two given values. I already did the single core version and it worked great. However, I've expanded it to multicore and it has issues. The output it produces is correct for small numbers, but the program crashes above a certain range. At first, I was not using semaphores, but now I am and the program still crashes. I've attached the code. The single core program was also attached, just for reference.

Notice that line 92 has a pause statement (PrimesMultiCog.c). This was so that the program can at least calculate small numbers, but defeats the purpose. I can get it to calculate larger numbers if I give more time to this pause. This is the exact behaviour before I've botched in the semaphores. I really don't have a clue of what is going on.

I need an insight of what may be happening. Any ideas?

By the way, I'm using SimpleIDE under Kubuntu Linux, if that matters.

Kind regards, Samuel Lourenço

Electrodude · 2016-04-04 13:54

In PrimesMultiCog.c, you're starting 8 cogs, but there are only 7 left - the main program is using up one of them. The last cogstart is failing. Try only starting 7 or fewer new cogs.

Heater. · 2016-04-04 13:56

Or you can always run whatever code is started in the 8th cog as a direct call from the main program.

samuell · 2016-04-04 15:23

Electrodude wrote: »

In PrimesMultiCog.c, you're starting 8 cogs, but there are only 7 left - the main program is using up one of them. The last cogstart is failing. Try only starting 7 or fewer new cogs.

Heater. wrote: »

Or you can always run whatever code is started in the 8th cog as a direct call from the main program.

Thanks, Electrodude and Heater.

I've checked that and I'm only starting 7 cogs. If I use the cogstart() function that returns int, and print the returned values, I only see numbers from 1 to 7 and not -1. The for is from 0 until 7, being 7 not included (0, 1, 2, 3, 4, 5, 6). One friend suggested that I may have excessive lockset and lockclr pairs, causing deadlock. Also, I'm not differentiating instances of the function I call into each cog, since I call the same function 7 times and I pass a NULL argument. Could be it?

If I comment the pause() on my init() function, my program wont start either. Is this a symptom of incorrect cog initiation (by passing NULL), since I'm passing the same function over and over again without differenciation?

Kind regards, Samuel Lourenço

Electrodude · 2016-04-04 17:50

You're right. Your loop only starts 7 cogs, and all of your arrays have 7 elements.

It looks like you're using the same stack for all 7 cogs. They each need separate stacks. If that still doesn't fix it, then try making all of the stacks bigger (128 longs or so instead of only 44).

samuell · 2016-04-04 18:24

Electrodude wrote: »

You're right. Your loop only starts 7 cogs, and all of your arrays have 7 elements.

It looks like you're using the same stack for all 7 cogs. They each need separate stacks. If that still doesn't fix it, then try making all of the stacks bigger (128 longs or so instead of only 44).

Hi Electrodude,

I guess separate stacks are in order, and makes sense, specially if we are calculating large numbers. It explains why the program crashes and why the pause() mitigates the issues, by essentially avoiding minimizing the usage of the same stack when doing small numbers (with large numbers, it is certain to have two or more cogs calculating at the same time - same when removing the pause).

You may have hit the nail in the head there. I have to try that. Is it prudent to declare an array of stacks? Say something like:

" unsigned int stack[7][44];"

and then:

" cogstart(isprime_cog, NULL, stack, sizeof(stack));"

inside the for cycle to start each cog?

Kind regards, Samuel Lourenço

Electrodude · 2016-04-04 22:24

The declaration:

unsigned int stack[7][44];

looks correct, but for the cogstart you might want something more like:

cogstart(isprime_cog, NULL, stack[i], sizeof(stack[i]));

That way, each cog actually gets its own piece of stack.

samuell · 2016-04-04 23:32

Electrodude wrote: »
The declaration:
unsigned int stack[7][44];
looks correct, but for the cogstart you might want something more like:
cogstart(isprime_cog, NULL, stack[i], sizeof(stack[i]));
That way, each cog actually gets its own piece of stack.

My bad. You are absolutely right! Thanks for catching this typo.

I'll test this and post in a jiffie.

samuell · 2016-04-04 23:51

Ok. It worked! Many thanks, Electrodude.

Here is the updated code. It is still convoluted and there are many things to improve.

Basically, I've created a two-dimensional array, which I've called stacks. Also used your suggestion. No need to use pause() and the code is insanely fast, compared to the single core version. There are other bugs to catch, but now I know where to look.

Kind regards, Samuel Lourenço

samuell · 2016-04-05 21:19

Meanwhive I've done further improvements. The code is much more clean now. I also included a non-optimized version of the code (a previous one), after which I removed some of the lockset/lockclr pairs from the main procedure. The non-optimized version only compiles well when using CMM. The final version runs very well after compiling when using both LMM and CMM. It generates a faster code.

Now, I think the reason why the non-optimized code crashes if compiled using LMM is because the code is a tad large (occupies almost the entire 32KB of RAM). However, by using the "Simple printf" optimization, I can generate a lot smaller binary under LMM, but that doesn't solve the issue. The non-optimized version still crashes if compiled that way.

So, can this be because of memory spare, or I'm having a "deadlock" phenomenon?

Kind regards, Samuel Lourenço

DavidZemon · 2016-04-05 23:44

I had trouble getting your code to work until I finally increased your stack size. I didn't do any searching to figure out how big it needs to be, but I jumped it straight from 44 to 128, and that did the trick.

Also, take a look at the printi function available in the Simple tools. It will save you even more space than __simple_printf.

Interestingly, if I enter a large enough number, it always stops at 1671 and hangs. I tried bumping the stack to 256 after the first time it froze and that didn't help.

samuell · 2016-04-06 09:28

DavidZemon wrote: »

I had trouble getting your code to work until I finally increased your stack size. I didn't do any searching to figure out how big it needs to be, but I jumped it straight from 44 to 128, and that did the trick.

Also, take a look at the printi function available in the Simple tools. It will save you even more space than __simple_printf.

Interestingly, if I enter a large enough number, it always stops at 1671 and hangs. I tried bumping the stack to 256 after the first time it froze and that didn't help.

Hi David,

Mind that a too large stack size will be detrimental and cause execution to halt too. Normally, a too small stack size will prevent any further calculations of prime numbers. Instability after compiling using LMM is caused by a too large program, which is indeed the case.

Kind regards, Samuel Lourenço

samuell · 2016-04-06 11:08

David,

Could you run the new code in attachment? It runs well without having to increase the stack size. I had it running overnight without any issues.

I've made some modifications, namely:
- sqrt() ins no longer used, by suggestion of a forum member. Code is faster and leaner;
- lockid is no longer volatile, although several cogs use it. This is because this var is only written once and never changes.

By the way, I've changed my mind and decided to release it in Portuguese, hence the translation.

Kind regards, Samuel Lourenço

DavidZemon · 2016-04-06 12:58

You shouldn't have any problems with memory size if you switch to a different serial library. Even __simple_printf is quite large. There are a number to choose from: Simple's print & printi, FdSerial's dprintf, libpropeller, and PropWare. Simple & FdSerial are both available in the libraries that come with SimpleIDE, so they're probably easiest to start with.

Also, it's important to note that __simple_printf is not part of the "Simple" libraries. print and printi are different (and smaller) serial and formatting implementations from that of __simple_printf.

DavidZemon · 2016-04-06 12:59

samuell wrote: »

Could you run the new code in attachment? It runs well without having to increase the stack size. I had it running overnight without any issues.

Definitely, I'll give it a try when I get home.

samuell · 2016-04-06 19:23

Hi all,

I've optimized the code a little bit.

for (unsigned long long n = 3; n * n <= number; n += 2)

changed to

for (unsigned long n = 3; (unsigned long long)n * n <= number; n += 2)

The code is a bit faster and smaller. To note:
- The variable 'n' doesn't need to be a long long.
- I could've used unsigned int instead of unsigned long, since both are 32 bits long with this compiler. However, int may have different lenghts on different systems, and the size of a long long is always the double of the size of a long.
- I've casted the first multiplier to unsigned long long, to cast the result to unsigned long long without loosing information. Note that "(unsigned long long)n * n" is different from "(unsigned long long)(n * n)", as both cast the result to a long long but only the first maintains all bits.

The code is attached.

Kind regards, Samuel Lourenço

Electrodude · 2016-04-06 20:02

"(unsigned long long)n * n <= number" could be optimized more. Calculate sqrt(number) before entering the loop and compare n to that. Yes, the sqrt is expensive, but one sqrt is way cheaper than a square every time through the loop. Also, then you won't need an unsigned long long. GCC might be smart enough to do this optimization on its own, but you'd have to look at the assembly output to know for sure.

samuell · 2016-04-06 21:31

Electrodude wrote: »

"(unsigned long long)n * n <= number" could be optimized more. Calculate sqrt(number) before entering the loop and compare n to that. Yes, the sqrt is expensive, but one sqrt is way cheaper than a square every time through the loop. Also, then you won't need an unsigned long long. GCC might be smart enough to do this optimization on its own, but you'd have to look at the assembly output to know for sure.

What you say makes sense, so here is what I tried:

{
    _Bool retval;
    unsigned long sqrt_num;
...
    else
    {
        retval = 1;
        sqrt_num = (unsigned long)sqrt(number);
        for (unsigned long n = 3; n <= sqrt_num; n += 2)
        {
            if (number % n == 0)
            {
                retval = 0;
                break;
            }
        }
    }
    return retval;
}

Compiled 27,496B. This code took 06:18 minutes to verify the numbers from one to one million.

The next I tried was this, for comparison (essentially, the "ancient" code, but using unsigned long for 'n'):

{
    _Bool retval;
...
    else
    {
        retval = 1;
        for (unsigned long n = 3; n <= sqrt(number); n += 2)
        {
            if (number % n == 0)
            {
                retval = 0;
                break;
            }
        }
    }
    return retval;
}

Compiled 27,656B. This code also took the longest to verify same the numbers, around 09:46.

For reference, the last code I've uploaded compiles to 16,784B and takes 07:35 minutes. I should have thought of that solution. It was a newbie mistake to assume that a for () cycle does not recalculate all values. The space the code takes is important if it almost filling all the memory. The speed is more important here.

I'll have to do a test with large numbers, but I guess that is a keeper.

samuell · 2016-04-06 22:41

Hi all,

After much testing, here is the updated code. I've decided to replace all short variables by chars.

The code is attached. Use -O1, -O2 or -Os optimization to compile it. It won't run without optimization (-O0).

Kind regards, Samuel Lourenço

Electrodude · 2016-04-06 23:25

sqrt probably adds 10KB because it's a floating point function and I guess it drags in the floating point library. Try to find an integer sqrt somewhere.

DavidZemon · 2016-04-07 01:26

Pretty cool! I'd say it's working pretty well

34762:  99995
34763:  99997

lu número(s) primo(s) encontrados.

Insira o valor mínimo:

samuell · 2016-04-07 01:48

Electrodude wrote: »

sqrt probably adds 10KB because it's a floating point function and I guess it drags in the floating point library. Try to find an integer sqrt somewhere.

Not only that, but it brings parts of the math library (math.h, is it?). I've checked for isqrt() functions on the web. Either they will use sqrt() from math.h (or cmath if we are talking about C++), or they will use extensive lookup tables.

See this very interesting discussion:
http://stackoverflow.com/questions/4930307/fastest-way-to-get-the-integer-part-of-sqrtn

Kind regards, Samuel Lourenço

samuell · 2016-04-07 09:04

Hi all,

I've done some more testing, and came to the conclusion that the version that uses sqrt() might have stability issues. I've tested this with an overclocked setup to 128MHz and with the supply pushed up, but I believe it would have problems at 80MHz. The code that doesn't use sqrt, doesn't have issues.

Kind regards, Samuel Lourenço

samuell · 2016-04-07 12:38

Ok. Definitely it is a board issue. Now I can't get it to run well with any of the programs when overclocked. The PPTC fuse on by custom board is causing a voltage drop greater than before. Runs fine if not overclocked and thus no need to increase the voltages. By the way, a photo of the board I'm talking about:

The faster program also runs well in the Activity board.

DavidZemon wrote: »
Pretty cool! I'd say it's working pretty well
34762:  99995
34763:  99997

lu número(s) primo(s) encontrados.

Insira o valor mínimo:

Hi David,

Are you having integer display issues when you run the code?

Meanwhile, I've revised the code and done some house cleaning. Since I can't find a suitable function to replace sqrt(), I've left as it is. I've used the "Simple printf" option to compile it and generated a smaller binary (the speed seems to be the same). The code is attached. Runs well on my side. So far it is displaying the 695xxxth prime after three hours of testing.

Kind regards, Samuel Lourenço

samuell · 2016-04-22 20:35

Hi,

I've found an interesting jewel regarding sqrt() with integers:
http://www.codecodex.com/wiki/Calculate_an_integer_square_root

Meanwhile, I've given it a try. The code compiles to 16912B (without Math lib this time, but now I had to uncheck Simple printf too, or else it wouldn't print the numbers) and the program takes 7:21 minutes to calculate numbers from 1 to 1000000. Definitely, sqrt is faster.

Code is attached.

Kind regards, Samuel Lourenço

JasonDorie · 2016-04-22 23:36

I don't know if this would make the inner loop faster or slower, but every 3rd number in your series is divisible by 3:

3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27 . . .

You could pre-test division by three like you do for division by two, then keep a second counter in your loop. Skip the modulo test for every 3rd value and just reset the counter back to zero.

If you're compiling as LMM or PASM it should be much faster to increment a number and test for == 3 than to do the division involved in the modulo. You could use the same trick for 5's, too, saving you some additional modulo tests, but at some point keeping track of the extra counters will have more overhead than just doing the division.

samuell · 2016-04-26 08:26

JasonDorie wrote: »

I don't know if this would make the inner loop faster or slower, but every 3rd number in your series is divisible by 3:

3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27 . . .

You could pre-test division by three like you do for division by two, then keep a second counter in your loop. Skip the modulo test for every 3rd value and just reset the counter back to zero.

If you're compiling as LMM or PASM it should be much faster to increment a number and test for == 3 than to do the division involved in the modulo. You could use the same trick for 5's, too, saving you some additional modulo tests, but at some point keeping track of the extra counters will have more overhead than just doing the division.

Hi JasonDorie,

It is the same with 5 too, and every other number as a matter of fact. That is how Sieve of Eratosthenes works. Ideally, the program should calculate prime numbers recursively, that is, start from 2, then 3, and then use the previously calculated prime numbers to test the new numbers. Of course it would need a lot of memory.

Your solution implies further complexity without any significant speed improvement. You would avoid dividing by those numbers, but in the other hand you would need to test if the divider was a multiple of three. That would imply two test divisions per cycle, and not just one.

Kind regards, Samuel Lourenço

JasonDorie · 2016-04-26 13:49

But it wouldn't - I wouldn't do division by 3, just maintain a simple counter. When the counter == 3 you skip the current number and zero the counter. Much faster than modulo on the Propeller.

samuell · 2016-04-26 14:54

JasonDorie wrote: »

But it wouldn't - I wouldn't do division by 3, just maintain a simple counter. When the counter == 3 you skip the current number and zero the counter. Much faster than modulo on the Propeller.

I see, that is ingenious actually. But you still have to test the division by 3, since it is a prime number. If not divisible by 3 it is not divisible by any of the multiples.

I'll have to test that, but it will sure complicate the algorithm.

Thanks!

TinkersALot · 2016-04-26 15:21

this may be interesting....

https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes

yeti · 2016-04-26 15:46

Sure it would be easy to skip every 2nd number, 3rd number, every 5th, every 7th, ... 38137th, ... but then you'd put the knowledge which numbers are primes into your code instead of calculating these primes.

If you calculate all primes beginning from 2 and store them to be able to skip their multiples, you'll get a similar size problem in the data section instead of the code space...

Here is a Python example for the later case... it is quite fast but needs large lists (abusing a dictionary as sparse array) and sooner or later will make even systems with gigabytes of memory cry for more swap space:

# floating (sliding?) sieve of eratosthenes prime number generator
#==================================================================
#
# 2011-09-09, yeti
#

L = {}
n = 2

while 1:
        if n in L:
                P = L[n]
                del L[n]
        else:
                print n
                P = [n]
        for p in P:
                npp = n+p
                if npp in L:
                        if p not in L[npp]:
                                L[npp] += [p]
                else:
                        L[npp] = [p]
        n += 1

Multicore program crashes or produces garbage when calculating large prime numbers

Comments