Multicore program crashes or produces garbage when calculating large prime numbers
samuell
Posts: 554
I am designing a program that outputs prime numbers between two given values. I already did the single core version and it worked great. However, I've expanded it to multicore and it has issues. The output it produces is correct for small numbers, but the program crashes above a certain range. At first, I was not using semaphores, but now I am and the program still crashes. I've attached the code. The single core program was also attached, just for reference.
Notice that line 92 has a pause statement (PrimesMultiCog.c). This was so that the program can at least calculate small numbers, but defeats the purpose. I can get it to calculate larger numbers if I give more time to this pause. This is the exact behaviour before I've botched in the semaphores. I really don't have a clue of what is going on.
I need an insight of what may be happening. Any ideas?
By the way, I'm using SimpleIDE under Kubuntu Linux, if that matters.
Kind regards, Samuel Lourenço
Notice that line 92 has a pause statement (PrimesMultiCog.c). This was so that the program can at least calculate small numbers, but defeats the purpose. I can get it to calculate larger numbers if I give more time to this pause. This is the exact behaviour before I've botched in the semaphores. I really don't have a clue of what is going on.
I need an insight of what may be happening. Any ideas?
By the way, I'm using SimpleIDE under Kubuntu Linux, if that matters.
Kind regards, Samuel Lourenço
Comments
Thanks, Electrodude and Heater.
I've checked that and I'm only starting 7 cogs. If I use the cogstart() function that returns int, and print the returned values, I only see numbers from 1 to 7 and not -1. The for is from 0 until 7, being 7 not included (0, 1, 2, 3, 4, 5, 6). One friend suggested that I may have excessive lockset and lockclr pairs, causing deadlock. Also, I'm not differentiating instances of the function I call into each cog, since I call the same function 7 times and I pass a NULL argument. Could be it?
If I comment the pause() on my init() function, my program wont start either. Is this a symptom of incorrect cog initiation (by passing NULL), since I'm passing the same function over and over again without differenciation?
Kind regards, Samuel Lourenço
It looks like you're using the same stack for all 7 cogs. They each need separate stacks. If that still doesn't fix it, then try making all of the stacks bigger (128 longs or so instead of only 44).
Hi Electrodude,
I guess separate stacks are in order, and makes sense, specially if we are calculating large numbers. It explains why the program crashes and why the pause() mitigates the issues, by essentially avoiding minimizing the usage of the same stack when doing small numbers (with large numbers, it is certain to have two or more cogs calculating at the same time - same when removing the pause).
You may have hit the nail in the head there. I have to try that. Is it prudent to declare an array of stacks? Say something like:
" unsigned int stack[7][44];"
and then:
" cogstart(isprime_cog, NULL, stack, sizeof(stack));"
inside the for cycle to start each cog?
Kind regards, Samuel Lourenço
My bad. You are absolutely right! Thanks for catching this typo.
I'll test this and post in a jiffie.
Here is the updated code. It is still convoluted and there are many things to improve.
Basically, I've created a two-dimensional array, which I've called stacks. Also used your suggestion. No need to use pause() and the code is insanely fast, compared to the single core version. There are other bugs to catch, but now I know where to look.
Kind regards, Samuel Lourenço
Now, I think the reason why the non-optimized code crashes if compiled using LMM is because the code is a tad large (occupies almost the entire 32KB of RAM). However, by using the "Simple printf" optimization, I can generate a lot smaller binary under LMM, but that doesn't solve the issue. The non-optimized version still crashes if compiled that way.
So, can this be because of memory spare, or I'm having a "deadlock" phenomenon?
Kind regards, Samuel Lourenço
Also, take a look at the printi function available in the Simple tools. It will save you even more space than __simple_printf.
Interestingly, if I enter a large enough number, it always stops at 1671 and hangs. I tried bumping the stack to 256 after the first time it froze and that didn't help.
Mind that a too large stack size will be detrimental and cause execution to halt too. Normally, a too small stack size will prevent any further calculations of prime numbers. Instability after compiling using LMM is caused by a too large program, which is indeed the case.
Kind regards, Samuel Lourenço
Could you run the new code in attachment? It runs well without having to increase the stack size. I had it running overnight without any issues.
I've made some modifications, namely:
- sqrt() ins no longer used, by suggestion of a forum member. Code is faster and leaner;
- lockid is no longer volatile, although several cogs use it. This is because this var is only written once and never changes.
By the way, I've changed my mind and decided to release it in Portuguese, hence the translation.
Kind regards, Samuel Lourenço
Also, it's important to note that __simple_printf is not part of the "Simple" libraries. print and printi are different (and smaller) serial and formatting implementations from that of __simple_printf.
Definitely, I'll give it a try when I get home.
I've optimized the code a little bit.
changed to
The code is a bit faster and smaller. To note:
- The variable 'n' doesn't need to be a long long.
- I could've used unsigned int instead of unsigned long, since both are 32 bits long with this compiler. However, int may have different lenghts on different systems, and the size of a long long is always the double of the size of a long.
- I've casted the first multiplier to unsigned long long, to cast the result to unsigned long long without loosing information. Note that "(unsigned long long)n * n" is different from "(unsigned long long)(n * n)", as both cast the result to a long long but only the first maintains all bits.
The code is attached.
Kind regards, Samuel Lourenço
The next I tried was this, for comparison (essentially, the "ancient" code, but using unsigned long for 'n'): Compiled 27,656B. This code also took the longest to verify same the numbers, around 09:46.
For reference, the last code I've uploaded compiles to 16,784B and takes 07:35 minutes. I should have thought of that solution. It was a newbie mistake to assume that a for () cycle does not recalculate all values. The space the code takes is important if it almost filling all the memory. The speed is more important here.
I'll have to do a test with large numbers, but I guess that is a keeper.
After much testing, here is the updated code. I've decided to replace all short variables by chars.
The code is attached. Use -O1, -O2 or -Os optimization to compile it. It won't run without optimization (-O0).
Kind regards, Samuel Lourenço
See this very interesting discussion:
http://stackoverflow.com/questions/4930307/fastest-way-to-get-the-integer-part-of-sqrtn
Kind regards, Samuel Lourenço
I've done some more testing, and came to the conclusion that the version that uses sqrt() might have stability issues. I've tested this with an overclocked setup to 128MHz and with the supply pushed up, but I believe it would have problems at 80MHz. The code that doesn't use sqrt, doesn't have issues.
Kind regards, Samuel Lourenço
The faster program also runs well in the Activity board.
Hi David,
Are you having integer display issues when you run the code?
Meanwhile, I've revised the code and done some house cleaning. Since I can't find a suitable function to replace sqrt(), I've left as it is. I've used the "Simple printf" option to compile it and generated a smaller binary (the speed seems to be the same). The code is attached. Runs well on my side. So far it is displaying the 695xxxth prime after three hours of testing.
Kind regards, Samuel Lourenço
I've found an interesting jewel regarding sqrt() with integers:
http://www.codecodex.com/wiki/Calculate_an_integer_square_root
Meanwhile, I've given it a try. The code compiles to 16912B (without Math lib this time, but now I had to uncheck Simple printf too, or else it wouldn't print the numbers) and the program takes 7:21 minutes to calculate numbers from 1 to 1000000. Definitely, sqrt is faster.
Code is attached.
Kind regards, Samuel Lourenço
3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27 . . .
You could pre-test division by three like you do for division by two, then keep a second counter in your loop. Skip the modulo test for every 3rd value and just reset the counter back to zero.
If you're compiling as LMM or PASM it should be much faster to increment a number and test for == 3 than to do the division involved in the modulo. You could use the same trick for 5's, too, saving you some additional modulo tests, but at some point keeping track of the extra counters will have more overhead than just doing the division.
It is the same with 5 too, and every other number as a matter of fact. That is how Sieve of Eratosthenes works. Ideally, the program should calculate prime numbers recursively, that is, start from 2, then 3, and then use the previously calculated prime numbers to test the new numbers. Of course it would need a lot of memory.
Your solution implies further complexity without any significant speed improvement. You would avoid dividing by those numbers, but in the other hand you would need to test if the divider was a multiple of three. That would imply two test divisions per cycle, and not just one.
Kind regards, Samuel Lourenço
I'll have to test that, but it will sure complicate the algorithm.
Thanks!
https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes
If you calculate all primes beginning from 2 and store them to be able to skip their multiples, you'll get a similar size problem in the data section instead of the code space...
Here is a Python example for the later case... it is quite fast but needs large lists (abusing a dictionary as sparse array) and sooner or later will make even systems with gigabytes of memory cry for more swap space: