bsr or (int)log2 in Prop C / PASM help too (Optimization Problem)
mikeologist
Posts: 337
I need a bsr type function for integers in Prop C.
In the past I've used (int)log2(arg) because it works quickly on systems with FPUs. I assume this uses the processor's log tables to perform log(arg)/log(2).
Since the propeller has no FPU I am concerned about how long this takes, so I'm trying to test it.
Here is my code so far:
and the results:
I get the expected results including a negative value for undefined: log2(0).
There are a few issues:
No log2 in this gcc?
How do I read the PC so I can test the speed in clocks? Is this a valid method of timing?
Is there a better(faster) way than either of these? I read through the propeller manual and I did not find anything on log2 or bsr.
In the past I've used (int)log2(arg) because it works quickly on systems with FPUs. I assume this uses the processor's log tables to perform log(arg)/log(2).
Since the propeller has no FPU I am concerned about how long this takes, so I'm trying to test it.
Here is my code so far:
#include "simpletools.h" // Include simple tools //#include <math.h> int main() { // Entry point int index; // location within the binary integer int pow2a; // Characteristic sans mantissa, first test int pow2b; // Characteristic sans mantissa, second test int num_in; // log2 argument unsigned int startA; unsigned int endA; unsigned int startB; unsigned int endB; while(1) { // Primary loop print("Provide an integer: "); // Prompt user scan("%d\n", &num_in); // Collect input //startA = __current_pc(); index = 32; // length of int in bits pow2a = -1; // flag value while((pow2a == -1) & (index-- > 0)) { // While no characteristic and bits remain, decrement position if(num_in & (1 << index)) { // if the current bit is on pow2a = index; // record its location as the log2 characteristic break; // and exit } } //endA = __current_pc(); //startB = __current_pc(); // pow2b = (int)log2(num_in); // use log tables? log(num_in)/log(2)? cast to integer pow2b = (int)(log(num_in)/log(2)); // guess that answers ^ //endB = __current_pc(); //print("Result A: %d time: %d Result B: %d time: %d \n", pow2a, (endA-startA), pow2b, (endB-startB)); // display the results print("Result A: %d Result B: %d \n", pow2a, pow2b); // display the results } }
and the results:
Provide an integer: 8675309 Result A: 23 Result B: 23 Provide an integer: 0 Result A: -1 Result B: -2147483648 Provide an integer: 1 Result A: 0 Result B: 0 Provide an integer: 24601 Result A: 14 Result B: 14 Provide an integer: 2 Result A: 1 Result B: 1 Provide an integer: 3 Result A: 1 Result B: 1 Provide an integer: 4 Result A: 2 Result B: 2 Provide an integer: 90210 Result A: 16 Result B: 16
I get the expected results including a negative value for undefined: log2(0).
There are a few issues:
No log2 in this gcc?
How do I read the PC so I can test the speed in clocks? Is this a valid method of timing?
Is there a better(faster) way than either of these? I read through the propeller manual and I did not find anything on log2 or bsr.
Comments
The propeller doesn't have a BSR instruction. Finding the most significant 1 bit has to be done in software. The prop does have a log lookup table, but I think it's only accurate to 12 bits. So if that accuracy is OK you could use the PASM code that is in the Prop manual to implement an integer version of log2.
The propeller system counter is accessible through the CNT variable. Include propeller.h to get the definition for it.
Oh so slow:
Does CNT give an accurate reading here? If so, What a difference!
Would this be faster in PASM? I think the answer is almost certainly yes. If so I can figure it out and use this guide to inject the PASM directly.
Thanks for your help. Any pitfalls that you can see or recommended reading, please?
from:
Which is pretty much the same as mine.
I think bitwise encoding -1 in PASM is likely to be the fastest way. Now I just have to figure out how to fill in the TODO.
You don't mention the scope of your calculation... that is how many potential populations are you interested in. I found that for 8 bit images... at least for measuring information... it is possible to do the frequency based calculations once and store the results... multiply them out to remove the decimals and then put the results in a file, which is read in at run time. That reduces the problem to a sum over the population... about a 8-12 clocks per calculation before parsing the result into a final float.
I have many uses for this in my float-free math library that I'm trying to cram into a propeller.
One specific use is converting integers to array indexes for my partial predictive factoring method.
In this the highest power of two points to 1 of 32 columns in a 2D array of ints, where the row is selected by the int's hamming weight mod(4), which is also a simple bitwise opertion.
I'm not opposed to a file. I'm looking for the smallest fastest option; could be code, could be a table.
Assuming a 32bit word, with bits numbered 1-32 I think this might work.
Add the operand to ff:ff:00:00. Update C but not Z
If it sets carry, then the first nonzero bit is at location 17 or more so add 16 to your result (do not update c or z)
If c was not set then shift the operand left by 16 places. (And update z)
Add ff:00:00:00 to the already shifted operand and update c but not z
If it sets carry, add 8 to result (don't update c or z)
If not, rotate left by 8
Add f0:00:00:00 and update c but not z
If c, then add 4 to result
If not shift left 4
Add c0:00:00:00 and update c but not z
If c then add two to result
If not, shift left by two
Add 80:00:00:00 and update c but not z
If c then add 1 to result
If not then if z is not set (from the first shift, that's why we don't update it) then add one more.
Unless I've made some error in logic, that should get you the index of the first nonzero bit, or zero if they are all zero and should take something like 3*log2(n) instructions on a propeller. (32 bits is 2^5, so I takes about 5 rounds or 15 instructions to find it.
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog
Looks like they could be done in PASM pretty neatly. The first one could do with some loop unrolling.
Better than brute force shifting through 32 bits.
@Heater
Thanks a bunch, that's the stuff I was looking for. Now I've got plenty to work on over Christmas :-D
Results to follow
-Phil
But don't we have to AND the masks with x rather than TEST?
TEST will leave the x unchanged. Which means any lower bits it has will still be there for later TESTs. Which will then be non-zero when they should not be.
Sadly I'm way out in the Finnish forest far away from any Propellers to test this with.
Thank you very much @Phil
I got it to work inline and it's nice and fast, here are the results:
Results are a bit off, though
and here is the code:
In my inexperience I couldn't get "_ffff0000 long $ffff0000" and the rest to work, compiler said unknown command, but I adapted. Does that impact the performance?
Either way it's a lot better than mine :cool:
Thanks again
With a functioning example I think I can move forward.
Could you try changing those "test" instructions to "and".
That will overwrite y so you will need to save that first. Or copy it to a temporary "t" and work on that.
Move forward as in fixing it :thumb:
I'll make a table tomorrow and apply your suggestions, thank you.
I should have interpreted "functioning example" as "compiles and running but malfunctioning example"
Hmm, yes quite
x = $00018000
The correct answer is 16. But that second one bit screws things up, causing an additional 8, 4, 2, 1, and 1 to be added to y. Dang!
Try this instead:
This is actually the program I started with. But I thought it would be less tricky-looking to use add and if_nz. I neglected to consider that z and nz are not symmetrical opposites.
-Phil
Andy
Thanks a bunch, Andy;
This taught me a good bit more about C
I put it in directly, it's fast and accurate:
Case 8192 has me curious.
Here's the updated code: Challenge accepted
Switch to LMM will cause the compiler to generate hardware assembly and run much faster but use more memory.
Mike
Thank you.
Reading more on the subject here. Is there any more reading that you'd recommend? Specifically a sample of C that compiles into the COG memory mode and any theory.
-Phil
Excellent, thank you
I look forward to testing it.
I'm also making a binary search model for the same goal just to see what I can see
I'll keep posting code and results
Merry Christmas
-Phil
That will isolate the computation from the parameter acquisition, should they become more separated.
-Phil
I made the BST method fully unrolled to begin with and tucked the '0' case way down in and I think the nesting of logic has a major speed impact: (This would need to be condensed to a loop in order to be size and speed effective)
It's the fastest that I had made up to that point. Then I looked at @Ariba code again and removed the nested logic and it sped up by a factor of 10 and turned it into a big theta, I like big theta. So far the modified Ariba is the winner. My next step is to test the code that you were kind enough to provide. I will share my results
-Phil