fast int sqrt for the prop
lonesock
Posts: 917
Hi, All.
I came up with a quick & simple integer sqrt function targeting the propeller (just got my 1st protoboard!!). It does not require multiplication, using only bit-shifts (for those not familiar with C notation, a<<1 = a*2, and a>>2 = a/4) and addition and minimal branching. I can't write assembly yet, so I'll just post the C version I used for testing on my desktop. This may already be a standard way of computing integer square roots, but I did not find anything related when doing a quick Google search. Enjoy!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
I came up with a quick & simple integer sqrt function targeting the propeller (just got my 1st protoboard!!). It does not require multiplication, using only bit-shifts (for those not familiar with C notation, a<<1 = a*2, and a>>2 = a/4) and addition and minimal branching. I can't write assembly yet, so I'll just post the C version I used for testing on my desktop. This may already be a standard way of computing integer square roots, but I did not find anything related when doing a quick Google search. Enjoy!
/* Jonathan "lonesock" Dummer Successive Approximation Integer Square Root 2008-05-21-10.24 Features: * simple * scalable input range * no multiply, just bitshift, add, sub, and minimal branching * developed for the Parallax Propeller chip */ const int sqrt_input_bits = 16; int ls_sqrt( int n ) { /* my eventual answer */ int x = 0; /* start with the highest (output) bit and work down */ for( int b = (sqrt_input_bits>>1)-1; b >= 0; --b ) { /* the difference between x^2 now, and if I add (1<<b) to x is: (x + (1<<b))^2 - x^2 x^2 + 2*x*(1<<b) + (1<<b)^2 - x^2 ((x+x)<<b) + (1<<(b+b)) (x+x+(1<<b))<<b */ int dx2 = ((1<<b)+x+x)<<b; /* is n large enough to handle the delta in x^2? */ if( n >= dx2 ) { /* if so, remove it from n */ n -= dx2; /* and add this bit into x */ x += 1 << b; } } /* and the winner is... */ return x; }
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Comments
Azlan
Post Edited (Praxis) : 5/21/2008 7:59:16 PM GMT
I did want to mention, since it is probably not obvious from my code, that the loop is only executed once per output bit. So if you have a 16-bit input range, the whole loop will only be executed 8 times. For a 32-bit integer, 16 times, etc. This should give a nice quick execution even for extended input ranges, and the inner loop isn't too complicated.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
It takes at most 15 iterations (when suitably modified for a 32-bit input).
Azlan's code is a simpler algorithm, but it is much slower; it takes as many iterations as the return
value, which for a 32-bit signed input can be as high as 46,340.
Note that Spin *has* a sqrt function already so at least in Spin neither of these is needed.
Leon
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle
' y = 32-bit unsigned input
' x = 16-bit square root output
sqrt··················· mov···· x,#0
······················· mov···· t1,#0
······················· mov···· t2,#16
:loop·················· shl···· y,#1··········· wc
······················· rcl···· t1,#1
······················· shl···· y,#1··········· wc
······················· rcl···· t1,#1
······················· shl···· x,#2
······················· or····· x,#1
······················· cmpsub· t1,x··········· wc
······················· shr···· x,#2
······················· rcl···· x,#1
······················· djnz··· t2,#:loop
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
@Chip: I will definitely be playing with that code today, thanks!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
btw, just working from your example, Chip, I think this is my version rendered to asm
It is shorter than the posting you listed, but I'm sure I've made some hideous mistakes Note that my version could be further shortened if:
1) instead the 1st two commands inside :loop (mov, shl) I could combine them somehow, maybe with a MUX type command?
2) I could work with 2*x the whole time, and then just do a SHR at the end (this would save one of the "add t1,x" commands)
Am I way off base?
thanks,
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
To calculate a square root in Spin you use the operator: ^^
Example: n := ^^ 49
(this operator calls the ASM routine which Chip has posted above, inside the Interpreter cog).
Andy
OK, there _were_ some hideous mistakes in my assembly code posted above. First, I needed my loop to go from 15 down to 0, not 16 to 1. And secondly I need the incremental values of x each iteration, so I can't use the "rcl x,#1" trick. Here is the working assembly version of my implementation:
It runs ~20% faster than the asm version inside spin, but with a higher instruction count, and one extra temporary variable. As you can see, I compute bits 15 down to 1 inside the loop, then afterward I compute bit 0. The bit 0 section seems awkward to me...can any asm guru suggest a more elegant way to set bit 0 of x if (2x+1 <= y)?
I'm attaching a zip file of my Spin test files. "validate_sqrt.spin" compares my value against the spin assembly value in a loop, then lights a LED if everything matched (you will need to change the pin, I'm not using the demo board. At 80MHz the propeller can validate about 66k sqrts per second, so be prepared to wait depending on the initial count value you use). "test_sqrt.spin" starts up 3 cogs, 1 for straight Spin, 1 for the original Spin assembly version, and one for the new version of sqrt. Every 10,000 sqrts the cog will toggle an LED, for a visual speed reference. You will probably need to change these pins too.
thanks,
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 5/28/2008 4:36:07 PM GMT
Latest updated version:
This version is 25% faster than the Spin assembly version (so the same speed as my previous version), but now it uses the same number of temporary variables, and less overall code (making it a suitable replacement for the sqrt inside Spin...wink wink). It can be sped up even more, at the expense of instruction count, by semi-unrolling the loop; specifically the 1st iteration where x = 0, and the final iteration where bit2 = 0 can be performed explicitly saving about 6 executed operations overall, but adding another 8(ish) instructions to the function.
For anybody following along, this version looks radically different from the previous incarnation. Here's what changed:
* modified the bit counter to be 2*bit, starting at 32 but then immediately decrementing by 2 so my loop goes from 30 down to 0 in steps of -2
* x uses the rcl trick, instead of placing the bit in the correct location right off the bat, which means...
* I need to modify x inside the loop to regain it's proper alignment. Instead of delta_x^2 = ((1<<bits)+2x)<<bits), now delta_x^2 = (4x+1)<<(2*bits)
Timing on my prop at 80MHz
* just Spin with ^^, no asm: 35,000 sqrts/sec
* asm Spin version: 120,250 sqrts/sec
* my version: 150,600 sqrts/sec
* unrolled: 157,600 sqrts/sec
Does anyone see a way to further optimize / shorten the code?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 5/28/2008 4:42:25 PM GMT
That looks fantastic. I love it when things get smaller and faster. Good Job!
Who knows, it might even be possible to make it smaller, yet, but I can't see how.
Do you think a hardware implementation of sqrt would be good for the next Propeller if it took, say, 16 clocks to execute, and was implemented as a register r/w rather than an instruction (no holding things up)? You could write the 32-bit value and 16+ clocks afterwards read back the 16-bit root.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
I looked at your code and wondered if the·root-in-progress could be maintained in-situ for the CMPSUB test. I was able to shave two instructions·out of·the loop! SUMC·works like magic here:
······· mov···· root,#0········ 'reset root
······· mov···· mask,h40000000· 'reset mask (constant·in register)
:loop
······· or····· root,mask······ 'set trial bit
······· cmpsub· input,root· wc· 'subtract root from input if fits
······· sumnc·· root,mask······ 'cancel trial bit, set root bit if fit
······· shr···· root,#1········ 'shift root down
······· shr···· mask,#2········ 'shift mask down
······· tjnz··· mask,#:loop···· 'loop until mask empty
This could be straight-line coded in only 4 instructions per iteration for 312,500 roots per second:
······· mov···· root,h40000000
······· cmpsub· input,root· wc
······· sumnc·· root,h40000000
······· shr···· root,#1······
······· or····· root,h10000000
······· cmpsub· input,root· wc
······· sumnc·· root,h10000000
······· shr···· root,#1·······
······· or····· root,h04000000
······· cmpsub· input,root· wc
······· sumnc·· root,h04000000
······· shr···· root,#1·······
······· or····· root,h01000000
······· cmpsub· input,root· wc
······· sumnc·· root,h01000000
······· shr···· root,#1·······
······· or····· root,h00400000
······· cmpsub· input,root· wc
······· sumnc·· root,h00400000
······· shr···· root,#1·······
······· or····· root,h00100000
······· cmpsub· input,root· wc
······· sumnc·· root,h00100000
······· shr···· root,#1·······
······· or····· root,h00040000
······· cmpsub· input,root· wc
······· sumnc·· root,h00040000
······· shr···· root,#1·······
······· or····· root,h00010000
······· cmpsub· input,root· wc
······· sumnc·· root,h00010000
······· shr···· root,#1·······
······· or····· root,h00004000
······· cmpsub· input,root· wc
······· sumnc·· root,h00004000
······· shr···· root,#1·······
······· or····· root,h00001000
······· cmpsub· input,root· wc
······· sumnc·· root,h00001000
······· shr···· root,#1·······
······· or····· root,h00000400
······· cmpsub· input,root· wc
······· sumnc·· root,h00000400
······· shr···· root,#1·······
······· or····· root,#$100
······· cmpsub· input,root· wc
······· sumnc·· root,#$100
······· shr···· root,#1·······
······· or····· root,#$40
······· cmpsub· input,root· wc
······· sumnc·· root,#$40
······· shr···· root,#1·······
······· or····· root,#$10
······· cmpsub· input,root· wc
······· sumnc·· root,#$10
······· shr···· root,#1·······
······· or····· root,#$4
······· cmpsub· input,root· wc
······· sumnc·· root,#$4
······· shr···· root,#1·······
······· or····· root,#$1
······· cmpsub· input,root· wc
······· sumnc·· root,#$1
······· shr···· root,#1·······
I·always thought that the Spin interpreter's square-root routine was as small as possible, but you found a much better way, which·got me to think about it differently, which produced even another improvement. How much better could the whole interpreter be, I wonder?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
And Chip got an even better solution by using sumnc. Well somehow that add/andn felt a bit odd [noparse];)[/noparse]
Brilliant, both of you!
@ Chip:
That is blindingly fast, especially unrolled! (You do lose space due to the h* constants, on top of the instruction count, correct?) But I'm happy with the looped version, thanks!
Regarding having a native sqrt, I would love to have it, but priority-wise it is certainly lower than hardware MUL and DIV instructions. I'd hate to be lynched by mobs of prop-fans ("We were going to include a HW MUL, but that lonesock guy demanded a sqrt, and we just didn't have time to do both..."). (btw, I'm using sqrt to compute standard deviations, playing with some audio compression methods, so I'm computing it per block, not per sample, so it's certainly fast enough for my needs now!)
Regarding optimizing the entire Spin interpreter, there's always room for improvements Maybe you could hold little mini-forum-contests, one function at a time (Protoboard prizes or something would be fun!) A nice goal would be shrinking the entire interpreter down to a size where you could call an actual inline assembly function if the whole thing fit in, say, 16 longs. What do you think?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
It's almost small enough to consider putting it in the LMM kernel, hmm... I wonder, may be what we will do in the compiler is to have a "Helper COG." All the extra useful functions that don't fit in the LMM kernel COG can live there. e.g. string functions, fast sqrt, the low level floating point primitives etc. This COG will be similar to providing a shared library version of these routines....
The only CON is that we will use up one more COG, but that still leaves 6....
I know we all suffer from paranoia over simple stuff like this, but let's·cultivate some freedom here. Let's enjoy engineering!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Graham
Fast squares plus fast square root = fast hypotenuse.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 5/29/2008 10:54:11 AM GMT
btw, ImageCraft, I'm really enjoying your C compiler (only played with the first Beta, still need to try out #4!). Thanks for the great work. If there is anything I can do to contribute, just let me know.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
www.piclist.com/techref/microchip/math/sq/index.htm
(of course it would be fun to work it out from scratch )
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
http://www.indoor.flyer.co.uk/kinematics.htm
Scroll down a little way and see how simple the kinematics of some machines can be, just need squares and roots.
Graham
Here's how it was (note that 'h40000000' is a register containing $40000000):
······· mov···· root,#0········ 'reset root
······· mov···· mask,h40000000· 'reset mask (constant·in register)
:loop·· or····· root,mask······ 'set trial bit
······· cmpsub· input,root· wc· 'subtract root from input if fits
······· sumnc·· root,mask······ 'cancel trial bit, set root bit if fit
······· shr···· root,#1········ 'shift root down
······· shr···· mask,#2········ 'shift mask down
······· tjnz··· mask,#:loop···· 'loop until mask empty
Here's the smaller version (now 'mask' is a register containing $40000000 which gets used directly and rotated all the way back to its original value):
······· mov···· root,#0········ 'reset root
·:loop· or····· root,mask······ 'set trial bit
······· cmpsub· input,root· wc· 'subtract root from input if fits
······· sumnc·· root,mask······ 'cancel trial bit, set root bit if fit
······· shr···· root,#1········ 'shift root down
······· ror···· mask,#2·····wc· 'shift mask down (wraps on last iteration)
·if_nc· jmp···· #:loop······ ·· 'loop until mask restored
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
I want to get the bits I have done shrunk working before I post it. Chip has given me permission to post the code - Thanks Chip
I have a spin debugger semi working and a single-stepper nearly working (uses 8 longs in the interpreter and the code resides in hub - there are a few caveats because the Interpreter can execute in more that 1 cog concurrently and stumped me for a while!). I need this to verify the Ram Interpreter.
I have a question: Does anyone know if the Prop I ignores the upper 16 bits in a pointer to hub ram ?
hptr long $5555_7000
rdlong x,hptr
i.e. Does the $5555_0000 get ignored or must it be $0000_7000 ($7000 is the hub ram address)? (I am away and no hardware to test it on).
Posteditted: I'll add the improved sqrt·if that's ok. I haven't looked at multiply or divide, so any suggestions would be good.
Post Edited (Cluso99) : 8/27/2008 8:09:44 AM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
You have made me rethink onvernight. I am going to tidy the Ram Interpreter and post it untested with a statement of what I know works and where it is up to, so watch out for the thread. Hopefully I might get that done on the plane today.
I will also start· a thread on spin and asm debuggers which·I have sort of working, as it may inspire the input like this thread has.
As long as you're making the gesture (and maybe now regretting it ), a very useful set of register-based hardware macro functions would be:
····32 * 32 = 64 (unsigned multiply)
····64 / 32 = 32 rem 32 (unsigned divide)
····sqrt 64 = 32 (square root)
This would make it possible to perform most 32-bit math without concerns for overflow or scaling issues.
One very useful function to have in integer math is the trinary scaling operation: x * y / z, which requires the full precision of the product to be retained for the divide, thus effecting multiplication by a rational scaling factor y/z. The above operations would accommodate that, as well as Pythagorean distances in n-space to (nearly) 32-bit precision.
Thanks,
Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!
That's a lot of transistors, but I sure see your point. I imagine x*y would have to be computed before the divide-by-z could begin.
One thing I made recently was a ratio computation circuit where you supply A and B, where B>A, and it returns a 32-bit fractional value. For example, 1/2 would yield $8000_0000. It's really great for computing·FRQ values for CTRs. Say you want to determine FRQ for 3.579545MHz, while you're main clock is 80MHz. You'd just put in 3_579_545 / 80_000_000 and get $0B74_5CFE out. Does this do anything for you? It's cheap in terms of silicon.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Chip Gracey
Parallax, Inc.
Post Edited (Chip Gracey (Parallax)) : 8/27/2008 10:37:27 PM GMT
The frequency computation issue is actually what spurred my thoughts on this topic. So, yes, a circuit dedicated to fractional division would be a huge help.
If the generalized 64:32 operations require too many transistors (and I keep forgetting: that number has to be multiplied by the number of cogs), would it be possible to create simpler hardware building blocks for doing these unsigned operations, which the programmer could invoke iteratively? For example, 32 * 32 = 32 upper (unsigned), and 32 * 32 = 32 lower (unsigned). In the case of division, instead of 64 / 32 = 32 rem 32 (unsigned), maybe 32 rem 32 / 32 = 32 rem 32 (unsigned), which would permit division of multi-precision numerators if applied enough times. (I'm not sure either of these help the transistor count, though.)
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs left!