PDA

View Full Version : fast int sqrt for the prop



lonesock
05-22-2008, 12:56 AM
Hi, All.

I came up with a quick & simple integer sqrt function targeting the propeller (just got my 1st protoboard!!). It does not require multiplication, using only bit-shifts (for those not familiar with C notation, a<<1 = a*2, and a>>2 = a/4) and addition and minimal branching. I can't write assembly yet, so I'll just post the C version I used for testing on my desktop. This may already be a standard way of computing integer square roots, but I did not find anything related when doing a quick Google search. Enjoy!



/*
Jonathan "lonesock" Dummer

Successive Approximation Integer Square Root

2008-05-21-10.24

Features:
* simple
* scalable input range
* no multiply, just bitshift, add, sub, and minimal branching
* developed for the Parallax Propeller chip
*/
const int sqrt_input_bits = 16;
int ls_sqrt( int n )
{
/* my eventual answer */
int x = 0;
/* start with the highest (output) bit and work down */
for( int b = (sqrt_input_bits>>1)-1; b >= 0; --b )
{
/*
the difference between x^2 now, and if I add (1<<b) to x is:
(x + (1<<b))^2 - x^2
x^2 + 2*x*(1<<b) + (1<<b)^2 - x^2
((x+x)<<b) + (1<<(b+b))
(x+x+(1<<b))<<b
*/
int dx2 = ((1<<b)+x+x)<<b;
/* is n large enough to handle the delta in x^2? */
if( n >= dx2 )
{
/* if so, remove it from n */
n -= dx2;
/* and add this bit into x */
x += 1 << b;
}
}
/* and the winner is... */
return x;
}

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Praxis
05-22-2008, 02:43 AM
Here is another way, this is a spin recode of a 9S12 assembly language routine i use.





PRI FIND_SQR (VALUE) :ROOT | SUBT,COUNT

COUNT := 0
SUBT := 0
VALUE := VALUE - 1

REPEAT
VALUE := VALUE - SUBT
IF VALUE < 0
RETURN COUNT
SUBT += 2
COUNT += 1
IF VALUE == 0
RETURN COUNT





Azlan

Post Edited (Praxis) : 5/21/2008 7:59:16 PM GMT

lonesock
05-22-2008, 03:04 AM
Thanks, Praxis, that is definitely simpler than my offering, also without multiplication! http://forums.parallax.com/images/smilies/wink.gif

I did want to mention, since it is probably not obvious from my code, that the loop is only executed once per output bit. So if you have a 16-bit input range, the whole loop will only be executed 8 times. For a 32-bit integer, 16 times, etc. This should give a nice quick execution even for extended input ranges, and the inner loop isn't too complicated.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

rokicki
05-22-2008, 03:09 AM
The original poster's algorithm is the standard by-hand square root algorithm, similar to long division.
It takes at most 15 iterations (when suitably modified for a 32-bit input).

Azlan's code is a simpler algorithm, but it is much slower; it takes as many iterations as the return
value, which for a 32-bit signed input can be as high as 46,340.

Note that Spin *has* a sqrt function already so at least in Spin neither of these is needed.

Leon
05-22-2008, 07:37 AM
I used the Newton-Raphson technique once for a 16-bit fixed-point square root on a DSP. It was very fast - three iterations, IIRC.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

cgracey
05-22-2008, 08:14 AM
Here's some code from the Spin interpreter that computes square root:

' y = 32-bit unsigned input
' x = 16-bit square root output

sqrt··················· mov···· x,#0
······················· mov···· t1,#0
······················· mov···· t2,#16

:loop·················· shl···· y,#1··········· wc
······················· rcl···· t1,#1
······················· shl···· y,#1··········· wc
······················· rcl···· t1,#1
······················· shl···· x,#2
······················· or····· x,#1
······················· cmpsub· t1,x··········· wc
······················· shr···· x,#2
······················· rcl···· x,#1
······················· djnz··· t2,#:loop

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

lonesock
05-22-2008, 10:14 PM
Thanks for the feedback & instruction, everybody.

@Chip: I will definitely be playing with that code today, thanks!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

lonesock
05-22-2008, 11:28 PM
I need some noob help: I can't figure out how to call an assembly function from Spin. I see plenty of examples on starting some asm in a new cog...is that the only way?

btw, just working from your example, Chip, I think this is my version rendered to asm


' my version
sqrt
mov x,#0
mov t2,#16
:loop
mov t1,#1
shl t1,t2
add t1,x
add t1,x
shl t1,t2
cmpsub y,t1 wc
rcl x,#1
djnz t2,#:loop


It is shorter than the posting you listed, but I'm sure I've made some hideous mistakes http://forums.parallax.com/images/smilies/wink.gif Note that my version could be further shortened if:
1) instead the 1st two commands inside :loop (mov, shl) I could combine them somehow, maybe with a MUX type command?
2) I could work with 2*x the whole time, and then just do a SHR at the end (this would save one of the "add t1,x" commands)

Am I way off base?

thanks,

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Paul Baker
05-22-2008, 11:34 PM
Yes, when spin is running the interpreter fully occupies the entire cog. Theres no room to execute any user assembly. It's an either/or situation, a cog runs either Spin or assembly but never both.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker (mailto:pbaker@parallax.com)
Propeller Applications Engineer
[/url][url=http://www.parallax.com] (http://www.parallax.com)
Parallax, Inc. (http://www.parallax.com)

Ariba
05-23-2008, 03:12 AM
Only to be shure that you don't miss this:

To calculate a square root in Spin you use the operator: ^^
Example: n := ^^ 49

(this operator calls the ASM routine which Chip has posted above, inside the Interpreter cog).

Andy

lonesock
05-26-2008, 01:21 AM
<edit>updated the zip to reflect the code changes in the last post</edit>

OK, there _were_ some hideous mistakes in my assembly code posted above. First, I needed my loop to go from 15 down to 0, not 16 to 1. And secondly I need the incremental values of x each iteration, so I can't use the "rcl x,#1" trick. Here is the working assembly version of my implementation:


sqrt_lonesock
mov y,x2
mov x,#0
mov bit,#15
mov bmask,#1
shl bmask,#15
:loop
mov t1,bmask
add t1,x
add t1,x
shl t1,bit
cmpsub y,t1 wc
muxc x,bmask
shr bmask,#1
djnz bit,#:loop
shl x,#1
add x,#1
cmpsub y,x wc
shr x,#1
muxc x,#1


It runs ~20% faster than the asm version inside spin, but with a higher instruction count, and one extra temporary variable. As you can see, I compute bits 15 down to 1 inside the loop, then afterward I compute bit 0. The bit 0 section seems awkward to me...can any asm guru suggest a more elegant way to set bit 0 of x if (2x+1 <= y)?

I'm attaching a zip file of my Spin test files. "validate_sqrt.spin" compares my value against the spin assembly value in a loop, then lights a LED if everything matched (you will need to change the pin, I'm not using the demo board. At 80MHz the propeller can validate about 66k sqrts per second, so be prepared to wait depending on the initial count value you use). "test_sqrt.spin" starts up 3 cogs, 1 for straight Spin, 1 for the original Spin assembly version, and one for the new version of sqrt. Every 10,000 sqrts the cog will toggle an LED, for a visual speed reference. You will probably need to change these pins too.

thanks,

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Post Edited (lonesock) : 5/28/2008 4:36:07 PM GMT

lonesock
05-28-2008, 10:46 AM
<edit>added timing info</edit>

Latest updated version:


sqrt_lonesock
mov x,#0
mov bit2,#32
:loop
sub bit2,#2
mov t1,x
shl t1,#2
add t1,#1
shl t1,bit2
cmpsub y,t1 wc
rcl x,#1
tjnz bit2,#:loop


This version is 25% faster than the Spin assembly version (so the same speed as my previous version), but now it uses the same number of temporary variables, and less overall code (making it a suitable replacement for the sqrt inside Spin...wink wink). It can be sped up even more, at the expense of instruction count, by semi-unrolling the loop; specifically the 1st iteration where x = 0, and the final iteration where bit2 = 0 can be performed explicitly saving about 6 executed operations overall, but adding another 8(ish) instructions to the function.

For anybody following along, this version looks radically different from the previous incarnation. Here's what changed:
* modified the bit counter to be 2*bit, starting at 32 but then immediately decrementing by 2 so my loop goes from 30 down to 0 in steps of -2
* x uses the rcl trick, instead of placing the bit in the correct location right off the bat, which means...
* I need to modify x inside the loop to regain it's proper alignment. Instead of delta_x^2 = ((1<<bits)+2x)<<bits), now delta_x^2 = (4x+1)<<(2*bits)

Timing on my prop at 80MHz
* just Spin with ^^, no asm: 35,000 sqrts/sec
* asm Spin version: 120,250 sqrts/sec
* my version: 150,600 sqrts/sec
* unrolled: 157,600 sqrts/sec

Does anyone see a way to further optimize / shorten the code?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Post Edited (lonesock) : 5/28/2008 4:42:25 PM GMT

cgracey
05-29-2008, 06:44 AM
Lonesock,

That looks fantastic. I love it when things get smaller and faster. Good Job!

Who knows, it might even be possible to make it smaller, yet, but I can't see how.

Do you think a hardware implementation of sqrt would be good for the next Propeller if it took, say, 16 clocks to execute, and was implemented as a register r/w rather than an instruction (no holding things up)? You could write the 32-bit value and 16+ clocks afterwards read back the 16-bit root.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

cgracey
05-29-2008, 08:57 AM
Lonesock,

I looked at your code and wondered if the·root-in-progress could be maintained in-situ for the CMPSUB test. I was able to shave two instructions·out of·the loop! SUMC·works like magic here:

······· mov···· root,#0········ 'reset root
······· mov···· mask,h40000000· 'reset mask (constant·in register)
:loop
······· or····· root,mask······ 'set trial bit
······· cmpsub· input,root· wc· 'subtract root from input if fits
······· sumnc·· root,mask······ 'cancel trial bit, set root bit if fit
······· shr···· root,#1········ 'shift root down
······· shr···· mask,#2········ 'shift mask down
······· tjnz··· mask,#:loop···· 'loop until mask empty

This could be straight-line coded in only 4 instructions per iteration for 312,500 roots per second:

······· mov···· root,h40000000
······· cmpsub· input,root· wc
······· sumnc·· root,h40000000
······· shr···· root,#1······

······· or····· root,h10000000
······· cmpsub· input,root· wc
······· sumnc·· root,h10000000
······· shr···· root,#1·······

······· or····· root,h04000000
······· cmpsub· input,root· wc
······· sumnc·· root,h04000000
······· shr···· root,#1·······

······· or····· root,h01000000
······· cmpsub· input,root· wc
······· sumnc·· root,h01000000
······· shr···· root,#1·······

······· or····· root,h00400000
······· cmpsub· input,root· wc
······· sumnc·· root,h00400000
······· shr···· root,#1·······

······· or····· root,h00100000
······· cmpsub· input,root· wc
······· sumnc·· root,h00100000
······· shr···· root,#1·······

······· or····· root,h00040000
······· cmpsub· input,root· wc
······· sumnc·· root,h00040000
······· shr···· root,#1·······

······· or····· root,h00010000
······· cmpsub· input,root· wc
······· sumnc·· root,h00010000
······· shr···· root,#1·······

······· or····· root,h00004000
······· cmpsub· input,root· wc
······· sumnc·· root,h00004000
······· shr···· root,#1·······

······· or····· root,h00001000
······· cmpsub· input,root· wc
······· sumnc·· root,h00001000
······· shr···· root,#1·······

······· or····· root,h00000400
······· cmpsub· input,root· wc
······· sumnc·· root,h00000400
······· shr···· root,#1·······

······· or····· root,#$100
······· cmpsub· input,root· wc
······· sumnc·· root,#$100
······· shr···· root,#1·······

······· or····· root,#$40
······· cmpsub· input,root· wc
······· sumnc·· root,#$40
······· shr···· root,#1·······

······· or····· root,#$10
······· cmpsub· input,root· wc
······· sumnc·· root,#$10
······· shr···· root,#1·······

······· or····· root,#$4
······· cmpsub· input,root· wc
······· sumnc·· root,#$4
······· shr···· root,#1·······

······· or····· root,#$1
······· cmpsub· input,root· wc
······· sumnc·· root,#$1
······· shr···· root,#1·······

I·always thought that the Spin interpreter's square-root routine was as small as possible, but you found a much better way, which·got me to think about it differently, which produced even another improvement. How much better could the whole interpreter be, I wonder?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

kuroneko
05-29-2008, 09:06 AM
lonesock said...
Does anyone see a way to further optimize / shorten the code?



sqrt xor x, x
mov p, storage

:loop or x, p
cmpsub y, x wc

IF_C add x, p
andn x, p

shr x, #1
shr p, #2 wz
IF_NZ jmp #:loop

storage long %01 << 30 ' x = 0, (x * 4 + 1) << 30
x res 1
p res 1



And Chip got an even better solution by using sumnc. Well somehow that add/andn felt a bit odd ;)

lonesock
05-29-2008, 10:16 AM
@ Chip and kuroneko:
Brilliant, both of you!

@ Chip:
That is blindingly fast, especially unrolled! (You do lose space due to the h* constants, on top of the instruction count, correct?) But I'm happy with the looped version, thanks!

Regarding having a native sqrt, I would love to have it, but priority-wise it is certainly lower than hardware MUL and DIV instructions. I'd hate to be lynched by mobs of prop-fans ("We were going to include a HW MUL, but that lonesock guy demanded a sqrt, and we just didn't have time to do both..."). (btw, I'm using sqrt to compute standard deviations, playing with some audio compression methods, so I'm computing it per block, not per sample, so it's certainly fast enough for my needs now!)

Regarding optimizing the entire Spin interpreter, there's always room for improvements http://forums.parallax.com/images/smilies/wink.gif Maybe you could hold little mini-forum-contests, one function at a time (Protoboard prizes or something would be fun!) A nice goal would be shrinking the entire interpreter down to a size where you could call an actual inline assembly function if the whole thing fit in, say, 16 longs. What do you think?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

ImageCraft
05-29-2008, 01:51 PM
Can we use it in the C compiler http://forums.parallax.com/images/smilies/smile.gif?

It's almost small enough to consider putting it in the LMM kernel, hmm... I wonder, may be what we will do in the compiler is to have a "Helper COG." All the extra useful functions that don't fit in the LMM kernel COG can live there. e.g. string functions, fast sqrt, the low level floating point primitives etc. This COG will be similar to providing a shared library version of these routines....

The only CON is that we will use up one more COG, but that still leaves 6....

cgracey
05-29-2008, 03:03 PM
ImageCraft said...
Can we use it in the C compiler http://forums.parallax.com/images/smilies/smile.gif?


Well, I'm not going to claim any ownership, and I don't think anyone else would, either. I think we'd all be happy if you used it, as we're all excited about a better way to do something.

I know we all suffer from paranoia over simple stuff like this, but let's·cultivate some freedom here. Let's enjoy engineering!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Graham Stabler
05-29-2008, 04:04 PM
Is there a fast square as well as a root, i.e. faster than a straight multiply by the same number?

Graham

cgracey
05-29-2008, 05:13 PM
Graham Stabler said...
Is there a fast square as well as a root, i.e. faster than a straight multiply by the same number?

Graham
I'm sure there is. We'd have to reverse the process of computing the root. If the root can be determined in 4 instructions per bit, the square should be determinable in 4 instruction per 2 bits. Hey, a straight multiply takes 2 instructions per bit, so there it is. Perhaps a square can be computed at the rate of 2 bits per three instructions, or in·48 instructions total.

Fast squares plus fast square root = fast hypotenuse.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 5/29/2008 10:54:11 AM GMT

lonesock
05-29-2008, 06:15 PM
Chip Gracey (Parallax) said...

ImageCraft said...

Can we use it in the C compiler http://forums.parallax.com/images/smilies/smile.gif?

Well, I'm not going to claim any ownership, and I don't think anyone else would, either. I think we'd all be happy if you used it, as we're all excited about a better way to do something.

I know we all suffer from paranoia over simple stuff like this, but let's cultivate some freedom here. Let's enjoy engineering!

Exactly! In the spirit of Towel Day (belated), "Share and Enjoy!". For the most part I think patents on algorithms qualify as 'silly', especially for something as trivial as a standard mathematical operation. So, just for the record, all of the code I posted is released to the public domain. I can't comment on the license of the original sqrt assembly that Chip shared with us, though.

btw, ImageCraft, I'm really enjoying your C compiler (only played with the first Beta, still need to try out #4!). Thanks for the great work. If there is anything I can do to contribute, just let me know.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

lonesock
05-29-2008, 06:23 PM
Graham Stabler said...
Is there a fast square as well as a root, i.e. faster than a straight multiply by the same number?

Graham

There are some interesting thoughts here:
www.piclist.com/techref/microchip/math/sq/index.htm (http://www.piclist.com/techref/microchip/math/sq/index.htm)
(of course it would be fun to work it out from scratch http://forums.parallax.com/images/smilies/wink.gif)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Graham Stabler
05-29-2008, 07:52 PM
Chip, you read my mind, check out my site on parallel kinematics:

http://www.indoor.flyer.co.uk/kinematics.htm

Scroll down a little way and see how simple the kinematics of some machines can be, just need squares and roots.

Graham

cgracey
08-27-2008, 02:06 PM
I was looking at this thread to·recall the·improved square-root algorithm for the new Spin interpreter and I found a way to shave another long off the looped version.

Here's how it was (note that 'h40000000' is a register containing $40000000):

······· mov···· root,#0········ 'reset root
······· mov···· mask,h40000000· 'reset mask (constant·in register)
:loop·· or····· root,mask······ 'set trial bit
······· cmpsub· input,root· wc· 'subtract root from input if fits
······· sumnc·· root,mask······ 'cancel trial bit, set root bit if fit
······· shr···· root,#1········ 'shift root down
······· shr···· mask,#2········ 'shift mask down
······· tjnz··· mask,#:loop···· 'loop until mask empty



Here's the smaller version (now 'mask' is a register containing $40000000 which gets used directly and rotated all the way back to its original value):

······· mov···· root,#0········ 'reset root
·:loop· or····· root,mask······ 'set trial bit
······· cmpsub· input,root· wc· 'subtract root from input if fits
······· sumnc·· root,mask······ 'cancel trial bit, set root bit if fit
······· shr···· root,#1········ 'shift root down
······· ror···· mask,#2·····wc· 'shift mask down (wraps on last iteration)
·if_nc· jmp···· #:loop······ ·· 'loop until mask restored

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Cluso99
08-27-2008, 02:45 PM
I have enough space to use an LMM style inline routine or a small overlay section in the ram interpreter - I have over 40 longs available (and counting). So don't presume this cannot be done http://forums.parallax.com/images/smilies/smile.gif Unfortunately my testing ground to a halt due to a few other things here. I have done a quick calc and it will be quite a bit faster!

I want to get the bits I have done shrunk working before I post it. Chip has given me permission to post the code - Thanks Chip http://forums.parallax.com/images/smilies/smile.gif

I have a spin debugger semi working and a single-stepper nearly working (uses 8 longs in the interpreter and the code resides in hub - there are a few caveats because the Interpreter can execute in more that 1 cog concurrently and stumped me for a while!). I need this to verify the Ram Interpreter.

I have a question: Does anyone know if the Prop I ignores the upper 16 bits in a pointer to hub ram ?

hptr long $5555_7000
rdlong x,hptr

i.e. Does the $5555_0000 get ignored or must it be $0000_7000 ($7000 is the hub ram address)? (I am away and no hardware to test it on).

Posteditted: I'll add the improved sqrt·if that's ok. I haven't looked at multiply or divide, so any suggestions would be good.

Post Edited (Cluso99) : 8/27/2008 8:09:44 AM GMT

cgracey
08-27-2008, 03:49 PM
Cluso99 said...

I have a question: Does anyone know if the Prop I ignores the upper 16 bits in a pointer to hub ram ?

hptr long $5555_7000
rdlong x,hptr

i.e. Does the $5555_0000 get ignored or must it be $0000_7000 ($7000 is the hub ram address)? (I am away and no hardware to test it on).
Yes, those top 16 bits get ignored. Just the lower 16 bits are used.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Ale
08-27-2008, 11:29 PM
Rotating the mask is the same method I used in the BCD addition and subtraction that I posted in the Math article. very powerful to save space and cycles.

Cluso99
08-28-2008, 04:09 AM
Chip said...
Do you think a hardware implementation of sqrt would be good for the next Propeller if it took, say, 16 clocks to execute, and was implemented as a register r/w rather than an instruction (no holding things up)? You could write the 32-bit value and 16+ clocks afterwards read back the 16-bit root.

Wow Chip - that would be fantastic. Maybe the same for divide??- one for result, one for remainder??? Also love your sqrt http://forums.parallax.com/images/smilies/smile.gif

You have made me rethink onvernight. I am going to tidy the Ram Interpreter and post it untested with a statement of what I know works and where it is up to, so watch out for the thread. Hopefully I might get that done on the plane today.

I will also start· a thread on spin and asm debuggers which·I have sort of working, as it may inspire the input like this thread has. http://forums.parallax.com/images/smilies/cool.gif

Phil Pilgrim (PhiPi)
08-28-2008, 04:40 AM
Chip,

As long as you're making the gesture (and maybe now regretting it http://forums.parallax.com/images/smilies/smile.gif ), a very useful set of register-based hardware macro functions would be:

····32 * 32 = 64 (unsigned multiply)
····64 / 32 = 32 rem 32 (unsigned divide)
····sqrt 64 = 32 (square root)

This would make it possible to perform most 32-bit math without concerns for overflow or scaling issues.

One very useful function to have in integer math is the trinary scaling operation: x * y / z, which requires the full precision of the product to be retained for the divide, thus effecting multiplication by a rational scaling factor y/z. The above operations would accommodate that, as well as Pythagorean distances in n-space to (nearly) 32-bit precision.

Thanks,
Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs (http://forums.parallax.com/showthread.php?p=729096) left!

cgracey
08-28-2008, 05:10 AM
Phil Pilgrim (PhiPi) said...
Chip,

As long as you're making the gesture (and maybe now regretting it http://forums.parallax.com/images/smilies/smile.gif ), a very useful set of register-based hardware macro functions would be:

····32 * 32 = 64 (unsigned multiply)
····64 / 32 = 32 rem 32 (unsigned divide)
····sqrt 64 = 32 (square root)

This would make it possible to perform most 32-bit math without concerns for overflow or scaling issues.

One very useful function to have in integer math is the trinary scaling operation: x * y / z, which requires the full precision of the product to be retained for the divide, thus effecting multiplication by a rational scaling factor y/z. The above operations would accommodate that, as well as Pythagorean distances in n-space to (nearly) 32-bit precision.

Thanks,
Phil



Phil,

That's a lot of transistors, but I sure see your point. I imagine x*y would have to be computed before the divide-by-z could begin.

One thing I made recently was a ratio computation circuit where you supply A and B, where B>A, and it returns a 32-bit fractional value. For example, 1/2 would yield $8000_0000. It's really great for computing·FRQ values for CTRs. Say you want to determine FRQ for 3.579545MHz, while you're main clock is 80MHz. You'd just put in 3_579_545 / 80_000_000 and get $0B74_5CFE out. Does this do anything for you? It's cheap in terms of silicon.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 8/27/2008 10:37:27 PM GMT

Phil Pilgrim (PhiPi)
08-28-2008, 06:01 AM
Chip,

The frequency computation issue is actually what spurred my thoughts on this topic. So, yes, a circuit dedicated to fractional division would be a huge help.

If the generalized 64:32 operations require too many transistors (and I keep forgetting: that number has to be multiplied by the number of cogs), would it be possible to create simpler hardware building blocks for doing these unsigned operations, which the programmer could invoke iteratively? For example, 32 * 32 = 32 upper (unsigned), and 32 * 32 = 32 lower (unsigned). In the case of division, instead of 64 / 32 = 32 rem 32 (unsigned), maybe 32 rem 32 / 32 = 32 rem 32 (unsigned), which would permit division of multi-precision numerators if applied enough times. (I'm not sure either of these help the transistor count, though.)

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
'Still some PropSTICK Kit bare PCBs (http://forums.parallax.com/showthread.php?p=729096) left!

VIRAND
08-28-2008, 07:02 AM
I'm surprised no one used the LOG Tables!

Tracy Allen
08-28-2008, 09:06 AM
Chip, I'd very much like the ratio computation, right up there on a priority level with sqrt. You did that in hardware as a separate circuit?

The way I have been doing it in pasm is the following and I wonder in the spirit of this thread if the code can be shortened.



' convert ratio y / x, where y < x, into a binary fraction f
' enter with y, x:{y > x, x < 2^31, y <= 2^31}
' exit with binary fraction, f, where f is the numerator of the approximation, y /x = f / (2^n).
'
fractyxb ' enter here , n contains number of bits, usually will be 32 bits for frqx calc.
mov f,#0 ' result
:loop
shl y,#1 ' double the value of y
cmpsub y, x wc,wr ' if y>=x then y:=y-x, carry:=1, else a unchanged, carry:=0
rcl f,#1 ' shift in carry, double the estimate of f, appoximation f/(2^n) =~ y/x
djnz n,#:loop ' do all bits
fractyxb_ret ret




Here is the equivalent in spin, which can then be used with the ** operator when the number of bits=32 for calculating the trinary
result = z * y / x. (result := f ** z)





PRI fractyxzb (y, x, b) : f
'' calculate f = y/x * 2^nb
'' enter with y < x, and b=number of bits in approximation
'' b is number of bits
'' same as f: y/x = f / (2^nb) < y/x < (f+1)/(2^nb),
'' that is, f over a power of two is a close appoximation to the original fraction.
repeat b
y <<= 1
f <<= 1
if y => x '
y -= x
f++

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Tracy Allen
www.emesystems.com (http://www.emesystems.com)

cgracey
08-28-2008, 10:57 AM
Tracy,

I think the only way that routine can get sped up is by straight-lining each iteration. In the next Propeller you could do this:

··········· rep···· [32,3]····· 'repeat 3 instructions 32 times

··········· nop················ 'must execute two instructions here

··········· nop

··········· shl···· x,#1······· 'begin 3-instruction block

··········· cmpsub· x,y····· wc

··········· rcl···· q,#1······· 'total cycles = 3 + 3*32 = 99



With the divider circuit, you'd do this:

··········· setdivx x·········· 'set x

··········· setdivy y,[32]····· 'set y, begin 32 iterations

··········· (nop)·············· 'you could execute up to 31 instructions here

··········· getdivq q·········· 'get result (waits), total cycles = 2 + 31 + 1 = 34

So, it's ~3 times faster, but works in the background, so in a way it only takes 3 instructions (33x faster). This is the way the CORDIC works, too.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


Chip Gracey
Parallax, Inc.

Post Edited (Chip Gracey (Parallax)) : 8/28/2008 4:03:09 AM GMT

Ale
08-28-2008, 03:31 PM
In a way how many FPUs work, I mean in the background. If we could access external memory fast enough, I mean at 160 MHz we could safely use SDRAM maybe at 66 or 80 MHz, that could easily compete with some multimedia optimized processors in the type or ARM or similar. There are many multimedia players with limited resources that decode video at 160x128 or 320x240 and 24 to 30 fps (No idea if it is doable).

The SH-1 (and I think SH-2) have an iterative division that takes 32 or so instructions and cycles. but htis way is better because you have 31 instructions to do something.

In a not so related note, is there any help for pointers ?, scale, auto-post or -pre decrement/increment ? (Maybe it is moot with the speed).

What happens with consumption ? is there an estimate or something like that ?

Cluso99
08-29-2008, 10:52 AM
Hey... Did anyone notice this...


Chip said...

··········· rep···· [32,3]····· 'repeat 3 instructions 32 times


WOW !!!!!

Ale
08-29-2008, 05:26 PM
I did. The newest way to do a loop without variable and without jump instruction. Pretty neat. I do not see the hour this chip shows up at my desk !!.
Between this and the other thread I lost the count on how many improvements this prop 2 has, many more than what it was outlined in that long thread from last year We need it !. At last something we want will be created, just for us (and some other customers :) ).