Shop OBEX P1 Docs P2 Docs Learn Events
fast int sqrt for the prop — Parallax Forums

fast int sqrt for the prop

lonesocklonesock Posts: 917
edited 2008-08-29 10:26 in Propeller 1
Hi, All.

I came up with a quick & simple integer sqrt function targeting the propeller (just got my 1st protoboard!!). It does not require multiplication, using only bit-shifts (for those not familiar with C notation, a<<1 = a*2, and a>>2 = a/4) and addition and minimal branching. I can't write assembly yet, so I'll just post the C version I used for testing on my desktop. This may already be a standard way of computing integer square roots, but I did not find anything related when doing a quick Google search. Enjoy!

/*
    Jonathan "lonesock" Dummer

    Successive Approximation Integer Square Root

    2008-05-21-10.24

    Features:
    * simple
    * scalable input range
    * no multiply, just bitshift, add, sub, and minimal branching
    * developed for the Parallax Propeller chip
*/
const int sqrt_input_bits = 16;
int ls_sqrt( int n )
{
    /* my eventual answer */
    int x = 0;
    /* start with the highest (output) bit and work down */
    for( int b = (sqrt_input_bits>>1)-1; b >= 0; --b )
    {
        /*
            the difference between x^2 now, and if I add (1<<b) to x is:
            (x + (1<<b))^2 - x^2
            x^2 + 2*x*(1<<b) + (1<<b)^2 - x^2
            ((x+x)<<b) + (1<<(b+b))
            (x+x+(1<<b))<<b
        */
        int dx2 = ((1<<b)+x+x)<<b;
        /* is n large enough to handle the delta in x^2? */
        if( n >= dx2 )
        {
            /* if so, remove it from n */
            n -= dx2;
            /* and add this bit into x */
            x += 1 << b;
        }
    }
    /* and the winner is... */
    return x;
}

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
«1

Comments

  • PraxisPraxis Posts: 333
    edited 2008-05-21 19:43
    Here is another way, this is a spin recode of a 9S12 assembly language routine i use.

    
    PRI FIND_SQR (VALUE) :ROOT | SUBT,COUNT
    
      COUNT := 0
      SUBT := 0
      VALUE := VALUE - 1
    
      REPEAT
        VALUE := VALUE - SUBT
        IF VALUE < 0
          RETURN COUNT
        SUBT += 2
        COUNT += 1
        IF VALUE == 0
          RETURN COUNT
    
    
    



    Azlan

    Post Edited (Praxis) : 5/21/2008 7:59:16 PM GMT
  • lonesocklonesock Posts: 917
    edited 2008-05-21 20:04
    Thanks, Praxis, that is definitely simpler than my offering, also without multiplication! wink.gif

    I did want to mention, since it is probably not obvious from my code, that the loop is only executed once per output bit. So if you have a 16-bit input range, the whole loop will only be executed 8 times. For a 32-bit integer, 16 times, etc. This should give a nice quick execution even for extended input ranges, and the inner loop isn't too complicated.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    lonesock
    Piranha are people too.
  • rokickirokicki Posts: 1,000
    edited 2008-05-21 20:09
    The original poster's algorithm is the standard by-hand square root algorithm, similar to long division.
    It takes at most 15 iterations (when suitably modified for a 32-bit input).

    Azlan's code is a simpler algorithm, but it is much slower; it takes as many iterations as the return
    value, which for a 32-bit signed input can be as high as 46,340.

    Note that Spin *has* a sqrt function already so at least in Spin neither of these is needed.
  • LeonLeon Posts: 7,620
    edited 2008-05-22 00:37
    I used the Newton-Raphson technique once for a 16-bit fixed-point square root on a DSP. It was very fast - three iterations, IIRC.

    Leon

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Amateur radio callsign: G1HSM
    Suzuki SV1000S motorcycle
  • cgraceycgracey Posts: 14,246
    edited 2008-05-22 01:14
    Here's some code from the Spin interpreter that computes square root:

    ' y = 32-bit unsigned input
    ' x = 16-bit square root output

    sqrt··················· mov···· x,#0
    ······················· mov···· t1,#0
    ······················· mov···· t2,#16

    :loop·················· shl···· y,#1··········· wc
    ······················· rcl···· t1,#1
    ······················· shl···· y,#1··········· wc
    ······················· rcl···· t1,#1
    ······················· shl···· x,#2
    ······················· or····· x,#1
    ······················· cmpsub· t1,x··········· wc
    ······················· shr···· x,#2
    ······················· rcl···· x,#1
    ······················· djnz··· t2,#:loop

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • lonesocklonesock Posts: 917
    edited 2008-05-22 15:14
    Thanks for the feedback & instruction, everybody.

    @Chip: I will definitely be playing with that code today, thanks!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    lonesock
    Piranha are people too.
  • lonesocklonesock Posts: 917
    edited 2008-05-22 16:28
    I need some noob help: I can't figure out how to call an assembly function from Spin. I see plenty of examples on starting some asm in a new cog...is that the only way?

    btw, just working from your example, Chip, I think this is my version rendered to asm
    '    my version
    sqrt
            mov        x,#0
            mov        t2,#16
    :loop
            mov        t1,#1
            shl        t1,t2
            add        t1,x
            add        t1,x
            shl        t1,t2
            cmpsub    y,t1    wc
            rcl        x,#1
            djnz    t2,#:loop
    


    It is shorter than the posting you listed, but I'm sure I've made some hideous mistakes wink.gif Note that my version could be further shortened if:
    1) instead the 1st two commands inside :loop (mov, shl) I could combine them somehow, maybe with a MUX type command?
    2) I could work with 2*x the whole time, and then just do a SHR at the end (this would save one of the "add t1,x" commands)

    Am I way off base?

    thanks,

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    lonesock
    Piranha are people too.
  • Paul BakerPaul Baker Posts: 6,351
    edited 2008-05-22 16:34
    Yes, when spin is running the interpreter fully occupies the entire cog. Theres no room to execute any user assembly. It's an either/or situation, a cog runs either Spin or assembly but never both.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Paul Baker
    Propeller Applications Engineer

    Parallax, Inc.
  • AribaAriba Posts: 2,690
    edited 2008-05-22 20:12
    Only to be shure that you don't miss this:

    To calculate a square root in Spin you use the operator: ^^
    Example: n := ^^ 49

    (this operator calls the ASM routine which Chip has posted above, inside the Interpreter cog).

    Andy
  • lonesocklonesock Posts: 917
    edited 2008-05-25 18:21
    <edit>updated the zip to reflect the code changes in the last post</edit>

    OK, there _were_ some hideous mistakes in my assembly code posted above. First, I needed my loop to go from 15 down to 0, not 16 to 1. And secondly I need the incremental values of x each iteration, so I can't use the "rcl x,#1" trick. Here is the working assembly version of my implementation:
    sqrt_lonesock
            mov y,x2
            mov x,#0
            mov bit,#15
            mov bmask,#1
            shl bmask,#15
    :loop
            mov t1,bmask
            add t1,x
            add t1,x
            shl t1,bit
            cmpsub y,t1 wc
            muxc x,bmask
            shr bmask,#1
            djnz bit,#:loop
            shl x,#1
            add x,#1
            cmpsub y,x wc
            shr x,#1
            muxc x,#1
    


    It runs ~20% faster than the asm version inside spin, but with a higher instruction count, and one extra temporary variable. As you can see, I compute bits 15 down to 1 inside the loop, then afterward I compute bit 0. The bit 0 section seems awkward to me...can any asm guru suggest a more elegant way to set bit 0 of x if (2x+1 <= y)?

    I'm attaching a zip file of my Spin test files. "validate_sqrt.spin" compares my value against the spin assembly value in a loop, then lights a LED if everything matched (you will need to change the pin, I'm not using the demo board. At 80MHz the propeller can validate about 66k sqrts per second, so be prepared to wait depending on the initial count value you use). "test_sqrt.spin" starts up 3 cogs, 1 for straight Spin, 1 for the original Spin assembly version, and one for the new version of sqrt. Every 10,000 sqrts the cog will toggle an LED, for a visual speed reference. You will probably need to change these pins too.

    thanks,

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    lonesock
    Piranha are people too.

    Post Edited (lonesock) : 5/28/2008 4:36:07 PM GMT
  • lonesocklonesock Posts: 917
    edited 2008-05-28 03:46
    <edit>added timing info</edit>

    Latest updated version:
    sqrt_lonesock
          mov x,#0
          mov bit2,#32
    :loop
          sub bit2,#2
          mov t1,x
          shl t1,#2
          add t1,#1
          shl t1,bit2
          cmpsub y,t1 wc
          rcl x,#1
          tjnz bit2,#:loop
    


    This version is 25% faster than the Spin assembly version (so the same speed as my previous version), but now it uses the same number of temporary variables, and less overall code (making it a suitable replacement for the sqrt inside Spin...wink wink). It can be sped up even more, at the expense of instruction count, by semi-unrolling the loop; specifically the 1st iteration where x = 0, and the final iteration where bit2 = 0 can be performed explicitly saving about 6 executed operations overall, but adding another 8(ish) instructions to the function.

    For anybody following along, this version looks radically different from the previous incarnation. Here's what changed:
    * modified the bit counter to be 2*bit, starting at 32 but then immediately decrementing by 2 so my loop goes from 30 down to 0 in steps of -2
    * x uses the rcl trick, instead of placing the bit in the correct location right off the bat, which means...
    * I need to modify x inside the loop to regain it's proper alignment. Instead of delta_x^2 = ((1<<bits)+2x)<<bits), now delta_x^2 = (4x+1)<<(2*bits)

    Timing on my prop at 80MHz
    * just Spin with ^^, no asm: 35,000 sqrts/sec
    * asm Spin version: 120,250 sqrts/sec
    * my version: 150,600 sqrts/sec
    * unrolled: 157,600 sqrts/sec

    Does anyone see a way to further optimize / shorten the code?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    lonesock
    Piranha are people too.

    Post Edited (lonesock) : 5/28/2008 4:42:25 PM GMT
  • cgraceycgracey Posts: 14,246
    edited 2008-05-28 23:44
    Lonesock,

    That looks fantastic. I love it when things get smaller and faster. Good Job!

    Who knows, it might even be possible to make it smaller, yet, but I can't see how.

    Do you think a hardware implementation of sqrt would be good for the next Propeller if it took, say, 16 clocks to execute, and was implemented as a register r/w rather than an instruction (no holding things up)? You could write the 32-bit value and 16+ clocks afterwards read back the 16-bit root.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • cgraceycgracey Posts: 14,246
    edited 2008-05-29 01:57
    Lonesock,

    I looked at your code and wondered if the·root-in-progress could be maintained in-situ for the CMPSUB test. I was able to shave two instructions·out of·the loop! SUMC·works like magic here:

    ······· mov···· root,#0········ 'reset root
    ······· mov···· mask,h40000000· 'reset mask (constant·in register)
    :loop
    ······· or····· root,mask······ 'set trial bit
    ······· cmpsub· input,root· wc· 'subtract root from input if fits
    ······· sumnc·· root,mask······ 'cancel trial bit, set root bit if fit
    ······· shr···· root,#1········ 'shift root down
    ······· shr···· mask,#2········ 'shift mask down
    ······· tjnz··· mask,#:loop···· 'loop until mask empty

    This could be straight-line coded in only 4 instructions per iteration for 312,500 roots per second:

    ······· mov···· root,h40000000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h40000000
    ······· shr···· root,#1······

    ······· or····· root,h10000000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h10000000
    ······· shr···· root,#1·······

    ······· or····· root,h04000000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h04000000
    ······· shr···· root,#1·······

    ······· or····· root,h01000000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h01000000
    ······· shr···· root,#1·······

    ······· or····· root,h00400000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h00400000
    ······· shr···· root,#1·······

    ······· or····· root,h00100000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h00100000
    ······· shr···· root,#1·······

    ······· or····· root,h00040000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h00040000
    ······· shr···· root,#1·······

    ······· or····· root,h00010000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h00010000
    ······· shr···· root,#1·······

    ······· or····· root,h00004000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h00004000
    ······· shr···· root,#1·······

    ······· or····· root,h00001000
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h00001000
    ······· shr···· root,#1·······

    ······· or····· root,h00000400
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,h00000400
    ······· shr···· root,#1·······

    ······· or····· root,#$100
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,#$100
    ······· shr···· root,#1·······

    ······· or····· root,#$40
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,#$40
    ······· shr···· root,#1·······

    ······· or····· root,#$10
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,#$10
    ······· shr···· root,#1·······

    ······· or····· root,#$4
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,#$4
    ······· shr···· root,#1·······

    ······· or····· root,#$1
    ······· cmpsub· input,root· wc
    ······· sumnc·· root,#$1
    ······· shr···· root,#1·······

    I·always thought that the Spin interpreter's square-root routine was as small as possible, but you found a much better way, which·got me to think about it differently, which produced even another improvement. How much better could the whole interpreter be, I wonder?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • kuronekokuroneko Posts: 3,623
    edited 2008-05-29 02:06
    lonesock said...
    Does anyone see a way to further optimize / shorten the code?
    sqrt            xor     x, x
                    mov     p, storage
    
    :loop           or      x, p
                    cmpsub  y, x wc
    
            IF_C    add     x, p
                    andn    x, p
    
                    shr     x, #1
                    shr     p, #2 wz
            IF_NZ   jmp     #:loop
    
    storage         long    %01 << 30       ' x = 0, (x * 4 + 1) << 30
    x               res     1
    p               res     1
    



    And Chip got an even better solution by using sumnc. Well somehow that add/andn felt a bit odd [noparse];)[/noparse]
  • lonesocklonesock Posts: 917
    edited 2008-05-29 03:16
    @ Chip and kuroneko:
    Brilliant, both of you!

    @ Chip:
    That is blindingly fast, especially unrolled! (You do lose space due to the h* constants, on top of the instruction count, correct?) But I'm happy with the looped version, thanks!

    Regarding having a native sqrt, I would love to have it, but priority-wise it is certainly lower than hardware MUL and DIV instructions. I'd hate to be lynched by mobs of prop-fans ("We were going to include a HW MUL, but that lonesock guy demanded a sqrt, and we just didn't have time to do both..."). (btw, I'm using sqrt to compute standard deviations, playing with some audio compression methods, so I'm computing it per block, not per sample, so it's certainly fast enough for my needs now!)

    Regarding optimizing the entire Spin interpreter, there's always room for improvements wink.gif Maybe you could hold little mini-forum-contests, one function at a time (Protoboard prizes or something would be fun!) A nice goal would be shrinking the entire interpreter down to a size where you could call an actual inline assembly function if the whole thing fit in, say, 16 longs. What do you think?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    lonesock
    Piranha are people too.
  • ImageCraftImageCraft Posts: 348
    edited 2008-05-29 06:51
    Can we use it in the C compiler smile.gif?

    It's almost small enough to consider putting it in the LMM kernel, hmm... I wonder, may be what we will do in the compiler is to have a "Helper COG." All the extra useful functions that don't fit in the LMM kernel COG can live there. e.g. string functions, fast sqrt, the low level floating point primitives etc. This COG will be similar to providing a shared library version of these routines....

    The only CON is that we will use up one more COG, but that still leaves 6....
  • cgraceycgracey Posts: 14,246
    edited 2008-05-29 08:03
    ImageCraft said...
    Can we use it in the C compiler smile.gif?

    Well, I'm not going to claim any ownership, and I don't think anyone else would, either. I think we'd all be happy if you used it, as we're all excited about a better way to do something.

    I know we all suffer from paranoia over simple stuff like this, but let's·cultivate some freedom here. Let's enjoy engineering!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • Graham StablerGraham Stabler Posts: 2,510
    edited 2008-05-29 09:04
    Is there a fast square as well as a root, i.e. faster than a straight multiply by the same number?

    Graham
  • cgraceycgracey Posts: 14,246
    edited 2008-05-29 10:13
    Graham Stabler said...
    Is there a fast square as well as a root, i.e. faster than a straight multiply by the same number?

    Graham
    I'm sure there is. We'd have to reverse the process of computing the root. If the root can be determined in 4 instructions per bit, the square should be determinable in 4 instruction per 2 bits. Hey, a straight multiply takes 2 instructions per bit, so there it is. Perhaps a square can be computed at the rate of 2 bits per three instructions, or in·48 instructions total.

    Fast squares plus fast square root = fast hypotenuse.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.

    Post Edited (Chip Gracey (Parallax)) : 5/29/2008 10:54:11 AM GMT
  • lonesocklonesock Posts: 917
    edited 2008-05-29 11:15
    Chip Gracey (Parallax) said...
    ImageCraft said...

    Can we use it in the C compiler smile.gif?
    Well, I'm not going to claim any ownership, and I don't think anyone else would, either. I think we'd all be happy if you used it, as we're all excited about a better way to do something.

    I know we all suffer from paranoia over simple stuff like this, but let's cultivate some freedom here. Let's enjoy engineering!
    Exactly! In the spirit of Towel Day (belated), "Share and Enjoy!". For the most part I think patents on algorithms qualify as 'silly', especially for something as trivial as a standard mathematical operation. So, just for the record, all of the code I posted is released to the public domain. I can't comment on the license of the original sqrt assembly that Chip shared with us, though.

    btw, ImageCraft, I'm really enjoying your C compiler (only played with the first Beta, still need to try out #4!). Thanks for the great work. If there is anything I can do to contribute, just let me know.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    lonesock
    Piranha are people too.
  • lonesocklonesock Posts: 917
    edited 2008-05-29 11:23
    Graham Stabler said...
    Is there a fast square as well as a root, i.e. faster than a straight multiply by the same number?

    Graham
    There are some interesting thoughts here:
    www.piclist.com/techref/microchip/math/sq/index.htm
    (of course it would be fun to work it out from scratch wink.gif)

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    lonesock
    Piranha are people too.
  • Graham StablerGraham Stabler Posts: 2,510
    edited 2008-05-29 12:52
    Chip, you read my mind, check out my site on parallel kinematics:

    http://www.indoor.flyer.co.uk/kinematics.htm

    Scroll down a little way and see how simple the kinematics of some machines can be, just need squares and roots.

    Graham
  • cgraceycgracey Posts: 14,246
    edited 2008-08-27 07:06
    I was looking at this thread to·recall the·improved square-root algorithm for the new Spin interpreter and I found a way to shave another long off the looped version.

    Here's how it was (note that 'h40000000' is a register containing $40000000):

    ······· mov···· root,#0········ 'reset root
    ······· mov···· mask,h40000000· 'reset mask (constant·in register)
    :loop·· or····· root,mask······ 'set trial bit
    ······· cmpsub· input,root· wc· 'subtract root from input if fits
    ······· sumnc·· root,mask······ 'cancel trial bit, set root bit if fit
    ······· shr···· root,#1········ 'shift root down
    ······· shr···· mask,#2········ 'shift mask down
    ······· tjnz··· mask,#:loop···· 'loop until mask empty



    Here's the smaller version (now 'mask' is a register containing $40000000 which gets used directly and rotated all the way back to its original value):

    ······· mov···· root,#0········ 'reset root
    ·:loop· or····· root,mask······ 'set trial bit
    ······· cmpsub· input,root· wc· 'subtract root from input if fits
    ······· sumnc·· root,mask······ 'cancel trial bit, set root bit if fit
    ······· shr···· root,#1········ 'shift root down
    ······· ror···· mask,#2·····wc· 'shift mask down (wraps on last iteration)
    ·if_nc· jmp···· #:loop······ ·· 'loop until mask restored

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-08-27 07:45
    I have enough space to use an LMM style inline routine or a small overlay section in the ram interpreter - I have over 40 longs available (and counting). So don't presume this cannot be done smile.gif Unfortunately my testing ground to a halt due to a few other things here. I have done a quick calc and it will be quite a bit faster!

    I want to get the bits I have done shrunk working before I post it. Chip has given me permission to post the code - Thanks Chip smile.gif

    I have a spin debugger semi working and a single-stepper nearly working (uses 8 longs in the interpreter and the code resides in hub - there are a few caveats because the Interpreter can execute in more that 1 cog concurrently and stumped me for a while!). I need this to verify the Ram Interpreter.

    I have a question: Does anyone know if the Prop I ignores the upper 16 bits in a pointer to hub ram ?

    hptr long $5555_7000
    rdlong x,hptr

    i.e. Does the $5555_0000 get ignored or must it be $0000_7000 ($7000 is the hub ram address)? (I am away and no hardware to test it on).

    Posteditted: I'll add the improved sqrt·if that's ok. I haven't looked at multiply or divide, so any suggestions would be good.

    Post Edited (Cluso99) : 8/27/2008 8:09:44 AM GMT
  • cgraceycgracey Posts: 14,246
    edited 2008-08-27 08:49
    Cluso99 said...

    I have a question: Does anyone know if the Prop I ignores the upper 16 bits in a pointer to hub ram ?

    hptr long $5555_7000
    rdlong x,hptr

    i.e. Does the $5555_0000 get ignored or must it be $0000_7000 ($7000 is the hub ram address)? (I am away and no hardware to test it on).
    Yes, those top 16 bits get ignored. Just the lower 16 bits are used.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • AleAle Posts: 2,363
    edited 2008-08-27 16:29
    Rotating the mask is the same method I used in the BCD addition and subtraction that I posted in the Math article. very powerful to save space and cycles.
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-08-27 21:09
    Chip said...
    Do you think a hardware implementation of sqrt would be good for the next Propeller if it took, say, 16 clocks to execute, and was implemented as a register r/w rather than an instruction (no holding things up)? You could write the 32-bit value and 16+ clocks afterwards read back the 16-bit root.
    Wow Chip - that would be fantastic. Maybe the same for divide??- one for result, one for remainder??? Also love your sqrt smile.gif

    You have made me rethink onvernight. I am going to tidy the Ram Interpreter and post it untested with a statement of what I know works and where it is up to, so watch out for the thread. Hopefully I might get that done on the plane today.

    I will also start· a thread on spin and asm debuggers which·I have sort of working, as it may inspire the input like this thread has. cool.gif
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2008-08-27 21:40
    Chip,

    As long as you're making the gesture (and maybe now regretting it smile.gif ), a very useful set of register-based hardware macro functions would be:

    ····32 * 32 = 64 (unsigned multiply)
    ····64 / 32 = 32 rem 32 (unsigned divide)
    ····sqrt 64 = 32 (square root)

    This would make it possible to perform most 32-bit math without concerns for overflow or scaling issues.

    One very useful function to have in integer math is the trinary scaling operation: x * y / z, which requires the full precision of the product to be retained for the divide, thus effecting multiplication by a rational scaling factor y/z. The above operations would accommodate that, as well as Pythagorean distances in n-space to (nearly) 32-bit precision.

    Thanks,
    Phil

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    'Still some PropSTICK Kit bare PCBs left!
  • cgraceycgracey Posts: 14,246
    edited 2008-08-27 22:10
    Phil Pilgrim (PhiPi) said...
    Chip,

    As long as you're making the gesture (and maybe now regretting it smile.gif ), a very useful set of register-based hardware macro functions would be:

    ····32 * 32 = 64 (unsigned multiply)
    ····64 / 32 = 32 rem 32 (unsigned divide)
    ····sqrt 64 = 32 (square root)

    This would make it possible to perform most 32-bit math without concerns for overflow or scaling issues.

    One very useful function to have in integer math is the trinary scaling operation: x * y / z, which requires the full precision of the product to be retained for the divide, thus effecting multiplication by a rational scaling factor y/z. The above operations would accommodate that, as well as Pythagorean distances in n-space to (nearly) 32-bit precision.

    Thanks,
    Phil

    Phil,

    That's a lot of transistors, but I sure see your point. I imagine x*y would have to be computed before the divide-by-z could begin.

    One thing I made recently was a ratio computation circuit where you supply A and B, where B>A, and it returns a 32-bit fractional value. For example, 1/2 would yield $8000_0000. It's really great for computing·FRQ values for CTRs. Say you want to determine FRQ for 3.579545MHz, while you're main clock is 80MHz. You'd just put in 3_579_545 / 80_000_000 and get $0B74_5CFE out. Does this do anything for you? It's cheap in terms of silicon.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.

    Post Edited (Chip Gracey (Parallax)) : 8/27/2008 10:37:27 PM GMT
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2008-08-27 23:01
    Chip,

    The frequency computation issue is actually what spurred my thoughts on this topic. So, yes, a circuit dedicated to fractional division would be a huge help.

    If the generalized 64:32 operations require too many transistors (and I keep forgetting: that number has to be multiplied by the number of cogs), would it be possible to create simpler hardware building blocks for doing these unsigned operations, which the programmer could invoke iteratively? For example, 32 * 32 = 32 upper (unsigned), and 32 * 32 = 32 lower (unsigned). In the case of division, instead of 64 / 32 = 32 rem 32 (unsigned), maybe 32 rem 32 / 32 = 32 rem 32 (unsigned), which would permit division of multi-precision numerators if applied enough times. (I'm not sure either of these help the transistor count, though.)

    -Phil

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    'Still some PropSTICK Kit bare PCBs left!
Sign In or Register to comment.