Shop OBEX P1 Docs P2 Docs Learn Events
Square root methods — Parallax Forums

Square root methods

Peter JakackiPeter Jakacki Posts: 10,193
edited 2015-05-26 07:27 in Propeller 1
I have just optimized a simple square root routine that I have in Tachyon Forth in high-level bytecode so it's not the fastest. But I'm curious as to what speeds to expect with these simple routines. I know when I introduce the floating point cog in V3 that this shouldn't matter too much, for FP methods that is.

A square root of 12257 is taking 1.4ms at present. Are there any methods you know of that aren't too complicated?
{
Usage: 12257 SQRT . 111 ok
Timing: 
12257 LAP SQRT LAP .LAP SPACE .  1.400ms 111 ok
125 LAP SQRT LAP .LAP SPACE .  139.400us 11 ok
1234 LAP SQRT LAP .LAP SPACE .  413.400us 35 ok
1234 DUP * LAP SQRT LAP .LAP SPACE .  14.600ms 1234 ok
}
pub SQRT ( value -- sqrt )
    0 0 ROT 1-               --- COUNT SUBT VALUE-1 
    BEGIN
       OVER -                --- VALUE-SUBT
      DUP 31 >> NOT          --- IF VALUE => 0 ( check sign )
    WHILE
      ROT 1+                 --- COUNT+1
      ROT 2+                 --- SUBT+2
      ROT                    --- restore order
    REPEAT  
    2DROP                    --- return with COUNT
     ;

Comments

  • Heater.Heater. Posts: 21,230
    edited 2015-05-20 09:27
    If my Forth studies serve me well what you are doing there is the square root by difference table technique (or whatever it is called). Subtract ever increasing odd numbers until you cannot subtract any more. The number of subtractions is the root. Took me years to realize why that algorithm worked after I first saw it. Despite having done the maths behind it years before in school! Of course it gets badly slower for bigger numbers.

    Anyway, have a look here: http://www.codecodex.com/wiki/Calculate_an_integer_square_root#C. No Forth examples but many other languages you can adapt from. I guess you need to find an algorithm that does not use multiply or divide.

    Why not do this as a word defined in PASM ?
  • MJBMJB Posts: 1,235
    edited 2015-05-20 11:45
    Heater. wrote: »
    Why not do this as a word defined in PASM ?
    This might be a good application for a LMM style kernel extension.
    If there is no more kernel space and loading a RUNMOD also has some overhead
    the penalty of LMM execution might not be so bad in comparison.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2015-05-20 14:28
    Here's another technique for sqrt(x):
    d := 1
    q := x
    repeat until (||(d - q) =< 1)
      q := x / d
      d := (q + d) >> 1
    return d
    

    It should converge fairly quickly. Of course, there is that nasty division, but that's atomic, right?

    -Phil
  • Tracy AllenTracy Allen Posts: 6,664
    edited 2015-05-20 14:48
    I'd bookmarked this thread, fast-int-sqrt-for-the-prop. Especially post #14.
  • evanhevanh Posts: 16,083
    edited 2015-05-20 15:52
    Bookmarked! That's a pretty cool thread Tracy. Chip certainly had lots of ideas already brewing for the Prop2. It clearly inputed to the Prop2 feature set and can be seen echoing in Chip's post about the recently finished Hub CORDIC engine.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-05-20 18:45
    Heater. wrote: »
    If my Forth studies serve me well what you are doing there is the square root by difference table technique (or whatever it is called). Subtract ever increasing odd numbers until you cannot subtract any more. The number of subtractions is the root. Took me years to realize why that algorithm worked after I first saw it. Despite having done the maths behind it years before in school! Of course it gets badly slower for bigger numbers.

    Anyway, have a look here: http://www.codecodex.com/wiki/Calculate_an_integer_square_root#C. No Forth examples but many other languages you can adapt from. I guess you need to find an algorithm that does not use multiply or divide.

    Why not do this as a word defined in PASM ?

    I didn't know what the technique was called but it's one of the methods I've seen that didn't involve division (that's atomic right Phil?) but it does get badly slower for bigger numbers. It's a tight bit of code that's suitable for some applications but PASM involves loading a cog and since there is no more room in the Tachyon cog then that either needs to run as a RUNMOD in Tachyon or as a math's cog I guess.

    @Tracy - thanks for the BM, good material there and some of it looks familiar too.
  • Hal AlbachHal Albach Posts: 747
    edited 2015-05-20 19:10
    I didn't know what the technique was called but it's one of the methods I've seen that didn't involve division (that's atomic right Phil?) but it does get badly slower for bigger numbers. It's a tight bit of code that's suitable for some applications but PASM involves loading a cog and since there is no more room in the Tachyon cog then that either needs to run as a RUNMOD in Tachyon or as a math's cog I guess.

    @Tracy - thanks for the BM, good material there and some of it looks familiar too.

    That method is called "The Sum of Odd Integers". If you add sequential odd integers starting with 1 the total will always be a square and the number of integers added will be the square root. The process can be speeded up by grouping the number into pairs of integers from the right. Then do the subtraction until underflow on each pair starting at the left. This was the method used by a Friden EC132 desktop electronic calculator built in the 60's.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-05-20 19:49
    I found another method in the link that Heater gave me and tried one of them which seems to work quite well and it's simple too!
    { 
    Square root method based on this Tristan Muntsinger algorithm which uses multiplication:
        unsigned int c = 0x8000;  
        unsigned int g = 0x8000;  
      
        for(;;) {  
            if(g*g > n)  
                g ^= c;  
            c >>= 1;  
            if(c == 0)  
                return g;  
            g |= c;  
        }  
    
    Timing:
    1234 DUP * LAP SQRT LAP .LAP SPACE .  296.800us 1234 ok
    200000000 LAP SQRT LAP .LAP SPACE .  300.200us 14142 ok
    355.000000 113 / LAP SQRT LAP .LAP SPACE .  294.600us 1772 ok
    }
    
    pub SQRT ( n -- sqrt )
    	$8000 DUP ( n c g )
    	BEGIN
    	  DUP DUP * 4TH >			--- if (g*g > n)
    	    IF OVER XOR THEN			--- g ^= c
    	  SWAP 2/ ( n g c )			--- c >>= 1
    	  ?DUP					--- while c <> 0 
    	WHILE
    	  SWAP ( n c g ) OVER OR		--- g |= c
    	REPEAT
    	NIP					--- leave only g on stack
    	;
    

    The good thing about this one is that it converges quickly and only requires multiplication which is still a fast operation in Tachyon.
  • KeithEKeithE Posts: 957
    edited 2015-05-20 20:20
    I recall that some of the Forth CPUs (e.g. Novix) had a square root step instruction, and I sort of recall talk of cube root. But I can't remember how they worked, so it might only apply to the Verilog Propeller. Does anyone have any information for these? I believe that they produced a bit per clock in hardware - sort of like some of the more trivial hardware dividers.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-05-20 22:21
    My other immediate option is to make this function a RUNMOD which is really a small PASM module that gets loaded into the Tachyon cog as needed since there isn't much room. So I took some of Chip's code and made that into a RUNMOD.
    org             _RUNMOD
    
    SQRT          mov       X,#0
    :loop         or        X,smask
                  cmpsub    tos,X wc
                  sumnc     X,smask
                  shr       X,#1
                  ror       smask,#2 wc
            if_nc jmp       #:loop
                  mov       tos,X
                  jmp       unext
    smask         long      $4000_0000
    


    So assuming we have executed [SQRT] first to load in that module we can just RUNMOD or alias that as SQRT and presto!

    355.000000 113 / LAP SQRT LAP .LAP SPACE . 5.600us 1772 ok
    1234 DUP * LAP SQRT LAP .LAP SPACE . 5.600us 1234 ok
    2000000000 LAP SQRT LAP .LAP SPACE . 5.600us 44721 ok


    So a square root always takes 5.6us and another 20us to load the module, so a once off takes under 26us, not bad really.

    EDIT: the LOADMOD only really has to load in 10 longs so I modified [SQRT] which now takes 14us to load and modified the RUNMOD instruction which reduces to a single byte opcode so that the RUNMOD for SQRT takes 5.2us or 19.2us for a once off SQRT.
  • MJBMJB Posts: 1,235
    edited 2015-05-21 07:14
    So assuming we have executed [SQRT] first to load in that module we can just RUNMOD or alias that as SQRT and presto!

    355.000000 113 / LAP SQRT LAP .LAP SPACE . 5.600us 1772 ok
    1234 DUP * LAP SQRT LAP .LAP SPACE . 5.600us 1234 ok
    2000000000 LAP SQRT LAP .LAP SPACE . 5.600us 44721 ok


    So a square root always takes 5.6us and another 20us to load the module, so a once off takes under 26us, not bad really.

    EDIT: the LOADMOD only really has to load in 10 longs so I modified [SQRT] which now takes 14us to load and modified the RUNMOD instruction which reduces to a single byte opcode so that the RUNMOD for SQRT takes 5.2us or 19.2us for a once off SQRT.

    so what do you think about running kernel extensions in LMM mode.
    In this case with LMM tking 4* cycles it should be around 20us as well, but without consuming the RUNMOD.
    And such a mechanism might be handy for PASM routines above the RUNMOD size.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-05-21 07:34
    MJB wrote: »
    so what do you think about running kernel extensions in LMM mode.
    In this case with LMM tking 4* cycles it should be around 20us as well, but without consuming the RUNMOD.
    And such a mechanism might be handy for PASM routines above the RUNMOD size.

    I think I was considering this for V3 as I also wanted to have inline assembler support hence another reason for vocabularies. However LMM can't run as fast as a RUNMOD and while it does take 14us to load the module the thing is that we would normally be doing more than a single SQRT operation at only 5.2us each time. But taking all things into consideration V3 will have a maths cog anyway although there is a latency with passing parameters to it and waiting for a response. Also bear in mind that there are these "coglets", objects that can be loaded into a cog from EEPROM or SD at runtime, even replaced etc.

    P.S. This is really a little OT for this thread, it really belongs on the Tachyon thread.
  • RaymanRayman Posts: 14,793
    edited 2015-05-21 10:33
    Any reason not to just use the Prop's log and anti-log tables in ROM? Seems like that would be fastest...

    Although, now that I think about it, the code would probably be longer. Especially if you wanted to interpolate for best accuracy.
    Still, if speed is the main concern it seems this could be faster.
  • redheadedrodredheadedrod Posts: 78
    edited 2015-05-21 17:11
    Seems like we had to do this in my Assembly class the first semester.. I should look up my notes and see how we did it.
  • dbpagedbpage Posts: 217
    edited 2015-05-25 17:44
    I just want to mention one thing. I was tasked with developing an algorithm to calculate the root mean square from triaxial accelometers. I spent considerable amount of time developing fast square root methods when I realized that the possible answers fell within the range of 2.5 to 10.0 Gs. I immediately abandoned my efforts and created a table from 25 to 100, a mere 76 values, to lookup the answer. That saved code space, processing time and code maintainability.

    I recommend that you take a look at the number of possible answers in your application to determine if it's really necessary to actually calulate the square root.
  • evanhevanh Posts: 16,083
    edited 2015-05-26 04:20
    dbpage wrote: »
    ... I immediately abandoned my efforts and created a table from 25 to 100, a mere 76 values, to lookup the answer. That saved code space, processing time and code maintainability.

    Had a similar moment coding on a PLC the other day. T'was fine tuning a hand-coded PWM routine that ran at around 10Hz carrier and decided to check the overall PLC loop time and realised I had made it increase from something like 1.5ms to 2.5ms with only 10 instructions or so!

    Thinking about what the operators would want I realised a mere 10 - 20 levels of adjustment is plenty. The modulation could be far coarser. So I downed the sample rate ten fold ... Hmm, that reminds me, I should probably test it.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-05-26 07:27
    As mentioned back in post #11 I got this down to 5.2us including opcode fetch once the module has been loaded or 19.2us for a once off including loading the module.

    I have now added a method that returns the result to 5 decimal places for that extra bit of precision.
    4,000,000,000 UMSQRT 5 .DP 63,245.55320 ok

    Of course this takes quite a bit longer at around 90us but it's the next best thing (or better) than floating point and as has been pointed out, in practice the values we encounter in most applications are far more limited, hence scaled integers, even 64-bits.
Sign In or Register to comment.