Square root methods
Peter Jakacki
Posts: 10,193
I have just optimized a simple square root routine that I have in Tachyon Forth in high-level bytecode so it's not the fastest. But I'm curious as to what speeds to expect with these simple routines. I know when I introduce the floating point cog in V3 that this shouldn't matter too much, for FP methods that is.
A square root of 12257 is taking 1.4ms at present. Are there any methods you know of that aren't too complicated?
A square root of 12257 is taking 1.4ms at present. Are there any methods you know of that aren't too complicated?
{ Usage: 12257 SQRT . 111 ok Timing: 12257 LAP SQRT LAP .LAP SPACE . 1.400ms 111 ok 125 LAP SQRT LAP .LAP SPACE . 139.400us 11 ok 1234 LAP SQRT LAP .LAP SPACE . 413.400us 35 ok 1234 DUP * LAP SQRT LAP .LAP SPACE . 14.600ms 1234 ok } pub SQRT ( value -- sqrt ) 0 0 ROT 1- --- COUNT SUBT VALUE-1 BEGIN OVER - --- VALUE-SUBT DUP 31 >> NOT --- IF VALUE => 0 ( check sign ) WHILE ROT 1+ --- COUNT+1 ROT 2+ --- SUBT+2 ROT --- restore order REPEAT 2DROP --- return with COUNT ;
Comments
Anyway, have a look here: http://www.codecodex.com/wiki/Calculate_an_integer_square_root#C. No Forth examples but many other languages you can adapt from. I guess you need to find an algorithm that does not use multiply or divide.
Why not do this as a word defined in PASM ?
If there is no more kernel space and loading a RUNMOD also has some overhead
the penalty of LMM execution might not be so bad in comparison.
It should converge fairly quickly. Of course, there is that nasty division, but that's atomic, right?
-Phil
I didn't know what the technique was called but it's one of the methods I've seen that didn't involve division (that's atomic right Phil?) but it does get badly slower for bigger numbers. It's a tight bit of code that's suitable for some applications but PASM involves loading a cog and since there is no more room in the Tachyon cog then that either needs to run as a RUNMOD in Tachyon or as a math's cog I guess.
@Tracy - thanks for the BM, good material there and some of it looks familiar too.
That method is called "The Sum of Odd Integers". If you add sequential odd integers starting with 1 the total will always be a square and the number of integers added will be the square root. The process can be speeded up by grouping the number into pairs of integers from the right. Then do the subtraction until underflow on each pair starting at the left. This was the method used by a Friden EC132 desktop electronic calculator built in the 60's.
The good thing about this one is that it converges quickly and only requires multiplication which is still a fast operation in Tachyon.
So assuming we have executed [SQRT] first to load in that module we can just RUNMOD or alias that as SQRT and presto!
355.000000 113 / LAP SQRT LAP .LAP SPACE . 5.600us 1772 ok
1234 DUP * LAP SQRT LAP .LAP SPACE . 5.600us 1234 ok
2000000000 LAP SQRT LAP .LAP SPACE . 5.600us 44721 ok
So a square root always takes 5.6us and another 20us to load the module, so a once off takes under 26us, not bad really.
EDIT: the LOADMOD only really has to load in 10 longs so I modified [SQRT] which now takes 14us to load and modified the RUNMOD instruction which reduces to a single byte opcode so that the RUNMOD for SQRT takes 5.2us or 19.2us for a once off SQRT.
so what do you think about running kernel extensions in LMM mode.
In this case with LMM tking 4* cycles it should be around 20us as well, but without consuming the RUNMOD.
And such a mechanism might be handy for PASM routines above the RUNMOD size.
I think I was considering this for V3 as I also wanted to have inline assembler support hence another reason for vocabularies. However LMM can't run as fast as a RUNMOD and while it does take 14us to load the module the thing is that we would normally be doing more than a single SQRT operation at only 5.2us each time. But taking all things into consideration V3 will have a maths cog anyway although there is a latency with passing parameters to it and waiting for a response. Also bear in mind that there are these "coglets", objects that can be loaded into a cog from EEPROM or SD at runtime, even replaced etc.
P.S. This is really a little OT for this thread, it really belongs on the Tachyon thread.
Although, now that I think about it, the code would probably be longer. Especially if you wanted to interpolate for best accuracy.
Still, if speed is the main concern it seems this could be faster.
I recommend that you take a look at the number of possible answers in your application to determine if it's really necessary to actually calulate the square root.
Had a similar moment coding on a PLC the other day. T'was fine tuning a hand-coded PWM routine that ran at around 10Hz carrier and decided to check the overall PLC loop time and realised I had made it increase from something like 1.5ms to 2.5ms with only 10 instructions or so!
Thinking about what the operators would want I realised a mere 10 - 20 levels of adjustment is plenty. The modulation could be far coarser. So I downed the sample rate ten fold ... Hmm, that reminds me, I should probably test it.
I have now added a method that returns the result to 5 decimal places for that extra bit of precision.
4,000,000,000 UMSQRT 5 .DP 63,245.55320 ok
Of course this takes quite a bit longer at around 90us but it's the next best thing (or better) than floating point and as has been pointed out, in practice the values we encounter in most applications are far more limited, hence scaled integers, even 64-bits.