32-bit CORDIC algorithm in Spin

Here is a CORDIC algorithm written in Spin which can generate very precise sines and cosines, rotate x,y points around 0,0, and convert x,y to polar form (angle, length).
This code uses the Propeller Demo Board and a TV for viewing text. The main point of this, though is just the 'pri cordic' method, the 27-long lookup table, and the three a,x,y long variables. You can pull those out and implement them into your own code
Kwabena (Kye) is using this algorithm to process GPS coordinates (converted into 32-bit-angles) to determine distances between points on the earth with a numerical accuracy of a few inches.
Chip
This code uses the Propeller Demo Board and a TV for viewing text. The main point of this, though is just the 'pri cordic' method, the 27-long lookup table, and the three a,x,y long variables. You can pull those out and implement them into your own code
Kwabena (Kye) is using this algorithm to process GPS coordinates (converted into 32-bit-angles) to determine distances between points on the earth with a numerical accuracy of a few inches.
Chip
Comments
This routine is very elegant, and I'm sure Kye will produce great results.
Thanks again,
Massimo
True, but navigation algorithms are very complex, and if a great circle routine is so good you only error is the one from the GPS. I think this is a great achievement.
Massimo
Bruce
If it could actually be more precise, I'd be impressed...
The Propeller's sine rom has 2^13 samples, full circle, that are 16 bits in range.
This CORDIC algorithm has about 2^32 samples, full circle, that are about 30 bits in range.
So, it's got 2^19 better phase resolution and 2^14 better range resolution.
If you think about the earth being 24,900 miles in circumference, 2^13 discrete angles resolve to 3.04 miles each, while 2^32 discrete angles resolve to 0.367 inches each.
Considering that the earth's radius is about 3,960 miles, a 16-bit range yields 319 feet per count, while a 30-bit range yields 0.448 inches per count.
Lawson
I still recon that the earth is flat :-)
Probably Kye is considering also the ellipsoid... ;-)
Now, how high was that bridge? and the mast above the water? Oops!
Massimo
I'd like to see a graph or table that shows the improvement over using the table.
The main modification is to scale up small input values to get full (or almost full) resolution.
I changed it to use PST (I don't have TV at work)
I expanded the corlut table to 1 (this may not make any difference, I don't know)
Comment welcome
Bean
' Original program by Chip Gracey ' ' Modification by Terry Hitt (Bean) ' Changed from TV to PST (I don't have a TV available) ' Scale up inputs for better resolution with small input values ' Added 3 values to corlut to finish table to zero con _clkmode = xtal1 + pll16x _xinfreq = 5_000_000 obj PST : "Parallax Serial Terminal" var long a, x, y pub go PST.start(115200) ' Setup PST for 115200 baud (Bean) WaitCnt(cnt+ClkFreq*5) ' Give user 5 seconds to start PST (Bean) a := 0 x := -4 y := -3 show 'show original x,y and angle to rotate by cordic(1) 'rotate x,y show 'show cordic(0) 'convert to polar show 'show, should be very close to original repeat 99 'do a bunch more to see cumulative error cordic(1) cordic(0) show pri show ' Changed from TV to PST (Bean) PST.str(string("A ")) PST.dec(a) PST.char(13) PST.str(string(" X ")) PST.dec(x) PST.char(13) PST.str(string(" Y ")) PST.dec(y) PST.char(13) PST.char(13) pri cordic(atan2) | negate, i, da, dx, dy, shift ' CORDIC algorithm ' ' if atan2 = 0: x,y are rotated by angle in a ' if atan2 = 1: x,y are converted from cartesian to polar with angle->a, length->x ' ' - angle range: $0000_0000-$FFFF_FFFF = 0-359.9999999 degrees ' - hypotenuse of x,y must be within ±1_300_000_000 to avoid overflow ' - algorithm works best if x,y are kept large: ' example: x=$40000000, y=0, a=angle, cordic(0) performs cos,sin into x,y ' Scale up values for better resolution (Bean) shift := 0 repeat while (||x < 400_000_000) and (||y < 400_000_000) x *= 2 y *= 2 shift += 1 if atan2 'if atan2 mode, reset a a~ negate := x < 0 'check quadrant 2 | 3 for either atan2 or rotate mode else negate := (a ^ a << 1) & $80000000 if negate 'if quadrant 2 | 3, (may be outside ±106° convergence range) a += $80000000 '..add 180 degrees to angle -x '..negate x -y '..negate y repeat i from 0 to 29 'do CORDIC iterations (added 3 values to original values) da := corlut[i] dx := y ~> i dy := x ~> i if atan2 negate := y < 0 'if atan2 mode, drive y towards 0 else negate := a => 0 'if rotate mode, drive a towards 0 if negate -da -dx -dy a += da x += dx y -= dy 'remove CORDIC gain by multiplying by ~0.60725293500888 i := $4DBA76D4 x := (x ** i) << 1 + (x * i) >> 31 y := (y ** i) << 1 + (x * i) >> 31 ' Undo scaling of original data (Bean) x := (x + 1 << (shift-1)) ~> shift y := (y + 1 << (shift-1)) ~> shift dat corlut long $20000000 'CORDIC angle lookup table long $12E4051E long $09FB385B long $051111D4 long $028B0D43 long $0145D7E1 long $00A2F61E long $00517C55 long $0028BE53 long $00145F2F long $000A2F98 long $000517CC long $00028BE6 long $000145F3 long $0000A2FA long $0000517D long $000028BE long $0000145F long $00000A30 long $00000518 long $0000028C long $00000146 long $000000A3 long $00000051 long $00000029 long $00000014 long $0000000A long $00000005 ' 3 extra values added (Bean) long $00000003 long $00000001
You would have to replace the lookup table, start the iterations at i=1, instead of at i=0, then repeat every third iteration without changing i. The next Propeller has both circular and hyperbolic CORDIC in hardware.
Today my computer is dead with a virus, but I hope to have it reborn by noon. I will do it then and post it.
Chip
Good idea. I found through experimentation that if you provide as many sub-bits as you have whole bits, you will be lossless. For example, if you have 32-bit A,X,Y's, give them each 32-sub bits and not only is there no need to normalize them upwards, but they will be as accurate as possible. You cannot do any better than that.
I had also read, and verified through experimentation, that for 32-bit CORDIC variables, 27 iterations is optimal. Beyond that, you get too much truncation in the X and Y deltas and it amounts to a few LSBs' worth of noise in the results. It just fidgits around.
It does circular and hyperbolic functions etc, it stalled because of some small detail or other, I wonder if together we can whip it in to shape. It was in assembly too so very fast and might make a nice object, if I remember correctly it might have been 16 bit but could be made 32bit or switch-able.
Graham
Chip, thanks for the comments.
By "sub-bits" do you mean like a 32.32 fixed point number ? That might make more sense as it would allow fractional values to be used.
Bean
There is a PASM routine from Beau doing atan2... it could be a nice starting point.
Thanks Kye!
With the distance already done the course is just a little step away...
Did you test it for close points (say less than a mile)?
Massimo
Thanks too, Bean, for the scaling algorithm and the PST output.
It might be possible to condense this
shift := 0 repeat while (||x < 400_000_000) and (||y < 400_000_000) x *= 2 y *= 2 shift += 1
to something like,' enter with ||x =< $2000_0000 and ||y =< $2000_0000 ' (overly strict to keep hypotenuse<1300000000) shift := 30 - ( >| (||x <# ||y) ) ' encode the biggest of x and y into 32 bits x << shift ' left justify the largest into the field msb at $2000_0000 y << shift ' both shifted equally, angle unchanged
Sometimes one might not want to restore the shift directly at the end. For example, if x=1 and y=1 at entry, you want an angle of 45 degrees and a hypotenuse of 1.41421, and the fractional part is coded in the extra bits and would be lost by scaling back.My code is in PASM. I will upload it later.
Graham
But, Prop 2 will have a multiply instruction, so it seems you could do it faster with an infinite series approximation there...
Or, what am I missing?
I limited myself to calculating cis(theta) - ie sines and cosines rather than atan2 or arbitrary rotation, which means the multiply factor isn't needed (you just start with a scaled down value for x).
Anyway the great thing is that the SUMNC instruction comes into its own for the conditional negation and I've managed to get the loop down to 13 instructions
:loop movs :loadins, corindex ' overwrite source register with table entry regnum add corindex, #1 ' can't use loadins straight away, remember :loadins mov da, corlut ' this modified instruction indexes corlut mov dx, yy sar dx, shift ' dx = yy >> ii neg dy, xx sar dy, shift ' dy = -xx >> ii add shift, #1 cmps aa, #0 wc ' drive angle to zero sumnc aa, da ' sumnc instruction very handy for this. sumnc xx, dx sumnc yy, dy djnz nn, #:loop ' 52 cycles per loop, 27 loops = 1404 cycles
Series approximations don't always converge quickly. For some functions, Tchebychev polynomials provide a reasonable compromise, but for the amount of work the Prop II would have to get the same accuracy, I think my money would still be on the CORDIC method.
You do raise an interesting point, though. For example, the transcendental function library for the IBM 1130 (a 16-bitter w/ multiply) used Tchebychev polynomials. But I don't know if CORDIC had been thought of yet (1968).
-Phil
http://www.ganssle.com/approx/approx.pdf
It can do a arbitrarily good job with just a few multiplications...
The guru for polynomial approximations is Jack Crenshaw, whose Math Toolkit for Real-Time Programming is referenced in the nice Genssle pdf. Crenshaw goes into depth on computing one function after another, for the most part assuming there is an FPU available. One has to be struck by the many different tricks and mathematics that are involved with the unique personality of each function and the effort to find a polynomial or an algorithm that will converge quickly. The results are very good and can certainly whip CORDIC, again provided there is an FPU available, and provided there is enough memory to hold all the unique algorithms.
And actually, multiplication with floating point is fast even without an FPU.
But, it is very slow without a hardware multiply.
I guess that's my question... Wikipedia (not that it's always right) says that CORDIC is used primarily when you don't have a hardware multiply. But, the Prop 2 will have a hardware multiply. So, why use CORDIC on Prop2?
I think I remember something about the Prop2 having special CORDIC circuitry though... Maybe that's the reason...
But, it does look like CORDIC is the best option for Prop1 to do better than the ROM table.
(Although, I do wonder if an external table, like on SD card or other flash device, would be faster...)