Overlapping CORDIC commands to maximize throughput
cgracey
Posts: 14,223
I did a test today where I overlapped CORDIC commands and result fetches for the single-most complex CORDIC instruction:
SETQ+QROTATE rotates a signed 32-bit (x,y) coordinate pair by a 32-bit angle, then returns the resultant (x,y) via GETQX and GETQY.
With minimum timing, achievable by overlapping command issuing and result fetching, a cog in an 8-cog P2 implementation can do one of these operations every 8 clocks. There is no time for hub memory accesses or indirection for SETQ+QROTATE+GETQY+GETQX. All instructions must be hard-coded with register addresses from which inputs are read and outputs are stored.
In my test, I did 32 operations, of which the middle 16 were completely overlapped. It took 312 clocks to complete (32*8 + 56). That comes to 9.75 clocks per operation (312/32). The more operations you overlap, the closer you get to 8 clocks per operation.
The program I wrote rotates 32 sets of (x,y) coordinate pairs by ascending angle values and outputs the y values to 16-bit DACs on P0..P31. I'm running the P2 at 250MHz off a 20MHz crystal.
P0..P31 output 990-ohm 3.3V sine waves from 10.0Khz..13.1KHz in 100Hz increments. The whole loop takes 2.33us, so the update rate is 428KHz.
Here is the code:
EDIT: Found a bug in the program where P0 DAC was not updating. Fixed it.
setq y '2 clocks qrotate x,angle '2 getqy y '2 getqx x '2 = 8 clocks (minimum)
SETQ+QROTATE rotates a signed 32-bit (x,y) coordinate pair by a 32-bit angle, then returns the resultant (x,y) via GETQX and GETQY.
With minimum timing, achievable by overlapping command issuing and result fetching, a cog in an 8-cog P2 implementation can do one of these operations every 8 clocks. There is no time for hub memory accesses or indirection for SETQ+QROTATE+GETQY+GETQX. All instructions must be hard-coded with register addresses from which inputs are read and outputs are stored.
In my test, I did 32 operations, of which the middle 16 were completely overlapped. It took 312 clocks to complete (32*8 + 56). That comes to 9.75 clocks per operation (312/32). The more operations you overlap, the closer you get to 8 clocks per operation.
The program I wrote rotates 32 sets of (x,y) coordinate pairs by ascending angle values and outputs the y values to 16-bit DACs on P0..P31. I'm running the P2 at 250MHz off a 20MHz crystal.
P0..P31 output 990-ohm 3.3V sine waves from 10.0Khz..13.1KHz in 100Hz increments. The whole loop takes 2.33us, so the update rate is 428KHz.
Here is the code:
' ' CORDIC overlapping command demo ' ' - outputs 32 sine waves of increasing frequency on P0..P31 using 990-ohm DACs ' - uses SETQ+QROTATE+GETQY+GETQX, the most-in/out-intensive CORDIC command ' con f = $0010C400 'base frequency = ~10KHz, 100Hz steps dat org hubset ##%1_000001_0000011000_1111_10_00 'enable crystal+PLL, stay in 20MHz+ mode waitx ##20_000_000/100 'wait ~10ms for crystal+PLL to stabilize hubset ##%1_000001_0000011000_1111_10_11 'now switch to PLL running at 250MHz rep @.r,#32 'ready to set P0..P31 to DAC mode wrpin dacmode,i 'set 16-bit noise-dither DAC mode wxpin #1,i 'set period to 1 for anytime updating dirh i 'enable smart pin incmod i,#31 'inc index, wrap to 0 .r ' ' ' Rotate 32 sets of (x,y) coordinates at different rates ' by overlapping CORDIC commands and result fetches ' ' 'clk sum ' 'w=wait !=cordic tick ' loop setq y+00 '2 ? begin first 8 commands qrotate x+00,a+00 '?w+2 2! setq y+01 '2 4 qrotate x+01,a+01 '4w+2 10! setq y+02 '2 12 qrotate x+02,a+02 '4w+2 18! setq y+03 '2 20 qrotate x+03,a+03 '4w+2 26! setq y+04 '2 28 qrotate x+04,a+04 '4w+2 34! setq y+05 '2 36 qrotate x+05,a+05 '4w+2 42! setq y+06 '2 44 qrotate x+06,a+06 '4w+2 50! setq y+07 '2 52 qrotate x+07,a+07 '4w+2 58! result 00 is ready at 54!!! getqy y+00 '2 60 get result 00, no waiting!!! getqx x+00 '2 62 setq y+08 '2 64 begin overlapping commands and results qrotate x+08,a+08 '2 66! getqy y+01 '2 68 getqx x+01 '2 70 setq y+09 '2 72 qrotate x+09,a+09 '2 74! getqy y+02 '2 76 getqx x+02 '2 78 setq y+10 '2 80 qrotate x+10,a+10 '2 82! getqy y+03 '2 84 getqx x+03 '2 86 setq y+11 '2 88 qrotate x+11,a+11 '2 90! getqy y+04 '2 92 getqx x+04 '2 94 setq y+12 '2 96 qrotate x+12,a+12 '2 98! getqy y+05 '2 100 getqx x+05 '2 102 setq y+13 '2 104 qrotate x+13,a+13 '2 106! getqy y+06 '2 108 getqx x+06 '2 110 setq y+14 '2 112 qrotate x+14,a+14 '2 114! getqy y+07 '2 116 getqx x+07 '2 118 setq y+15 '2 120 qrotate x+15,a+15 '2 122! getqy y+08 '2 124 getqx x+08 '2 126 setq y+16 '2 128 qrotate x+16,a+16 '2 130! getqy y+09 '2 132 getqx x+09 '2 134 setq y+17 '2 136 qrotate x+17,a+17 '2 138! getqy y+10 '2 140 getqx x+10 '2 142 setq y+18 '2 144 qrotate x+18,a+18 '2 146! getqy y+11 '2 148 getqx x+11 '2 150 setq y+19 '2 152 qrotate x+19,a+19 '2 154! getqy y+12 '2 156 getqx x+12 '2 158 setq y+20 '2 160 qrotate x+20,a+20 '2 162! getqy y+13 '2 164 getqx x+13 '2 166 setq y+21 '2 168 qrotate x+21,a+21 '2 170! getqy y+14 '2 172 getqx x+14 '2 174 setq y+22 '2 176 qrotate x+22,a+22 '2 178! getqy y+15 '2 180 getqx x+15 '2 182 setq y+23 '2 184 qrotate x+23,a+23 '2 186! getqy y+16 '2 188 getqx x+16 '2 190 setq y+24 '2 192 qrotate x+24,a+24 '2 194! getqy y+17 '2 196 getqx x+17 '2 198 setq y+25 '2 200 qrotate x+25,a+25 '2 202! getqy y+18 '2 204 getqx x+18 '2 206 setq y+26 '2 208 qrotate x+26,a+26 '2 210! getqy y+19 '2 212 getqx x+19 '2 214 setq y+27 '2 216 qrotate x+27,a+27 '2 218! getqy y+20 '2 220 getqx x+20 '2 222 setq y+28 '2 224 qrotate x+28,a+28 '2 226! getqy y+21 '2 228 getqx x+21 '2 230 setq y+29 '2 232 qrotate x+29,a+29 '2 234! getqy y+22 '2 236 getqx x+22 '2 238 setq y+30 '2 240 qrotate x+30,a+30 '2 242! getqy y+23 '2 244 getqx x+23 '2 246 setq y+31 '2 248 qrotate x+31,a+31 '2 250! getqy y+24 '2 252 get 8 trailing results getqx x+24 '2 254 getqy y+25 '4w+2 260 getqx x+25 '2 262 getqy y+26 '4w+2 268 getqx x+26 '2 270 getqy y+27 '4w+2 276 getqx x+27 '2 278 getqy y+28 '4w+2 284 getqx x+28 '2 286 getqy y+29 '4w+2 292 getqx x+29 '2 294 getqy y+30 '4w+2 300 getqx x+30 '2 302 getqy y+31 '4w+2 308 getqx x+31 '2 310 ' ' ' Output y[00..31] (sines) to P0..P31 ' rep @.r,#32 'ready to update 32 DACs alts i,#y 'get y[00..31] into next s and inc i getword j,0,#1 'get upper word of y bitnot j,#15 'convert signed word to unsigned word for DAC output wypin j,i 'update DAC output value incmod i,#31 'inc index, wrap to 0 .r drvnot #32 'toggle P32 on each iteration jmp #loop ' ' ' Data ' dacmode long %10100_00000000_01_00010_0 '990-ohm DAC + randomly-dithered 16-bit DAC mode i long 0 'index j long 0 'misc x long $7F000000[32] 'initial (x,y) coordinates y long $00000000[32] a long 100*f,101*f,102*f,103*f,104*f,105*f,106*f,107*f 'ascending frequencies long 108*f,109*f,110*f,111*f,112*f,113*f,114*f,115*f long 116*f,117*f,118*f,119*f,120*f,121*f,122*f,123*f long 124*f,125*f,126*f,127*f,128*f,129*f,130*f,131*f
EDIT: Found a bug in the program where P0 DAC was not updating. Fixed it.
Comments
Most importantly here, by batching up CORDIC operations in any mixture, a cog can approach a mere 8 clocks per operation, for even the most data-intensive CORDIC function.