Overlapping CORDIC commands to maximize throughput

I did a test today where I overlapped CORDIC commands and result fetches for the single-most complex CORDIC instruction:
SETQ+QROTATE rotates a signed 32-bit (x,y) coordinate pair by a 32-bit angle, then returns the resultant (x,y) via GETQX and GETQY.
With minimum timing, achievable by overlapping command issuing and result fetching, a cog in an 8-cog P2 implementation can do one of these operations every 8 clocks. There is no time for hub memory accesses or indirection for SETQ+QROTATE+GETQY+GETQX. All instructions must be hard-coded with register addresses from which inputs are read and outputs are stored.
In my test, I did 32 operations, of which the middle 16 were completely overlapped. It took 312 clocks to complete (32*8 + 56). That comes to 9.75 clocks per operation (312/32). The more operations you overlap, the closer you get to 8 clocks per operation.
The program I wrote rotates 32 sets of (x,y) coordinate pairs by ascending angle values and outputs the y values to 16-bit DACs on P0..P31. I'm running the P2 at 250MHz off a 20MHz crystal.
P0..P31 output 990-ohm 3.3V sine waves from 10.0Khz..13.1KHz in 100Hz increments. The whole loop takes 2.33us, so the update rate is 428KHz.
Here is the code:
EDIT: Found a bug in the program where P0 DAC was not updating. Fixed it.
setq y '2 clocks
qrotate x,angle '2
getqy y '2
getqx x '2 = 8 clocks (minimum)
SETQ+QROTATE rotates a signed 32-bit (x,y) coordinate pair by a 32-bit angle, then returns the resultant (x,y) via GETQX and GETQY.
With minimum timing, achievable by overlapping command issuing and result fetching, a cog in an 8-cog P2 implementation can do one of these operations every 8 clocks. There is no time for hub memory accesses or indirection for SETQ+QROTATE+GETQY+GETQX. All instructions must be hard-coded with register addresses from which inputs are read and outputs are stored.
In my test, I did 32 operations, of which the middle 16 were completely overlapped. It took 312 clocks to complete (32*8 + 56). That comes to 9.75 clocks per operation (312/32). The more operations you overlap, the closer you get to 8 clocks per operation.
The program I wrote rotates 32 sets of (x,y) coordinate pairs by ascending angle values and outputs the y values to 16-bit DACs on P0..P31. I'm running the P2 at 250MHz off a 20MHz crystal.
P0..P31 output 990-ohm 3.3V sine waves from 10.0Khz..13.1KHz in 100Hz increments. The whole loop takes 2.33us, so the update rate is 428KHz.
Here is the code:
'
' CORDIC overlapping command demo
'
' - outputs 32 sine waves of increasing frequency on P0..P31 using 990-ohm DACs
' - uses SETQ+QROTATE+GETQY+GETQX, the most-in/out-intensive CORDIC command
'
con f = $0010C400 'base frequency = ~10KHz, 100Hz steps
dat org
hubset ##%1_000001_0000011000_1111_10_00 'enable crystal+PLL, stay in 20MHz+ mode
waitx ##20_000_000/100 'wait ~10ms for crystal+PLL to stabilize
hubset ##%1_000001_0000011000_1111_10_11 'now switch to PLL running at 250MHz
rep @.r,#32 'ready to set P0..P31 to DAC mode
wrpin dacmode,i 'set 16-bit noise-dither DAC mode
wxpin #1,i 'set period to 1 for anytime updating
dirh i 'enable smart pin
incmod i,#31 'inc index, wrap to 0
.r
'
'
' Rotate 32 sets of (x,y) coordinates at different rates
' by overlapping CORDIC commands and result fetches
'
' 'clk sum
' 'w=wait !=cordic tick
'
loop setq y+00 '2 ? begin first 8 commands
qrotate x+00,a+00 '?w+2 2!
setq y+01 '2 4
qrotate x+01,a+01 '4w+2 10!
setq y+02 '2 12
qrotate x+02,a+02 '4w+2 18!
setq y+03 '2 20
qrotate x+03,a+03 '4w+2 26!
setq y+04 '2 28
qrotate x+04,a+04 '4w+2 34!
setq y+05 '2 36
qrotate x+05,a+05 '4w+2 42!
setq y+06 '2 44
qrotate x+06,a+06 '4w+2 50!
setq y+07 '2 52
qrotate x+07,a+07 '4w+2 58! result 00 is ready at 54!!!
getqy y+00 '2 60 get result 00, no waiting!!!
getqx x+00 '2 62
setq y+08 '2 64 begin overlapping commands and results
qrotate x+08,a+08 '2 66!
getqy y+01 '2 68
getqx x+01 '2 70
setq y+09 '2 72
qrotate x+09,a+09 '2 74!
getqy y+02 '2 76
getqx x+02 '2 78
setq y+10 '2 80
qrotate x+10,a+10 '2 82!
getqy y+03 '2 84
getqx x+03 '2 86
setq y+11 '2 88
qrotate x+11,a+11 '2 90!
getqy y+04 '2 92
getqx x+04 '2 94
setq y+12 '2 96
qrotate x+12,a+12 '2 98!
getqy y+05 '2 100
getqx x+05 '2 102
setq y+13 '2 104
qrotate x+13,a+13 '2 106!
getqy y+06 '2 108
getqx x+06 '2 110
setq y+14 '2 112
qrotate x+14,a+14 '2 114!
getqy y+07 '2 116
getqx x+07 '2 118
setq y+15 '2 120
qrotate x+15,a+15 '2 122!
getqy y+08 '2 124
getqx x+08 '2 126
setq y+16 '2 128
qrotate x+16,a+16 '2 130!
getqy y+09 '2 132
getqx x+09 '2 134
setq y+17 '2 136
qrotate x+17,a+17 '2 138!
getqy y+10 '2 140
getqx x+10 '2 142
setq y+18 '2 144
qrotate x+18,a+18 '2 146!
getqy y+11 '2 148
getqx x+11 '2 150
setq y+19 '2 152
qrotate x+19,a+19 '2 154!
getqy y+12 '2 156
getqx x+12 '2 158
setq y+20 '2 160
qrotate x+20,a+20 '2 162!
getqy y+13 '2 164
getqx x+13 '2 166
setq y+21 '2 168
qrotate x+21,a+21 '2 170!
getqy y+14 '2 172
getqx x+14 '2 174
setq y+22 '2 176
qrotate x+22,a+22 '2 178!
getqy y+15 '2 180
getqx x+15 '2 182
setq y+23 '2 184
qrotate x+23,a+23 '2 186!
getqy y+16 '2 188
getqx x+16 '2 190
setq y+24 '2 192
qrotate x+24,a+24 '2 194!
getqy y+17 '2 196
getqx x+17 '2 198
setq y+25 '2 200
qrotate x+25,a+25 '2 202!
getqy y+18 '2 204
getqx x+18 '2 206
setq y+26 '2 208
qrotate x+26,a+26 '2 210!
getqy y+19 '2 212
getqx x+19 '2 214
setq y+27 '2 216
qrotate x+27,a+27 '2 218!
getqy y+20 '2 220
getqx x+20 '2 222
setq y+28 '2 224
qrotate x+28,a+28 '2 226!
getqy y+21 '2 228
getqx x+21 '2 230
setq y+29 '2 232
qrotate x+29,a+29 '2 234!
getqy y+22 '2 236
getqx x+22 '2 238
setq y+30 '2 240
qrotate x+30,a+30 '2 242!
getqy y+23 '2 244
getqx x+23 '2 246
setq y+31 '2 248
qrotate x+31,a+31 '2 250!
getqy y+24 '2 252 get 8 trailing results
getqx x+24 '2 254
getqy y+25 '4w+2 260
getqx x+25 '2 262
getqy y+26 '4w+2 268
getqx x+26 '2 270
getqy y+27 '4w+2 276
getqx x+27 '2 278
getqy y+28 '4w+2 284
getqx x+28 '2 286
getqy y+29 '4w+2 292
getqx x+29 '2 294
getqy y+30 '4w+2 300
getqx x+30 '2 302
getqy y+31 '4w+2 308
getqx x+31 '2 310
'
'
' Output y[00..31] (sines) to P0..P31
'
rep @.r,#32 'ready to update 32 DACs
alts i,#y 'get y[00..31] into next s and inc i
getword j,0,#1 'get upper word of y
bitnot j,#15 'convert signed word to unsigned word for DAC output
wypin j,i 'update DAC output value
incmod i,#31 'inc index, wrap to 0
.r
drvnot #32 'toggle P32 on each iteration
jmp #loop
'
'
' Data
'
dacmode long %10100_00000000_01_00010_0 '990-ohm DAC + randomly-dithered 16-bit DAC mode
i long 0 'index
j long 0 'misc
x long $7F000000[32] 'initial (x,y) coordinates
y long $00000000[32]
a long 100*f,101*f,102*f,103*f,104*f,105*f,106*f,107*f 'ascending frequencies
long 108*f,109*f,110*f,111*f,112*f,113*f,114*f,115*f
long 116*f,117*f,118*f,119*f,120*f,121*f,122*f,123*f
long 124*f,125*f,126*f,127*f,128*f,129*f,130*f,131*f
EDIT: Found a bug in the program where P0 DAC was not updating. Fixed it.
Comments
Most importantly here, by batching up CORDIC operations in any mixture, a cog can approach a mere 8 clocks per operation, for even the most data-intensive CORDIC function.