Here is how I see it.
The FIFO was reduced from 16 longs to 8 longs when 16 cogs didn't fit, otherwise I agree with Evan. Block size for RDFAST/WRFAST was kept at 64 bytes so that 16 cogs will work most easily in the future. Fast block moves with setq don't use the one-and-only FIFO because they don't need to - these transfers happen in one clock cycle, once correct slice is selected and provided FIFO is not accessing hub RAM for its burst reads/writes.
Yes, where the data is vague, is what happens when that slot is taken ?
If data travels in order, the fast block move must defer to +8, and then from there, run until the next defer, or it is done.
Q:Can fast block move transfer data out-of-order ? (data does not say so?)
The instruction data bursts can be interceded upon by the FIFO bursts, possibly multiple times in a single opcode if the instruction is copying a large number of longwords.
HubRAM burst data order for both the FIFO and instructions is always sequential - incrementing consecutive longword addresses.
PS: Minimum RDLONG execution time is the burst length + 8 clocks. So longer burst lengths have a throughput advantage within the cog. This won't limit the FIFOs access to hubRAM.
If it takes around 1900 clocks to do the 1280 transfers in the non-pixel doubled 32bpp mode, then for a pixel doubled variant which only reads in 320 longs and writes out 640 longs I'm anticipating it will be just 75% of this, or 1425 clocks. The optimized unrolled copy routine I came up with then uses up 1280 more cycles for pixel doubling, so 1425 + 1280 = 2705 leaving ~500 clocks for any loop overheads and the time to dynamically build the 64 copy instructions in COG RAM plus per scanline housekeeping. I think we can still do it but it will be tight!
The only thing I see which might be different is that this mode I might burst transfer 32 longs at a time instead of 40 (for space reasons), which may interfere with the streamer in a different way. I guess if that doesn't work I could try to move down to 20 at a time instead. I can't do over 32 at a time as I don't have room for the unrolled instructions. To build the unrolled instruction sequence actually takes a while though.
I think I will need something like this loop...which itself takes 320 clocks! Anyone know a faster way to do it without reading it dynamically (which I could do but is sort of cheating)?
UPDATE: I've added things up for what I'd consider to be a heavily optimized case for doing this processing and get these numbers...
1) hub transfers: ~1425 clocks (anticipated based on measured values but could change)
2) pixel doubling unrolled loop setup: 320 clocks
3) pixel doubling: 1280 clocks
4) outer call overhead + scanline housekeeping: 210 clocks
Total = 3235 clocks.
Total result is just over my desired 3200 budget but may still be okay...but to put this in perspective a TERC4 packet encode of 2 audio samples plus housekeeping appears to take ~3142 clocks when I counted it up and I am allowing for 3200...
I'm measuring FIFOs to be 19 longwords deep. This will have a bearing on the high and low water marks, which in turn has bearing on which slot/slice is the refilling position. So not as simple as just the start address, but still deterministic.
I'm measuring FIFOs to be 19 longwords deep. This will have a bearing on the high and low water marks, which in turn has bearing on which slot/slice is the refilling position. So not as simple as just the start address, but still deterministic.
Good point about high and low water marks, do you know what those are on the streamer ?
I was basing my spacing/interactions on the streamer filling, and then taking a long when available.
If it instead has hysteresis, and takes more longs, then it needs to do so much less often, and that makes a big difference to blocking effects.
eg a hyst of 8, means it reads 8 longs every 80 cycles, giving 70 free adjacent slots to any fast block move, and now impact on fast block move is much reduced.
When I talk about the streamer doing DMA ops, the FIFO is always in between. From the prop2 doc:
There are three ways the hub FIFO interface can be used, and it can only be used for one of these, at a time:
Hub execution (when the PC is $00400..$FFFFF)
Streamer usage (background transfers from hub RAM → pins/DACs, or from pins → hub RAM)
Software usage (fast sequential-reading or sequential-writing instructions)
The FIFO's 8-clock (one rotation) bursts are how the streamer gets its data. Presumably, the FIFO's high water is 19 and the low water is 11.
I want to believe but I don't quite get it yet, how exactly does it move the 64 wrlut instructions into the 64 long buffer space in COG RAM for later execution? ALTI sometimes gets confusing.
The ALTI instructions entirely take over the addressing of both cogram and lutram. Register "ptrs" holds both working addresses.
Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.
I have some mouse sprite code coming together...it's bloating a little bit more that I'd like but I might still squeeze it in..
It's still incomplete as I still need to get the X offset computed and clip sides, accounting for X hotspot etc. Hopefully not too many more instructions...then time to test.
do_mouse
getword a, mouse_xy, #1 'get mouse y co-ordinate
getnib b, mouseptr, #7 'get y hotspot (0-15)
sub a, b 'compensate for it
subr a, scanline wc 'check if sprite is on scanline
if_nc cmpr a, #15 wc 'check if sprite is on scanline
if_c ret wcz 'mouse sprite not in range, exit
alts bpp, #bittable 'bpp starts out as index from 0-5
mov bitmask, 0-0 'get table entry
mul a, bitmask 'multiply mouse row by its length
shr bitmask, #16 wz 'get mask portion
if_z not bitmask 'fix up the 32 bpp case
mov bpp, bitmask
ones bpp 'convert to real bpp
add a, mouseptr 'add offset to base mouse address
setq2 #17-1 'read 17 longs max, mask is first
rdlong $120, a '..followed by mouse image data
setq2 #16-1 'read 16 scanline longs into LUT
rdlong $110, linebufaddr ' TODO: determine X offset/clipping etc
mov ptra, #$110 'ptra used for source image data
mov ptrb, #$120 'ptrb used for mouse image data
rdlut c, ptrb++ 'the mouse mask is stored first
mul offset, bpp 'calculate mouse offset (0-30)
mov mask, bitmask 'setup mask for pixel's data size
rol mask, offset 'align first pixel
rdlut a, ptra 'get original scanline pixel data
testb bitmask, #0 wc 'set c initially to read sprite in
rep @endmouse, #16 'repeat loop for 16 pixels
if_c rdlut b, ptrb++ 'get next mouse sprite pixel(s)
if_c rol b, offset 'align with the source input data
shr c, #1 wc 'get mask bit 1=set, 0=transparent
if_c setq mask 'apply bit mask to muxq mask
if_nc setq #0 'setup a transparent mouse pixel
muxq a, b 'select original or mouse pixel
rol mask, bpp wc 'advance mask by 1,2,4,8,16,32
if_c wrlut a, ptra++ 'write back data once we filled it
if_c rdlut a, ptra '...and read next source pixel(s)
rol bitmask, bpp wc 'rotate mouse data refill tracker
endmouse
setq2 #16-1
wrlong $110, linebufaddr 'write LUT image data back to hub
ret
bittable
long $0001_00_08 '1bpp
long $0003_00_08 '2bpp
long $000F_00_0C '4bpp
long $00FF_00_14 '8bpp
long $FFFF_00_24 '16bpp
long $0000_00_44 '32bpp
bitmask long 0 'TODO eventually share other register names to eliminate
bpp long 0
offset long 0
mask long 0
The ALTI instructions entirely take over the addressing of both cogram and lutram. Register "ptrs" holds both working addresses.
Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.
Yeah but it doesn't construct new unrolled code anywhere does it? If this is for moving data to LUT it takes 4 clocks per long vs unrolled loop of 2 clocks per long.
Correct. There is still some trade-off. Space and construction time for unrolled 2 clocks per longword vs compact 4 clocks per longword. Better than earlier loops speeds though.
I'd like to create the unrolled loop code using indirection if it could get the loop to 4 instructions instead of 5. I wonder if that is possible. It would buy another 64 clocks.
Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
BTW: I don't know if it's useful but he's freed up the PTRA register too.
Brian's code copies longs from a COGRAM buffer to LUT taking 4 clocks per long with indirection, with 4 instructions in the loop. Yes it is faster than my current duplication code which is less than 6400 but still not fast enough for a 3200 cycle budget. To hit that lower timing budget I found I will need to use an unrolled loop of 64 COGRAM instructions, which then copy longs at a rate of one long per 2 clocks into LUT RAM (ie. 2x faster that Brian's code).
The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total. If I can shrink that creation loop down to 4 instructions somehow with indirection perhaps, it will then drop to 256 clocks and buy me a bit more of a safety buffer.
Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
That's pretty clever. Can we do something like that for the unrolled loop creation code?
The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total.
Oh! Must be time to not do the construction method then. See my latest rolled up version above.
Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
That's pretty clever. Can we do something like that for the unrolled loop creation code?
I don't see how the dynamically generated unrolled loop can beat evanh's code, and you can do 40 long bursts like the other routines.
Read 320 pixels into COG at 66% efficiency.
Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
Write 640 pixels from LUT to hub at 66% efficiency.
320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.
The unrolled version is still a bit faster.
320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.
Edit: updated earlier compensation factors from 4/3 to 3/2.
Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
Very cool evanh! Plus LUTRAM may work out ok for this special optimized 32bpp code doubler because the 256 entry palette is not going to be required for this 32bpp mode and I can reclaim it. Also it would let me use more of the COGRAM as the initial buffer area so get 40 or 80 long transfer loops. If it works, I like! You might have just saved me about another 20080 128? clocks.
Read 320 pixels into COG at 66% efficiency.
Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
Write 640 pixels from LUT to hub at 66% efficiency.
320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.
The unrolled version is still a bit faster.
320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.
Edit: updated earlier compensation factors from 4/3 to 3/2.
For some reason I had it in my head that you were dynamically generating the code before each burst, rather than each line.
As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?
Then the generation happens once at start of frame, and it collapses to a simple flag check each line.
If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.
Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.
Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Comments
If data travels in order, the fast block move must defer to +8, and then from there, run until the next defer, or it is done.
Q:Can fast block move transfer data out-of-order ? (data does not say so?)
HubRAM burst data order for both the FIFO and instructions is always sequential - incrementing consecutive longword addresses.
PS: Minimum RDLONG execution time is the burst length + 8 clocks. So longer burst lengths have a throughput advantage within the cog. This won't limit the FIFOs access to hubRAM.
The only thing I see which might be different is that this mode I might burst transfer 32 longs at a time instead of 40 (for space reasons), which may interfere with the streamer in a different way. I guess if that doesn't work I could try to move down to 20 at a time instead. I can't do over 32 at a time as I don't have room for the unrolled instructions. To build the unrolled instruction sequence actually takes a while though.
To generate this unrolled sequence:
I think I will need something like this loop...which itself takes 320 clocks! Anyone know a faster way to do it without reading it dynamically (which I could do but is sort of cheating)?
UPDATE: I've added things up for what I'd consider to be a heavily optimized case for doing this processing and get these numbers...
1) hub transfers: ~1425 clocks (anticipated based on measured values but could change)
2) pixel doubling unrolled loop setup: 320 clocks
3) pixel doubling: 1280 clocks
4) outer call overhead + scanline housekeeping: 210 clocks
Total = 3235 clocks.
Total result is just over my desired 3200 budget but may still be okay...but to put this in perspective a TERC4 packet encode of 2 audio samples plus housekeeping appears to take ~3142 clocks when I counted it up and I am allowing for 3200...
I was basing my spacing/interactions on the streamer filling, and then taking a long when available.
If it instead has hysteresis, and takes more longs, then it needs to do so much less often, and that makes a big difference to blocking effects.
eg a hyst of 8, means it reads 8 longs every 80 cycles, giving 70 free adjacent slots to any fast block move, and now impact on fast block move is much reduced.
The FIFO's 8-clock (one rotation) bursts are how the streamer gets its data. Presumably, the FIFO's high water is 19 and the low water is 11.
Maybe using register indirection can help.
Kind of amusing how there is 36 unused machine code bits.
I want to believe but I don't quite get it yet, how exactly does it move the 64 wrlut instructions into the 64 long buffer space in COG RAM for later execution? ALTI sometimes gets confusing.
Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.
It's still incomplete as I still need to get the X offset computed and clip sides, accounting for X hotspot etc. Hopefully not too many more instructions...then time to test.
Yeah but it doesn't construct new unrolled code anywhere does it? If this is for moving data to LUT it takes 4 clocks per long vs unrolled loop of 2 clocks per long.
BTW: I don't know if it's useful but he's freed up the PTRA register too.
Brian's code copies longs from a COGRAM buffer to LUT taking 4 clocks per long with indirection, with 4 instructions in the loop. Yes it is faster than my current duplication code which is less than 6400 but still not fast enough for a 3200 cycle budget. To hit that lower timing budget I found I will need to use an unrolled loop of 64 COGRAM instructions, which then copy longs at a rate of one long per 2 clocks into LUT RAM (ie. 2x faster that Brian's code).
The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total. If I can shrink that creation loop down to 4 instructions somehow with indirection perhaps, it will then drop to 256 clocks and buy me a bit more of a safety buffer.
That's pretty clever. Can we do something like that for the unrolled loop creation code?
I don't see how the dynamically generated unrolled loop can beat evanh's code, and you can do 40 long bursts like the other routines.
Read 320 pixels into COG at 66% efficiency.
Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
Write 640 pixels from LUT to hub at 66% efficiency.
320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.
The unrolled version is still a bit faster.
320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.
Edit: updated earlier compensation factors from 4/3 to 3/2.
That's a nifty use of the pipeline there Evan! Very Cool
And that's powerful partly because there is a recent history of the output I can look back on and compare what I'd got on the previous attempts.
For some reason I had it in my head that you were dynamically generating the code before each burst, rather than each line.
As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?
Then the generation happens once at start of frame, and it collapses to a simple flag check each line.
If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.
Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.