Here is how I see it.
The FIFO was reduced from 16 longs to 8 longs when 16 cogs didn't fit, otherwise I agree with Evan. Block size for RDFAST/WRFAST was kept at 64 bytes so that 16 cogs will work most easily in the future. Fast block moves with setq don't use the one-and-only FIFO because they don't need to - these transfers happen in one clock cycle, once correct slice is selected and provided FIFO is not accessing hub RAM for its burst reads/writes.
Yes, where the data is vague, is what happens when that slot is taken ?
If data travels in order, the fast block move must defer to +8, and then from there, run until the next defer, or it is done.
Q:Can fast block move transfer data out-of-order ? (data does not say so?)
The instruction data bursts can be interceded upon by the FIFO bursts, possibly multiple times in a single opcode if the instruction is copying a large number of longwords.
HubRAM burst data order for both the FIFO and instructions is always sequential - incrementing consecutive longword addresses.
PS: Minimum RDLONG execution time is the burst length + 8 clocks. So longer burst lengths have a throughput advantage within the cog. This won't limit the FIFOs access to hubRAM.
If it takes around 1900 clocks to do the 1280 transfers in the non-pixel doubled 32bpp mode, then for a pixel doubled variant which only reads in 320 longs and writes out 640 longs I'm anticipating it will be just 75% of this, or 1425 clocks. The optimized unrolled copy routine I came up with then uses up 1280 more cycles for pixel doubling, so 1425 + 1280 = 2705 leaving ~500 clocks for any loop overheads and the time to dynamically build the 64 copy instructions in COG RAM plus per scanline housekeeping. I think we can still do it but it will be tight!
The only thing I see which might be different is that this mode I might burst transfer 32 longs at a time instead of 40 (for space reasons), which may interfere with the streamer in a different way. I guess if that doesn't work I could try to move down to 20 at a time instead. I can't do over 32 at a time as I don't have room for the unrolled instructions. To build the unrolled instruction sequence actually takes a while though.
I think I will need something like this loop...which itself takes 320 clocks! Anyone know a faster way to do it without reading it dynamically (which I could do but is sort of cheating)?
UPDATE: I've added things up for what I'd consider to be a heavily optimized case for doing this processing and get these numbers...
1) hub transfers: ~1425 clocks (anticipated based on measured values but could change)
2) pixel doubling unrolled loop setup: 320 clocks
3) pixel doubling: 1280 clocks
4) outer call overhead + scanline housekeeping: 210 clocks
Total = 3235 clocks.
Total result is just over my desired 3200 budget but may still be okay...but to put this in perspective a TERC4 packet encode of 2 audio samples plus housekeeping appears to take ~3142 clocks when I counted it up and I am allowing for 3200...
I'm measuring FIFOs to be 19 longwords deep. This will have a bearing on the high and low water marks, which in turn has bearing on which slot/slice is the refilling position. So not as simple as just the start address, but still deterministic.
I'm measuring FIFOs to be 19 longwords deep. This will have a bearing on the high and low water marks, which in turn has bearing on which slot/slice is the refilling position. So not as simple as just the start address, but still deterministic.
Good point about high and low water marks, do you know what those are on the streamer ?
I was basing my spacing/interactions on the streamer filling, and then taking a long when available.
If it instead has hysteresis, and takes more longs, then it needs to do so much less often, and that makes a big difference to blocking effects.
eg a hyst of 8, means it reads 8 longs every 80 cycles, giving 70 free adjacent slots to any fast block move, and now impact on fast block move is much reduced.
When I talk about the streamer doing DMA ops, the FIFO is always in between. From the prop2 doc:
There are three ways the hub FIFO interface can be used, and it can only be used for one of these, at a time:
Hub execution (when the PC is $00400..$FFFFF)
Streamer usage (background transfers from hub RAM → pins/DACs, or from pins → hub RAM)
Software usage (fast sequential-reading or sequential-writing instructions)
The FIFO's 8-clock (one rotation) bursts are how the streamer gets its data. Presumably, the FIFO's high water is 19 and the low water is 11.
I want to believe but I don't quite get it yet, how exactly does it move the 64 wrlut instructions into the 64 long buffer space in COG RAM for later execution? ALTI sometimes gets confusing.
The ALTI instructions entirely take over the addressing of both cogram and lutram. Register "ptrs" holds both working addresses.
Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.
I have some mouse sprite code coming together...it's bloating a little bit more that I'd like but I might still squeeze it in..
It's still incomplete as I still need to get the X offset computed and clip sides, accounting for X hotspot etc. Hopefully not too many more instructions...then time to test.
do_mouse
getword a, mouse_xy, #1'get mouse y co-ordinategetnib b, mouseptr, #7'get y hotspot (0-15)sub a, b 'compensate for itsubr a, scanline wc'check if sprite is on scanlineif_nccmpr a, #15wc'check if sprite is on scanlineif_cretwcz'mouse sprite not in range, exitalts bpp, #bittable 'bpp starts out as index from 0-5mov bitmask, 0-0'get table entrymul a, bitmask 'multiply mouse row by its lengthshr bitmask, #16wz'get mask portionif_znot bitmask 'fix up the 32 bpp casemov bpp, bitmask
ones bpp 'convert to real bppadd a, mouseptr 'add offset to base mouse addresssetq2 #17-1'read 17 longs max, mask is firstrdlong$120, a '..followed by mouse image datasetq2 #16-1'read 16 scanline longs into LUTrdlong$110, linebufaddr ' TODO: determine X offset/clipping etc movptra, #$110'ptra used for source image datamovptrb, #$120'ptrb used for mouse image datardlut c, ptrb++ 'the mouse mask is stored firstmul offset, bpp 'calculate mouse offset (0-30)mov mask, bitmask 'setup mask for pixel's data sizerol mask, offset 'align first pixelrdlut a, ptra'get original scanline pixel datatestb bitmask, #0wc'set c initially to read sprite inrep @endmouse, #16'repeat loop for 16 pixelsif_crdlut b, ptrb++ 'get next mouse sprite pixel(s)if_crol b, offset 'align with the source input datashr c, #1wc'get mask bit 1=set, 0=transparentif_csetq mask 'apply bit mask to muxq maskif_ncsetq #0'setup a transparent mouse pixelmuxq a, b 'select original or mouse pixelrol mask, bpp wc'advance mask by 1,2,4,8,16,32if_cwrlut a, ptra++ 'write back data once we filled it if_crdlut a, ptra'...and read next source pixel(s)rol bitmask, bpp wc'rotate mouse data refill tracker
endmouse
setq2 #16-1wrlong$110, linebufaddr 'write LUT image data back to hubret
bittable
long$0001_00_08'1bpplong$0003_00_08'2bpplong$000F_00_0C'4bpplong$00FF_00_14'8bpplong$FFFF_00_24'16bpplong$0000_00_44'32bpp
bitmask long0'TODO eventually share other register names to eliminate
bpp long0
offset long0
mask long0
The ALTI instructions entirely take over the addressing of both cogram and lutram. Register "ptrs" holds both working addresses.
Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.
Yeah but it doesn't construct new unrolled code anywhere does it? If this is for moving data to LUT it takes 4 clocks per long vs unrolled loop of 2 clocks per long.
Correct. There is still some trade-off. Space and construction time for unrolled 2 clocks per longword vs compact 4 clocks per longword. Better than earlier loops speeds though.
I'd like to create the unrolled loop code using indirection if it could get the loop to 4 instructions instead of 5. I wonder if that is possible. It would buy another 64 clocks.
Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
BTW: I don't know if it's useful but he's freed up the PTRA register too.
Brian's code copies longs from a COGRAM buffer to LUT taking 4 clocks per long with indirection, with 4 instructions in the loop. Yes it is faster than my current duplication code which is less than 6400 but still not fast enough for a 3200 cycle budget. To hit that lower timing budget I found I will need to use an unrolled loop of 64 COGRAM instructions, which then copy longs at a rate of one long per 2 clocks into LUT RAM (ie. 2x faster that Brian's code).
The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total. If I can shrink that creation loop down to 4 instructions somehow with indirection perhaps, it will then drop to 256 clocks and buy me a bit more of a safety buffer.
Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
That's pretty clever. Can we do something like that for the unrolled loop creation code?
The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total.
Oh! Must be time to not do the construction method then. See my latest rolled up version above.
Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
That's pretty clever. Can we do something like that for the unrolled loop creation code?
I don't see how the dynamically generated unrolled loop can beat evanh's code, and you can do 40 long bursts like the other routines.
Read 320 pixels into COG at 66% efficiency.
Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
Write 640 pixels from LUT to hub at 66% efficiency.
320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.
The unrolled version is still a bit faster.
320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.
Edit: updated earlier compensation factors from 4/3 to 3/2.
Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
Very cool evanh! Plus LUTRAM may work out ok for this special optimized 32bpp code doubler because the 256 entry palette is not going to be required for this 32bpp mode and I can reclaim it. Also it would let me use more of the COGRAM as the initial buffer area so get 40 or 80 long transfer loops. If it works, I like! You might have just saved me about another 20080 128? clocks.
Read 320 pixels into COG at 66% efficiency.
Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
Write 640 pixels from LUT to hub at 66% efficiency.
320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.
The unrolled version is still a bit faster.
320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.
Edit: updated earlier compensation factors from 4/3 to 3/2.
For some reason I had it in my head that you were dynamically generating the code before each burst, rather than each line.
As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?
Then the generation happens once at start of frame, and it collapses to a simple flag check each line.
If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.
Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.
Seeing there was some concern here about FIFO metrics, I just added this to the documentation:
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Comments
If data travels in order, the fast block move must defer to +8, and then from there, run until the next defer, or it is done.
Q:Can fast block move transfer data out-of-order ? (data does not say so?)
HubRAM burst data order for both the FIFO and instructions is always sequential - incrementing consecutive longword addresses.
PS: Minimum RDLONG execution time is the burst length + 8 clocks. So longer burst lengths have a throughput advantage within the cog. This won't limit the FIFOs access to hubRAM.
The only thing I see which might be different is that this mode I might burst transfer 32 longs at a time instead of 40 (for space reasons), which may interfere with the streamer in a different way. I guess if that doesn't work I could try to move down to 20 at a time instead. I can't do over 32 at a time as I don't have room for the unrolled instructions. To build the unrolled instruction sequence actually takes a while though.
To generate this unrolled sequence:
wrlut buffer, ptra++ wrlut buffer, ptra++ wrlut buffer+1, ptra++ wrlut buffer+1, ptra++ wrlut buffer+2, ptra++ wrlut buffer+2, ptra++ ...
I think I will need something like this loop...which itself takes 320 clocks! Anyone know a faster way to do it without reading it dynamically (which I could do but is sort of cheating)?rep #5, #32 mov 0-0, instr add $-1, _d0 mov 1-0, instr add $-1, _d0 add instr, _d0 ... instr wrlut buffer, ptra++ _d0 long $200
UPDATE: I've added things up for what I'd consider to be a heavily optimized case for doing this processing and get these numbers...
1) hub transfers: ~1425 clocks (anticipated based on measured values but could change)
2) pixel doubling unrolled loop setup: 320 clocks
3) pixel doubling: 1280 clocks
4) outer call overhead + scanline housekeeping: 210 clocks
Total = 3235 clocks.
Total result is just over my desired 3200 budget but may still be okay...but to put this in perspective a TERC4 packet encode of 2 audio samples plus housekeeping appears to take ~3142 clocks when I counted it up and I am allowing for 3200...
I was basing my spacing/interactions on the streamer filling, and then taking a long when available.
If it instead has hysteresis, and takes more longs, then it needs to do so much less often, and that makes a big difference to blocking effects.
eg a hyst of 8, means it reads 8 longs every 80 cycles, giving 70 free adjacent slots to any fast block move, and now impact on fast block move is much reduced.
The FIFO's 8-clock (one rotation) bursts are how the streamer gets its data. Presumably, the FIFO's high water is 19 and the low water is 11.
Maybe using register indirection can help.
setd ptrs,#buffer sets ptrs,#0 rep #4,#32 alti ptrs,#%100_111 wrlut 0-0,#0-0 alti ptrs,#%111_111 wrlut 0-0,#0-0
Kind of amusing how there is 36 unused machine code bits.
I want to believe but I don't quite get it yet, how exactly does it move the 64 wrlut instructions into the 64 long buffer space in COG RAM for later execution? ALTI sometimes gets confusing.
Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.
It's still incomplete as I still need to get the X offset computed and clip sides, accounting for X hotspot etc. Hopefully not too many more instructions...then time to test.
do_mouse getword a, mouse_xy, #1 'get mouse y co-ordinate getnib b, mouseptr, #7 'get y hotspot (0-15) sub a, b 'compensate for it subr a, scanline wc 'check if sprite is on scanline if_nc cmpr a, #15 wc 'check if sprite is on scanline if_c ret wcz 'mouse sprite not in range, exit alts bpp, #bittable 'bpp starts out as index from 0-5 mov bitmask, 0-0 'get table entry mul a, bitmask 'multiply mouse row by its length shr bitmask, #16 wz 'get mask portion if_z not bitmask 'fix up the 32 bpp case mov bpp, bitmask ones bpp 'convert to real bpp add a, mouseptr 'add offset to base mouse address setq2 #17-1 'read 17 longs max, mask is first rdlong $120, a '..followed by mouse image data setq2 #16-1 'read 16 scanline longs into LUT rdlong $110, linebufaddr ' TODO: determine X offset/clipping etc mov ptra, #$110 'ptra used for source image data mov ptrb, #$120 'ptrb used for mouse image data rdlut c, ptrb++ 'the mouse mask is stored first mul offset, bpp 'calculate mouse offset (0-30) mov mask, bitmask 'setup mask for pixel's data size rol mask, offset 'align first pixel rdlut a, ptra 'get original scanline pixel data testb bitmask, #0 wc 'set c initially to read sprite in rep @endmouse, #16 'repeat loop for 16 pixels if_c rdlut b, ptrb++ 'get next mouse sprite pixel(s) if_c rol b, offset 'align with the source input data shr c, #1 wc 'get mask bit 1=set, 0=transparent if_c setq mask 'apply bit mask to muxq mask if_nc setq #0 'setup a transparent mouse pixel muxq a, b 'select original or mouse pixel rol mask, bpp wc 'advance mask by 1,2,4,8,16,32 if_c wrlut a, ptra++ 'write back data once we filled it if_c rdlut a, ptra '...and read next source pixel(s) rol bitmask, bpp wc 'rotate mouse data refill tracker endmouse setq2 #16-1 wrlong $110, linebufaddr 'write LUT image data back to hub ret bittable long $0001_00_08 '1bpp long $0003_00_08 '2bpp long $000F_00_0C '4bpp long $00FF_00_14 '8bpp long $FFFF_00_24 '16bpp long $0000_00_44 '32bpp bitmask long 0 'TODO eventually share other register names to eliminate bpp long 0 offset long 0 mask long 0
Yeah but it doesn't construct new unrolled code anywhere does it? If this is for moving data to LUT it takes 4 clocks per long vs unrolled loop of 2 clocks per long.
BTW: I don't know if it's useful but he's freed up the PTRA register too.
'copy buff to lutram with longword doubling setd .dptr, #buff 'cogram address mov ptra, #0 'lutram address rep @.rend, #LEN alti .dptr, #%111_000 wrlut 0-0, ptra++ .dptr wrlut 0, ptra++ .rend
Brian's code copies longs from a COGRAM buffer to LUT taking 4 clocks per long with indirection, with 4 instructions in the loop. Yes it is faster than my current duplication code which is less than 6400 but still not fast enough for a 3200 cycle budget. To hit that lower timing budget I found I will need to use an unrolled loop of 64 COGRAM instructions, which then copy longs at a rate of one long per 2 clocks into LUT RAM (ie. 2x faster that Brian's code).
The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total. If I can shrink that creation loop down to 4 instructions somehow with indirection perhaps, it will then drop to 256 clocks and buy me a bit more of a safety buffer.
That's pretty clever. Can we do something like that for the unrolled loop creation code?
I don't see how the dynamically generated unrolled loop can beat evanh's code, and you can do 40 long bursts like the other routines.
Read 320 pixels into COG at 66% efficiency.
Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
Write 640 pixels from LUT to hub at 66% efficiency.
320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.
The unrolled version is still a bit faster.
320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.
Edit: updated earlier compensation factors from 4/3 to 3/2.
That's a nifty use of the pipeline there Evan! Very Cool
And that's powerful partly because there is a recent history of the output I can look back on and compare what I'd got on the previous attempts.
'construct in lutram an unrolled - copy buff to lutram with longword doubling setd .instr, #buff 'reset cogram address mov ptra, #(LUT_end - LUT_code) 'lutram address rep @.rend, #LEN wrlut .instr, ptra++ 'copy instruction to lutram wrlut .instr, ptra++ ' ditto, double up alti .instr, #%011_000 'increment buff pixel address in D field .rend ... .instr wrlut buff, ptra++
For some reason I had it in my head that you were dynamically generating the code before each burst, rather than each line.
As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?
Then the generation happens once at start of frame, and it collapses to a simple flag check each line.
If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.
Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.
The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.