TonyB_, I'd say the 8 unrolls buys a bigger budget but still seems a fair bit riskier with COGRAM use at this point. Four unrolls will still thankfully probably work unless we discover some as yet unknown limit there. So we sort of need some sample mouse sprite code worked through to get a feel for how much of the remaining cycle budget and COG RAM space will be consumed by it in COG-A and whether 8 unrolls is going to be realistic / required. There also still needs to be more COG space left to support all the per scan line & frame housekeeping and extra code to read back in any updated palette, mouse stuff and new mode for the next frame, and whatever other vertical blanking code needs to get run if some of its COG space has been temporarily reallocated to the 256 entry table and larger unrolled loop code in this 256 colour mode for both COGs.
IMHO once a single mouse cursor sprite and any other required housekeeping code is added, personally I wouldn't be too worried if there is not that much of the cycle budget left for many other extra functions or further sprites etc in this colour mode, though others may disagree if they have special graphics features/effects in mind that might still get squeezed in. Maybe enabling 720 pixel wide lines would useful too for wider HDTV compatibility perhaps. By the way the budget is already bigger there because that width uses a 27MHz pixel clock so perhaps it should be just coded as a slightly different driver with that fixed resolution and clock budget in mind.
It seems likely you would tack on more COGs in front of this graphics display pipeline to feed into the COGA/COGB pair if you ever wanted to fully support multiple sprites/tiles with full scrolling etc, certainly with this very demanding 256 colour mode. Basically I think we just would need something to be able to display an array of scanlines in memory holding a byte per pixel, up to 480 lines for full frame buffer, or alternatively a much smaller set of scanlines that are dynamically constructed by other COGs on the fly, plus have the ability to display our own single transparent mouse sprite at some particular screen co-ordinates for some GUI support. And this contained within outer code that can dynamically select the other modes on a per frame bases to enable 16 colour graphics and colour text with dynamic fonts and a programmable text cursor. That combination of features in a bitbang driver is already going to be pretty versatile and I would hope should be sufficient until rev B silicon comes along with better support for HDMI. So I wouldn't see the need for too much more to come in this driver. Plug in HDMI audio support etc could/would be done by other COGs that can interoperate and synchronize with this driver.
rogloh, 720 pixels is a bleak prospect with four unrolls as TMDS encoding time is 8280 cycles and the total is 8580. We could use the lower half of the LUT RAM for code probably, to replace the first 256 longs of cog RAM overwritten by the palette. I've started to think about the cursor. One possibility is a couple of 32x32 single colour+transparent sprites at the same x,y coordinates. It takes twice as many instructions to create a byte mask from a bit pattern, compared to a nibble mask (four, not two). A single 256 colour sprite might be easier.
Yeah that budget for doing 720 pixels does seem bleak, though that width is probably not the prime focus right now and can hopefully be re-evaluated later with future optimizations if possible.
I thought of a few things that might be possible with the mouse sprite if eventually required for space/time reasons:
-we could just do 16x16 instead of 32x32 to save both its storage space and HUB/LUT read/write instructions
-do colour only instead of monochrome if that saves cycles (as mono can be emulated away with colour version) have to check which type of mouse actually increases the processing work, not sure yet, sort of seems like colour may be faster.
-compute the coloured mouse sprite's different byte rotations within its storage words for all its scanlines in advance at blanking time for the mouse's current X co-ordinate, storing this back into HUB RAM, and just select one of these scanlines for burst reading using the mouse's Y co-ordinates compared against the current scanline to see it it falls within the right range to be displayed.
I added some baseline code below to get a feel for how many cycles this mouse sprite may take. I have implemented a colour 32x32 mouse and assumed some further work is done during blanking to compute addresses etc. This code takes 139 clock cycles to do the 32x32 mouse and I think probably 94 clock cycles for 16x16. It might be possible to shave it down slightly with more optimizations. Untested, so there may be bugs/omissions here.
Update: found an optimization to shave 4 clock cycles and 2 COG registers of these numbers above by replacing first two instructions with the pointer computation below it. ie. just do this:
mov ptr, y
sub ptr, mouseYstart wc
instead of the the first two instructions below.
Update2: found another one instruction saving. If you really need another 2 clocks you could replicate the if_nc condition all the way through the code after the mouseYend comparison then you can omit the line:
if_c jmp #tmds
' Sample mouse sprite code supporting a 32x32 transparent mouse pixel image in 256 colour mode for bitbang HDMI driver
' total execution time = 139 clock cycles with unrolled loop, a smaller sized rolled up version takes 211 clock cycles!
' If a smaller 16x16 mouse was used it instead would save 9+4+32 = 45 clock cycles, making only 94 clock cycles for the mouse.
' COGRAM use is 85 longs for the unrolled version, and some 28 of these COG RAM registers can also be reused later for other purposes.
' Rolled version takes 61 COG longs.
cmp y, mouseYstart wc ' I assume y is already tracking the scanline number in the frame
if_nc cmp mouseYend, y wc
if_c jmp #tmds ' don't execute mouse code if it's not on this scanline
' compute mouse image/mask start address in hub RAM for this scanline
mov ptr, y
sub ptr, mouseYstart
mul ptr, #18*4
add ptr, mousebaseaddr
' read mouse image and mask for this scanline, data is already pre-rotated for current mouse X position
' read 18 longs = holding 36 bytes of pixels followed by 36 bytes of masks
setq #18-1
rdlong mousedatabuf, ptr ' assume 27 clock cycle execution for burst transfer
' compute original pixel buffer address at the current mouse position on screen
mov ptr, mouseXoffset
add ptr, scanlinebufaddr
' read 9 longs = 36 pixels from pixel buffer of scanline around the mouse write starting position
setq #9-1
rdlong origpixelbuf, ptr ' assume 18 clock cycles worst case for burst transfer
' prepare LUT RAM write address
mov ptr, lutbufptr
' repeat these next 4 instructions 9 times with adjusted register offsets each iteration,
' or expand loop to use selfmodifying code and use rep - but that is much slower as it
' requires 3 more register address update instructions inside the loop
' unrolled loop version takes (4*9 - 1)*2 = 70 clock cycles and consumes 35 registers,
' while rep version takes (7*9 + 2 + ~6)*2 ~= 142 cycles and need about 11 COG RAM registers
' including some extras for the first initialization - so I'd choose an unrolled version
and origpixelbuf, mousemaskbuf ' and with mouse mask at long offset 9 from mouse pixel data
or origpixelbuf, mousedatabuf ' modify pixels
wrlut origpixelbuf, ptr ' write to LUT
add ptr,#1
' .... repeat above 9 times increasing the static buffer addresses as required by 1 each iteration
tmds
' rest of TMDS code follows after mouse sprite
' mouse storage follows
scanlinebufaddr long 0 ' holds start address of scanline pixel buffer in hub RAM
mousebaseaddr long 0 ' holds start address of mouse image in hub RAM
mouseYstart long 0 ' holds first screen scanline of mouse
mouseYend long 0 ' holds last screen scanline of mouse
mouseXoffset long 0 ' holds byte offset of mouse on scanline (but long aligned)
lutbufptr long 0 ' holds lut RAM mouse start address
y long 0 ' holds current scanline
' working regs
ptr long 0
mousedatabuf long 0[9] ' this space could be shared with other working registers only used once at start of line
mousemaskbuf long 0[9]
origpixelbuf long 0[9]
Why is the background pixel pattern loaded separately? This seems to be doing double duty as it needs to be loaded anyway for conversion.
Your loop also fails to update the mousedatabuf pointer, so another instruction would need to be added.
If you ensure that the mouse pointer transparent colour is 0, and for 256 colour mode all other colours used have non-zero nibbles in all positions, you could use MUXNIBS to perform the masking and update in one operation, also saving the need for separate mask longs.
So you would simply step through the 9 longs for the pre shifted and clipped mouse pointer image with something like:
For a monochrome pointer you might investigate the use of SKIPF with a single long per scanline. This works LSB first, so the order of instructions needs to match the order of pixels in the long; flip the sequence if I've got it backwards. More code space than the coloured version.
AJL, I loaded the background pattern again, because the first load goes into LUT and it's slow to read it back out of there for the number of longs needed - I think it's actually faster to go read a second copy from hub which I admit is unintuitive.
The mousedatabuf in COGRAM gets loaded using the ptr register which is computed for the current scanline and base address of all the mouse's scanlines in the hub RAM. I think it is okay. If you meant in the unrolled loop portion, I already assumed the code would just be modified for each copy to account for the different offset in the next long registers being accessed/modified.
Yeah I also thought about SKIPF if a monochrome mouse is desired instead. Trick there might be dealing with rotated shifted pixels, you sort of need 36 instead of 32 pixels, so the mask doesn't quite fit. My unrolled loop with colour does one pixel per instruction.
Will have to look into your muxnibs to see how that might help. I'm not yet familiar with all these new P2 instructions. Update: I think muxnibs needs nibble oriented pixel data, this won't be suitable for the 256 colour modes with 8bpp, but could be for the 16 colour graphics mode.
This is all great, we have a workable mouse sprite baseline and if other people here can contribute even faster ways to do this we should check them all out. Show us your code!
AJL, I loaded the background pattern again, because the first load goes into LUT and it's slow to read it back out of there for the number of longs needed - I think it's actually faster to go read a second copy from hub which I admit is unintuitive.
The mousedatabuf in COGRAM gets loaded using the ptr register which is computed for the current scanline and base address of all the mouse's scanlines in the hub RAM. I think it is okay. If you meant in the unrolled loop portion, I already assumed the code would just be modified for each copy to account for the different offset in the next long registers being accessed/modified.
Yeah I also thought about SKIPF if a monochrome mouse is desired instead. Trick there might be dealing with rotated shifted pixels, you sort of need 36 instead of 32 pixels, so the mask doesn't quite fit. My unrolled loop with colour does one pixel per instruction.
Will have to look into your muxnibs to see how that might help. I'm not yet familiar with all these new P2 instructions. Update: I think muxnibs needs nibble oriented pixel data, this won't be suitable for the 256 colour modes with 8bpp, but could be for the 16 colour graphics mode.
This is all great, we have a workable mouse sprite baseline and if other people here can contribute even faster ways to do this we should check them all out. Show us your code!
So, if you read my previous post more carefully, you'll see that I state the conditions necessary for it to work with 8bpp colour: No non-zero nibbles in the selected mouse pointer colours, so colours $01-$10, $20, $30, $40, $50, $60, $70, $80, $90, $A0, $B0, $C0, $D0, $E0, and $F0 wouldn't work. This restriction doesn't seem particularly onerous to me.
Also, while I'm prepared to be corrected, I believe the MUXNIBS instruction can work directly on LUTRAM by specifying an address between $00200 and $003FF by preceding it with an AUGD, so no need for a second load or copying it back. I don't have a means to test, but I think the direct LUTRAM access approach will be quicker and smaller.
Not yet Chip, but once I get a new P2D2 board and my own PCB board design finished, back and fitted with an HDMI connector, test setup etc hopefully in a few more weeks perhaps I will certainly be hoping to try it out.
At the moment most of this is all just design discussions between TonyB_, myself and few others interested in this to see if we have a chance to get it to work on the rev A P2 in the timing budget available. I do think there is hope for it to actually work including a graphics mode with 256 colours. I'd expect perhaps yourself or ozpropdev would be most likely to get something running quickly based on some of this stuff with your own ideas and P2 knowledge and your equipment, before I can, I'm maybe a few weeks away. It doesn't matter who gets this going first but I like the idea we can have something usable for others to play with at some point early in the release. A text based driver using HDMI/DVI will be pretty handy for console use and debug etc. Of course there is always standard analog VGA, but trying out HDMI/DVI with P2 should be fun, given the P2 was never really intended to do it, plus it's a good way to get your hands dirty and learn some new P2 stuff too. That's why I'm so interested in it, plus I did a custom Verilog thing recently with P1V that outputs over HDMI and a couple of my own sprite drivers for P1 so I now know a little bit about what types of features can come in handy, at least for me, and hopefully others.
AJL, yes if the restriction is put in place with non-zero nibbles and a subset of the colour range is supported for mouse colours it would appear that muxnibs is a possibility too. Whether that subset of colours concept fits in with the colour range people want I guess is TBD. For implementing monochrome pointers it would be fine with only two colours needed anyway. You could even fix them to 0xFF and 0xFE or something. The zero colour index in the palette is probably a good choice to always reserve as the transparent colour for sprite related things too.
I didn't realise that AUGD could also be used to identify LUTRAM, I was under the impression you had to use RDLUT/WRLUT only to access that RAM, but I am still learning as I go. Any further cycle penalty I wonder given it normally takes 3 cycles to access LUTRAM compared to two for normal COG RAM? Perhaps that can be utilised in other places in our code to reduce cycles, will have to keep it in mind for the future if it works the way you say. All these cool new tricks now with P2 to learn about. I think TonyB_ also mentioned you can execute from LUTRAM, that sounds interesting too.
Update: Applying the muxnibs stuff AJL identified to my code above, with no other changes to how original pixel buffer memory gets read twice from hub and no use of potential AUGD etc (until we know that is real), but including both the optimizations I identified earlier in my original code and a new one, I seem to find we can shave off up to 49 cycles from the mouse sprite code. This brings the total from the original 139 clocks down to 90 clocks. It also saves many more COG longs as well in the code (now down to 55 best case of which 19 are reusable registers). Pretty major improvement actually. Only actual limitation as such is we need to avoid certain colours with zero nibbles in the mouse sprite (256 total colours is not possible) and reserve 0 for the mouse's "transparent" colour. This avoids calculating additional shifted masks too in the vertical blanking code.
I also again looked at reading the LUTRAM directly to get the 9 long pixel buffer before masking, but it takes 3*9 = 27 clocks to get it back out of LUT to COGRAM in the best case with self modifying LUT addresses updated at blanking time, while my original approach of just burst loading 9 longs assuming worst case hub transfer latency and the prior address setup computation appears to just take 24 clock cycles and may be even faster if the hub window aligns. If AUGD works with LUTRAM however this result could potentially be changed in its favour.
Modified code attached:
' Sample mouse sprite code supporting a 32x32 transparent mouse pixel image in 256 colour mode for bitbang HDMI
' total = 90 clock cycles, 55 COG RAM registers
mov ptr, y ' I assume y is already tracking the scanline number in the frame
sub ptr, mouseYstart wc
if_nc cmp mouseYend, y wc
' compute mouse image/mask start address in hub RAM for this scanline
mul ptr, #9*4
add ptr, mousebaseaddr
' read mouse image and mask for this scanline, data is already pre-shifted for current mouse X position
' read 9 longs = holding 32 bytes of mouse pixels
setq #9-1
rdlong mousedatabuf, ptr ' assume 18 clock cycle execution for burst transfer
' compute original pixel buffer address at the current mouse position on screen
mov ptr, mouseXoffset
add ptr, scanlinebufaddr
' read 9 longs = 36 pixels from pixel buffer of scanline around the mouse write starting position
setq #9-1
rdlong origpixelbuf, ptr ' assume 18 clock cycles worst case for burst transfer
muxnibs origpixelbuf+0, mousedatabuf+0
muxnibs origpixelbuf+1, mousedatabuf+1
muxnibs origpixelbuf+2, mousedatabuf+2
muxnibs origpixelbuf+3, mousedatabuf+3
muxnibs origpixelbuf+4, mousedatabuf+4
muxnibs origpixelbuf+5, mousedatabuf+5
muxnibs origpixelbuf+6, mousedatabuf+5
muxnibs origpixelbuf+7, mousedatabuf+7
muxnibs origpixelbuf+8, mousedatabuf+8
if_nc wrlut origpixelbuf+0, #0-0 patched LUT address value based on X coordinate of mouse
if_nc wrlut origpixelbuf+1, #0-0
if_nc wrlut origpixelbuf+2, #0-0
if_nc wrlut origpixelbuf+3, #0-0
if_nc wrlut origpixelbuf+4, #0-0
if_nc wrlut origpixelbuf+5, #0-0
if_nc wrlut origpixelbuf+6, #0-0
if_nc wrlut origpixelbuf+7, #0-0
if_nc wrlut origpixelbuf+8, #0-0
tmds
' rest of TMDS code follows after mouse sprite
' mouse storage follows
scanlinebufaddr long 0 ' holds start address of scanline pixel buffer in hub RAM
mousebaseaddr long 0 ' holds start address of mouse image in hub RAM
mouseYstart long 0 ' holds first screen scanline of mouse
mouseYend long 0 ' holds last screen scanline of mouse
mouseXoffset long 0 ' holds byte offset of mouse on scanline (but long aligned)
lutbufptr long 0 ' holds lut RAM mouse start address
y long 0 ' holds current scanline
' working regs
ptr long 0
mousedatabuf long 0[9] ' this space could be shared with other working registers as it is only needed once at the start of line for mouse
origpixelbuf long 0[9]
Fair enough - I thought it was too good to be true. Thanks ozpropdev. Looks like we have a new benchmark of 90 clocks at best right now, for a mouse sprite in 256 colour mode (with minor limitations). This is for 32x32 pixel image. For a 16x16 mouse you could save 24 more clocks and get to 66 clocks. This smaller mouse would save another 16 COG longs too.
You can execute from LUTRAM but the registers in the operands must be cog registers.
Both cog ram and lut ram are dual ported, but the second port on the lut has some restrictions as this is also used for the streamer etc. Not sure of all the ramifications.
AUGD and AUGS don't work to change registers from cog to lut though due to the lack of the second port not being in parallel with the cog second port. Again, not sure of all ramifications.
We could use SETQ+WMLONG to write 255 colour + transparent sprites on top of scan line data in hub RAM. Read pixels that sprite might overwrite first, read sprite, write sprite, do TMDS, then write back original pixels.
In our case with a mouse, I think it is best not to touch hub RAM. I know you could do the mouse that way, but it's better hidden inside the COG IMO. Other COGs might be reading/writing the frame buffer, doing graphics block transfers etc and they don't want to see any temporary mouse pointer data in the buffer at random times.
I don't like the MUXNIBS limitation. Clipping won't be that tricky, simply adjust the sprite pointer and WMLONG block size for partial sprites. Also if hub RAM is shared with VGA output, sprite will appear there too. This is the intended application of WMLONG.
EDIT:
I accept the point about affecting the frame buffer.
I know what you mean about the colour limitation, it's a bit restrictive. It's possibly workable for a monochrome mouse only but not ideal and has palette implications.
Anyway I thought of another method that might be a reasonable compromise, it will increase the cycle count again though by 27 clocks.
instead of this optimized code for each mouse long
Ok I think we could avoid corrupting the full original frame buffer yet still use the method of WMLONG with bursts by doing this...
- read mouse image data from the scanline we are on (again 9 longs) with SETQ + RDLONG
- write back 9 longs from LUTRAM around the cursor X pos to a small hub scratch buffer with SETQ2 + WRLONG
- use WMLONG to write this mouse to the scratch buffer in hub RAM which will apply the mask for us
- use another RDLONG to read back new data from scratch into LUT RAM on top of original data read before mouse applied
This will only take 4 small additional bursts reads/writes of 9 longs and just require 0 as the transparent colour. Far fewer COG instructions now for the mouse.
Now just 90 clocks worst case and 19 persistent COG RAM addresses, plus 10 temporary COG registers than can be reclaimed or even just 1 temporary COG register if LUTRAM gets used to hold the mouse scanline data. In fact It's probably less clocks given we are doing back to back hub bursts, perhaps 2 egg beater hub cycles per each burst access of 9 longs = 16 clocks each burst once the first transfer is done, so maybe no more than 84 cycles in the real world, perhaps even less with the last aligned burst transfer completing sooner of if the first burst is already hub aligned from prior work?
This is a very good way to go. Any edge clipping can be handled with the LUT RAM being wider than the scanline, and gross min/max mouse X position enforcement done during vertical blanking. I think this is the best one so far...
' Sample mouse sprite code supporting 32x32 a transparent mouse pixel image in 256 colour mode for bitbang HDMI
' total = 90 clock cycles worst case, more like 84 perhaps
mov ptr, y ' I assume y is already tracking the scanline number in the frame
sub ptr, mouseYstart wc
if_nc cmp mouseYend, y wc
' compute mouse image/mask start address in hub RAM for this scanline
mul ptr, #9*4
add ptr, mousebaseaddr
' read mouse image for this scanline, data is already pre-shifted for current mouse X position
' read 9 longs = holding 18 bytes of pixels
setq #9-1
rdlong mousedatabuf, ptr ' assume 18 clock cycle execution for burst transfer
' write a temporary pixel buffer in hub for the mouse
setq2 #9-1
wrlong lutbufptr, hubscratchptr
' overlay the mouse data
setq #9-1
wmlong mousedatabuf, hubscratchptr
' read data into LUT
setq2 #9-1
if_nc rdlong lutbufptr, hubscratchptr
tmds
' rest of TMDS code follows after mouse sprite
' mouse storage follows
mousebaseaddr long 0 ' holds start address of mouse image in hub RAM
mouseYstart long 0 ' holds first screen scanline of mouse
mouseYend long 0 ' holds last screen scanline of mouse
lutbufptr long 0 ' holds lut RAM mouse start address
hubscratchptr long 0 ' address of a scratch buffer in RAM for mouse overlay
y long 0 ' holds current scanline
' working regs
ptr long 0
mousedatabuf long 0[9] ' this space could be shared with other working registers as it is only needed once at the start of line for mouse
Ok I think we could avoid corrupting the full original frame buffer yet still use the method of WMLONG with bursts by doing this...
- read mouse image data from the scanline we are on (again 9 longs) with SETQ + RDLONG
- write back 9 longs from LUTRAM around the cursor X pos to a small hub scratch buffer with SETQ2 + WRLONG
- use WMLONG to write this mouse to the scratch buffer in hub RAM which will apply the mask for us
- use another RDLONG to read back new data from scratch into LUT RAM on top of original data read before mouse applied
This will only take 4 small additional bursts reads/writes of 9 longs and just require 0 as the transparent colour. Far fewer COG instructions now for the mouse.
rogloh, here's my idea from earlier today:
Cog A Instr Longs From To
1. Read sprite line to sprite buffer SETQ2+RDLONG 8 HUB LUT
2. Read scan line to line buffer SETQ2+RDLONG 160 HUB LUT
3. Write pixels to temp buffer SETQ2+WRLONG 9 LUT HUB
4. Write sprite to temp buffer SETQ2+WMLONG 9 LUT HUB
5. Read sprite+pixels from temp buffer SETQ2+RDLONG <=9 HUB LUT
During 2-3 cog B shifts the sprite data according to sprite_x[1:0] and there is just enough time do this. Therefore only need one lot of sprite data.
TonyB_, sounds interesting with only one copy of the sprite needed, can you show the code for your sprite shifting so we see how fast it is?
My latest sample mouse code doesn't need to do shifting during active scanlines as I was just planning to shift the sprite data once during vertical blanking - so there were only going to be 2 copies of the mouse data, the reference image, and the shifted image (per scanline). Maybe you've found a sufficiently fast way to do it on the fly?
' Cog B sprite data shifter
'
mov sprite_ptr,#sprite_buf
mov shift,sprite_x_y ' x coordinate in low bits
and shift,#3 wz ' shift = 0/1/2/3
shl shift,#3 ' shift = 0/8/16/24
'
if_z mov mask,#0 ' $00000000 when sprite_x[1:0]=00
if_nz not mask,#0
if_nz bitl mask,shift
if_nz signx mask,shift ' $000000FF/$0000FFFF/$00FFFFFF when sprite_x[1:0]=01/10/11
'
mov old_x,#0
rep @.end,#8
'
rdlut x,sprite_ptr
rol x,shift
mov new_x,x
andn new_x,mask
or new_x,old_x
wrlut new_x,sprite_ptr
add sprite_ptr,#1
mov old_x,x
and old_x,mask
.end
wrlut old_x,sprite_ptr
A look-up table could generate mask more quickly but I've opted to save longs here. I think sprite (x,y) should be in a single long and read at the start of every line, so that the sprite could appear on screen multiple times.
EDIT:
Sprite data shifting is not required when using WMLONG - see later posts.
Still working out the logic in your code but the cycle count looks good, 19*8 + 11*2 = 174 clock cycles for the mouse sprite shifting, so this could overlap the time taken to read in the 160 longs plus prepare the temporary buffer pretty nicely, as that would already take at least 164 + 13 = 177 cycles or so anyway in the fastest possible case.
It's a great use of spare time to make COG B pull its weight a little more while COG-A is doing some other work for it. Good job again.
You know this is rather interesting to see that the sprite shifting works takes a COG quite a long time compared to other work relatively speaking, especially considering we now have this new WMLONG instruction. I think that is going to mean for an external sprite driver it would be running at its fastest speed if your sprite image data was already available as 4 copies in hub for the different shift positions. Then a single COG would be able to do many more sprites per scanlines in this case, assuming you want to be able to display them on single pixel boundaries, same goes for tiles with scrolling.
This 4 copy idea is worth considering in the future if you want lots more sprites, but it would obviously burn hub RAM much faster. Alternatively you just use burst byte transfers instead of longs and live with the fact that is 4x slower to transfer all the data, however maybe with this extra shifting overhead the byte transfers really start to make sense if you want to conserve hub RAM.
Update: I trust WMLONG can be used to just do single masked byte transfers too on byte boundaries else it may not be quite so fast if restricted to longs, as that may still require some pre shifting per byte. I think the P2 long alignment is different now anyway to make this still possible.
On a P2 what actually happens if you write a long across a long boundary?
Assume writing long AABBCCDD to hub address 0x1
does the hub memory then look like this
Byte address Contents
0x00 AA
0x01 DD
0x02 CC
0x03 BB
or is another byte of the long written into address 0x4 with a second bus transfer, or something else altogether? I know the P1 would just do this
Byte address Contents
0x00 DD
0x01 CC
0x02 BB
0x03 AA
but I think P2 is different now, right? Documentation just says there are no special alignment rules for words and longs in hub RAM, but I'm wondering what this means for the data results.
Would WMLONG in a SETQ/SETQ2 burst writing to a non-aligned long boundary then take ~2x as long to complete compared to if it were aligned? Documentation mentions up to 20 clocks for single transfer but I am wondering about the burst transfers.
Comments
IMHO once a single mouse cursor sprite and any other required housekeeping code is added, personally I wouldn't be too worried if there is not that much of the cycle budget left for many other extra functions or further sprites etc in this colour mode, though others may disagree if they have special graphics features/effects in mind that might still get squeezed in. Maybe enabling 720 pixel wide lines would useful too for wider HDTV compatibility perhaps. By the way the budget is already bigger there because that width uses a 27MHz pixel clock so perhaps it should be just coded as a slightly different driver with that fixed resolution and clock budget in mind.
It seems likely you would tack on more COGs in front of this graphics display pipeline to feed into the COGA/COGB pair if you ever wanted to fully support multiple sprites/tiles with full scrolling etc, certainly with this very demanding 256 colour mode. Basically I think we just would need something to be able to display an array of scanlines in memory holding a byte per pixel, up to 480 lines for full frame buffer, or alternatively a much smaller set of scanlines that are dynamically constructed by other COGs on the fly, plus have the ability to display our own single transparent mouse sprite at some particular screen co-ordinates for some GUI support. And this contained within outer code that can dynamically select the other modes on a per frame bases to enable 16 colour graphics and colour text with dynamic fonts and a programmable text cursor. That combination of features in a bitbang driver is already going to be pretty versatile and I would hope should be sufficient until rev B silicon comes along with better support for HDMI. So I wouldn't see the need for too much more to come in this driver. Plug in HDMI audio support etc could/would be done by other COGs that can interoperate and synchronize with this driver.
I thought of a few things that might be possible with the mouse sprite if eventually required for space/time reasons:
-we could just do 16x16 instead of 32x32 to save both its storage space and HUB/LUT read/write instructions
-do colour only instead of monochrome if that saves cycles (as mono can be emulated away with colour version) have to check which type of mouse actually increases the processing work, not sure yet, sort of seems like colour may be faster.
-compute the coloured mouse sprite's different byte rotations within its storage words for all its scanlines in advance at blanking time for the mouse's current X co-ordinate, storing this back into HUB RAM, and just select one of these scanlines for burst reading using the mouse's Y co-ordinates compared against the current scanline to see it it falls within the right range to be displayed.
I added some baseline code below to get a feel for how many cycles this mouse sprite may take. I have implemented a colour 32x32 mouse and assumed some further work is done during blanking to compute addresses etc. This code takes 139 clock cycles to do the 32x32 mouse and I think probably 94 clock cycles for 16x16. It might be possible to shave it down slightly with more optimizations. Untested, so there may be bugs/omissions here.
Update: found an optimization to shave 4 clock cycles and 2 COG registers of these numbers above by replacing first two instructions with the pointer computation below it. ie. just do this:
mov ptr, y
sub ptr, mouseYstart wc
instead of the the first two instructions below.
Update2: found another one instruction saving. If you really need another 2 clocks you could replicate the if_nc condition all the way through the code after the mouseYend comparison then you can omit the line:
if_c jmp #tmds
Your loop also fails to update the mousedatabuf pointer, so another instruction would need to be added.
If you ensure that the mouse pointer transparent colour is 0, and for 256 colour mode all other colours used have non-zero nibbles in all positions, you could use MUXNIBS to perform the masking and update in one operation, also saving the need for separate mask longs.
So you would simply step through the 9 longs for the pre shifted and clipped mouse pointer image with something like:
For a monochrome pointer you might investigate the use of SKIPF with a single long per scanline. This works LSB first, so the order of instructions needs to match the order of pixels in the long; flip the sequence if I've got it backwards. More code space than the coloured version.
Combining SKIPF with MUXNIBS for the coloured version could give some further savings, and save the buffer overrun also.
The mousedatabuf in COGRAM gets loaded using the ptr register which is computed for the current scanline and base address of all the mouse's scanlines in the hub RAM. I think it is okay. If you meant in the unrolled loop portion, I already assumed the code would just be modified for each copy to account for the different offset in the next long registers being accessed/modified.
Yeah I also thought about SKIPF if a monochrome mouse is desired instead. Trick there might be dealing with rotated shifted pixels, you sort of need 36 instead of 32 pixels, so the mask doesn't quite fit. My unrolled loop with colour does one pixel per instruction.
Will have to look into your muxnibs to see how that might help. I'm not yet familiar with all these new P2 instructions.
Update: I think muxnibs needs nibble oriented pixel data, this won't be suitable for the 256 colour modes with 8bpp, but could be for the 16 colour graphics mode.
This is all great, we have a workable mouse sprite baseline and if other people here can contribute even faster ways to do this we should check them all out. Show us your code!
So, if you read my previous post more carefully, you'll see that I state the conditions necessary for it to work with 8bpp colour: No non-zero nibbles in the selected mouse pointer colours, so colours $01-$10, $20, $30, $40, $50, $60, $70, $80, $90, $A0, $B0, $C0, $D0, $E0, and $F0 wouldn't work. This restriction doesn't seem particularly onerous to me.
Also, while I'm prepared to be corrected, I believe the MUXNIBS instruction can work directly on LUTRAM by specifying an address between $00200 and $003FF by preceding it with an AUGD, so no need for a second load or copying it back. I don't have a means to test, but I think the direct LUTRAM access approach will be quicker and smaller.
At the moment most of this is all just design discussions between TonyB_, myself and few others interested in this to see if we have a chance to get it to work on the rev A P2 in the timing budget available. I do think there is hope for it to actually work including a graphics mode with 256 colours. I'd expect perhaps yourself or ozpropdev would be most likely to get something running quickly based on some of this stuff with your own ideas and P2 knowledge and your equipment, before I can, I'm maybe a few weeks away. It doesn't matter who gets this going first but I like the idea we can have something usable for others to play with at some point early in the release. A text based driver using HDMI/DVI will be pretty handy for console use and debug etc. Of course there is always standard analog VGA, but trying out HDMI/DVI with P2 should be fun, given the P2 was never really intended to do it, plus it's a good way to get your hands dirty and learn some new P2 stuff too. That's why I'm so interested in it, plus I did a custom Verilog thing recently with P1V that outputs over HDMI and a couple of my own sprite drivers for P1 so I now know a little bit about what types of features can come in handy, at least for me, and hopefully others.
I didn't realise that AUGD could also be used to identify LUTRAM, I was under the impression you had to use RDLUT/WRLUT only to access that RAM, but I am still learning as I go. Any further cycle penalty I wonder given it normally takes 3 cycles to access LUTRAM compared to two for normal COG RAM? Perhaps that can be utilised in other places in our code to reduce cycles, will have to keep it in mind for the future if it works the way you say. All these cool new tricks now with P2 to learn about. I think TonyB_ also mentioned you can execute from LUTRAM, that sounds interesting too.
Update: Applying the muxnibs stuff AJL identified to my code above, with no other changes to how original pixel buffer memory gets read twice from hub and no use of potential AUGD etc (until we know that is real), but including both the optimizations I identified earlier in my original code and a new one, I seem to find we can shave off up to 49 cycles from the mouse sprite code. This brings the total from the original 139 clocks down to 90 clocks. It also saves many more COG longs as well in the code (now down to 55 best case of which 19 are reusable registers). Pretty major improvement actually. Only actual limitation as such is we need to avoid certain colours with zero nibbles in the mouse sprite (256 total colours is not possible) and reserve 0 for the mouse's "transparent" colour. This avoids calculating additional shifted masks too in the vertical blanking code.
I also again looked at reading the LUTRAM directly to get the 9 long pixel buffer before masking, but it takes 3*9 = 27 clocks to get it back out of LUT to COGRAM in the best case with self modifying LUT addresses updated at blanking time, while my original approach of just burst loading 9 longs assuming worst case hub transfer latency and the prior address setup computation appears to just take 24 clock cycles and may be even faster if the hub window aligns. If AUGD works with LUTRAM however this result could potentially be changed in its favour.
Modified code attached:
It cannot be used with AUGD.
Both cog ram and lut ram are dual ported, but the second port on the lut has some restrictions as this is also used for the streamer etc. Not sure of all the ramifications.
AUGD and AUGS don't work to change registers from cog to lut though due to the lack of the second port not being in parallel with the cog second port. Again, not sure of all ramifications.
EDIT:
Clipping might be tricky.
EDIT:
I accept the point about affecting the frame buffer.
Anyway I thought of another method that might be a reasonable compromise, it will increase the cycle count again though by 27 clocks.
instead of this optimized code for each mouse long
it would do this for each long, and we go back to reading in a full 18 longs of mouse data again including its 9 longs of mask
Did we not make an option for WRBYTE and friends to inhibit the writes of value 00? That should get rid of the masking steps for a full color cursor.
Edit: I see Tony mentioned it.
- read mouse image data from the scanline we are on (again 9 longs) with SETQ + RDLONG
- write back 9 longs from LUTRAM around the cursor X pos to a small hub scratch buffer with SETQ2 + WRLONG
- use WMLONG to write this mouse to the scratch buffer in hub RAM which will apply the mask for us
- use another RDLONG to read back new data from scratch into LUT RAM on top of original data read before mouse applied
This will only take 4 small additional bursts reads/writes of 9 longs and just require 0 as the transparent colour. Far fewer COG instructions now for the mouse.
Now just 90 clocks worst case and 19 persistent COG RAM addresses, plus 10 temporary COG registers than can be reclaimed or even just 1 temporary COG register if LUTRAM gets used to hold the mouse scanline data. In fact It's probably less clocks given we are doing back to back hub bursts, perhaps 2 egg beater hub cycles per each burst access of 9 longs = 16 clocks each burst once the first transfer is done, so maybe no more than 84 cycles in the real world, perhaps even less with the last aligned burst transfer completing sooner of if the first burst is already hub aligned from prior work?
This is a very good way to go. Any edge clipping can be handled with the LUT RAM being wider than the scanline, and gross min/max mouse X position enforcement done during vertical blanking. I think this is the best one so far...
rogloh, here's my idea from earlier today:
During 2-3 cog B shifts the sprite data according to sprite_x[1:0] and there is just enough time do this. Therefore only need one lot of sprite data.
My latest sample mouse code doesn't need to do shifting during active scanlines as I was just planning to shift the sprite data once during vertical blanking - so there were only going to be 2 copies of the mouse data, the reference image, and the shifted image (per scanline). Maybe you've found a sufficiently fast way to do it on the fly?
Please check this code:
A look-up table could generate mask more quickly but I've opted to save longs here. I think sprite (x,y) should be in a single long and read at the start of every line, so that the sprite could appear on screen multiple times.
EDIT:
Sprite data shifting is not required when using WMLONG - see later posts.
It's a great use of spare time to make COG B pull its weight a little more while COG-A is doing some other work for it. Good job again.
This 4 copy idea is worth considering in the future if you want lots more sprites, but it would obviously burn hub RAM much faster. Alternatively you just use burst byte transfers instead of longs and live with the fact that is 4x slower to transfer all the data, however maybe with this extra shifting overhead the byte transfers really start to make sense if you want to conserve hub RAM.
Update: I trust WMLONG can be used to just do single masked byte transfers too on byte boundaries else it may not be quite so fast if restricted to longs, as that may still require some pre shifting per byte. I think the P2 long alignment is different now anyway to make this still possible.
Assume writing long AABBCCDD to hub address 0x1
does the hub memory then look like this or is another byte of the long written into address 0x4 with a second bus transfer, or something else altogether? I know the P1 would just do this but I think P2 is different now, right? Documentation just says there are no special alignment rules for words and longs in hub RAM, but I'm wondering what this means for the data results.