Shop OBEX P1 Docs P2 Docs Learn Events
HDMI added to Prop2 - Page 19 — Parallax Forums

HDMI added to Prop2

11516171921

Comments

  • roglohrogloh Posts: 5,122
    edited 2018-11-09 20:55
    TonyB_, I'd say the 8 unrolls buys a bigger budget but still seems a fair bit riskier with COGRAM use at this point. Four unrolls will still thankfully probably work unless we discover some as yet unknown limit there. So we sort of need some sample mouse sprite code worked through to get a feel for how much of the remaining cycle budget and COG RAM space will be consumed by it in COG-A and whether 8 unrolls is going to be realistic / required. There also still needs to be more COG space left to support all the per scan line & frame housekeeping and extra code to read back in any updated palette, mouse stuff and new mode for the next frame, and whatever other vertical blanking code needs to get run if some of its COG space has been temporarily reallocated to the 256 entry table and larger unrolled loop code in this 256 colour mode for both COGs.

    IMHO once a single mouse cursor sprite and any other required housekeeping code is added, personally I wouldn't be too worried if there is not that much of the cycle budget left for many other extra functions or further sprites etc in this colour mode, though others may disagree if they have special graphics features/effects in mind that might still get squeezed in. Maybe enabling 720 pixel wide lines would useful too for wider HDTV compatibility perhaps. By the way the budget is already bigger there because that width uses a 27MHz pixel clock so perhaps it should be just coded as a slightly different driver with that fixed resolution and clock budget in mind.

    It seems likely you would tack on more COGs in front of this graphics display pipeline to feed into the COGA/COGB pair if you ever wanted to fully support multiple sprites/tiles with full scrolling etc, certainly with this very demanding 256 colour mode. Basically I think we just would need something to be able to display an array of scanlines in memory holding a byte per pixel, up to 480 lines for full frame buffer, or alternatively a much smaller set of scanlines that are dynamically constructed by other COGs on the fly, plus have the ability to display our own single transparent mouse sprite at some particular screen co-ordinates for some GUI support. And this contained within outer code that can dynamically select the other modes on a per frame bases to enable 16 colour graphics and colour text with dynamic fonts and a programmable text cursor. That combination of features in a bitbang driver is already going to be pretty versatile and I would hope should be sufficient until rev B silicon comes along with better support for HDMI. So I wouldn't see the need for too much more to come in this driver. Plug in HDMI audio support etc could/would be done by other COGs that can interoperate and synchronize with this driver.
  • rogloh, 720 pixels is a bleak prospect with four unrolls as TMDS encoding time is 8280 cycles and the total is 8580. We could use the lower half of the LUT RAM for code probably, to replace the first 256 longs of cog RAM overwritten by the palette. I've started to think about the cursor. One possibility is a couple of 32x32 single colour+transparent sprites at the same x,y coordinates. It takes twice as many instructions to create a byte mask from a bit pattern, compared to a nibble mask (four, not two). A single 256 colour sprite might be easier.
  • roglohrogloh Posts: 5,122
    edited 2018-11-10 02:50
    Yeah that budget for doing 720 pixels does seem bleak, though that width is probably not the prime focus right now and can hopefully be re-evaluated later with future optimizations if possible.

    I thought of a few things that might be possible with the mouse sprite if eventually required for space/time reasons:
    -we could just do 16x16 instead of 32x32 to save both its storage space and HUB/LUT read/write instructions
    -do colour only instead of monochrome if that saves cycles (as mono can be emulated away with colour version) have to check which type of mouse actually increases the processing work, not sure yet, sort of seems like colour may be faster.
    -compute the coloured mouse sprite's different byte rotations within its storage words for all its scanlines in advance at blanking time for the mouse's current X co-ordinate, storing this back into HUB RAM, and just select one of these scanlines for burst reading using the mouse's Y co-ordinates compared against the current scanline to see it it falls within the right range to be displayed.

    I added some baseline code below to get a feel for how many cycles this mouse sprite may take. I have implemented a colour 32x32 mouse and assumed some further work is done during blanking to compute addresses etc. This code takes 139 clock cycles to do the 32x32 mouse and I think probably 94 clock cycles for 16x16. It might be possible to shave it down slightly with more optimizations. Untested, so there may be bugs/omissions here.

    Update: found an optimization to shave 4 clock cycles and 2 COG registers of these numbers above by replacing first two instructions with the pointer computation below it. ie. just do this:

    mov ptr, y
    sub ptr, mouseYstart wc

    instead of the the first two instructions below.

    Update2: found another one instruction saving. If you really need another 2 clocks you could replicate the if_nc condition all the way through the code after the mouseYend comparison then you can omit the line:

    if_c jmp #tmds
            ' Sample mouse sprite code supporting a 32x32 transparent mouse pixel image in 256 colour mode for bitbang HDMI driver
    
            ' total execution time = 139 clock cycles with unrolled loop, a smaller sized rolled up version takes 211 clock cycles!
            ' If a smaller 16x16 mouse was used it instead would save 9+4+32 = 45 clock cycles, making only 94 clock cycles for the mouse.
            ' COGRAM use is 85 longs for the unrolled version, and some 28 of these COG RAM registers can also be reused later for other purposes.  
            ' Rolled version takes 61 COG longs.
    
            cmp     y, mouseYstart wc  ' I assume y is already tracking the scanline number in the frame
      if_nc cmp     mouseYend, y wc
      if_c  jmp     #tmds  ' don't execute mouse code if it's not on this scanline
    
            ' compute mouse image/mask start address in hub RAM for this scanline
            mov     ptr, y 
            sub     ptr, mouseYstart
            mul     ptr, #18*4
            add     ptr, mousebaseaddr
            
            ' read mouse image and mask for this scanline, data is already pre-rotated for current mouse X position
            ' read 18 longs = holding 36 bytes of pixels followed by 36 bytes of masks
            setq    #18-1 
            rdlong  mousedatabuf, ptr ' assume 27 clock cycle execution for burst transfer
    
            ' compute original pixel buffer address at the current mouse position on screen
            mov     ptr, mouseXoffset
            add     ptr, scanlinebufaddr
    
            ' read 9 longs = 36 pixels from pixel buffer of scanline around the mouse write starting position
            setq    #9-1 
            rdlong  origpixelbuf, ptr  ' assume 18 clock cycles worst case for burst transfer
    
            ' prepare LUT RAM write address
            mov     ptr, lutbufptr
    
            ' repeat these next 4 instructions 9 times with adjusted register offsets each iteration, 
            ' or expand loop to use selfmodifying code and use rep - but that is much slower as it 
            ' requires 3 more register address update instructions inside the loop
            ' unrolled loop version takes (4*9 - 1)*2 = 70 clock cycles and consumes 35 registers, 
            ' while rep version takes (7*9 + 2 + ~6)*2 ~= 142 cycles and need about 11 COG RAM registers 
            ' including some extras for the first initialization - so I'd choose an unrolled version
            
            and     origpixelbuf, mousemaskbuf ' and with mouse mask at long offset 9 from mouse pixel data
            or      origpixelbuf, mousedatabuf  ' modify pixels
            wrlut   origpixelbuf, ptr  ' write to LUT
            add     ptr,#1
            ' .... repeat above 9 times increasing the static buffer addresses as required by 1 each iteration
    
    tmds
            ' rest of TMDS code follows after mouse sprite
    
    
    ' mouse storage follows
    
    scanlinebufaddr long 0  ' holds start address of scanline pixel buffer in hub RAM
    mousebaseaddr   long 0  ' holds start address of mouse image in hub RAM
    mouseYstart     long 0  ' holds first screen scanline of mouse
    mouseYend       long 0  ' holds last screen scanline of mouse
    mouseXoffset    long 0  ' holds byte offset of mouse on scanline (but long aligned)
    lutbufptr       long 0  ' holds lut RAM mouse start address
    y               long 0  ' holds current scanline
    
    ' working regs
    ptr             long 0  
    mousedatabuf    long 0[9]  ' this space could be shared with other working registers only used once at start of line 
    mousemaskbuf    long 0[9]
    origpixelbuf    long 0[9]
    
    
  • Why is the background pixel pattern loaded separately? This seems to be doing double duty as it needs to be loaded anyway for conversion.

    Your loop also fails to update the mousedatabuf pointer, so another instruction would need to be added.

    If you ensure that the mouse pointer transparent colour is 0, and for 256 colour mode all other colours used have non-zero nibbles in all positions, you could use MUXNIBS to perform the masking and update in one operation, also saving the need for separate mask longs.

    So you would simply step through the 9 longs for the pre shifted and clipped mouse pointer image with something like:
      muxnibs ptr, mousedatabuf
      muxnibs ptr+1, mousedatabuf+1
      muxnibs ptr+2, mousedatabuf+2
      muxnibs ptr+3, mousedatabuf+3
      muxnibs ptr+4, mousedatabuf+4
      muxnibs ptr+5, mousedatabuf+5
      muxnibs ptr+6, mousedatabuf+6
      muxnibs ptr+7, mousedatabuf+7
      muxnibs ptr+8, mousedatabuf+8
    

    For a monochrome pointer you might investigate the use of SKIPF with a single long per scanline. This works LSB first, so the order of instructions needs to match the order of pixels in the long; flip the sequence if I've got it backwards. More code space than the coloured version.
      skipf mousedata
      setbyte ptr+4, mousecolour, 3
      setbyte ptr+4, mousecolour, 2
      setbyte ptr+4, mousecolour, 1
      setbyte ptr+4, mousecolour, 0
      setbyte ptr+3, mousecolour, 3
      setbyte ptr+3, mousecolour, 2
      setbyte ptr+3, mousecolour, 1
      setbyte ptr+3, mousecolour, 0
      setbyte ptr+2, mousecolour, 3
      setbyte ptr+2, mousecolour, 2
      setbyte ptr+2, mousecolour, 1
      setbyte ptr+2, mousecolour, 0
      setbyte ptr+1, mousecolour, 3
      setbyte ptr+1, mousecolour, 2
      setbyte ptr+1, mousecolour, 1
      setbyte ptr+1, mousecolour, 0
      setbyte ptr, mousecolour, 3
      setbyte ptr, mousecolour, 2
      setbyte ptr, mousecolour, 1
      setbyte ptr, mousecolour, 0
    

    Combining SKIPF with MUXNIBS for the coloured version could give some further savings, and save the buffer overrun also.
  • roglohrogloh Posts: 5,122
    edited 2018-11-10 03:26
    AJL, I loaded the background pattern again, because the first load goes into LUT and it's slow to read it back out of there for the number of longs needed - I think it's actually faster to go read a second copy from hub which I admit is unintuitive.

    The mousedatabuf in COGRAM gets loaded using the ptr register which is computed for the current scanline and base address of all the mouse's scanlines in the hub RAM. I think it is okay. If you meant in the unrolled loop portion, I already assumed the code would just be modified for each copy to account for the different offset in the next long registers being accessed/modified.

    Yeah I also thought about SKIPF if a monochrome mouse is desired instead. Trick there might be dealing with rotated shifted pixels, you sort of need 36 instead of 32 pixels, so the mask doesn't quite fit. My unrolled loop with colour does one pixel per instruction.

    Will have to look into your muxnibs to see how that might help. I'm not yet familiar with all these new P2 instructions.
    Update: I think muxnibs needs nibble oriented pixel data, this won't be suitable for the 256 colour modes with 8bpp, but could be for the 16 colour graphics mode.

    This is all great, we have a workable mouse sprite baseline and if other people here can contribute even faster ways to do this we should check them all out. Show us your code! :smile:
  • roglohrogloh Posts: 5,122
    edited 2018-11-10 03:21
    Duplicate post deleted.
  • cgraceycgracey Posts: 14,133
    Rogloh, are you actually outputting HDMI, yet?
  • rogloh wrote: »
    AJL, I loaded the background pattern again, because the first load goes into LUT and it's slow to read it back out of there for the number of longs needed - I think it's actually faster to go read a second copy from hub which I admit is unintuitive.

    The mousedatabuf in COGRAM gets loaded using the ptr register which is computed for the current scanline and base address of all the mouse's scanlines in the hub RAM. I think it is okay. If you meant in the unrolled loop portion, I already assumed the code would just be modified for each copy to account for the different offset in the next long registers being accessed/modified.

    Yeah I also thought about SKIPF if a monochrome mouse is desired instead. Trick there might be dealing with rotated shifted pixels, you sort of need 36 instead of 32 pixels, so the mask doesn't quite fit. My unrolled loop with colour does one pixel per instruction.

    Will have to look into your muxnibs to see how that might help. I'm not yet familiar with all these new P2 instructions.
    Update: I think muxnibs needs nibble oriented pixel data, this won't be suitable for the 256 colour modes with 8bpp, but could be for the 16 colour graphics mode.

    This is all great, we have a workable mouse sprite baseline and if other people here can contribute even faster ways to do this we should check them all out. Show us your code! :smile:

    So, if you read my previous post more carefully, you'll see that I state the conditions necessary for it to work with 8bpp colour: No non-zero nibbles in the selected mouse pointer colours, so colours $01-$10, $20, $30, $40, $50, $60, $70, $80, $90, $A0, $B0, $C0, $D0, $E0, and $F0 wouldn't work. This restriction doesn't seem particularly onerous to me.

    Also, while I'm prepared to be corrected, I believe the MUXNIBS instruction can work directly on LUTRAM by specifying an address between $00200 and $003FF by preceding it with an AUGD, so no need for a second load or copying it back. I don't have a means to test, but I think the direct LUTRAM access approach will be quicker and smaller.
  • Not yet Chip, but once I get a new P2D2 board and my own PCB board design finished, back and fitted with an HDMI connector, test setup etc hopefully in a few more weeks perhaps I will certainly be hoping to try it out. :smile:

    At the moment most of this is all just design discussions between TonyB_, myself and few others interested in this to see if we have a chance to get it to work on the rev A P2 in the timing budget available. I do think there is hope for it to actually work including a graphics mode with 256 colours. I'd expect perhaps yourself or ozpropdev would be most likely to get something running quickly based on some of this stuff with your own ideas and P2 knowledge and your equipment, before I can, I'm maybe a few weeks away. It doesn't matter who gets this going first but I like the idea we can have something usable for others to play with at some point early in the release. A text based driver using HDMI/DVI will be pretty handy for console use and debug etc. Of course there is always standard analog VGA, but trying out HDMI/DVI with P2 should be fun, given the P2 was never really intended to do it, plus it's a good way to get your hands dirty and learn some new P2 stuff too. That's why I'm so interested in it, plus I did a custom Verilog thing recently with P1V that outputs over HDMI and a couple of my own sprite drivers for P1 so I now know a little bit about what types of features can come in handy, at least for me, and hopefully others.
  • roglohrogloh Posts: 5,122
    edited 2018-11-10 12:49
    AJL, yes if the restriction is put in place with non-zero nibbles and a subset of the colour range is supported for mouse colours it would appear that muxnibs is a possibility too. Whether that subset of colours concept fits in with the colour range people want I guess is TBD. For implementing monochrome pointers it would be fine with only two colours needed anyway. You could even fix them to 0xFF and 0xFE or something. The zero colour index in the palette is probably a good choice to always reserve as the transparent colour for sprite related things too.

    I didn't realise that AUGD could also be used to identify LUTRAM, I was under the impression you had to use RDLUT/WRLUT only to access that RAM, but I am still learning as I go. Any further cycle penalty I wonder given it normally takes 3 cycles to access LUTRAM compared to two for normal COG RAM? Perhaps that can be utilised in other places in our code to reduce cycles, will have to keep it in mind for the future if it works the way you say. All these cool new tricks now with P2 to learn about. I think TonyB_ also mentioned you can execute from LUTRAM, that sounds interesting too.

    Update: Applying the muxnibs stuff AJL identified to my code above, with no other changes to how original pixel buffer memory gets read twice from hub and no use of potential AUGD etc (until we know that is real), but including both the optimizations I identified earlier in my original code and a new one, I seem to find we can shave off up to 49 cycles from the mouse sprite code. This brings the total from the original 139 clocks down to 90 clocks. It also saves many more COG longs as well in the code (now down to 55 best case of which 19 are reusable registers). Pretty major improvement actually. Only actual limitation as such is we need to avoid certain colours with zero nibbles in the mouse sprite (256 total colours is not possible) and reserve 0 for the mouse's "transparent" colour. This avoids calculating additional shifted masks too in the vertical blanking code.

    I also again looked at reading the LUTRAM directly to get the 9 long pixel buffer before masking, but it takes 3*9 = 27 clocks to get it back out of LUT to COGRAM in the best case with self modifying LUT addresses updated at blanking time, while my original approach of just burst loading 9 longs assuming worst case hub transfer latency and the prior address setup computation appears to just take 24 clock cycles and may be even faster if the hub window aligns. If AUGD works with LUTRAM however this result could potentially be changed in its favour.

    Modified code attached:
    
            ' Sample mouse sprite code supporting a 32x32 transparent mouse pixel image in 256 colour mode for bitbang HDMI
            ' total = 90 clock cycles, 55 COG RAM registers
    
            mov     ptr, y ' I assume y is already tracking the scanline number in the frame
            sub     ptr, mouseYstart wc
      if_nc cmp     mouseYend, y wc
    
            ' compute mouse image/mask start address in hub RAM for this scanline
            mul     ptr, #9*4
            add     ptr, mousebaseaddr
            
            ' read mouse image and mask for this scanline, data is already pre-shifted for current mouse X position
            ' read 9 longs = holding 32 bytes of mouse pixels  
            setq    #9-1 
            rdlong  mousedatabuf, ptr ' assume 18 clock cycle execution for burst transfer
    
            ' compute original pixel buffer address at the current mouse position on screen
            mov     ptr, mouseXoffset
            add     ptr, scanlinebufaddr
    
            ' read 9 longs = 36 pixels from pixel buffer of scanline around the mouse write starting position
            setq    #9-1 
            rdlong  origpixelbuf, ptr  ' assume 18 clock cycles worst case for burst transfer
    
            muxnibs origpixelbuf+0, mousedatabuf+0
            muxnibs origpixelbuf+1, mousedatabuf+1
            muxnibs origpixelbuf+2, mousedatabuf+2
            muxnibs origpixelbuf+3, mousedatabuf+3
            muxnibs origpixelbuf+4, mousedatabuf+4
            muxnibs origpixelbuf+5, mousedatabuf+5
            muxnibs origpixelbuf+6, mousedatabuf+5
            muxnibs origpixelbuf+7, mousedatabuf+7
            muxnibs origpixelbuf+8, mousedatabuf+8
    
    if_nc   wrlut   origpixelbuf+0, #0-0 patched LUT address value based on X coordinate of mouse
    if_nc   wrlut   origpixelbuf+1, #0-0
    if_nc   wrlut   origpixelbuf+2, #0-0
    if_nc   wrlut   origpixelbuf+3, #0-0
    if_nc   wrlut   origpixelbuf+4, #0-0
    if_nc   wrlut   origpixelbuf+5, #0-0
    if_nc   wrlut   origpixelbuf+6, #0-0
    if_nc   wrlut   origpixelbuf+7, #0-0
    if_nc   wrlut   origpixelbuf+8, #0-0
     
    tmds
            ' rest of TMDS code follows after mouse sprite
    
    
    ' mouse storage follows
    scanlinebufaddr long 0  ' holds start address of scanline pixel buffer in hub RAM
    mousebaseaddr   long 0  ' holds start address of mouse image in hub RAM
    mouseYstart     long 0  ' holds first screen scanline of mouse
    mouseYend       long 0  ' holds last screen scanline of mouse
    mouseXoffset    long 0  ' holds byte offset of mouse on scanline (but long aligned)
    lutbufptr       long 0  ' holds lut RAM mouse start address
    y               long 0  ' holds current scanline
    
    ' working regs
    ptr             long 0  
    mousedatabuf    long 0[9]  ' this space could be shared with other working registers as it is only needed once at the start of line for mouse
    origpixelbuf    long 0[9]
    
    
  • MUXNIBS can only operate on cogram.
    It cannot be used with AUGD.

  • roglohrogloh Posts: 5,122
    edited 2018-11-10 13:01
    Fair enough - I thought it was too good to be true. Thanks ozpropdev. Looks like we have a new benchmark of 90 clocks at best right now, for a mouse sprite in 256 colour mode (with minor limitations). This is for 32x32 pixel image. For a 16x16 mouse you could save 24 more clocks and get to 66 clocks. This smaller mouse would save another 16 COG longs too.
  • Cluso99Cluso99 Posts: 18,066
    You can execute from LUTRAM but the registers in the operands must be cog registers.

    Both cog ram and lut ram are dual ported, but the second port on the lut has some restrictions as this is also used for the streamer etc. Not sure of all the ramifications.

    AUGD and AUGS don't work to change registers from cog to lut though due to the lack of the second port not being in parallel with the cog second port. Again, not sure of all ramifications.
  • TonyB_TonyB_ Posts: 2,108
    edited 2018-11-10 13:23
    We could use SETQ+WMLONG to write 255 colour + transparent sprites on top of scan line data in hub RAM. Read pixels that sprite might overwrite first, read sprite, write sprite, do TMDS, then write back original pixels.

    EDIT:
    Clipping might be tricky.
  • In our case with a mouse, I think it is best not to touch hub RAM. I know you could do the mouse that way, but it's better hidden inside the COG IMO. Other COGs might be reading/writing the frame buffer, doing graphics block transfers etc and they don't want to see any temporary mouse pointer data in the buffer at random times.
  • TonyB_TonyB_ Posts: 2,108
    edited 2018-11-10 13:38
    I don't like the MUXNIBS limitation. Clipping won't be that tricky, simply adjust the sprite pointer and WMLONG block size for partial sprites. Also if hub RAM is shared with VGA output, sprite will appear there too. This is the intended application of WMLONG.

    EDIT:
    I accept the point about affecting the frame buffer.
  • I know what you mean about the colour limitation, it's a bit restrictive. It's possibly workable for a monochrome mouse only but not ideal and has palette implications.

    Anyway I thought of another method that might be a reasonable compromise, it will increase the cycle count again though by 27 clocks.

    instead of this optimized code for each mouse long
    muxnibs origpixelbuf, mousedatabuf
    wrlut origpixelbuf, #0-0
    

    it would do this for each long, and we go back to reading in a full 18 longs of mouse data again including its 9 longs of mask
    setq mousemaskbuf
    muxq origpixelbuf, mousedatabuf
    wrlut origpixelbuf,#0-0
    
  • potatoheadpotatohead Posts: 10,253
    edited 2018-11-10 16:43
    For clipping, just make the working buffer bigger than the display. Doing that is by far the cheapest.

    Did we not make an option for WRBYTE and friends to inhibit the writes of value 00? That should get rid of the masking steps for a full color cursor.

    Edit: I see Tony mentioned it.





  • Shortest way to get from S to D preserving S? Z if S=0
    '  S, D
    '  0, $FF_FF_FF_FF or $00_00_00_00
    '  8, $FF_FF_FF_00 or $00_00_00_FF
    ' 16, $FF_FF_00_00 or $00_00_FF_FF
    ' 24, $FF_00_00_00 or $00_FF_FF_FF
    
  • roglohrogloh Posts: 5,122
    edited 2018-11-10 23:14
    Ok I think we could avoid corrupting the full original frame buffer yet still use the method of WMLONG with bursts by doing this...

    - read mouse image data from the scanline we are on (again 9 longs) with SETQ + RDLONG
    - write back 9 longs from LUTRAM around the cursor X pos to a small hub scratch buffer with SETQ2 + WRLONG
    - use WMLONG to write this mouse to the scratch buffer in hub RAM which will apply the mask for us
    - use another RDLONG to read back new data from scratch into LUT RAM on top of original data read before mouse applied

    This will only take 4 small additional bursts reads/writes of 9 longs and just require 0 as the transparent colour. Far fewer COG instructions now for the mouse.

    Now just 90 clocks worst case and 19 persistent COG RAM addresses, plus 10 temporary COG registers than can be reclaimed or even just 1 temporary COG register if LUTRAM gets used to hold the mouse scanline data. In fact It's probably less clocks given we are doing back to back hub bursts, perhaps 2 egg beater hub cycles per each burst access of 9 longs = 16 clocks each burst once the first transfer is done, so maybe no more than 84 cycles in the real world, perhaps even less with the last aligned burst transfer completing sooner of if the first burst is already hub aligned from prior work?

    This is a very good way to go. Any edge clipping can be handled with the LUT RAM being wider than the scanline, and gross min/max mouse X position enforcement done during vertical blanking. I think this is the best one so far...
    
            ' Sample mouse sprite code supporting 32x32 a transparent mouse pixel image in 256 colour mode for bitbang HDMI
            ' total = 90 clock cycles worst case, more like 84 perhaps 
    
            mov     ptr, y ' I assume y is already tracking the scanline number in the frame
            sub     ptr, mouseYstart wc
      if_nc cmp     mouseYend, y wc
    
            ' compute mouse image/mask start address in hub RAM for this scanline
            mul     ptr, #9*4
            add     ptr, mousebaseaddr
            
            ' read mouse image for this scanline, data is already pre-shifted for current mouse X position
            ' read 9 longs = holding 18 bytes of pixels 
            setq    #9-1 
            rdlong  mousedatabuf, ptr ' assume 18 clock cycle execution for burst transfer
    
            ' write a temporary pixel buffer in hub for the mouse 
            setq2 #9-1
            wrlong lutbufptr, hubscratchptr
    
            ' overlay the mouse data 
            setq #9-1
            wmlong mousedatabuf, hubscratchptr
    
            ' read data into LUT
            setq2 #9-1
    if_nc   rdlong lutbufptr, hubscratchptr
    
    tmds
            ' rest of TMDS code follows after mouse sprite
    
    
    ' mouse storage follows
    mousebaseaddr   long 0  ' holds start address of mouse image in hub RAM
    mouseYstart     long 0  ' holds first screen scanline of mouse
    mouseYend       long 0  ' holds last screen scanline of mouse
    lutbufptr       long 0  ' holds lut RAM mouse start address
    hubscratchptr   long 0  ' address of a scratch buffer in RAM for mouse overlay
    y               long 0  ' holds current scanline
    
    ' working regs
    ptr             long 0  
    mousedatabuf    long 0[9] ' this space could be shared with other working registers as it is only needed once at the start of line for mouse
    
    


  • TonyB_TonyB_ Posts: 2,108
    edited 2018-11-11 00:10
    rogloh wrote: »
    Ok I think we could avoid corrupting the full original frame buffer yet still use the method of WMLONG with bursts by doing this...

    - read mouse image data from the scanline we are on (again 9 longs) with SETQ + RDLONG
    - write back 9 longs from LUTRAM around the cursor X pos to a small hub scratch buffer with SETQ2 + WRLONG
    - use WMLONG to write this mouse to the scratch buffer in hub RAM which will apply the mask for us
    - use another RDLONG to read back new data from scratch into LUT RAM on top of original data read before mouse applied

    This will only take 4 small additional bursts reads/writes of 9 longs and just require 0 as the transparent colour. Far fewer COG instructions now for the mouse.

    rogloh, here's my idea from earlier today:
    Cog A					Instr		Longs	From	To
    
    1. Read sprite line to sprite buffer	SETQ2+RDLONG	   8	HUB	LUT
    2. Read scan line to line buffer	SETQ2+RDLONG	 160	HUB	LUT
    3. Write pixels to temp	buffer		SETQ2+WRLONG	   9	LUT	HUB
    4. Write sprite to temp buffer		SETQ2+WMLONG	   9	LUT	HUB
    5. Read sprite+pixels from temp buffer	SETQ2+RDLONG	 <=9	HUB	LUT
    

    During 2-3 cog B shifts the sprite data according to sprite_x[1:0] and there is just enough time do this. Therefore only need one lot of sprite data.
  • TonyB_, sounds interesting with only one copy of the sprite needed, can you show the code for your sprite shifting so we see how fast it is?

    My latest sample mouse code doesn't need to do shifting during active scanlines as I was just planning to shift the sprite data once during vertical blanking - so there were only going to be 2 copies of the mouse data, the reference image, and the shifted image (per scanline). Maybe you've found a sufficiently fast way to do it on the fly?
  • TonyB_TonyB_ Posts: 2,108
    edited 2018-11-12 01:37
    Roger, we are clearly on the same wavelength! :)

    Please check this code:
    ' Cog B sprite data shifter
    '
    	mov 	sprite_ptr,#sprite_buf
    	mov	shift,sprite_x_y	' x coordinate in low bits
    	and	shift,#3	wz	' shift = 0/1/2/3
    	shl	shift,#3		' shift = 0/8/16/24
    '
     if_z	mov	mask,#0			' $00000000 when sprite_x[1:0]=00 
     if_nz	not	mask,#0
     if_nz	bitl	mask,shift
     if_nz	signx	mask,shift		' $000000FF/$0000FFFF/$00FFFFFF when sprite_x[1:0]=01/10/11
    '
    	mov	old_x,#0
    	rep	@.end,#8
    '
    	rdlut	x,sprite_ptr
    	rol	x,shift
    	mov	new_x,x
    	andn	new_x,mask
    	or	new_x,old_x
    	wrlut	new_x,sprite_ptr
    	add	sprite_ptr,#1
    	mov	old_x,x
    	and	old_x,mask
    .end
    	wrlut	old_x,sprite_ptr
    

    A look-up table could generate mask more quickly but I've opted to save longs here. I think sprite (x,y) should be in a single long and read at the start of every line, so that the sprite could appear on screen multiple times.

    EDIT:
    Sprite data shifting is not required when using WMLONG - see later posts.
  • Still working out the logic in your code but the cycle count looks good, 19*8 + 11*2 = 174 clock cycles for the mouse sprite shifting, so this could overlap the time taken to read in the 160 longs plus prepare the temporary buffer pretty nicely, as that would already take at least 164 + 13 = 177 cycles or so anyway in the fastest possible case.

    It's a great use of spare time to make COG B pull its weight a little more while COG-A is doing some other work for it. Good job again. :smile:
  • roglohrogloh Posts: 5,122
    edited 2018-11-11 02:35
    You know this is rather interesting to see that the sprite shifting works takes a COG quite a long time compared to other work relatively speaking, especially considering we now have this new WMLONG instruction. I think that is going to mean for an external sprite driver it would be running at its fastest speed if your sprite image data was already available as 4 copies in hub for the different shift positions. Then a single COG would be able to do many more sprites per scanlines in this case, assuming you want to be able to display them on single pixel boundaries, same goes for tiles with scrolling.

    This 4 copy idea is worth considering in the future if you want lots more sprites, but it would obviously burn hub RAM much faster. Alternatively you just use burst byte transfers instead of longs and live with the fact that is 4x slower to transfer all the data, however maybe with this extra shifting overhead the byte transfers really start to make sense if you want to conserve hub RAM.

    Update: I trust WMLONG can be used to just do single masked byte transfers too on byte boundaries else it may not be quite so fast if restricted to longs, as that may still require some pre shifting per byte. I think the P2 long alignment is different now anyway to make this still possible.
  • roglohrogloh Posts: 5,122
    edited 2018-11-11 02:35
    Deleted
  • roglohrogloh Posts: 5,122
    edited 2018-11-11 02:48
    On a P2 what actually happens if you write a long across a long boundary?

    Assume writing long AABBCCDD to hub address 0x1

    does the hub memory then look like this
    Byte address    Contents    
    0x00               AA
    0x01               DD
    0x02               CC
    0x03               BB
    
    or is another byte of the long written into address 0x4 with a second bus transfer, or something else altogether? I know the P1 would just do this
    Byte address    Contents    
    0x00               DD
    0x01               CC
    0x02               BB
    0x03               AA
    
    but I think P2 is different now, right? Documentation just says there are no special alignment rules for words and longs in hub RAM, but I'm wondering what this means for the data results.
  • P2 would give this result
    Byte address    Contents    
    0x00               ??
    0x01               DD
    0x02               CC
    0x03               BB
    0x04               AA
    
  • Ok thanks ozpropdev, so probably WMLONG works the same and we can make us of this to perhaps avoid some of this sprite shifting work.
  • roglohrogloh Posts: 5,122
    edited 2018-11-11 03:07
    Would WMLONG in a SETQ/SETQ2 burst writing to a non-aligned long boundary then take ~2x as long to complete compared to if it were aligned? Documentation mentions up to 20 clocks for single transfer but I am wondering about the burst transfers.
Sign In or Register to comment.