Shop OBEX P1 Docs P2 Docs Learn Events
HDMI added to Prop2 - Page 20 — Parallax Forums

HDMI added to Prop2

11516171820

Comments

  • Roger
    I use WMLONG for my sprites in my invaders demo. Works great! :)
  • ozpropdev, If you get a spare moment and wanted to try this, it might be really good to know how many clock cycles it takes to do bursts of WMLONG when the addressed initial long is not aligned by timing them. Perhaps a single hub access prior to running this may keep the egg-beater starting conditions somewhat consistent during the test. I'm wondering if WMLONG bursting needs to fall back to one or two hub windows per unanligned transfer that needs to write into two hub locations or whether this condition is optimised and can still sustain one long write per clock even when unaligned. In theory if the write data is temporarily held during the burst it could be, I just don't know.

    Eg, time the number of P2 clocks for doing say 500 transfers from LUTRAM with each of these four cases to see if the first one is the fastest and by how much, I'd guess the write timing itself is LUT data content independent:

    setq2 #500
    wmlong 0, #$400

    setq2 #500
    wmlong 0, #$401

    setq2 #500
    wmlong 0, #$402

    setq2 #500
    wmlong 0, #$403

    Cheers
    Roger

  • Roger
    Here's clock count for the 4 different address offsets.
    I add a random waitx to mix things up a bit every 4 passes.
    Looks pretty consistant,
    512
    506
    513
    507
    
    508
    506
    511
    511
    
    512
    508
    510
    506
    
    508
    506
    506
    511
    
    505
    506
    506
    512
    
    512
    507
    511
    506
    
    505
    506
    507
    508
    
    512
    508
    510
    513
    
    
  • Cluso99Cluso99 Posts: 18,069
    edited 2018-11-11 10:13
    I did something similar last night Roger. There are a few unusual times to setup the setq+rdlong sequence.
            rdlong  data,ptra                 ' syncs to hub
    '       ------------------------------------------------
            getct1  t1                        ' 2  \
            waitx   #3                        ' 5  | =16
            rdlong  data,ptra                 ' 9  /
    '       ------------------------------------------------
            getct1  t2
    
    
    '       ------------------------------------------------
            getct1  t1                        ' 2  \
            waitx   #3                        ' 5  | =25
            setq    #1                        ' 2  |
            rdlong  data,ptra                 ' 16 /
    '       ------------------------------------------------
    
    '       ------------------------------------------------
            getct1  t1                        ' 2  \
                                              ' 0  | =17
            setq    #1                        ' 2  |
            rdlong  data,ptra                 ' 13 /
    '       ------------------------------------------------
    
    '       ------------------------------------------------
            getct1  t1                        ' 2  \
            waitx   #1                        ' 3  | =17
            setq    #1                        ' 2  |
            rdlong  data,ptra                 ' 10 /
    '       ------------------------------------------------
    
    '       ------------------------------------------------
            getct   t1                        '  2  \
            nop                               '  2  | =115
            setq    #100-1                    '  2  |
            rdlong  data,ptra                 ' 109 /
    '       ------------------------------------------------
    
  • roglohrogloh Posts: 5,573
    edited 2018-11-11 10:26
    That's an awesome result, thanks guys! Looks like we don't have to shift any of the sprites, we can just write to an unaligned address and won't incur any significant penalty for doing so in longer bursts. I'm learning some good things about this P2. It's going to fly!
  • Cluso99Cluso99 Posts: 18,069
    here's what you asked
                    wmlong  0, adr0  
    '+-----------------------------------------------------------------------------+
                    getct   t1                             
                    setq2   #500
                    wmlong  0, adr0           ' $1FC = 508
    '+-----------------------------------------------------------------------------+
                    getct   t2                             
                    setq2   #500
                    wmlong  0, adr1           ' $201 = 513
    '+-----------------------------------------------------------------------------+
                    getct   t3                             
                    setq2   #500
                    wmlong  0, adr2           ' $200 = 512
    '+-----------------------------------------------------------------------------+
                    getct   t4                             
                    setq2   #500
                    wmlong  0, adr3           ' $200 = 512
    '+-----------------------------------------------------------------------------+
                    getct   t5                             
    
  • Egg-beater seems trickier to figure out the hub sweet spot. Cluso were you able to find out the minimum gaps to get the RDLONG taking only 3 clocks as listed in the documentation?
  • Cluso99Cluso99 Posts: 18,069
    edited 2018-11-11 10:44
    This was the sweet spot
            rdlong  data,ptra                 ' syncs to hub
    '       ------------------------------------------------
            getct1  t1                        ' 2  \
            waitx   #3                        ' 5  | =16
            rdlong  data,ptra                 ' 9  /
    '       ------------------------------------------------
            getct1  t2
    

    With waitx #4 it takes another 8 clocks.
  • TonyB_TonyB_ Posts: 2,163
    edited 2018-11-11 14:45
    rogloh wrote: »
    Would WMLONG in a SETQ/SETQ2 burst writing to a non-aligned long boundary then take ~2x as long to complete compared to if it were aligned? Documentation mentions up to 20 clocks for single transfer but I am wondering about the burst transfers.

    The extra time for non-aligned transfers is only one cycle, as indicated by * in the Instructions PDF.
    rogloh wrote: »
    That's an awesome result, thanks guys! Looks like we don't have to shift any of the sprites, we can just write to an unaligned address and won't incur any significant penalty for doing so in longer bursts. I'm learning some good things about this P2. It's going to fly!

    Shifting does take an age and I should have realised that it's not needed, never mind. A dedicated cog could handle loads of 256-colour sprites.

    Other things to look at:

    1. Cog A adding and removing vertical sync.

    2. Cog A reading palette into LUT, best done just before first active line, copied by both cogs to cog RAM $00-FF.

    3. Cog A reading sprite (x,y) every (active) line. Could cog B do something else that's useful when cog A is dealing with the sprite?

    4. Cog B starting streamer.
  • This thread will make an awesome case study for coding with the P2!

    J
  • roglohrogloh Posts: 5,573
    edited 2018-11-11 20:39
    So after seeing Cluso's results it became evident to me that I've been assuming the hub write and hub read timing is the same on P2, but that is not correct. I'd read 3..10 clocks as the range of clocks executed for hub transfer and used that, but that is only the range for the hub writes. Hub reads apparently vary from 9..16 clocks, so my earlier mouse sprite numbers are not accurate and I now think the values are now up to 12 clocks larger than my earlier values, accounting for both burst reads - so now up to 102 clocks from 90 I had before. But these numbers will vary slightly anyway depending on addresses read so it is difficult to pinpoint what they will precisely be just looking at my sample code. P2 is going to be quite a bit more difficult that way so if you are writing very deterministic code accessing the hub you will either need to use deterministic hub address sequences now or find another way to resync the code after your hub accesses. That's probably the only real downside with the egg-beater from a programming perspective, but the performance is great.

    Good news is I think this is still fast enough to do a mouse sprite in 256 colour mode with a bitbang HDMI driver on rev A silicon for VGA resolution and I have high confidence it should leave enough cycles for the remaining per scan line setup/housekeeping work needed. Without other optimisations it won't fit in a 720 pixel wide screen though with 4 loop unrolls however. I think TonyB_ earlier mentioned we had 300 clocks left over and all these cycles need to do this mouse, a 180 long pixel buffer burst read (up to 195 clocks) and some remaining housekeeping. Not enough for that quite yet but with effort it's probably doable one way or another if ever required.
  • Cluso99Cluso99 Posts: 18,069
    What I found unusual is what must be a setup time as I found the sweet spot by varying the WAITX value.
    So yes, the hub address will be critical to hit successive sweet spots.

    As for your mouse pointer, what if the screen width is 720 but you allowed say a 10px boarder all around. Ie only use 700px with black boarder that lets the mouse extend to the real 720 edge. While the usable screen px would be smaller, would make the maths easier.
  • TonyB_TonyB_ Posts: 2,163
    edited 2018-11-12 02:05
    Yes, we have been underestimating block read times. Here is some of what cog A must do before the TMDS encoding:
    Cog A operation				Instr		Longs
    
    1. Read sprite line to sprite buffer	SETQ2+RDLONG	   8
    2. Read scan line to line buffer	SETQ2+RDLONG	 160
    3. Write pixels to temp	buffer		SETQ2+WRLONG	   9
    4. Write sprite to temp buffer		SETQ2+WMLONG	   8 (9 writes if not long-aligned)
    5. Read sprite+pixels from temp buffer	SETQ2+RDLONG	   9
    

    The start addresses for ops 1-3 & 5 could be eight-long aligned, but will vary for op 4. I think we should assume worst-case time gaps of 16 cycles between these five block operations.
  • TonyB_TonyB_ Posts: 2,163
    edited 2018-11-12 02:07
    Cluso99 wrote: »
    As for your mouse pointer, what if the screen width is 720 but you allowed say a 10px border all around. Ie only use 700px with black border that lets the mouse extend to the real 720 edge. While the usable screen px would be smaller, would make the maths easier.

    I think the simplest and quickest way to overlay a 32-pixel width mouse pointer/sprite is to have a 32+640+32 = 704 pixel line buffer in the LUT, or 176 longs for 8bpp. Only the middle 160 longs are read by the TMDS encoder and the contents of the first and last eight do not matter. If there is a need to save space the first eight could contain the 32-pixel sprite line.

    The sprite will be on the screen if its leftmost x-coordinate is [-31 ... +639]* or adding 32 [+1 ... +671]. The latter divided by four gives the address in the LUT line buffer of nine longs (36 pixels) to write to a temp buffer in hub RAM. The low two bits of sprite(x) provide the byte offset into the temp buffer to write the eight sprite longs using WMLONG. The resulting nine longs in the buffer are read back to the same LUT address as the nine that were written.

    * Assuming sprite(y) = [-31 ... +479] for 32x32 sprite.
  • roglohrogloh Posts: 5,573
    edited 2018-11-12 08:24
    @TonyB_ yeah we should just assume worst case for now, it's just easier that way.

    Cluso, border could help shave a few cycles and could work. I see it as cheating a little though. Ultimately 8 unrolls plus some LUT RAM execution might be the trick for that 720 res. Though it does add complexity because it turns out that 720 pixels is actually 22.5 iterations of the 8x unrolled loop that is generating 32 pixels at a time and this is not an integer. You'd probably have to write 16 extra bogus pixels on the line buffer in hub memory that get cleaned up afterwards - not good as that would waste another 50 or so cycles to clean it up.

    A 6x unroll for the 720 width implementation is a good fit for the budget (it takes 7920 best case cycles for TMDS stuff, leaving 660 for the mouse sprite, initial pixel buffer read and housekeeping, that's enough) but makes the timing pretty tricky for COG-B to follow as it creates new slice write addresses each iteration, introducing other delays which would affect these numbers. Might be somehow workable if this delay pattern is known/computable by COG-B, not sure.

    UPDATE: 6x unroll looks interesting because it just alternates between 2 different hub window separations for each 60 longs transferred because every second iteration starts on the same slice - 120 longs is divisible by 8. This is a simpler pattern to follow.

    UPDATE2: the hub window pattern might just be 264 clocks = 33 hub cycles for the first iteration for processing 60 longs, then 34 hubs for the next 60 longs, then 33, then 34 etc. So you might need to add 15x8=120 clocks to the 7920 to account for this, making 8040 and COG-B would have to introduce some delays accordingly to remain in sync. Still leaves 540 clocks at the end of this for all the other work. I think this might be the way to go for a 720 pixel wide bitbang driver...and it saves the COG RAM space too.
  • roglohrogloh Posts: 5,573
    edited 2018-11-12 02:13
    It is normal to allow sprites to go left of the start of the screen and only be partially visible. You can have negative co-ordinates represent this or you can just offset everything on the screen by the size of the biggest supported sprite. Both schemes are workable. Probably just optimize it for speed, though it can make the application side software developers suffer slightly more head scratching when they start needing to use negative numbers etc.
  • TonyB_ wrote: »
    I think the simplest and quickest way to overlay a 32-pixel width mouse pointer/sprite is to have a 32+640+32 = 704 pixel line buffer in the LUT, or 176 longs for 8bpp. Only the middle 160 longs are read by the TMDS encoder and the contents of the first and last eight do not matter. If there is a need to save space the first eight could contain the 32-pixel sprite line.

    The sprite will be on the screen if its leftmost x-coordinate is [-31 ... +639]* or adding 32 [+1 ... +671]. The latter divided by four gives the address in the LUT line buffer of nine longs (36 pixels) to write to a temp buffer in hub RAM. The low two bits of sprite(x) provide the byte offset into the temp buffer to write the eight sprite longs using WMLONG. The resulting nine longs in the buffer are read back to the same LUT address as the nine that were written.

    * Assuming sprite(y) = [-31 ... +479] for 32x32 sprite.

    Getting apps to add 32 to sprite(x,y) is easiest all round, probably. Least work for Cog A and sprite can be disabled by setting x or y to zero.
  • For mouse use, from the application side, the pointer hot spot is probably what you want to set between 0 and 639. So the low level mouse driver that passes the co-ordinates to the HDMI engine COG-A via hub, is going to want to account for the hotspot offset inside the mouse sprite image too. Or it can be done in the HDMI engine if that has knowledge of the hotspot offsets.
  • roglohrogloh Posts: 5,573
    edited 2018-11-13 09:25
    Another table maybe useful for our timing computations...
    hubtmds.png

    Update:
    We could put in a variable delay that alternates between say 4 or 8 clock cycles every iteration in the outer loop and this might force the hub to remain aligned for the 6x unroll case I would hope, perhaps meaning 34 hub cycles / iteration = 8160 clocks per scanline in this case. It would always need to round up to the next window though. Something like...
    ' rep ...
    '...unrolled tmds encode parts...
    waitx   delay
    xor     delay, #4 
    setq2   #60-1
    wrlong  data, ptra
    add     ptra, #240
    

    Update2: I'm rethinking this delay for the 6x unroll. I have a feeling that the alternating hub slice starting point will just induce a 4 clock delay each iteration. So for 6 unrolls it would be always taking 33.5 hub cycles on each burst once they get going meaning 8040 clocks total with the 30 iterations. To compensate for this and remain aligned COG-B should just add 4 extra cycles each iteration of the outer loop on top of its normal work beyond the 264 clocks. I think this is the case, but might still be wrong.
    854 x 406 - 69K
  • TonyB_TonyB_ Posts: 2,163
    edited 2018-11-13 02:03
    Thanks for the table.

    I've been working on cog A code for active display lines. The current worst-case timing for everything after a new line starts and before the TMDS encoding is 312 cycles for 32 instructions. Two longs of cursor/sprite attributes are read from hub RAM every line: top-left (x,y) and pattern address. If the high bit of the latter is set the sprite will flash on and off at ~1 Hz.

    We can arrange the line buffer data in hub RAM the way that suits us best. Blanking could come first then active display, or vice-versa. TMDS data could lead the streamer if blanking is first.
  • After the TDMS encoding there is only a conditional jump back to the start of the line or a fall-through to the blanking code, so four unrolls should be enough for 640 pixel lines. Cog B has nothing to do after the start of line until the TMDS code and despite its extra patch instructions cog A will require more code overall.
  • roglohrogloh Posts: 5,573
    edited 2018-11-13 06:35
    Yeah it is a pity we can't do more mouse sprite handling in COG-B, COG-A pretty much needs to do all of this because of its free access to the hub RAM. But that is just the way it is with the bitbang driver. Anyway it looks like 640 pixel wide screens are a go for timing with a mouse sprite using 4x unroll. The 720 pixel variant would need a 6x approach, which I'd expect is hopefully achievable too now. So it's all looking pretty good at the moment. As long as the TMDS encode disparity issue is ok with DVI/HDMI monitors, I really think this is going to work.

    Your approach of reading x,y each scan line is intriguing me TonyB_. For a standard mouse or even overlayed flashing cursor use it is not required or probably desirable as you won't want to see multiple mice on the same screen when it's co-ordinate long is asynchronously modified by other COGs, but for other effects with a single dynamic sprite that gets modified synchronously at the horizontal blanking time I guess it could be of some use (eg. vertical split screen mode with text+cursor in one part, and graphics+mouse in another). However if two sprites were possible on a line, you could have a graphics mode and flashing text cursor anywhere on the same screen without software touching the frame buffer. This might be possible to squeeze into the timing budget if the second sprite remains very small and it's image is very simply constructed/computed dynamically, e.g. flashing single pixel vertical line for a text cursor at a given x pixel offset on the line. I could see that being very useful for GUI. Maybe a simple vertical line (or block) cursor is something COG-B could do.

    Speaking of sync, it makes sense to have a long somewhere in hub updated per scanline, and also per frame that other COGs can use to synchronize their own updates to HSYNC and VSYNC etc. This is certainly needed for sprite mode too so the external COG(s) generating their scanline buffers can remain locked with the output. I hope you considered that in your housekeeping overhead.

    I briefly looked at sprite timing for external COGs and the P2 capabilities. Making full use of the handy WMLONG and with 16 pixel wide tiles and 16 pixel wide sprites in 256 colour mode at 640x480 res, my first rough estimate is that you should hopefully be able to get somewhere from about 64-100 or so sprites per scanline per sprite rendering COG depending on how memory is structured and how well optimized the code is. If lines are doubled, eg. 320x240 res, this is increased beyond a factor of 2. I am also trying to figure out how easy it is to pixel double horizontally dynamically as a mode without needing another COG-A/COG-B restart upsetting HDMI lock. Having lower resolution modes supported is still useful and it saves hub RAM too.

    Update: By the way you could also support quite a lot more sprites than 100 per screen if no more than some fixed limit of say 64 (or perhaps more) could ever be on a single line and your sprite code stops once this limit is reached, because some savings can be had when the sprite image is not read/written. I'm seeing approximately 8 hub cycle windows (64 clocks) of processing required per rendered 16 pixel sprite after the list is loaded, of which you'd know within 2-3 hub windows if the sprite is required on this scanline or not. LUT RAM makes a great space to hold the full sprite list. With a 64 displayed sprite limit you might be talking ~200 sprites per screen, or, if you read a new sprite list for each new group of scanlines instead of once per frame, you can do a lot more than this again. Gauntlet type gaming with lots and lots of sprites shown at once.
  • roglohrogloh Posts: 5,573
    edited 2018-11-13 10:23
    I think I've found a very simple way to support a double wide pixel mode for enabling 320 or 360 screen widths using the same base TMDS loop code for both regular and double wide pixel modes ... we just need to convert the code from the left side to the right side via patching instructions (or offset registers) at vertical frame sync time if the mode changes. As long the unroll count is a multiple of 2 such as 4 or 6 this works. You'd just need to read half the source data into LUT RAM (or you could even keep it as is), and also adjust the frame buffer pointer by 320 instead of 640 per scanline. Even though I patched the actual constants in the sample below for our COG-A it probably makes sense to make all these constants registers instead so the amount of patching needed is minimized with the other unrolled loop copies sharing it. You'd just need 7 (actually only 6) registers modified then for the whole unrolled loop and these can be altered with simple xor's when the mode width changes. In comparison to this the vertical line doubling is even simpler, just increment the scanline pointer every second line.

    Done!

    Edit: the #0 in getbyte is not an S register, so looks like we would have to patch the actual instruction. Still workable though.
    	rdlut	pixels,pixeladdr               rdlut	pixels,pixeladdr
    	add	pixeladdr,#1                   add	pixeladdr,#0   ' patched so we repeat the same long read lower down
                                                   
    	getbyte	x,pixels,#0                    getbyte	x,pixels,#0
    	altd	x,#palette                     altd	x,#palette
    	wrlut	0-0,#buf+1                     wrlut	0-0,#buf+1
                                                                         
    	getbyte	x,pixels,#1                    getbyte	x,pixels,#0 ' patched
    	alts 	x,#palette                     alts	x,#palette
    	setword	patch,0-0,#1                   setword	patch,0-0,#1
    	wrlut	patch,#buf+3                   wrlut	patch,#buf+3
                                                   
    	getbyte	x,pixels,#2                    getbyte	x,pixels,#1  ' patched
    	altd	x,#palette                     altd	x,#palette
    	wrlut	0-0,#buf+6                     wrlut	0-0,#buf+6
                                                                         
    	getbyte	x,pixels,#3                    getbyte	x,pixels,#1 ' patched
    	alts 	x,#palette                     alts	x,#palette
    	setword	patch,0-0,#1                   setword	patch,0-0,#1
    	wrlut	patch,#buf+8                   wrlut	patch,#buf+8
                                                                         
    	rdlut	pixels,pixeladdr               rdlut	pixels, pixeladdr
    	add	pixeladdr,#1                   add	pixeladdr,#1
                                                   
    	getbyte	x,pixels,#0                    getbyte	x,pixels,#2 ' patched
    	altd	x,#palette                     altd	x,#palette
    	wrlut	0-0,#buf+11                    wrlut	0-0,#buf+11
                                                                         
    	getbyte	x,pixels,#1                    getbyte	x,pixels,#2  ' patched
    	alts 	x,#palette                     alts	x,#palette
    	setword	patch,0-0,#1                   setword	patch,0-0,#1
    	wrlut	patch,#buf+13                  wrlut	patch,#buf+13
                                                   
    	getbyte	x,pixels,#2                    getbyte	x,pixels,#3 ' patched
    	altd	x,#palette                     altd	x,#palette
    	wrlut	0-0,#buf+16                    wrlut	0-0,#buf+16
                                                                         
    	getbyte	x,pixels,#3                    getbyte	x,pixels,#3
    	alts 	x,#palette                     alts	x,#palette
    	setword	patch,0-0,#1                   setword	patch,0-0,#1
    	wrlut	patch,#buf+18                  wrlut	patch,#buf+18
    
  • rogloh wrote: »
    Your approach of reading x,y each scan line is intriguing me TonyB_. For a standard mouse or even overlayed flashing cursor use it is not required or probably desirable as you won't want to see multiple mice on the same screen when it's co-ordinate long is asynchronously modified by other COGs, but for other effects with a single dynamic sprite that gets modified synchronously at the horizontal blanking time I guess it could be of some use (eg. vertical split screen mode with text+cursor in one part, and graphics+mouse in another). However if two sprites were possible on a line, you could have a graphics mode and flashing text cursor anywhere on the same screen without software touching the frame buffer. This might be possible to squeeze into the timing budget if the second sprite remains very small and it's image is very simply constructed/computed dynamically, e.g. flashing single pixel vertical line for a text cursor at a given x pixel offset on the line. I could see that being very useful for GUI. Maybe a simple vertical line (or block) cursor is something COG-B could do.

    Speaking of sync, it makes sense to have a long somewhere in hub updated per scanline, and also per frame that other COGs can use to synchronize their own updates to HSYNC and VSYNC etc. This is certainly needed for sprite mode too so the external COG(s) generating their scanline buffers can remain locked with the output. I hope you considered that in your housekeeping overhead.

    There are line and frame counter words in the same long that could be written to hub RAM at worst-case +10 cycles, which would make elapsed time after start of line before sprite handling ~200 cycles. One sprite 32 pixels wide takes roughly 100 cycles, so it should be possible to have two.
  • rogloh wrote: »
    I think I've found a very simple way to support a double wide pixel mode for enabling 320 or 360 screen widths using the same base TMDS loop code for both regular and double wide pixel modes ... we just need to convert the code from the left side to the right side via patching instructions (or offset registers) at vertical frame sync time if the mode changes...
    	rdlut	pixels,pixeladdr               rdlut	pixels,pixeladdr
    	add	pixeladdr,#1                   add	pixeladdr,#0   ' patched so we repeat the same long read lower down
    	...
    

    This is a very good idea, rogloh. We could use two registers to add to pixeladdr after the rdlut's to avoid patching.

    We need to set the 640 display pixels to one of two special control values during blanking, for vsync on or off. The TMDS encoder could be modified slightly to do this for us during blanking lines. Set pixels to 0 or 1 and palette entries 0 and 1 to the control values and make the rdlut's conditional so they are not executed, or something along those lines.

    I didn't really get much done today. I'll try again tomorrow.
  • Two general independent sprites will be great for GUI use for clicking and typing in text boxes with a cursor with the mouse still visible. I like it.

    Also I noticed you were using ALTSW instead of ALTS in your COG-A TMDS stuff, I have modified it in my most recent code post above to use ALTS. Not sure if you had wanted ALTSW for some other reason but I think we probably need to use ALTS there.
  • TonyB_TonyB_ Posts: 2,163
    edited 2018-11-14 02:31
    I've been thinking about different sprite widths (and heights): 8x8 or 8x16, 16x16 or 16x32 and 32x32. Narrower sprites will need smaller read/write blocks which might hit the sweet spot for the inter-block gaps and allow more sprites per line.

    Another question is can we have a cursor sprite in the CGA-style text mode? A zero nibble would be transparent for the sprite but not for foreground and background. I've forgotten how much time we have to spare.

    P.S.
    Yes, ALTSW was an error.
  • roglohrogloh Posts: 5,573
    edited 2018-11-14 03:28
    I found on the P1 at least for my sprite engine you can sort of achieve some performance gain by checking if the second half of the sprite is zero, and then skip it. This would work for 4 pixels at a time if checking a long against zero. On the P2 I'd say it is not worth it due to the extra checking overhead vs the very small incremental number of burst transfer clocks needed on P2 for the larger sprite (2 clocks for 4 extra pixels in/out) . Smaller height doesn't save performance, just some hub RAM.

    Update: You could possibly encode the sprite size in the sprite list as some flags and then have different dedicated routines for each size. Again it is diminishing returns on P2 but it might hit the sweet spot for a particular size and let you have a lot of sprites (at that size). Once you mix sizes to take advantage of this timing gain you will find you sort of need to track a timing budget so you can stop doing more sprites before you are about to run out of time on the scanline, otherwise scanline corruption can occur and look ugly. This also gets tricky for the developer as the number of sprites per line becomes more variable. Easier with a known fixed limit.

    I believe we can easily do the text cursor in text mode using COG-B, without using all the sprite stuff. We just overwrite the written data in LUT-RAM with the cursor colours at the cursor position. Also with some double buffering in LUT RAM this work can potentially be done at the end of the line too. For 8 pixel characters, it is just a single long write in 16 colour mode at the cursor offset. Pretty simple. You can make it flash as well and define start/end scanlines for underscore/block type cursors.
  • Cluso99Cluso99 Posts: 18,069
    edited 2018-11-15 14:20
    I am generating VGA at 1920x1080 :smiley:

    There are speed and memory issues, but the generation can be done using 180MHz clock.

    see the VGA thread
  • cgraceycgracey Posts: 14,134
    I had to optimize the HDMI generator for faster timing, which added an extra clock cycle to its output. Two days spent down this rabbit hole, but I've got it thoroughly proven now.

    The only thing left on my plate is updating the streamer to accommodate the scope modes.
Sign In or Register to comment.