Depending on the loop unroll and burst write overhead it's borderline if 640x480x256 colours is achievable with 2 COGs, at least using my approach below. I calculate 41 clocks to convert each four pixels. This code needs to happen 160 times taking at least 26.24us at 250MHz or 26.05us at 251.75MHz with the "proper" VGA timing. That does not include reads and writes back to HUB from LUTRAM, and all per line setup overheads, mouse sprite handling etc. Here's the sample critical loop code which can be unrolled to help with burst write setup overhead. I think rep loops should be used and once its 256 entry table gets copied out to COG RAM on the streamer COG side, it frees lots of LUT RAM to hold large buffers for bursting to hub more efficiently. Eg. I think you could accumulate 320 longs (128 pixels) at a time in the LUT RAM and then write to hub. You'll need 160 longs in the LUT for all the source pixels, that still leaves 32 for other signalling uses etc.
The key would be only writing 4 of the 5 longs to hub per pixel pair, to buy time. 1600 writes is 6.4us @250MHz which we don't have remaining in our budget, but 1280 writes (which is all you need if the line buffer is already pre-populated with the static data) is only 5.12us(5.08us) which we do have left. The problem with these small writes from LUTRAM is they may not allow back to back bursts using the egg-beater high speed bus to hub which are essential for performance, and there won't be much time for anything else after reading in the 160 longs from hub, even if we get 100% bus usage in this time.
Damn...seems we need to find another optimization here, any way to use a counter to save on all those write pointer updates or leveraging other things like the table's base at offset 0 etc?
' COGA and COGB below are run in parallel fully synchronized and take exactly the same time to execute which is very nice.
' Someone probably needs to check my code is right and I've not overlooked something major that adds cycles in the critical loop.
'COGA:
rdlut pixels, pixeladdr ' read in next 4 pixels from LUT RAM buffer
add pixeladdr,#1
getbyte x,pixels,#0 ' compute even[7:4]
altd x,#table
wrlut 0-0,wrlutaddr
add wrlutaddr,#2
getbyte x,pixels,#1 ' compute odd[5:2]
alts x,#table
movw patch,0-0,#1 ' MSW patched into fixed contant LSW
wrlut patch,wrlutaddr
add wrlutaddr,#3
getbyte x,pixels,#2 ' compute even[7:4]
altd x,#table
wrlut 0-0,wrlutaddr
add wrlutaddr,#2
getbyte x,pixels,#3 ' compute odd[5:2]
alts x,#table
movw patch,0-0,#1 ' MSW patched into fixed contant LSW
wrlut patch,wrlutaddr
add wrlutaddr,#3
' 20 instructions per four pixels, 41 clocks = 26.24us @ 250MHz for 640 pixels
'COGB:
rdlut pixels,pixeladdr ' read in next 4 pixels from LUT RAM buffer
add pixeladdr,#1
getbyte x,pixels,#0 ' compute even[9:8]
alts x,#table
movw patch, 0-0,#0
wrlut patch,wrlutaddr
add wrlutaddr,#2
getbyte x,pixels,#1 ' compute odd[9:6]
altd x,#table
wrlut 0-0,wrlutaddr
add wrlutaddr,#3
getbyte x,pixels,#2 ' compute even[9:8]
alts x,#table
movw patch, 0-0,#0
wrlut patch,wrlutaddr
add wrlutaddr,#2
getbyte x,pixels,#3 ' compute odd[9:6]
altd x,#table
wrlut 0-0,wrlutaddr
add wrlutaddr,#3
' 20 instructions per four pixels, 41 clocks = 26.24us @ 250MHz for 640 pixels
There may be still some way to do this, but this particular mode is quite a bit more challenging. Still it would be very cool if doable, even if only supporting a single mouse pointer sprite for GUI in our COG pair
That mode would be resolution limited anyway. Fall back on tiles is a total option worth looking at.
Way more can be done with tiles, and colors than most people realize. The P1 tile driver Chip did has color indirection, and a very flexible HUB memory mapping. One can do windows, partial buffers of all kinds, and if one color is reserved, pointers and other goodies are simple bitmap ops, and some processing during blanking periods.
Say those tiles are 16 color, nibble per pixel. Reserve one or two, and given some per line, or per tile palette flexibility (the latter being much better if it can be done P1 style), and some amazing displays can be made.
Where those can be partially buffered, by using tiles where needed, stacking them in regions, etc... people won't even know pointers aren't hardware. Just do them during blanking.
@potatohead, I think the frame buffer might still be doable at 640x480x256 (no sprites, maybe single mouse one). Tiles and sprites is another type of driver altogether really to plain bitmap, but perhaps different code for it could be read in dynamically at frame sync time.
Just realized if I unroll the two COGA/COGB loops a fair bit in my most recent code it might be possible to just hard code the LUT write addresses to fixed addresses, this buys us a reduction of 4 instructions (8 clocks) per 4 pixels iteration which is great and saves a whopping 5 microseconds per scanline outside extra loop overhead. Yeah baby!
You'd just have to delay the second COG (streamer COG) while the writes to hub take place. This could get tricky unless hub is already pre-synced before we begin the loop and the second COG delays the exact amount needed every time. That's really the only way it can work.
Hmm, a good 240p sprite driver on P1 needs 4 rendering cogs @80Mhz
P2 instructions take only two clocks and 250Mhz is roughly 3*80. Therefore, P2 is roughly 6 times faster (more likely 8 times)
So even if you did a straight port of a P1 sprite driver (i.e. not making use of the fancier P2 instructions), you'd end up needing only one cog for 240p (need scan out each line twice) and two for full 480p
I also think it is also going to be possible to do an independent sprite driver that can work alongside a 640x480x256 bitbang HDMI driver if required. It could take at least one additional COG however. It could render to a single scanline in advance of being processed by the HDMI COG pair and this would save needing the full frame buffer memory - it's a perfect mode for games and scrolling etc. Not sure how many sprites per line it could do but probably quite a lot. In a limited sixteen colour mode, you may likely have some small number of sprites per line overlayed by the HDMI COG pair driver itself. And an alternative scheme where the sprite's and tile's palette data is already mapped into the 10B encoded form may also be possible, though that will involve a lot of render COGs due to the overhead of writing all the data and more memory for storing palette data (not really worth it IMO). For bitbang HDMI I think the sprite rendering COGs should probably just sit before the HDMI COG pair being talked about here, using as many COGs as needed for the sprite count, may not be that many given the P2's new instructions for dealing with pixel data.
Ideally the screen's mode is dynamic and just read in at the start of the frame, so the driver could be told to do either a 16 colour text mode, 16 or 256 colour bitmap graphics modes, or tiles & sprites mode, on any new frame. And I suspect 16 colour text and 16 colour graphics could get mixed into the same frame too if required, on row boundaries for some type of split screen use. That's always handy for console information with graphics shown as well.
Once we get revB silicon things change and it may help reduce the number of COGs needed for sprite modes over HDMI a little with any luck. Analog VGA is a different beast and it will allow many other resolutions to what the P2 will do with HDMI, which is going to be more restrictive and likely limited to 640x480 due to the P2's operational frequency range, unless someone overclocks to 400MHz perhaps, or runs at lower frame refresh rates to 60Hz. HDTV's may not like that, but general purpose DVI monitors might be happy enough.
Not sure how many sprites per line it could do but probably quite a lot.
For 16x16x4, I think at least 16 or 20? For 16x16x256 (= no bit twiddling, read byte, write byte if_nz, loop) my educated guess is 64. (extrapolated from what JETEngine can do on P1, thus assuming 256x224 resolution)
If one actually uses 4 cogs for rendering, all sorts of fun could be had - affine transformations, copper-style effects, multiple playfields and such things.
I think that VGA, except for text-only drivers, will likely end up being mostly used at 640x480, so the same code/assets can be used for HDMI and NTSC (might need a deflicker filter).
@potatohead, I think the frame buffer might still be doable at 640x480x256 (no sprites, maybe single mouse one). Tiles and sprites is another type of driver altogether really to plain bitmap, but perhaps different code for it could be read in dynamically at frame sync time.
Just realized if I unroll the two COGA/COGB loops a fair bit in my most recent code it might be possible to just hard code the LUT write addresses to fixed addresses, this buys us a reduction of 4 instructions (8 clocks) per 4 pixels iteration which is great and saves a whopping 5 microseconds per scanline outside extra loop overhead. Yeah baby!
You'd just have to delay the second COG (streamer COG) while the writes to hub take place. This could get tricky unless hub is already pre-synced before we begin the loop and the second COG delays the exact amount needed every time. That's really the only way it can work.
I think there is enough time to do 640x480x256, if all WRLUT addresses are hard-coded. Four pixels need 16 instructions and 33 cycles, therefore a whole line could be encoded and written to hub RAM in ~7000 cycles, leaving ~1000 cycles for reading 160 longs of pixel data and adding a cursor or some sprites. I prefer to use cycles as this could apply to 720 pixel lines as well.
As both cog's LUTs are set for sharing, simple software handshaking could keep the cogs in sync, e.g. cog B says "I've finished encoding this block" then cog A writes it to hub RAM and says "I've finished writing this block", rinse and repeat. The palette must be in cog RAM starting at address 0, with pixel and TMDS buffers in LUT RAM. Pixel value of zero could be transparent and would not be written to hub RAM when using WMLONG.
Was just thinking that we may be able to use the unencoded version of this to send data to 24-bit color LCD...
So, there are LCD boards now with TFP401 DVI decoder onboard, but that chip costs $10 and would use a lot of pins (although a lot less that 24).
But, it looks to me like we could use three 74VHC595 8-bit shift registers ($0.48 cents each) to send the color data to the LCDs using just 3 pins. We'd need on smartpin to clock the data in at 125 MHz (within rating) and a HDMI clock pin to latch bytes. Perhaps the other HDMI clock pin could go to the LCD.
This would feed pixels at 15.6 MHz, just a hair over range for the one I have of 9.2-15.0 MHz. So, maybe drop the P2 clock from 250 MHz to 200 MHz or so...
But, we still need a pin or two for sync... So, we're not really saving many pins, but are saving some $$
Of course, will be nice to just use 24 pins for color, since P2 has so many...
Note that ROLWORD rotates registers that should have a constant word, which has two implications for cog B: (1) it must have two patch registers for every loop unrolling and (2) these registers must be restored after the loop. The time required for (2) will always be less than the time it takes to write the TMDS longs to hub RAM, therefore cogs A and B should be line buffer and streamer, respectively.
...
But, it looks to me like we could use three 74VHC595 8-bit shift registers ($0.48 cents each) to send the color data to the LCDs using just 3 pins. We'd need on smartpin to clock the data in at 125 MHz (within rating) and a HDMI clock pin to latch bytes. Perhaps the other HDMI clock pin could go to the LCD.
There are Asian LCD modules around now, with HC595, and CPLD shifters that spec 128MHz shift speeds into the LCD.
(search eBay for lcd raspberry pi 128m spi)
These have RaspPi SPI pinouts, which is one reason I suggested P2 boards include a Pi-header. Existing infrastructure is done.
One of those would be good to connect to P2, to confirm 128MHz SPI operation.
rogloh, I've modified your code a little - please check.
Note that ROLWORD rotates registers that should have a constant word, which has two implications for cog B: (1) it must have two patch registers for every loop unrolling and (2) these registers must be restored after the loop. The time required for (2) will always be less than the time it takes to write the TMDS longs to hub RAM, therefore cogs A and B should be line buffer and streamer, respectively.
Just looked at your code TonyB_ and now I see why rolword is required. Seems I had a bug in my COG B code above and it could only work that way if there were two different tables for odd/even components with the order reversed which was not going fit in COG RAM. Your use of the Rolword opcode solves that. I also really like how the patch words can (just) be restored during the TMDS data writes to hub by the other COG, and there is enough time to do it. It will take 4 clocks to update two patch registers, and when we write the data to hub in the other COG it already takes at least 5 clocks for the 5 longs we just generated - so its a perfect opportunity to go fix the patch regs and and great use of probably otherwise wasted time. Very nice!
Only thing we need to figure out is the number of unrolls to make it fit in the budget allotted. I am still not sure as to the whole wrlong burst setup latency and the hub windows on the P2. If the worst case hub write delay is assumed on every batch (which may not be realistic anyway if we can code to maximize hub window opportunity), are there still going to be enough cycles left to comfortably do a mouse sprite before the initial processing begins I wonder. If so I think this 640x480x256 mode is golden.
I guess we need to know for the burst transfer from LUT RAM after the setq2 (which takes 2 cycles), how many cycles the wrlong takes to complete the full "N" number of longs requested. N is going to be some multiple of 5 from the TMDS encoding process, and ideally I suspect also wants to be a multiple of 8 (or 16?) as well for better hub transfer efficiency, but I don't know this. 40 is an obvious number for N, but 40 means 8 unrolls x 20 instructions (including the two patches and their fixes) making 160 longs used up in the COG for containing it. This won't leave much space remaining after the 256 entry palette lookup table in COG RAM as well.
So perhaps N=20 is better for COG RAM use but then what will be the number of clock cycles required for doing repeated 20 LONG burst transfers that are always separated by 16*4 instructions in the unrolled loop plus one preceding SETQ2 instruction (ie. 130 clocks) before the WRLONG instruction and the burst is triggered? That needs to be known to work out the time needed for the entire TMDS table computation and the write back of this data to hub so we know how much time will be left for all the other work. The documentation says WRLONG takes 3..10 clocks but I don't know if this is the initial latency and you then need to add the number of transfers to this number, or if that is something else just for single transfers only and different numbers apply for bursts. Gut feeling tells me these loops will begin to automatically self-align to the next hub window once they get going like the P1 does, but perhaps the new egg-beater may not work that way.
Update: If you add 10 clocks for WRLONG to 130 clocks before it and 20 clocks for the transfer, that happens to equal 160 clocks, so this is possibly the number we will see for each unrolled loop iteration. If so that would mean it takes 40 iterations x 160/250 = 25.6us at 250MHz, and that should be fine for the other work, including a decent mouse sprite overlay prior to beginning the translation lookups.
Say those tiles are 16 color, nibble per pixel. Reserve one or two, and given some per line, or per tile palette flexibility (the latter being much better if it can be done P1 style), and some amazing displays can be made.
Where those can be partially buffered, by using tiles where needed, stacking them in regions, etc... people won't even know pointers aren't hardware. Just do them during blanking.
Yeah I like tile modes myself, Given that bitmap is going to eat a lot of HUB-RAM at 640x480 and above.
16 color limit isn't the end of the world. If you can change the pallete to custom, or even change each tile's colors to a CLUT. That would allow for some really sharp looking graphics.
Either way whatever you guys come up with for HDMI, I'll be happy with.
rogloh, I've modified your code a little - please check.
Note that ROLWORD rotates registers that should have a constant word, which has two implications for cog B: (1) it must have two patch registers for every loop unrolling and (2) these registers must be restored after the loop. The time required for (2) will always be less than the time it takes to write the TMDS longs to hub RAM, therefore cogs A and B should be line buffer and streamer, respectively.
Just looked at your code TonyB_ and now I see why rolword is required. Seems I had a bug in my COG B code above and it could only work that way if there were two different tables for odd/even components with the order reversed which was not going fit in COG RAM. Your use of the Rolword opcode solves that. I also really like how the patch words can (just) be restored during the TMDS data writes to hub by the other COG, and there is enough time to do it. It will take 4 clocks to update two patch registers, and when we write the data to hub in the other COG it already takes at least 5 clocks for the 5 longs we just generated - so its a perfect opportunity to go fix the patch regs and and great use of probably otherwise wasted time. Very nice!
Only thing we need to figure out is the number of unrolls to make it fit in the budget allotted. I am still not sure as to the whole wrlong burst setup latency and the hub windows on the P2. If the worst case hub write delay is assumed on every batch (which may not be realistic anyway if we can code to maximize hub window opportunity), are there still going to be enough cycles left to comfortably do a mouse sprite before the initial processing begins I wonder. If so I think this 640x480x256 mode is golden.
Each loop generates 10 longs for two pairs of pixels, therefore cog B has plenty of time for patching. As the palette must be copied from LUT to cog RAM before the TMDS encoding, we have only 256 longs for instructions in cog RAM although we could use part of the LUT RAM for code. As timing is tight, I think the unrolling would have to be at least x4 and probably x5.
It would be good to know what is the best gap between two SETQ2+WRLONGs when starting at the same RAM slice. Is it an exact multiple of eight, or two fewer as suggested by the spreadsheet timings?
Yeah this gap is what I don't understand, TonyB_. We are getting close to the edge of this not working if there are unknown extra cycles somewhere for burst transfers. In my numbers above, I had (again!) incorrectly assumed cycles=instructions*2 but missed accounting for rdlut being 3 cycles. So my numbers above are incorrect and I had also assumed the WRLONG burst using a PTR register would have the PTR register update by the number of transfers, again it seems not the case now. With this in mind I think the unrolled loop we need for line buffer COG in 256 colour mode is going to now be something like this:
Each iteration of the unrolled loop (COG-A form) assuming 4 unrolls = 16 pixels get processed each iteration
33 cycles (pixels 0..3)
33 cycles (pixels 3..7)
33 cycles (pixels 8..11)
33 cycles (pixels 12..15)
2 cycles for SETQ2 prior to burst 40 longs (we create 5 longs per pixel pair)
X cycles for WRLONG (X = 3..10 for single transfers, but for 40 longs would it then become 42..49 when extrapolating from the single transfer case?)
2 cycles for PTR update (which we will also need now in RevA anyway, probably RevB too I suspect unless that is a simple fix)
0 cycles for outer loop restarting assuming REP loop takes no extra cycles. If this is not possible some DJNZ overhead is needed.
Total = 4*33 + 4 + X
= 136 + X
With 4 unrolls and 16 pixels processed each iteration we need to do this loop 40 times for the 640 pixels (or 45 for 720 pixels). This then equals (136 + X) * 40 cycles. With worst case X as 49 every time, unrealistic but at least conservative, we get a total of 7400 or 29.6us at 250MHz for 640x480. Still fits the budget ok for this resolution if the X values upper range really 49 cycles for 40 long transfers, but not as much time left on the scanline now for other work. Hopefully it would be sufficient for the other features like the mouse sprite.
Unrolling further can help but the main problem as I see it with 5 unrolls is that the hub timing is going to jump around each iteration because the starting slice does too and the execution timing will likely vary, meaning COG B will need to follow it precisely. [UPDATE - I now see you included a COGATN for that, interesting]. If the unroll count could get up to 8 this probably won't happen, because we will begin writing to the same slice address each iteration and the timing should then become far more consistent/easy to figure out I suspect. Eight unrolls is a real COG RAM hog though - with any luck it might still fit.
Each iteration of the unrolled loop (COG-A form) assuming 4 unrolls = 16 pixels get processed each iteration
33 cycles (pixels 0..3)
33 cycles (pixels 3..7)
33 cycles (pixels 8..11)
33 cycles (pixels 12..15)
2 cycles for SETQ2 prior to burst 40 longs (we create 5 longs per pixel pair)
X cycles for WRLONG (X = 3..10 for single transfers, but for 40 longs would it then become 42..49 when extrapolating from the single transfer case?)
2 cycles for PTR update (which we will also need now in RevA anyway, probably RevB too I suspect unless that is an simple fix)
0 cycles for outer loop restarting assuming REP loop takes no extra cycles. If this is not possible some DJNZ overhead is needed.
Total = 4*33 + 4 + X
= 136 + X
rogloh,
I agree with 136 cycles, which is exactly 17 rotations of the egg beater and leaves no setup time for WRLONG. As the first long is always written to the same slice with blocks of 40 longs, I think it's safe to assume there will be 18+5=23 rotations between the 40 block writes and the total time will be 7360 cycles max, leaving 640 free. Auto-incrementing RDLUT would save 320 cycles. We need over 160 cycles to write the pixel line to the LUTs. As you say, the timing for five unrolls and 50 long blocks is harder to calculate.
If SETQ2+WRLONG incremented PTRx correctly, the elimination of one instruction for each of the 40 blocks might mean there are 22 rotations between block writes and the overall cycle saving would be 312 or only two (after the last block).
Yeah timing is getting really borderline right now for 4 loop unrolls. Maybe 8 is still doable. Accurate HUB ram burst timing knowledge is very critical to all this, and I don't feel I have a full handle on that yet.
With 8 unrolls, COG RAM use in COG-B is high. Effectively 20 registers x 8, plus a few extras for the loop and the 256 entry TMDS palette table, so lets say about 420 registers. This leaves about say 76 COG RAM registers for dealing with video mode changes during vertical blanking (i.e reading in new COG code for the new video mode, or to update the palette for the 256 colour mode), and the background HDMI HUB streaming to pin function. The streaming code itself should not take a lot of space once everything is fully initialized and running I would expect and could hopefully just be a subroutine somewhere convenient in COG RAM memory.
In COG-A COG RAM use is also high, not by quite so much, but this COG would need to do more functions like deal with blanking and the mouse sprite. I really now wonder if it could fit there with the limited remaining space.
Say those tiles are 16 color, nibble per pixel. Reserve one or two, and given some per line, or per tile palette flexibility (the latter being much better if it can be done P1 style), and some amazing displays can be made.
Where those can be partially buffered, by using tiles where needed, stacking them in regions, etc... people won't even know pointers aren't hardware. Just do them during blanking.
Yeah I like tile modes myself, Given that bitmap is going to eat a lot of HUB-RAM at 640x480 and above.
16 color limit isn't the end of the world. If you can change the pallete to custom, or even change each tile's colors to a CLUT. That would allow for some really sharp looking graphics.
Either way whatever you guys come up with for HDMI, I'll be happy with.
I want to point out that a sprites & tiles mode with scrolling etc can and should leverage a soft HDMI driver like we are discussing about here. Nothing we have designed will preclude it as one of the supported modes and it would reduce the memory footprint significantly and open up lots of new applications as well. Instead of a full 480 line frame buffer, you could run with just a few lines (one per sprite COG). Sprites and tiles would be computed in external COGs and just feed into the scanline as a byte per pixels and get converted and then output by the HDMI pair. I think any sprite mode would very much benefit from a 256 colour palette from a range of 4096 colours. Sprites and tiles being restricted to 16 colours per screen is pretty limiting as I found when I wrote my old sprite driver, but being able to display 256 colours on the same screen at once is really nice, so it is important we try to support 256 colour mode with soft HDMI.
Also note, doing it this way by using a common scanline buffer prior to HDMI encoding will allow multiple output displays simultaneously too. Eg. displaying analog VGA and HDMI at the same time from the same source. This will be of benefit for devices with multiple output port types.
We sort of now need to figure out some fast mouse sprite code to gauge how many additional cycles it would take to do this and update the LUT RAM before the main TMDS table conversion stage starts. I think I mentioned a 32x32 pixel mouse image. Is it normally that big or more like 16x16 I wonder? For now let's assume 32 pixels for maximum size/flexibility. We need a fast way to calculate and read in the correct mouse cursor image (byte/pixel for full colour mouse, or one long for just monochrome) and its mask for the current scanline into COG RAM, read in the original pixel data corresponding to the mouse's current X position and extent into COG RAM, and patch the combined image/mouse pixels by masking then writing this data into the LUT RAM at the correct long addresses for the X co-ordinate of the mouse on the scanline, while dealing appropriately with edge conditions and the mouse hot spot within the image. Optimizations are strongly desired. I'll have a look but others are welcome to come up with tight code for it...
We sort of now need to figure out some fast mouse sprite code to gauge how many additional cycles it would take to do this and update the LUT RAM before the main TMDS table conversion stage starts. I think I mentioned a 32x32 pixel mouse image. Is it normally that big or more like 16x16 I wonder? For now let's assume 32 pixels for maximum size/flexibility. We need a fast way to calculate and read in the correct mouse cursor image (byte/pixel for full colour mouse, or one long for just monochrome) and its mask for the current scanline into COG RAM, read in the original pixel data corresponding to the mouse's current X position and extent into COG RAM, and patch the combined image/mouse pixels by masking then writing this data into the LUT RAM at the correct long addresses for the X co-ordinate of the mouse on the scanline, while dealing appropriately with edge conditions and the mouse hot spot within the image. Optimizations are strongly desired. I'll have a look but others are welcome to come up with tight code for it...
4 longs would hold a 32 x 32 pixel 4bpp mouse pointer image in un-encoded form, or 8 longs for 8bpp. I'd suggest that conditional load logic would take more space and mess with timing, so perhaps consider a single load operation, rather than a per block load.
Another long could contain the current X,Y position of the mouse pointer, updated only once per frame.
If the mouse pointer image is patched into the line buffer prior to TMDS conversion, rather than the frame buffer in HUB RAM, there's no need to restore the background; it happens for free. With one of the mouse pointer colours representing transparent, there should be no need for extra operations for masking, simply muxing.
It seems to me the process would be to load the block for processing, check whether mouse pointer overlay is necessary, mux in the mouse pointer where relevant and then convert the block.
On the next frame, if the mouse pointer has moved it gets patched into the new location.
I'm not sure how edge conditions and hot spot need to be addressed by this code; I'd think it would be handled by the mouse driver and cursor logic code.
Hi AJL,
the mouse cursor near screen edges does affect the number of mouse pixels drawn near the edges of the scanline, so the line buffer COG needs to cope with these special cases in order to not overflow is internal RAM buffers by writing at the wrong place.
One simple way to deal with that (at the expense of some additional unused LUT RAM) is to have a bit more extra space already reserved at these scan line edges internally and just allow some of the mouse image pixels to be written outside of the 640 active pixel portion of the line buffer in the LUT RAM (yet still within some enforced min/max clipping limits). We just don't process these extra pixels when we translate and write them back to hub. That keeps the mouse sprite code simple/fast/consistent and limits testing for the two corner cases before we go do the work.
The key work for 256 colour mice images is extracting and constructing the mouse image byte by byte from its original long storage format given that each long holds 4 8bpp pixels and that we can only ever write 32 bit data into the LUTRAM. We have plenty of existing and fancy new instructions handy for this type of work though. A monochrome mouse variant is a little different and needs to assemble bytes of fg/bg colour into LUTRAM based on the mouse image and its mask and apply these over the existing longs in LUTRAM at the correct offset addresses in the pixel buffer. I'm still thinking about some baseline implementation that can be used to determine some initial cycle count estimate and then can get optimized later or reimplemented more efficiently with less cycles. Be nice to keep all this mouse stuff within something like 50-100 instructions or less as we may not have more than this remaining in the budget in 256 colour mode if we want to leave some extra cycles for other per line overhead etc.
Yeah timing is getting really borderline right now for 4 loop unrolls. Maybe 8 is still doable. Accurate HUB ram burst timing knowledge is very critical to all this, and I don't feel I have a full handle on that yet.
I think COGATN should go, if possible.
What we really need is for cogs A and B to be in sync with each other and with the eggbeater. The hub writes by cog A when then be deterministic and cog B would know how long they take in advance and do a fixed wait to stay in sync with cog A.
What we really need is for cogs A and B to be in sync with each other and with the eggbeater. The hub writes by cog A when then be deterministic and cog B would know how long they take in advance and do a fixed wait to stay in sync with cog A.
Fully agree. That is why 4 or 8 unrolls is desirable as the hub slice doesn't change for each loop iteration. Some initial hub access is needed to sync the COG-A to the hub window (probably a part of the mouse stuff or some other per scanline housekeeping work) and a way to signal the start of processing to the COG-B (maybe just once with a COGATN), then they should remain in lockstep until the end of the TMDS table processing. This would be the way to go.
What we really need is for cogs A and B to be in sync with each other and with the eggbeater. The hub writes by cog A when then be deterministic and cog B would know how long they take in advance and do a fixed wait to stay in sync with cog A.
Fully agree. That is why 4 or 8 unrolls is desirable as the hub slice doesn't change for each loop iteration. Some initial hub access is needed to sync the COG-A to the hub window (probably a part of the mouse stuff or some other per scanline housekeeping work) and a way to signal the start of processing to the COG-B (maybe just once with a COGATN), then they should remain in lockstep until the end of the TMDS table processing. This would be the way to go.
I suspect getting rid of COGATN would not make any difference with four unrolls. The gap between WRLONG blocks would be 136 cycles or exactly 17 eggbeater revolutions, which would be 18 in practice allowing for setup time. I think the ideal gap is n+½ revs, to guarantee n+1 in practice. Removing COGATN changes the gap for eight unrolls from 270 cycles (33¾ revs) to 268 cycles (33½ revs). n+¾ might result in n+1, but it's marginal.
Total cycles for TMDS encoding would be 7360 for four unrolls and 7040 for eight. Putting COGATN back might increase the latter to 7200.
Comments
The key would be only writing 4 of the 5 longs to hub per pixel pair, to buy time. 1600 writes is 6.4us @250MHz which we don't have remaining in our budget, but 1280 writes (which is all you need if the line buffer is already pre-populated with the static data) is only 5.12us(5.08us) which we do have left. The problem with these small writes from LUTRAM is they may not allow back to back bursts using the egg-beater high speed bus to hub which are essential for performance, and there won't be much time for anything else after reading in the 160 longs from hub, even if we get 100% bus usage in this time.
Damn...seems we need to find another optimization here, any way to use a counter to save on all those write pointer updates or leveraging other things like the table's base at offset 0 etc?
That mode would be resolution limited anyway. Fall back on tiles is a total option worth looking at.
Way more can be done with tiles, and colors than most people realize. The P1 tile driver Chip did has color indirection, and a very flexible HUB memory mapping. One can do windows, partial buffers of all kinds, and if one color is reserved, pointers and other goodies are simple bitmap ops, and some processing during blanking periods.
Say those tiles are 16 color, nibble per pixel. Reserve one or two, and given some per line, or per tile palette flexibility (the latter being much better if it can be done P1 style), and some amazing displays can be made.
Where those can be partially buffered, by using tiles where needed, stacking them in regions, etc... people won't even know pointers aren't hardware. Just do them during blanking.
Just realized if I unroll the two COGA/COGB loops a fair bit in my most recent code it might be possible to just hard code the LUT write addresses to fixed addresses, this buys us a reduction of 4 instructions (8 clocks) per 4 pixels iteration which is great and saves a whopping 5 microseconds per scanline outside extra loop overhead. Yeah baby!
You'd just have to delay the second COG (streamer COG) while the writes to hub take place. This could get tricky unless hub is already pre-synced before we begin the loop and the second COG delays the exact amount needed every time. That's really the only way it can work.
P2 instructions take only two clocks and 250Mhz is roughly 3*80. Therefore, P2 is roughly 6 times faster (more likely 8 times)
So even if you did a straight port of a P1 sprite driver (i.e. not making use of the fancier P2 instructions), you'd end up needing only one cog for 240p (need scan out each line twice) and two for full 480p
Ideally the screen's mode is dynamic and just read in at the start of the frame, so the driver could be told to do either a 16 colour text mode, 16 or 256 colour bitmap graphics modes, or tiles & sprites mode, on any new frame. And I suspect 16 colour text and 16 colour graphics could get mixed into the same frame too if required, on row boundaries for some type of split screen use. That's always handy for console information with graphics shown as well.
Once we get revB silicon things change and it may help reduce the number of COGs needed for sprite modes over HDMI a little with any luck. Analog VGA is a different beast and it will allow many other resolutions to what the P2 will do with HDMI, which is going to be more restrictive and likely limited to 640x480 due to the P2's operational frequency range, unless someone overclocks to 400MHz perhaps, or runs at lower frame refresh rates to 60Hz. HDTV's may not like that, but general purpose DVI monitors might be happy enough.
If one actually uses 4 cogs for rendering, all sorts of fun could be had - affine transformations, copper-style effects, multiple playfields and such things.
I think that VGA, except for text-only drivers, will likely end up being mostly used at 640x480, so the same code/assets can be used for HDMI and NTSC (might need a deflicker filter).
I think there is enough time to do 640x480x256, if all WRLUT addresses are hard-coded. Four pixels need 16 instructions and 33 cycles, therefore a whole line could be encoded and written to hub RAM in ~7000 cycles, leaving ~1000 cycles for reading 160 longs of pixel data and adding a cursor or some sprites. I prefer to use cycles as this could apply to 720 pixel lines as well.
As both cog's LUTs are set for sharing, simple software handshaking could keep the cogs in sync, e.g. cog B says "I've finished encoding this block" then cog A writes it to hub RAM and says "I've finished writing this block", rinse and repeat. The palette must be in cog RAM starting at address 0, with pixel and TMDS buffers in LUT RAM. Pixel value of zero could be transparent and would not be written to hub RAM when using WMLONG.
EDIT:
Mistakes corrected.
So, there are LCD boards now with TFP401 DVI decoder onboard, but that chip costs $10 and would use a lot of pins (although a lot less that 24).
But, it looks to me like we could use three 74VHC595 8-bit shift registers ($0.48 cents each) to send the color data to the LCDs using just 3 pins. We'd need on smartpin to clock the data in at 125 MHz (within rating) and a HDMI clock pin to latch bytes. Perhaps the other HDMI clock pin could go to the LCD.
This would feed pixels at 15.6 MHz, just a hair over range for the one I have of 9.2-15.0 MHz. So, maybe drop the P2 clock from 250 MHz to 200 MHz or so...
But, we still need a pin or two for sync... So, we're not really saving many pins, but are saving some $$
Of course, will be nice to just use 24 pins for color, since P2 has so many...
Note that ROLWORD rotates registers that should have a constant word, which has two implications for cog B: (1) it must have two patch registers for every loop unrolling and (2) these registers must be restored after the loop. The time required for (2) will always be less than the time it takes to write the TMDS longs to hub RAM, therefore cogs A and B should be line buffer and streamer, respectively.
There are Asian LCD modules around now, with HC595, and CPLD shifters that spec 128MHz shift speeds into the LCD.
(search eBay for lcd raspberry pi 128m spi)
These have RaspPi SPI pinouts, which is one reason I suggested P2 boards include a Pi-header. Existing infrastructure is done.
One of those would be good to connect to P2, to confirm 128MHz SPI operation.
I think that's the main market focus, as that fits neatly over a Pi. There are also 4" ones, and above 4" the trend seems to be for HDMI for the Pi.
Just looked at your code TonyB_ and now I see why rolword is required. Seems I had a bug in my COG B code above and it could only work that way if there were two different tables for odd/even components with the order reversed which was not going fit in COG RAM. Your use of the Rolword opcode solves that. I also really like how the patch words can (just) be restored during the TMDS data writes to hub by the other COG, and there is enough time to do it. It will take 4 clocks to update two patch registers, and when we write the data to hub in the other COG it already takes at least 5 clocks for the 5 longs we just generated - so its a perfect opportunity to go fix the patch regs and and great use of probably otherwise wasted time. Very nice!
Only thing we need to figure out is the number of unrolls to make it fit in the budget allotted. I am still not sure as to the whole wrlong burst setup latency and the hub windows on the P2. If the worst case hub write delay is assumed on every batch (which may not be realistic anyway if we can code to maximize hub window opportunity), are there still going to be enough cycles left to comfortably do a mouse sprite before the initial processing begins I wonder. If so I think this 640x480x256 mode is golden.
So perhaps N=20 is better for COG RAM use but then what will be the number of clock cycles required for doing repeated 20 LONG burst transfers that are always separated by 16*4 instructions in the unrolled loop plus one preceding SETQ2 instruction (ie. 130 clocks) before the WRLONG instruction and the burst is triggered? That needs to be known to work out the time needed for the entire TMDS table computation and the write back of this data to hub so we know how much time will be left for all the other work. The documentation says WRLONG takes 3..10 clocks but I don't know if this is the initial latency and you then need to add the number of transfers to this number, or if that is something else just for single transfers only and different numbers apply for bursts. Gut feeling tells me these loops will begin to automatically self-align to the next hub window once they get going like the P1 does, but perhaps the new egg-beater may not work that way.
Update: If you add 10 clocks for WRLONG to 130 clocks before it and 20 clocks for the transfer, that happens to equal 160 clocks, so this is possibly the number we will see for each unrolled loop iteration. If so that would mean it takes 40 iterations x 160/250 = 25.6us at 250MHz, and that should be fine for the other work, including a decent mouse sprite overlay prior to beginning the translation lookups.
Yeah I like tile modes myself, Given that bitmap is going to eat a lot of HUB-RAM at 640x480 and above.
16 color limit isn't the end of the world. If you can change the pallete to custom, or even change each tile's colors to a CLUT. That would allow for some really sharp looking graphics.
Either way whatever you guys come up with for HDMI, I'll be happy with.
Each loop generates 10 longs for two pairs of pixels, therefore cog B has plenty of time for patching. As the palette must be copied from LUT to cog RAM before the TMDS encoding, we have only 256 longs for instructions in cog RAM although we could use part of the LUT RAM for code. As timing is tight, I think the unrolling would have to be at least x4 and probably x5.
It would be good to know what is the best gap between two SETQ2+WRLONGs when starting at the same RAM slice. Is it an exact multiple of eight, or two fewer as suggested by the spreadsheet timings?
There are six cycles outside of the TMDS block between end of previous wrlong and start of next. Missing the hub RAM slot could have a big effect.
Each iteration of the unrolled loop (COG-A form) assuming 4 unrolls = 16 pixels get processed each iteration
33 cycles (pixels 0..3)
33 cycles (pixels 3..7)
33 cycles (pixels 8..11)
33 cycles (pixels 12..15)
2 cycles for SETQ2 prior to burst 40 longs (we create 5 longs per pixel pair)
X cycles for WRLONG (X = 3..10 for single transfers, but for 40 longs would it then become 42..49 when extrapolating from the single transfer case?)
2 cycles for PTR update (which we will also need now in RevA anyway, probably RevB too I suspect unless that is a simple fix)
0 cycles for outer loop restarting assuming REP loop takes no extra cycles. If this is not possible some DJNZ overhead is needed.
Total = 4*33 + 4 + X
= 136 + X
With 4 unrolls and 16 pixels processed each iteration we need to do this loop 40 times for the 640 pixels (or 45 for 720 pixels). This then equals (136 + X) * 40 cycles. With worst case X as 49 every time, unrealistic but at least conservative, we get a total of 7400 or 29.6us at 250MHz for 640x480. Still fits the budget ok for this resolution if the X values upper range really 49 cycles for 40 long transfers, but not as much time left on the scanline now for other work. Hopefully it would be sufficient for the other features like the mouse sprite.
Unrolling further can help but the main problem as I see it with 5 unrolls is that the hub timing is going to jump around each iteration because the starting slice does too and the execution timing will likely vary, meaning COG B will need to follow it precisely. [UPDATE - I now see you included a COGATN for that, interesting]. If the unroll count could get up to 8 this probably won't happen, because we will begin writing to the same slice address each iteration and the timing should then become far more consistent/easy to figure out I suspect. Eight unrolls is a real COG RAM hog though - with any luck it might still fit.
rogloh,
I agree with 136 cycles, which is exactly 17 rotations of the egg beater and leaves no setup time for WRLONG. As the first long is always written to the same slice with blocks of 40 longs, I think it's safe to assume there will be 18+5=23 rotations between the 40 block writes and the total time will be 7360 cycles max, leaving 640 free. Auto-incrementing RDLUT would save 320 cycles. We need over 160 cycles to write the pixel line to the LUTs. As you say, the timing for five unrolls and 50 long blocks is harder to calculate.
With 8 unrolls, COG RAM use in COG-B is high. Effectively 20 registers x 8, plus a few extras for the loop and the 256 entry TMDS palette table, so lets say about 420 registers. This leaves about say 76 COG RAM registers for dealing with video mode changes during vertical blanking (i.e reading in new COG code for the new video mode, or to update the palette for the 256 colour mode), and the background HDMI HUB streaming to pin function. The streaming code itself should not take a lot of space once everything is fully initialized and running I would expect and could hopefully just be a subroutine somewhere convenient in COG RAM memory.
In COG-A COG RAM use is also high, not by quite so much, but this COG would need to do more functions like deal with blanking and the mouse sprite. I really now wonder if it could fit there with the limited remaining space.
I want to point out that a sprites & tiles mode with scrolling etc can and should leverage a soft HDMI driver like we are discussing about here. Nothing we have designed will preclude it as one of the supported modes and it would reduce the memory footprint significantly and open up lots of new applications as well. Instead of a full 480 line frame buffer, you could run with just a few lines (one per sprite COG). Sprites and tiles would be computed in external COGs and just feed into the scanline as a byte per pixels and get converted and then output by the HDMI pair. I think any sprite mode would very much benefit from a 256 colour palette from a range of 4096 colours. Sprites and tiles being restricted to 16 colours per screen is pretty limiting as I found when I wrote my old sprite driver, but being able to display 256 colours on the same screen at once is really nice, so it is important we try to support 256 colour mode with soft HDMI.
Also note, doing it this way by using a common scanline buffer prior to HDMI encoding will allow multiple output displays simultaneously too. Eg. displaying analog VGA and HDMI at the same time from the same source. This will be of benefit for devices with multiple output port types.
4 longs would hold a 32 x 32 pixel 4bpp mouse pointer image in un-encoded form, or 8 longs for 8bpp. I'd suggest that conditional load logic would take more space and mess with timing, so perhaps consider a single load operation, rather than a per block load.
Another long could contain the current X,Y position of the mouse pointer, updated only once per frame.
If the mouse pointer image is patched into the line buffer prior to TMDS conversion, rather than the frame buffer in HUB RAM, there's no need to restore the background; it happens for free. With one of the mouse pointer colours representing transparent, there should be no need for extra operations for masking, simply muxing.
It seems to me the process would be to load the block for processing, check whether mouse pointer overlay is necessary, mux in the mouse pointer where relevant and then convert the block.
On the next frame, if the mouse pointer has moved it gets patched into the new location.
I'm not sure how edge conditions and hot spot need to be addressed by this code; I'd think it would be handled by the mouse driver and cursor logic code.
the mouse cursor near screen edges does affect the number of mouse pixels drawn near the edges of the scanline, so the line buffer COG needs to cope with these special cases in order to not overflow is internal RAM buffers by writing at the wrong place.
One simple way to deal with that (at the expense of some additional unused LUT RAM) is to have a bit more extra space already reserved at these scan line edges internally and just allow some of the mouse image pixels to be written outside of the 640 active pixel portion of the line buffer in the LUT RAM (yet still within some enforced min/max clipping limits). We just don't process these extra pixels when we translate and write them back to hub. That keeps the mouse sprite code simple/fast/consistent and limits testing for the two corner cases before we go do the work.
The key work for 256 colour mice images is extracting and constructing the mouse image byte by byte from its original long storage format given that each long holds 4 8bpp pixels and that we can only ever write 32 bit data into the LUTRAM. We have plenty of existing and fancy new instructions handy for this type of work though. A monochrome mouse variant is a little different and needs to assemble bytes of fg/bg colour into LUTRAM based on the mouse image and its mask and apply these over the existing longs in LUTRAM at the correct offset addresses in the pixel buffer. I'm still thinking about some baseline implementation that can be used to determine some initial cycle count estimate and then can get optimized later or reimplemented more efficiently with less cycles. Be nice to keep all this mouse stuff within something like 50-100 instructions or less as we may not have more than this remaining in the budget in 256 colour mode if we want to leave some extra cycles for other per line overhead etc.
I think COGATN should go, if possible.
What we really need is for cogs A and B to be in sync with each other and with the eggbeater. The hub writes by cog A when then be deterministic and cog B would know how long they take in advance and do a fixed wait to stay in sync with cog A.
Fully agree. That is why 4 or 8 unrolls is desirable as the hub slice doesn't change for each loop iteration. Some initial hub access is needed to sync the COG-A to the hub window (probably a part of the mouse stuff or some other per scanline housekeeping work) and a way to signal the start of processing to the COG-B (maybe just once with a COGATN), then they should remain in lockstep until the end of the TMDS table processing. This would be the way to go.
I suspect getting rid of COGATN would not make any difference with four unrolls. The gap between WRLONG blocks would be 136 cycles or exactly 17 eggbeater revolutions, which would be 18 in practice allowing for setup time. I think the ideal gap is n+½ revs, to guarantee n+1 in practice. Removing COGATN changes the gap for eight unrolls from 270 cycles (33¾ revs) to 268 cycles (33½ revs). n+¾ might result in n+1, but it's marginal.
Total cycles for TMDS encoding would be 7360 for four unrolls and 7040 for eight. Putting COGATN back might increase the latter to 7200.