HDMI discussion
System
Posts: 45
in Propeller 2
This discussion was created from comments split from: Console Emulation.
Comments
Replying in your thread to avoid over-polluting MXXs thread...
It might help but you would need sufficient clocks for all the text work. My old comments in my code mention it can render 80 x 8 pixel wide text independent FG/BG coloured & optional flashing characters in 2760 P2 clocks so that's about 4.3 P2 clocks per pixel. With 10 clocks per pixel in HDMI mode there should be time to render text and do quite a bit of other work. The bigger issue perhaps is that solution uses LUT RAM to temporarily buffer the scan line output and another 64 COGs longs for a per scanline font. Another solution involving the HUB RAM instead might reduce that usage but require more access cycles which have to also compete with the streamer access.
Are you considering to put text into your own NeoVGA driver or think I should try to add HDMI to mine?
I was comparing what our different video drivers offer recently and there is quite some overlap now, but differences still exist. My driver is more general purpose overall, while your's is still currently primarily aimed at emulators but could be expanded over time with more capabilities.
p2videodrv:
NeoVGA:
You don't need to buffer the rendered line, you just feed one character sliver at a time into the streamer. It's essentially the same as pixel doubling. The font line buffer is needed though, RDBYTE per character is too much. Problem is that you can't byte-index LUT. (Though do we really need 256 characters? 128 is probably fine for basic debug/system output and if you want actually nice text you'd render it yourself (with proportional fonts, unicode, whatever) - ASCII only has 95 printable characters, so there's some space to squeeze umlauts).
Uhhh good question. In general more independent-ish HDMI implementations would be the good and right thing.
The big difference IMO is that the ***VGA driver (both the horrid old version and the new clean one) is really designed for hacking in whatever feature-of-the-week is needed. It's really currently not something that can just be dropped into a new project. Though the new version tries to be better about it...
PSRAM mailbox action and other color modes are trivial to add in, of course (I did it before, though I think that never released). The new version reserves the first half of LUT so 8bit mode can be added easily. It's just that I've found that having some other cog render stuff out to RGB24 line buffers has been the correctTM way of doing things in most of the projects I do (Tempest 2000 uses framebuffer rendering, but multiple buffers need to be composited and the 16 bit CRY color format is of course not native to the P2 - so
mikoobj.spin2
is what ends up doing the PSRAM request list, compositing all the CRY buffers together and converting the result to RGB24 - oops, same video driver model again)Also, big feature: driver loads itself fully into cog/lut, so you can nuke the hub data after it gets going. This would be really pertinent if I could add overlays to flexspin. NeoYume's menu-overwriting trick is a bit brute-force...
That's what my driver was intended to do too. Loading all config into the entire COG exec space (using LUT RAM) and the total ~4kB memory footprint overhead can be reused for a dual scan line buffer area, making it very low HUB usage overall. Also without HUB exec required this means that any corruption of HUB RAM after the video driver COG is spawned can't really crash the driver which is useful if you are using it to debug the problem, although it can obviously corrupt the screen output itself.
One other thing is I do wonder how high a resolution your driver can be operated at? I know mine can reach 1920x1200 in VGA mode, not sure about 2048x1536 - I think that exceeds my buffering capacity with the mouse sprite stealing 16 longs in LUT leaving only 240 for the temporary line buffer for doubling and 256 for the palette RAM. What would limit the resolution in your driver? Is it just the P2 clock speed for outputting doubled pixels or other things like streamer ops?
The old driver could do 5x scaling from 384x240 to 1920x1200. No reason it couldn't do this with 1x native data. Haven't readded such hi-res modes with the new one, but I guess it should work. Remember that software pixel doubling is only needed for DVI/HDMI (where there is always time for it because of the fixed clock ratio), for analog I can just set a slower clock and have the normal streamer DMA do the work.
Aha, yeah pixel doubling via HW streamer DMA is nice. I guess I expected I couldn't do that due to the fact that different regions could exist with/without doubling present, unless I somehow tweak streamer clocks on the fly per region - hmm maybe I could use setq to change the streamer rate per region if it were cleaned up a bit.
Doing HW pixel doubling and via the scanfunc trick could open up lots more LUT longs in my driver and free space needed for things like HDMI and frequent audio polling. I should think about this if I want to restructure it.
This is all very interesting as I can see you've probably learned a few things from what I did in my driver code (at least what parts to avoid and how to improve) and I can learn some from your approach too.
Something that takes the best parts of both our drivers could be pretty nice - probably closer to your overlay framework. I do sort of like the simplicity of basic frame buffers for graphics work and single COG operation with some in built text support is nice too with full per character bg/fg colour like real VGA, otherwise you need to burn more COGs. Multi-region is probably not as frequently needed but it can really help if you want to have pull down screens with different COG applications running in each, not needing to really share a single screen. Like a debug console that can be scrolled/hidden on demand running with some sprite based game or another GUI on-screen with a mouse cursor that don't need to know about this console and can simply continue writing to their own buffers independently.
What would the DVI encoder do if operated with a pixel clock other than 1/10 sysclock? A couple possibilities.
I don't think it works correctly with anything but a 10:1 ratio (as you need 10 serial clocks to stream out all the serial data with the TMDS/TERC4 encoded data). If you ran at 20:1 ratio it might do something interesting with every second pixel, but with 25MHz minimum DVI pixel clock you'd technically need a 500MHz P2 to do this. I guess it should be tried out (didn't we try this once already - can't recall, but fairly confident it didn't double for us otherwise I'd be making use of it).
Was thinking about this... I currently use altgb to just read the byte array of font data from COGRAM and this pulls out the correct byte from the long for you automatically in the following getbyte instruction. But it could be read from LUT with this, assuming this snippet executes from COGRAM (and is easily patchable) and flags can be used. This only adds 3 extra instructions per font byte (7 clocks with the RDLUT) vs the existing altgb method. These 7 extra clocks do thankfully get amortized over all 8 pixels so it's not too much slower in the 80 clock budget total for DVI/HDMI.
Yes, SETQ+XCONT. You really want to set it for the active video only and change it back for the front porch.
🎵Live and Learn, from the works of yesterdaaaaay!🎵
Agree, I usually bust out p2videodrv for simple demos or tests and such. Which reminds me that I never posted any of the texture mapping research code... (and I was going to add it here as a bonus but flexspin's project zipping just sharted itself)
Yeah
Not sure if that's worth it...
The pixel doubling code is:
The task code gets 22 useful cycles out of 40. Also note that it leaves the flags alone (not sure if I rely on that currently).
There may be a problem with doing both pixel doubling and text with the same scantask. My code uses 76 of these 22-cycle time slots for CRC+TERC4 encoding, so if you did 80 calls into a text function, there wouldn't be any time left. You could alternate it so that 160 calls produce 80 characters. Can certainly do monochrome text, but color seems like a pipe dream with this.
May look a bit like this:
So just putting aside the doubling for now, if you are sending data out in immediate mode using the streamer you don't have to issue an XCONT per pixel, but can do it per groups of 8 pixels when running 16 colour text (ala VGA style). This greatly reduces the frequency of calling the scanfunc to just 40 times per line in a 640x480 sized screen for example. A lot of my earlier budget involved looking up the LUTRAM for character & colour data and writing back. Some of this goes away or reduces when reading the FIFO and issuing XCONTs directly. Looking at my current text code the basic character loop is something like this where the last two instructions after the rep loop label are actually included because of skipf processing which skips two instructions inside the loop.:
In your own case, the scanfunc needs to be fully buffered up with 2 XCONTs to gain cycles to go do other work. This means 2 x 8 pixel characters or 4 total bytes with the colour attribute bytes included can be read from the FIFO and worked on at a time. Once loaded with the next pair of characters it would only have to happen on average every 160 clocks (16 pixels).
Sounds like your code uses 76x22 = 1672 total cycles during the active portion to complete the other TERC4/audio polling work. As long as you get enough total cycles on the line to do your work it probably doesn't matter precisely how they are distributed, assuming audio is still polled often enough to not miss the sample and that you don't ever underrun the streamer and you can fit these scanfunc calls somewhere useful in your own code.
With this in mind I expect a variant of the above could be coded which is optimized for reading source buffer data from the FIFO and its font data out of LUTRAM (although you do need 64+ extra housekeeping clocks per line to fill this once before the start of the active data can begin). It's mostly a matter of breaking this work into appropriately sized chunks so you get some free cycles for your operations. Something like this maybe?
You have repeat the work twice in the scanfunc before returning using the rep at the start and call this snippet 40 times per line. As counted it would take 70 clocks + 6 call overhead = 76 clocks every 160 P2 clocks, leaving 40x(160-76) = 3360 cycles remaining for TERC4/audio which seems like it should be sufficient if you only needed 1672 before. However you do have to read the 64 long font for the scan line into LUTRAM in before it can be used, taking 64+ more clocks elsewhere in the horizontal blanking portion or maybe at the end of prior line if needed. Also as coded, this loop consumes 4 extra clocks per character with the flashing text attribute, but it's good to keep that.
Now for the pixel doubling case, I'd expect that it should only increase the CPU time available between calls, as only half the number of characters need to be read in and processed (40 vs 80 for example with VGA). You can also then duplicate the computed 8x4 nibbles across 2 32 bit longs by replicating nibbles colour indices into bytes. So the generated 8x4 bit nibbles in the non-doubled color long above such as N7_N6_N5_N4_N3_N2_N1_N0 then becomes two longs as N7_N7_N6_N6_N5_N5_N4_N4 and N3_N3_N2_N2_N1_N1_N0_N0 to be sent to the streamer in the 40 iterations and the rep loop part wouldn't be needed. This extra work is less than processing a whole other character and can save cycles using those handy dandy SPLIT/MERGE style instructions which assist here. Something like this below may work which amounts to 51 clocks + 6 call overhead or 57 clocks total instead of 76.
Actually there's another problem which you might be considering and I only just realized. If the outer calling code is coded up to call fine grained scanfuncs for RGB24 individual pixel pair use and this common code also needs to be shared with text scanfuncs, then the calls will be happening far more frequently than we need. So maybe the text scanfunc needs to know this and only truly execute the scanfunc once every xx actual calls based on the number of pixels it generates and returns early otherwise so you can still share the audio+TERC calling code for both types of scanfunc. I see that issue now. An incmod counter,#limit wz instruction and early return may help (but that will still burn 10 P2 clocks per fake call, which might still be ok, not sure). It may mess up your accounting scheme for how many total pixels have been sent unless the scanfunc counts it for you perhaps.
That's what I was on about earlier. The task code will run for 0<=x<=22 cycles for each scanfunc call. There's essentially two somewhat related problems here:
So if you have a 76cy function generate 16px (160cy) of streamer data each time, you only get 40 timeslots.
If you add the incmod dealio, the base cost increases to 80.
The incmod idea just kinda sucks, changing the function pointer uses 2 less cycles per call, so 78+8+8+22+22+22 = 160. That might work out exactly! Though introducing 3/16 as a factor somewhere is nasty of course.
Yeah the incmod isn't ideal in the scanfunc as it involves the call overhead every time, I knew that, just a starting idea. Best to be done somehow on the caller side maybe with a conditional test that's only valid once every 8 times for example. Sort of like the idea of an incmod on the caller side, but will burn extra instruction space.
I expect there will be some way to chop it up to make it work, just need to think more. There's enough total clocks per line for the work itself.
Did you see that post way earlier? You can just change the function pointer and that only takes 2 cycles, that are inside the scanfunc. You do need enough
_ret_ mov scanfunc,#.something
lines to get to the desired division level, but for 1/3 calls, that's fine. Can also be used to balance out the workload.If your call to the scanfunc had a preceding incmod before each call, how many extra instructions would be taken up in the COG's memory I wonder? If the accounting instructions were done in the scanfunc, maybe it would free up room too.
E.g move the sub scantimecomp, #1 into the scanfunc for each pixel generated, and replace with this...
Then basically keep the scanfunc call spacing the same as you have and most of the time the call is not made which effectively gives you many more time slots than real scanfunc calls.
I see you'd probably only get 80 clocks per slot as the text code takes the other half.
So you would get to execute 3 * 22 clocks (your current spacing) for every scanfunc call and also incur 4 clock overhead penalty for 3 out of 4 slots where the call is not made.
(22+4) * 3 = 78. This gives 120 per line.
When the call is made it's going to take the 2 cycle incmod test + 6 call overhead cycles plus 78 working cycles for 16 pixels (including the extra accounting instruction)
This is 86 clocks.
So we have 78+86 = 164 clock cycles in 160 budget. Damn not enough, unless flashing attribute is yanked or other optimizations are found. You could unroll the rep loop in the scanfunc for two clocks saved.
Actually it's not as bad. I double counted the overhead. It was only 76 before including the call overhead so it's now 78 plus the prior 2 cycle incmod = 80. 78+80 will just fit the budget, but it's tight.
Update: One hassle with this is there are no free cycles for any cursors to be done on the fly. You could do a soft cursor with a flashing block or underscore, but it's not the same and will destructively kill the character it is on. Maybe that's the final icing on top if there's a way. It's mostly a compare counter value and overriding mov instruction to do a simple cursor, 4 clocks would just about do it.
Update2: One more problem in text mode. As the pixels need to be sent out right after the video guard symbols, we'd need to do an initial setup call just like the scanfunc but save off the first 16 pixels into two longs instead of sending the streamer so it can be ready after the video guard pixels go out in order to load up the streamer's buffer right away. It's probably solvable just yet another annoyance to deal with. Text mode is always a bit of a pain.
Update3: If the 22 clock budget usually included the accounting overhead instruction before we can remove that - dropping it down 20 cycles, freeing up 6 clocks per real scanfunc call which is 3 instructions and then adding this back into the scanfunc budget, possibly to help do a cursor or two
You can't do that, it'll mess up the timing for pixel doubling. There's a reason the caller subtracts from
scantimecomp
, it can do it in batches, since it only needs to be right at the end.Please re-read my earlier posts. With changing the function pointer I calculated it down to 160 cycles, should work exactly like that and need no extra change to scantask code. (just need to change the initial value of scantimecomp)
Not actually a problem. Remember the streamer has a command buffer, so the guard band can be entered while the back porch is still going. The opposite is a problem, going from short immediate commands to border/blanking, but that's a lot easier to solve.
As mentioned, accounting for scantimecomp is done in huge batches. It only needs to have the correct value at the very end.
Pixel doubling for non-text modes can be maintained I think if the incmod limit variable is set to zero, so it always calls your scanfunc in that case just like before. The extra incmod does of course need to introduce some adjustments to the sequence.
When pixel doubling for text, there would still be 40 real calls made and the scanfuncs would be able to work the same in the 160 clock budget, just half as much source data is read from the fifo and 16 total pixels would still go out like before.
So then are you mostly worried that the existing 22 clock cycle interval maximum between calling scanfunc can't ever be messed with any more due to existing critical fragments? Maybe that's the irritation if it was really hard to get this timing correct...? I was envisioning that some small adjustments to the places where the calls are done could still be made (which may increase total slot usage of the 120 available slightly) so long as you still get to call it every 40 clock cycles total (or faster, in which case it'll wait for the streamer buffer to be ready in the scanfunc before returning), keeping the spacing intact? Maybe you have some code that needs all 22 clock cycles for some processing work in some cases and reducing down to 20 cycles is really difficult or impossible.
Are you worried it won't fit in the COG RAM with the extra prior incmods needed per each scanfunc call, and your accounting is often done in batches so there won't be enough reclaiming of COGRAM space to make up for the extra incmod?
Or maybe incmod flags use killing either Z/C per scanfunc call is a showstopper?
The use of scantimecomp as it stands seems mainly to determine the end of line condition in the line loop. I was thinking if we accounted for all these pixels sent out within the scanfunc itself then a benefit is that this counter could do double duty and be used to reference the character index for a cursor position check....though you'd also need to put the same accounting operation back into the rgb24 and any other pixel doubling variants which will use two more cycles and obviously force adjustments to your code's call slot sequences. That's really only why I like the incmod stuff put on the caller side and moving the scantimecomp accounting into the scanfunc as it can free a few more clocks avoiding the call overhead altogether (4 instead of 8), though I can see it will consume more instruction space. I counted 28 calls to scanfunc and 8 adjustments to scantimecomp. This means it would consume 20 more instruction longs in COG/LUT which is a downside of course if space is really limited.
Yeah I know how your approach works - it'd just keep all the scanfunc calls as is and just adjust the next scanfunc function address dynamically in a chain, creating a few shortened calls followed by the real call. I do rather like it for the small impact it creates but the main downside is that there is no more budget in the text loop code pasted above so it won't ever be able to support a cursor. Minor gripe but I am looking for any solutions that resolve that...although perhaps my solutions are consuming too many resources for the minor budget increase it provides.
Yeah that's true, the guard band sends two pixels the same so it'll probably be able to issue just a single immediate channel command for that while the back porch is running. I was thinking each pixel needed to get it's own channel command and then the command fifo would be blocked.
Yeah not all 22 cycle batches as coded do change the scantimecomp variable so we can't bank on gaining those extra 6 cycles at all time. If we stole them by dropping down to 20 useful instructions between the scanfunc calls (from 22) then this would mean it has to increase slot usage slightly. Hopefully only about 10% or so if you go from 22 clocks down to 20. If you only used 96 slots before then hopefully 120 is still enough to finish before the end of the line...? It's tight.
What exactly is this 3/16 thing you mentioned earlier?
16 pixels over 3 calls -> nasty factor to get from 640 to 120
Adding cursor could be done as-is if flashing is dropped. Is that really an important feature? (ZX Spectrum programmers hate it!)
I'm getting a minor headache off this I feel. Maybe I should just prototype an actual impl instead of dilly-dallying.
Ok I guess that's due to the accounting outside the scanfunc.
Yeah I thought the same if it came down to it. Even VGA has a bit to ignore it and just use it for more background colours. Maybe I'll come up with more instr saved in the text loop somehow. Last night I thought of a way that the scantimecomp use could be avoided for the cursor index counter. Instead once at the start of line just configure a underflowing counter which only hits zero and sets the Z flag exactly when we reach the cursor pos. Be nice to have a secondary mouse cursor though for TextUI use.
Well that's the way to shutdown the argument. Yeah probably best just start to code it and things will work themselves out. I'm also looking for ways that my own features could potentially work in all this discussion, though not too sure my model of always being one scan line ahead would fit with it which probably also affects PSRAM, mouse sprite and multi-region stuff badly.
What was that spaceX quip again, something like: it's just easier to go the moon, than convince NASA we can go to the moon!
I mean you could probably make it work without any of this scanfunc headache then. If you always pre-render the next scanline into a buffer, you can run the audio task without timing headaches. Can also buffer the packets into hubRAM, which makes streaming them out nicer.
Agree. But how does doubling ever work then with HDMI? I'd think I'd have to give it up in HDMI mode which I guess is a sacrifice I'd be willing to make. Plus the extra HDMI code sort of needs space and would use the 240 LUT longs instead of how I use them today to both double pixels and render text into before writing it out. HDMI with audio is especially hard as you've discovered. HDMI video framing alone okay.
Well if we say the HDMI audio needs some ~2200 cycles in a scanline, there is a lot left left to copy and double up data for the next scanline.
Yeah, apart from 24 bpp mode which was a real killer IIRC. I'll have to dig up the actual cycles in the old DVI thread. It's different for the different depths. Less bpp are harder to double, but more packed bits get processed at a time and share the same longs for transfers. The 24bpp mode was quite hard on the COG because for each pixel doubled you need one HUB long read in to LUT and one read from LUT and two write to LUT operations and two writes back to HUB plus the loop overhead to fit in bursts over the 240 LUT longs. That's already 1+3+2*2 + 2 clocks or 10 clocks per doubled input pixel which is half the pixel budget. This even shares the memory bandwidth with the streamer too. But at best that's probably gonna leave around 320x10 = 3200 cycles minus overheads in a 640x480 resolution with 6400 clocks per active portion of the scanline. Might just still fit the budget. But another problem is the free LUT space and existing usage. How much extra code space does a basic HDMI TERC encode + audio resampler take I wonder ? It's certainly a challenge. Maybe there's scope to dynamically swap out TERC encoder code on the fly and reuse LUT space afterwards for doubling though that breaks my model of self contained rock solid video COG independent of HUB RAM corruption. All sorts of tricks may be needed.
I found some info about the DVI encoder. There is Verilog near the bottom of the page. https://forums.parallax.com/discussion/comment/1450072/#Comment_1450072
Question:
Does the HDMI logic support pixel repetition? E.g. TMDS clock = 250MHz and pixel clock = 12.5MHz for 320x240 source.
Yes. If a new 10-clock frame starts and there's no new data, it repeats the last value, updating the ones' accumulator for 8b video data.
Seems like we can't quote from other threads.
Quoting old messages seems to only quote the last replier's response. Nested quotes can be done manually, by quoting the earlier message, replacing
>
with>>
and merging the two quotes, e.g.I notice that old messages say
X wrote:
and newer ones sayX said:
. I don't know when that changed happened, probably years ago now!EDIT:
1. I also had to manually do the here link.
2. I really enjoyed the P2 design era and I miss it, although I showed up late in the process.
@SaucySoliton said:
This comment from Chip sounds promising. We'll have to retest this, it could save us a lot of problems if the DVI HW will repeat the pixels for us and open up the streamer use again.
Just tried patching my code so it outputs half the active pixels at half the rate in DVI mode with a setq before the active pixels begin using the divide by 20 value and restored at the end of active portion back to divide by 10 so the other sync timing stuff still all works. Couldn't get an image on screen. Might still be something I'm doing wrong or maybe this repetition only works in immediate mode...? My code is using the streamer.
Update: just verified this same approach of the extra setq instructions with VGA instead of DVI in my driver does double the pixels on screen so it would appear my patched code is okay. It's just that DVI output doesn't like it. I wonder if the streamer's NCO value still somehow affects the DVI encoder block's timing and messes with it or we have to do some special initial priming for it to work.