@Wuerfel_21 , I had a look at your driver in terms of the COG space it might need for a port to my driver for HDMI audio.
I counted up:
185 longs for build-audio without scanfunc overhead - this would go into my spare LUTRAM
COG RAM use:
9 longs for packet buffers
16 longs for TERC4 encoding table
3 longs for TMDS mask, spdif parity
2 longs for spdif chnl status
1 long audioperiod
2 longs for spleft/right
24 longs for misc hdmi audio/terc encoding state
6 audio poll loop
The known COG RAM use amounts to about 63 longs which would fit nicely inside my freed up font area in COG RAM, plus some extra COG RAM instructions are needed for transmitting HDMI sync/island output - this would vary a bit from yours as I'd need to integrate to my own code's framework, but hopefully not too much extra code is required given it would stream from HUB data. I could move the region handling code and/or mouse sprite code to an overlay and free 54(+),63 longs respectively for any extra HDMI sync logic that may be needed above what I can free by removing unnecessary PAL/NTSC/VGA stuff. Should be plenty of space.
I now think this approach has legs and would likely fit in my driver's space and hopefully timing budget even when you factor in the extra overhead of reading these overlays into LUT while pixels are streaming.
Eg. consider worst case pixel doubling in 32 bpp mode during active scan line of 6400 P2 clocks.
It takes about 3648 P2 clocks for pixel doubling.
It would take 185*1.25 = 231 P2 clocks to read in your packet encoding code from "build_audio" to end of encoding and at the end of this a 32 byte packet would be created in LUT with approximately 50 extra clocks to output it to HUB RAM.
This still leaves 6400-2752-231-50 = 2471 for a mouse sprite any PSRAM mailbox request and your TERC4 packet encoder to execute.
A mouse sprite takes ~78 clocks to read in the code and 480 clocks to execute.
This leaves 2471-78-480 = 1913 P2 clocks remaining in the active portion for packet encode, several audio polls on the scan line, and a PSRAM mailbox call in case some of that's not already doable during hblank. Hopefully it would suffice. I think you mentioned earlier your code takes 1398 cycles to CRC+TERC encode the packet. This leaves about ~515 for audio resampling and to populate the subpacket data in the best case. Your code uses ~180 per sample to resample and ~24 clocks each time you slide the audio input history buffer down. If done twice on a scan line it'll just fit, but 3 samples per scan line might be too hard. Hopefully we don't ever need 3 samples encoded at 48kHz given a 640x480 minimum resolution. Or I could look at simpler resampling.
You never need more than 2 samples per scanline, ever, unless you're doing 15khz video (yes 480i over HDMI is valid) or hi-res 96kHz audio. Think about it, all the valid sample rates (32, 44.1, 48) divided by 31khz are 1.something. I think I actually have a limit in place that stops it from encoding more than 2 per line.
In a pinch you can drop to linear resampling, but that has awful artifacts to it.
You can actually do HDMI+VGA, so better not get rid of the VGA sync code. If you use the normal 252MHz VGA mode, it will even work properly! Maybe useful if someone wants to use a DVI-I connector.
@Wuerfel_21 said:
Good job counting all those cycles!
Yes, tedious and easy to miss something.
You never need more than 2 samples per scanline, ever, unless you're doing 15khz video (yes 480i over HDMI is valid) or hi-res 96kHz audio. Think about it, all the valid sample rates (32, 44.1, 48) divided by 31khz are 1.something. I think I actually have a limit in place that stops it from encoding more than 2 per line.
Yeah, that's right. It could still potentially work with more samples per line but maybe just not during pixel doubling which is unfortunately what you may want if you are doing 480i anyway. It might still be useful for PSRAM based non-pixel doubled true colour setups. There are lots of extra cycles available in graphics modes without doubling enabled for more samples to be packed into the packet, although not many for text regions. Need to mull that over.
In a pinch you can drop to linear resampling, but that has awful artifacts to it.
Good to know.
You can actually do HDMI+VGA, so better not get rid of the VGA sync code. If you use the normal 252MHz VGA mode, it will even work properly! Maybe useful if someone wants to use a DVI-I connector.
Yep true, and I'll try to include if it fits, but it's not the main priority to have multi-outputs right now. I will ultimately need to make good use of the sync time for various things my driver does. Some parts of the mouse sprite stuff can probably go there. I think the audio is still gonna be real tight in the worst case scenario, but it'd be great use of the time available in a single P2 COG if it all fits in the end, especially given all the various features contained in my current P2 video driver. Be real nice to have all existing features working without too many compromises.
One thing I want to do is to have a master volume control setting in the final sample output for left/right channels going out to HDMI. One of my per display longs updated per frame will suffice to hold dual 16 bit multiplier values - this could be used for panning effects too. Most display devices can also already control volume, and it's best to do it at the amp, but it'd be nice to be able to do it from the P2 as well even just for fading in/out etc. You do of course lose quality by scaling the signal range down, I know.
@Wuerfel_21 said:
You never need more than 2 samples per scanline, ever, unless you're doing 15khz video (yes 480i over HDMI is valid) or hi-res 96kHz audio. Think about it, all the valid sample rates (32, 44.1, 48) divided by 31khz are 1.something. I think I actually have a limit in place that stops it from encoding more than 2 per line.
For 44.1/48kHz audio and 31.5kHz video, do you output 1 or 2 audio samples every line including vertical blanking & sync?
Just realized that in any 480i/576i case you'll have 2x the time per scan line (scan line frequency is halved) so there would be extra time to do more audio packet processing. This should work out okay if you need to encode 3-4 samples per line (15.75kHz * 4 > 48kHz). As you can still fit 4 stereo samples per audio packet you only need to encode a single packet so it won't mess with the data island arrangement. For text in 40 column output (double wide mode) you'd have to pixel quadruple however. Again there is more time to do this processing so it scales accordingly.
@Wuerfel_21 , I've been looking at this code for a while and am wondering if there are any ways optimize it further. It seems like a lot of data movement but it's probably about as good as it's gonna get without unrolling loops which won't buy much. Nice use of splitw and mergeb by the way.
I still sort of wonder if something could be done during the prior CRC encode stage of subpackets when you are already working with nibbles. Maybe they could be split/merged then during that sequence in the rep loops? Something with ROLNIB perhaps accumulating across channels. The interleaving of red/green channel data is an added complication, isn't it.
UPDATE: Actually on second thoughts it's probably pretty fast code as is. If you add anything more into the CRC REP it will multiply up pretty quickly and likely exceed this code's timing.
' Assume PTRB is an appropriate LUT pointer to deposit 32 longs of encoded TERC4
.serloop
sub scantimecomp,#3 + (3*4)
call scanfunc
' serialize ch0
cmp serbyte,#0 wz
if_z rolbyte serch0,packet_extra,#2 ' FE for first packet in island, FF otherwise
if_nz rolbyte serch0,#$FF,#0
altgb serbyte,#packet_header
rolbyte serch0
'rolbyte serch0,packet_vsync,#0 ' VSYNC on(?)
'rolbyte serch0,#$FF,#0 ' HSYNC on
rolword serch0,packet_extra,#0 ' Set sync status
altgw serbyte,#packet_data3
rolword temp1
call scanfunc
altgw serbyte,#packet_data2
rolword temp1
splitw temp1
rolword serch1,temp1,#0
rolword serch2,temp1,#1
altgw serbyte,#packet_data1
rolword temp1
altgw serbyte,#packet_data0
rolword temp1
splitw temp1
call scanfunc
rolword serch1,temp1,#0
rolword serch2,temp1,#1
mergeb serch0
mergeb serch1
mergeb serch2
'debug("TERC bytes ",uhex_long(serch0,serch1,serch2))
mov temp4,#4
.byteenc
getnib temp1,serch0,#0
alts temp1,#terctable
mov temp2,0-0 ' This also copies HSync bit for VGA
getnib temp1,serch0,#1
call scanfunc
alts temp1,#terctable
mov temp3,0-0 ' This also copies HSync bit for VGA
getnib temp1,serch1,#0
setq tmds_ch1_mask
alts temp1,#terctable
muxq temp2,0-0
getnib temp1,serch1,#1
alts temp1,#terctable
muxq temp3,0-0
call scanfunc
getnib temp1,serch2,#0
setq tmds_ch2_mask
alts temp1,#terctable
muxq temp2,0-0
getnib temp1,serch2,#1
alts temp1,#terctable
muxq temp3,0-0
wrlut temp2,ptrb++
wrlut temp3,ptrb++
'debug("packet enc ",uhex_(temp2))
'debug("packet enc ",uhex_(temp3))
call scanfunc
shr serch0,#8
shr serch1,#8
shr serch2,#8
djnz temp4,#.byteenc
incmod serbyte,#3 wc
if_nc jmp #.serloop
sub scantimecomp,#1
jmp scanfunc ' tail call
' TERC packet pre-encoding buffers
packet_header res 1
packet_data0 res 2
packet_data1 res 2
packet_data2 res 2
packet_data3 res 2
You can save some longs by re-rolling the .byteenc loop, I guess. Other than that, well I didn't find any way to make it faster, but I also didn't loose sleep over it, since it's already quick enough for what I need. There is a theoretical trick I considered - if you only ever send 2 samples, the upper two bits of ch1/ch2 are not needed and can be zero, so you can reduce the processing that way.
The CRC loop is separate because it is a separate thing really. The CRC bytes are just normal data to the TERC encoder.
OT: Friggin vodafone is having a normal one, currently only have internetworking on my thinkpad (knew that LTE modem would come in handy one day), might run out of LTE quota and disappear for a while.
Today I was able to migrate my code to free the longs in anticipation/preparation of attempting some HDMI capability in my driver @Wuerfel_21
So far I installed your audio resampler and TMDS encoder into 193 longs of LUT space including room for 32 more longs for the temporary encoded packet storage before writing it into HUB.
I also moved the text font into the LUT which freed space in COG RAM and I've limited the text handling to 170 characters per line down from 240 (for 1920 pixel VGA mode). For the pure DVI/HDMI modes using text, rendering 1920 pixels was too excessive anyway, however after this change the driver can still support text rendering up to 1360 pixels. I moved the mouse code to now use a LUT RAM based overlay scheme as well.
I just tested that DVI still works so these memory layout changes didn't break the code so far which is a good sign.
With all the extra HDMI processing stuff and and audio state added, COG RAM is now down to ~52 longs free. This will be needed for new streamer commands and sync handling and data island generation logic. Basically the differences between what DVI sends out and what HDMI does needs to go here, and I'll get to replace some of the existing DVI instructions which frees a few more longs in COGRAM too.
If I can't fit those extra changes in the space available I can always try to migrate pixel doubling to an overlay also which would probably net another 27 more longs but perhaps cost 50 clocks worth of extra time to load in on the fly.
I also need to add some calls to poll audio in various places and would need to ensure it gets sampled sufficiently often to not miss a sample. Allowing up to 96kHz input rate you'd need to sample within 10us of the last time. At 27MHz this is every 2700 TMDS or P2 clocks or 2520 for regular VGA frequencies. Approximately 4 times per scan line should work. Finer grained would be better for higher input frequencies > 96kHz but complicates the code in large tight loops. It'd be neat to accept inputs as high as 192kHz and resample back to 48kHz, but not sure yet what will fit.
I don't think the cubic resampler works properly above 96kHz (=2x ratio) as-is, really (you would get something, but it'd be full of artifacts) Of course you could downsample 192 to 96, that would work. But sample rates higher than 96 are just more expensive to deal with for approximately zero benefit, so why even bother?
Wen we go from any DVI device to HDMI, we use a passieve cable.
The difference is only the audio I think. Bu I had a Dreambox satellite decoder with a DVI and embedded audio. They use DVI on the box, to bypass the copy code law of HDMI.
We use a passive dvi to HDMI cable on the Dreambox and have audio.
I guess, if you get HDMI working with audio, DVI wil be the same.
So on a PCB, we can use a HDMI, if we need DVI it is just a matter of a passive HDMI to DVI cable.
That's what we're already talking about, DVI vs HDMI signalling, where the difference really is the embedded audio/auxillary data. I think we're both just using the normal HDMI adapter that Parallax makes. Though I adapter it to DVI for capture on my VisionRGB card.
I don't think the cubic resampler works properly above 96kHz (=2x ratio) as-is, really (you would get something, but it'd be full of artifacts) Of course you could downsample 192 to 96, that would work. But sample rates higher than 96 are just more expensive to deal with for approximately zero benefit, so why even bother?
That's good to hear. I was worried I'd need to try to go higher than 96kHz. Supporting 96 is more straightforward with just 4 audio polls per line. Of course any potential 480i mode if that comes down the track would need to do more polling per line but there's more time then for that anyway. I think budget wise this thing looks good. Just packing it into the COG is the main trick right now.
I always make a stand and say that IMO 32kHz is sufficient for casual listening (i.e. as a non-audiophile delivery format). I think most people can not hear up to that 16kHz nyquist limit (I certainly can't and never could, even as a child) and even if they do, the sensitivity is very low. It does require better filters than 44.1/48 and this is really where the extra benefit of 96 lies. In terms of bit-depth you can't go wrong with 16, but I think DAT/DV style 12 bit companding is probably transparent (esp if you use appropriate noise shaping after companding - I think most encoders for this format take 16 bit input only, when you'd really want to go from 24 or float).
Yeah in most cases a driver will probably want to rescale up from 32 or 44.1kHz source sampling rate to a fixed 48kHz rate for actual output just to keep from dropping the audio link between different played back source material at different rates etc. My consideration is mainly for figuring out how many input polling samples I'd need to do per scan line in the worst case. I'm thinking 96kHz is a reasonable input limit to target that is still achievable by the code and would be overkill to go higher than that. There's not a lot of benefit to be had going higher unless you can actually output at that same rate as well to hifi gear or couple with 24bit samples which is probably a different application to this. Plus 16 bit stereo samples fit into a single repo pin anyway which simplifies the data polling.
I didn't test specificially but does your resampler work for both scaling up and down the input rate to 48kHz @Wuerfel_21 ? Or only one way?
It should work both ways but I didn't really test upsampling. Upsampling should work fine at any ratio, even really low ones.
If you have 32 or 44.1 source, you should really just send it as-is for ideal quality. These three rates (32/44.1/48) are all required, so they'll always work.
@Wuerfel_21 said:
It should work both ways but I didn't really test upsampling. Upsampling should work fine at any ratio, even really low ones.
If you have 32 or 44.1 source, you should really just send it as-is for ideal quality. These three rates (32/44.1/48) are all required, so they'll always work.
Yeah ideally you should but if you want some general video driver that works after init with whatever you may throw at it, then resampling everything you send is still good. Changing dynamically requires extra APIs and manipulation of timing parameters, recalc of CTS/N etc, updated infoframe. All doable yes but just more complex for now.
Oh no, now I'm scrambling! I missed the skipf inside the REP loop in my table of pixel doubling execution times when determining the clocks, which now takes it longer to execute with 9 instead of 7 clocks per doubled input pixel. Will now have to find a way to code a special 24bpp doubling mode path with just this code copied somewhere else without the skipf. Or maybe I could copy the REP into the skipf instruction with adjusted skip masks for 24bpp gfx mode. Second rep should cancel first.
Only executed instruction shown for 24bpp:
doubleloop rep #0-0, pb 'patched pixel doubling loop count
skipf pattern '1 2 4 8 16 32 '<<<<<< I don't want this line to execute
rdlut a, ptra++ '* * * * * *
wrlut a, ptrb++ ' *
wrlut a, ptrb++ '* * * * * *Short 32bpp loop
@Wuerfel_21 said:
You need to recalculate the resampler's parameters if the input rate changes, anyways.
Ok good to know. Maybe that's easier than changing the HDMI output stuff though. Will need to figure out a pathway to make changes on the fly. Some long in HUB could be polled in the encoder to see if things need changing perhaps. I still have about 15 longs free in the LUT RAM to add a little bit more audio code. Hopefully that would suffice for supporting such parameter changes. Maybe one or two need to be reserved for audio poll calls and a chain loader if needed.
No the first thing I need to do is the read before I write back.
Here's the full related code including the skip patterns and lengths that get patched in - I might be able to extend the skip pattern length to have 4 or 5 more bits though if I treat instrcount as a nibble instead of a byte.
' __instr_skip_patterns____instcount
doublebits long %01111001101110110110110_000001001
doublenits long %01110111001111011100010_000001010
doublenibs long %00001110100001111010100_000001110
doublebytes long %01111110101111111010110_000000111
doublewords long %01111101101111110110110_000000111
doublelongs long %10000000001111101111110_000000100
' Code to double pixels in all the different colour depths
doublepixels
push ptra 'preserve current pointers
push ptrb
mov ptra, #$188 'setup pointers for LUT accesses
mov ptrb, #$110
doubleloop rep #0-0, pb 'patched pixel doubling loop count
skipf pattern '1 2 4 8 16 32
rdlut a, ptra++ '* * * * * *
setq nibblemask ' *
splitw a ' *
mov b, a '* * * * *
movbyts a, #%%2020 ' *
movbyts a, #%%1100 ' * *
movbyts a, #%%1010 '* *
wrlut a, ptrb++ ' *
mergeb a ' *
mergew a '*
mov pb, a ' *
shl pb, #4 ' *
muxq a, pb ' *
wrlut a, ptrb++ '* * * * * *Short 32bpp loop
movbyts b, #%%3131 ' * |returns here but
movbyts b, #%%3322 ' * * |falls through at
movbyts b, #%%3232 '* * |end after its REP
mergew b '* |block completes.
mergeb b ' * |
mov pb, b ' * |
shl pb, #4 ' * |
muxq b, pb ' * |
wrlut b, ptrb++ '* * * * * |<skipped for 32bpp
pop ptrb
pop ptra
ret wcz
I can probably branch out at the first 32 bpp only instruction (change to a jump and make the skipf mask skip everything until then), then the jump target can be made fully 32bpp specific code, only overhead is one extra branch and rep/skipf incurred once.
Update: Here's the plan, bonus is I can probably just extend the skip pattern length to skip the faster32bpp code at the end for other depths though I need to check it's doable - should be:
doublelongs long %1111111_000000000
doublepixels
push ptra 'preserve current pointers
push ptrb
mov ptra, #$188 'setup pointers for LUT accesses
mov ptrb, #$110
doubleloop rep #0-0, pb 'patched pixel doubling loop count
skipf pattern '1 2 4 8 16 32
rdlut a, ptra++ '* * * * *
setq nibblemask ' *
splitw a ' *
mov b, a '* * * * *
movbyts a, #%%2020 ' *
movbyts a, #%%1100 ' * *
movbyts a, #%%1010 '* *
jmp #faster32bpp ' *
mergeb a ' *
mergew a '*
mov pb, a ' *
shl pb, #4 ' *
muxq a, pb ' *
wrlut a, ptrb++ '* * * * *
movbyts b, #%%3131 ' *
movbyts b, #%%3322 ' * *
movbyts b, #%%3232 '* *
mergew b '*
mergeb b ' *
mov pb, b ' *
shl pb, #4 ' *
muxq b, pb ' *
wrlut b, ptrb++ '* * * * *
faster32bpp rep #3, pb ' *
rdlut a, ptra++ ' *
wrlut a, ptrb++ ' *
wrlut a, ptrb++ ' *
pop ptrb
pop ptra
ret wcz
p.s. in fact I can see another optimization of the final wrlut if I share it with the prior rep loop code.
@rogloh said:
Yeah ideally you should but if you want some general video driver that works after init with whatever you may throw at it, then resampling everything you send is still good. Changing dynamically requires extra APIs and manipulation of timing parameters, recalc of CTS/N etc, updated infoframe. All doable yes but just more complex for now.
Why do you need to support changing the audio sample rate without losing the audio link? Nobody cares about briefly losing the video signal when the video resolution changes.
Someone does care , the latest HDMI spec has the ability to switch resolutions without dropout, but I think that requires FRL shenanigans? The 2.x spec isn't public... cowards!
@Electrodude said:
Why do you need to support changing the audio sample rate without losing the audio link? Nobody cares about briefly losing the video signal when the video resolution changes.
Yeah it'd be good to keep audio running independent of the source material sample rate change. That's why keeping it fixed at 48kHz the whole time is not a bad idea.
If it's just a local sample rate change in the P2 where COGs now send at different rate into the repo while keeping output rate same, Ada mentions resampling should work at rates below 48kHz so it probably a change between 32/44.1/48kHz output rates by the audio generator COG only and the HDMI COG wouldn't need to know much about it.
If however it's an output (HDMI source) sample rate change between 32/44.1/48kHz, I'd expect you'd need to recalc the CTS/N clock regen frame parameters and re-encode a new clock regen packet (updated encoding could be done by a caller API), and alter the SPDIF channel status. Probably just a few other local variables used by the overlay would change. Things such as resample_phase_frac, resample_ratio_int in Wuerfel_21's code. Not quite sure yet on the full extent but it's probably reasonably limited and simple to update the COG params. There might still be some minor click or pop while it changes at the output if these changes are not perfectly synchronized with the audio sample packets being sent, probably also depends a bit on how the HDMI sink device deals with this sort of operation.
Forget about smoothly changing audio rates, you will not get it to happen without a glitch. You'd need to re-calculate the resampler rate at exactly the right moment. On top of that, the generator cog also needs to switch precisely at the right moment without a glitch. (Due to buffering, this may not be the same moment).
To avoid this, you'd need a system where the HDMI cog determines the sample clock. With the current mailbox system, the generator determines the clock, but the HDMI cog needs to know it exactly to generate the correct 48k output.
Maybe some samples could be nulled out / muted while changing, but agree that probably one more or one less sample might be sent during the rate switchover and this might affect the output device if it can't hande that condition.
UPDATE: It could be another reason to add a volume control param to the output - which I plan to do.
UPDATE2: Hmm those 15 free LUT longs are starting to look a bit more precious now.
It wouldn't go to static or anything like that, the sink should be designed to avoid that.
Also don't feel married to the one design that I came up with (pin mailbox->resampler->encoder) specifically to cater to the existing sound chip emulators (which produce sound in real-time, at odd rates and generally don't have a lot of memory to spare). There's obviously a bigger design space there.
I wonder what sort of other end-of-the-line DSP could be done if one skips out on resampling. Maybe I could generate 3D sound as WXYZ channels in ambisonics B-Format (I've been reading up on that recently) and have the video cog do the decoding to the selected speaker layout (to save on memory and CPU time elsewhere).
@Wuerfel_21 said:
It wouldn't go to static or anything like that, the sink should be designed to avoid that.
Also don't feel married to the one design that I came up with (pin mailbox->resampler->encoder) specifically to cater to the existing sound chip emulators (which produce sound in real-time, at odd rates and generally don't have a lot of memory to spare). There's obviously a bigger design space there.
Doesn't need to be long term, really just doing some amount of experimenting for now to see if I can get something to work at least and may fit within the constraints of my own model/COG resource limits.
I wonder what sort of other end-of-the-line DSP could be done if one skips out on resampling. Maybe I could generate 3D sound as WXYZ channels in ambisonics B-Format (I've been reading up on that recently) and have the video cog do the decoding to the selected speaker layout (to save on memory and CPU time elsewhere).
Sounds good too. Seems weird to put a lot of audio burden in the video COG but HDMI and audio is coupled somewhat closely together. The other way to go is to have some sort of pull model where the HDMI pulls out audio samples at its own rate from some small FIFO and the audio source needs to monitor this FIFO's fill/drain levels and feed it with samples. Might create more jitter this way though as it would need to adapt its output rate slightly to keep the level from draining/filling too quickly. Probably messy to manage the buffer.
Comments
@Wuerfel_21 , I had a look at your driver in terms of the COG space it might need for a port to my driver for HDMI audio.
I counted up:
185 longs for build-audio without scanfunc overhead - this would go into my spare LUTRAM
COG RAM use:
The known COG RAM use amounts to about 63 longs which would fit nicely inside my freed up font area in COG RAM, plus some extra COG RAM instructions are needed for transmitting HDMI sync/island output - this would vary a bit from yours as I'd need to integrate to my own code's framework, but hopefully not too much extra code is required given it would stream from HUB data. I could move the region handling code and/or mouse sprite code to an overlay and free 54(+),63 longs respectively for any extra HDMI sync logic that may be needed above what I can free by removing unnecessary PAL/NTSC/VGA stuff. Should be plenty of space.
I now think this approach has legs and would likely fit in my driver's space and hopefully timing budget even when you factor in the extra overhead of reading these overlays into LUT while pixels are streaming.
Eg. consider worst case pixel doubling in 32 bpp mode during active scan line of 6400 P2 clocks.
It takes about 3648 P2 clocks for pixel doubling.
It would take 185*1.25 = 231 P2 clocks to read in your packet encoding code from "build_audio" to end of encoding and at the end of this a 32 byte packet would be created in LUT with approximately 50 extra clocks to output it to HUB RAM.
This still leaves 6400-2752-231-50 = 2471 for a mouse sprite any PSRAM mailbox request and your TERC4 packet encoder to execute.
A mouse sprite takes ~78 clocks to read in the code and 480 clocks to execute.
This leaves 2471-78-480 = 1913 P2 clocks remaining in the active portion for packet encode, several audio polls on the scan line, and a PSRAM mailbox call in case some of that's not already doable during hblank. Hopefully it would suffice. I think you mentioned earlier your code takes 1398 cycles to CRC+TERC encode the packet. This leaves about ~515 for audio resampling and to populate the subpacket data in the best case. Your code uses ~180 per sample to resample and ~24 clocks each time you slide the audio input history buffer down. If done twice on a scan line it'll just fit, but 3 samples per scan line might be too hard. Hopefully we don't ever need 3 samples encoded at 48kHz given a 640x480 minimum resolution. Or I could look at simpler resampling.
Good job counting all those cycles!
You never need more than 2 samples per scanline, ever, unless you're doing 15khz video (yes 480i over HDMI is valid) or hi-res 96kHz audio. Think about it, all the valid sample rates (32, 44.1, 48) divided by 31khz are 1.something. I think I actually have a limit in place that stops it from encoding more than 2 per line.
In a pinch you can drop to linear resampling, but that has awful artifacts to it.
You can actually do HDMI+VGA, so better not get rid of the VGA sync code. If you use the normal 252MHz VGA mode, it will even work properly! Maybe useful if someone wants to use a DVI-I connector.
Yes, tedious and easy to miss something.
Yeah, that's right. It could still potentially work with more samples per line but maybe just not during pixel doubling which is unfortunately what you may want if you are doing 480i anyway. It might still be useful for PSRAM based non-pixel doubled true colour setups. There are lots of extra cycles available in graphics modes without doubling enabled for more samples to be packed into the packet, although not many for text regions. Need to mull that over.
Good to know.
Yep true, and I'll try to include if it fits, but it's not the main priority to have multi-outputs right now. I will ultimately need to make good use of the sync time for various things my driver does. Some parts of the mouse sprite stuff can probably go there. I think the audio is still gonna be real tight in the worst case scenario, but it'd be great use of the time available in a single P2 COG if it all fits in the end, especially given all the various features contained in my current P2 video driver. Be real nice to have all existing features working without too many compromises.
One thing I want to do is to have a master volume control setting in the final sample output for left/right channels going out to HDMI. One of my per display longs updated per frame will suffice to hold dual 16 bit multiplier values - this could be used for panning effects too. Most display devices can also already control volume, and it's best to do it at the amp, but it'd be nice to be able to do it from the P2 as well even just for fading in/out etc. You do of course lose quality by scaling the signal range down, I know.
For 44.1/48kHz audio and 31.5kHz video, do you output 1 or 2 audio samples every line including vertical blanking & sync?
Yes. I always send it during HSync.
Just realized that in any 480i/576i case you'll have 2x the time per scan line (scan line frequency is halved) so there would be extra time to do more audio packet processing. This should work out okay if you need to encode 3-4 samples per line (15.75kHz * 4 > 48kHz). As you can still fit 4 stereo samples per audio packet you only need to encode a single packet so it won't mess with the data island arrangement. For text in 40 column output (double wide mode) you'd have to pixel quadruple however. Again there is more time to do this processing so it scales accordingly.
@Wuerfel_21 , I've been looking at this code for a while and am wondering if there are any ways optimize it further. It seems like a lot of data movement but it's probably about as good as it's gonna get without unrolling loops which won't buy much. Nice use of splitw and mergeb by the way.
I still sort of wonder if something could be done during the prior CRC encode stage of subpackets when you are already working with nibbles. Maybe they could be split/merged then during that sequence in the rep loops? Something with ROLNIB perhaps accumulating across channels. The interleaving of red/green channel data is an added complication, isn't it.
UPDATE: Actually on second thoughts it's probably pretty fast code as is. If you add anything more into the CRC REP it will multiply up pretty quickly and likely exceed this code's timing.
You can save some longs by re-rolling the
.byteenc
loop, I guess. Other than that, well I didn't find any way to make it faster, but I also didn't loose sleep over it, since it's already quick enough for what I need. There is a theoretical trick I considered - if you only ever send 2 samples, the upper two bits of ch1/ch2 are not needed and can be zero, so you can reduce the processing that way.The CRC loop is separate because it is a separate thing really. The CRC bytes are just normal data to the TERC encoder.
OT: Friggin vodafone is having a normal one, currently only have internetworking on my thinkpad (knew that LTE modem would come in handy one day), might run out of LTE quota and disappear for a while.
Today I was able to migrate my code to free the longs in anticipation/preparation of attempting some HDMI capability in my driver @Wuerfel_21
So far I installed your audio resampler and TMDS encoder into 193 longs of LUT space including room for 32 more longs for the temporary encoded packet storage before writing it into HUB.
I also moved the text font into the LUT which freed space in COG RAM and I've limited the text handling to 170 characters per line down from 240 (for 1920 pixel VGA mode). For the pure DVI/HDMI modes using text, rendering 1920 pixels was too excessive anyway, however after this change the driver can still support text rendering up to 1360 pixels. I moved the mouse code to now use a LUT RAM based overlay scheme as well.
I just tested that DVI still works so these memory layout changes didn't break the code so far which is a good sign.
With all the extra HDMI processing stuff and and audio state added, COG RAM is now down to ~52 longs free. This will be needed for new streamer commands and sync handling and data island generation logic. Basically the differences between what DVI sends out and what HDMI does needs to go here, and I'll get to replace some of the existing DVI instructions which frees a few more longs in COGRAM too.
If I can't fit those extra changes in the space available I can always try to migrate pixel doubling to an overlay also which would probably net another 27 more longs but perhaps cost 50 clocks worth of extra time to load in on the fly.
I also need to add some calls to poll audio in various places and would need to ensure it gets sampled sufficiently often to not miss a sample. Allowing up to 96kHz input rate you'd need to sample within 10us of the last time. At 27MHz this is every 2700 TMDS or P2 clocks or 2520 for regular VGA frequencies. Approximately 4 times per scan line should work. Finer grained would be better for higher input frequencies > 96kHz but complicates the code in large tight loops. It'd be neat to accept inputs as high as 192kHz and resample back to 48kHz, but not sure yet what will fit.
Good job!
I don't think the cubic resampler works properly above 96kHz (=2x ratio) as-is, really (you would get something, but it'd be full of artifacts) Of course you could downsample 192 to 96, that would work. But sample rates higher than 96 are just more expensive to deal with for approximately zero benefit, so why even bother?
Thinking,
Wen we go from any DVI device to HDMI, we use a passieve cable.
The difference is only the audio I think. Bu I had a Dreambox satellite decoder with a DVI and embedded audio. They use DVI on the box, to bypass the copy code law of HDMI.
We use a passive dvi to HDMI cable on the Dreambox and have audio.
I guess, if you get HDMI working with audio, DVI wil be the same.
So on a PCB, we can use a HDMI, if we need DVI it is just a matter of a passive HDMI to DVI cable.
RS-stocknr.:
852-5279
That's what we're already talking about, DVI vs HDMI signalling, where the difference really is the embedded audio/auxillary data. I think we're both just using the normal HDMI adapter that Parallax makes. Though I adapter it to DVI for capture on my VisionRGB card.
That's good to hear. I was worried I'd need to try to go higher than 96kHz. Supporting 96 is more straightforward with just 4 audio polls per line. Of course any potential 480i mode if that comes down the track would need to do more polling per line but there's more time then for that anyway. I think budget wise this thing looks good. Just packing it into the COG is the main trick right now.
I always make a stand and say that IMO 32kHz is sufficient for casual listening (i.e. as a non-audiophile delivery format). I think most people can not hear up to that 16kHz nyquist limit (I certainly can't and never could, even as a child) and even if they do, the sensitivity is very low. It does require better filters than 44.1/48 and this is really where the extra benefit of 96 lies. In terms of bit-depth you can't go wrong with 16, but I think DAT/DV style 12 bit companding is probably transparent (esp if you use appropriate noise shaping after companding - I think most encoders for this format take 16 bit input only, when you'd really want to go from 24 or float).
Yeah in most cases a driver will probably want to rescale up from 32 or 44.1kHz source sampling rate to a fixed 48kHz rate for actual output just to keep from dropping the audio link between different played back source material at different rates etc. My consideration is mainly for figuring out how many input polling samples I'd need to do per scan line in the worst case. I'm thinking 96kHz is a reasonable input limit to target that is still achievable by the code and would be overkill to go higher than that. There's not a lot of benefit to be had going higher unless you can actually output at that same rate as well to hifi gear or couple with 24bit samples which is probably a different application to this. Plus 16 bit stereo samples fit into a single repo pin anyway which simplifies the data polling.
I didn't test specificially but does your resampler work for both scaling up and down the input rate to 48kHz @Wuerfel_21 ? Or only one way?
It should work both ways but I didn't really test upsampling. Upsampling should work fine at any ratio, even really low ones.
If you have 32 or 44.1 source, you should really just send it as-is for ideal quality. These three rates (32/44.1/48) are all required, so they'll always work.
Yeah ideally you should but if you want some general video driver that works after init with whatever you may throw at it, then resampling everything you send is still good. Changing dynamically requires extra APIs and manipulation of timing parameters, recalc of CTS/N etc, updated infoframe. All doable yes but just more complex for now.
You need to recalculate the resampler's parameters if the input rate changes, anyways.
Oh no, now I'm scrambling! I missed the skipf inside the REP loop in my table of pixel doubling execution times when determining the clocks, which now takes it longer to execute with 9 instead of 7 clocks per doubled input pixel. Will now have to find a way to code a special 24bpp doubling mode path with just this code copied somewhere else without the skipf. Or maybe I could copy the REP into the skipf instruction with adjusted skip masks for 24bpp gfx mode. Second rep should cancel first.
Only executed instruction shown for 24bpp:
Ok good to know. Maybe that's easier than changing the HDMI output stuff though. Will need to figure out a pathway to make changes on the fly. Some long in HUB could be polled in the encoder to see if things need changing perhaps. I still have about 15 longs free in the LUT RAM to add a little bit more audio code. Hopefully that would suffice for supporting such parameter changes. Maybe one or two need to be reserved for audio poll calls and a chain loader if needed.
deleted
No the first thing I need to do is the read before I write back.
Here's the full related code including the skip patterns and lengths that get patched in - I might be able to extend the skip pattern length to have 4 or 5 more bits though if I treat instrcount as a nibble instead of a byte.
I can probably branch out at the first 32 bpp only instruction (change to a jump and make the skipf mask skip everything until then), then the jump target can be made fully 32bpp specific code, only overhead is one extra branch and rep/skipf incurred once.
Update: Here's the plan, bonus is I can probably just extend the skip pattern length to skip the faster32bpp code at the end for other depths though I need to check it's doable - should be:
p.s. in fact I can see another optimization of the final wrlut if I share it with the prior rep loop code.
Why do you need to support changing the audio sample rate without losing the audio link? Nobody cares about briefly losing the video signal when the video resolution changes.
Someone does care , the latest HDMI spec has the ability to switch resolutions without dropout, but I think that requires FRL shenanigans? The 2.x spec isn't public... cowards!
Yeah it'd be good to keep audio running independent of the source material sample rate change. That's why keeping it fixed at 48kHz the whole time is not a bad idea.
If it's just a local sample rate change in the P2 where COGs now send at different rate into the repo while keeping output rate same, Ada mentions resampling should work at rates below 48kHz so it probably a change between 32/44.1/48kHz output rates by the audio generator COG only and the HDMI COG wouldn't need to know much about it.
If however it's an output (HDMI source) sample rate change between 32/44.1/48kHz, I'd expect you'd need to recalc the CTS/N clock regen frame parameters and re-encode a new clock regen packet (updated encoding could be done by a caller API), and alter the SPDIF channel status. Probably just a few other local variables used by the overlay would change. Things such as resample_phase_frac, resample_ratio_int in Wuerfel_21's code. Not quite sure yet on the full extent but it's probably reasonably limited and simple to update the COG params. There might still be some minor click or pop while it changes at the output if these changes are not perfectly synchronized with the audio sample packets being sent, probably also depends a bit on how the HDMI sink device deals with this sort of operation.
Forget about smoothly changing audio rates, you will not get it to happen without a glitch. You'd need to re-calculate the resampler rate at exactly the right moment. On top of that, the generator cog also needs to switch precisely at the right moment without a glitch. (Due to buffering, this may not be the same moment).
To avoid this, you'd need a system where the HDMI cog determines the sample clock. With the current mailbox system, the generator determines the clock, but the HDMI cog needs to know it exactly to generate the correct 48k output.
Maybe some samples could be nulled out / muted while changing, but agree that probably one more or one less sample might be sent during the rate switchover and this might affect the output device if it can't hande that condition.
UPDATE: It could be another reason to add a volume control param to the output - which I plan to do.
UPDATE2: Hmm those 15 free LUT longs are starting to look a bit more precious now.
It wouldn't go to static or anything like that, the sink should be designed to avoid that.
Also don't feel married to the one design that I came up with (pin mailbox->resampler->encoder) specifically to cater to the existing sound chip emulators (which produce sound in real-time, at odd rates and generally don't have a lot of memory to spare). There's obviously a bigger design space there.
I wonder what sort of other end-of-the-line DSP could be done if one skips out on resampling. Maybe I could generate 3D sound as WXYZ channels in ambisonics B-Format (I've been reading up on that recently) and have the video cog do the decoding to the selected speaker layout (to save on memory and CPU time elsewhere).
Doesn't need to be long term, really just doing some amount of experimenting for now to see if I can get something to work at least and may fit within the constraints of my own model/COG resource limits.
Sounds good too. Seems weird to put a lot of audio burden in the video COG but HDMI and audio is coupled somewhat closely together. The other way to go is to have some sort of pull model where the HDMI pulls out audio samples at its own rate from some small FIFO and the audio source needs to monitor this FIFO's fill/drain levels and feed it with samples. Might create more jitter this way though as it would need to adapt its output rate slightly to keep the level from draining/filling too quickly. Probably messy to manage the buffer.