So after hacking up some example for HDMI compatible audio the good news is that there will be sufficient time to TERC4 encode two data packets per scan line. I found a way to get down to around 2710 clock cycles to encode each data island packet which is 271 pixels worth of the scan line. So during the time we are actively streaming 640 RGB pixels at VGA resolution and are otherwise unoccupied we would have enough time there to go construct and encode up to two packets and accumulate stereo audio samples at rates <=96kHz, plus do other required housekeeping. Each audio sample actually processed adds about 84 more clocks (~9 pixels) to this number. This would allow audio to be present on every scan line (which is important to help keep latency down) and still have enough time to dynamically encode another type of packet such as other info frames etc.
To expand on the last sentence, my reading of the spec is that only one data packet is needed per line for <= 96kHz stereo audio and 640x480 or 720x480/576 video. This packet could contain up to four audio samples but only one or two samples are required for 44.1 kHz or 48 kHz audio as line frequency is 31.5 kHz. Is that correct?
That is my understanding, yes. Each 32 pixel wide audio packet within the data island has sufficient capacity to send from 0-4, 24 bit stereo audio samples. We are sending video lines at 31.5kHz so there is the bandwidth for streaming up to 126,000 samples per second, which therefore supports from 32kHz up to 96kHz audio (at the standard audio sampling rates of 32/44.1/48kHz etc), but not the next 4x oversampling rates of 128/176.4/192kHz.
If we send two of these audio packets in the island per scan line during horizontal blanking we can get up to 192kHz stereo audio. Or instead it can be used for multichannel audio up to 8 channels (at lower rates). But sending two audio packets on every line becomes problematic because there won't be enough time to encode a third packet dynamically, at least given the way my code is currently constructed. I'm thinking if we needed 3 packets per line we could possibly send pre-encoded static packets for the third packet, already stored in hub. Or occasionally just skip an audio packet and substitute it with another dynamically encoded packet type now and again and hope to catch up the audio in later packets, because we do have a little bit of bandwidth left over. As long as it is only done infrequently and the downstream device has an audio buffer of its own to compensate for this variation, nothing should be lost.
Right now in my example I currently use an audio sample fifo in hub RAM which I read from and then write back the current read (tail) pointer after each access that all other COGs can see, plus an 8 bit incrementing counter packed into the same long. My COG will take either one or two samples from this fifo on each scan line according to an accumulator I maintain that I add to and subtract from and whose value indicates when I am allowed to read and send another sample. To remain in sync, COGs that write existing pre-sampled audio content (like from an SD card, or being generated by something like SIDCog etc) will just need to keep the fifo fed so it doesn't underflow which is fairly easy. However this may be more problematic for COGs that actually want to sample audio on the fly from IO pin's ADCs synchronously at 44.1kHz etc and send via HDMI, they would will need to use the 25.2/27MHz HDMI clock being output on the pin and divide it down themselves appropriately for their own clocking needs. Unfortunately only 48kHz and 25.2MHz are perfect integer multiples, while the others are integer fractions. Another audio timing model where the HDMI audio transmit decision gets decided by another COG might be needed for that type of application, but that can be looked at later.
Update: Just realised that we should use the core clock too which buys you 10x improvement over the HDMI clock, and then this also allows precise access to 32kHz, which is a factor of 252MHz (divide by 7875), but not of 25.2MHz. 270MHz is also a perfect multiple of 48kHz too which is nice. Extracting 44.1kHz precisely is not possible from 252/270MHz when using a simple counter without a PLL but we can get a close 44102.2Hz (252MHz/5714) and 44103.2Hz (270MHz/6122) in VGA and 480p/576p modes respectively we could try to adjust N and CTS fields in the clock regeneration packet accordingly. Needs more thought...
@rogloh
Have you tried PulseView? It mentions decode support for a few different USB layers and it looks as though it supports a couple different Saleae models.
I did look at it briefly once some time back on my Linux machine IIRC. Their home page lists TMDS as a possible candidate for a future protocol decoder. If they had it right now I'd probably go back to trying it.
Update: Just realised that we should use the core clock too which buys you 10x improvement over the HDMI clock, and then this also allows precise access to 32kHz, which is a factor of 252MHz (divide by 7875), but not of 25.2MHz. 270MHz is also a perfect multiple of 48kHz too which is nice. Extracting 44.1kHz precisely is not possible from 252/270MHz when using a simple counter without a PLL but we can get a close 44102.2Hz (252MHz/5714) and 44103.2Hz (270MHz/6122) in VGA and 480p/576p modes respectively we could try to adjust N and CTS fields in the clock regeneration packet accordingly. Needs more thought...
HDMI sinks can generate the audio clock from the video clock using the following formula:
fs = (fTMDS * N) / (128 * CTS)
where
fs = audio sampling frequency
fTMDS = TMDS pixel frequency = P2 sysclk/10
N and CTS are two integers
The HDMI v1.4 spec has tables of N and CTS starting at actual page 137 and parts are reproduced below:
N and CTS above result in perfect fs values, thus a P2 sysclk of 252 or 270 MHz can output 32/44.1/48 kHz audio (and multiples of the latter two) with no error.
I have included 25.2/1.001 as that is 25.175, but I have omitted 27*1.001.
Yeah, I know those numbers and use them in the audio clock regen packet. The problem is deriving the clock for sampling on the P2. We can send data at the right audio rate precisely over the TMDS bus, but how do you sample at exactly 44.1kHz when you only have a 270MHz or 252MHz P2? We have to sample at almost 44100, but not quite (44102 etc) and somehow compensate by using different N/CTS numbers accordingly. The output device would have to play back slightly fast too to match our rate or there will be a buffer overflow eventually.
When it comes to sample rate, the exact rate is not anywhere near as important as people tend to think. What's really important is stability. If it's crystal locked then you're good to go. Choose a suitable nearby frequency and use that.
This principle can be even more loose for video frame rate. Exact numbers are entirely unimportant ... for human presentation.
It's a good question on how the receiver/TV/monitor is supposed to handle things. It must have to resample everything it receives. The audio clocks are never going to match.
They could if the output audio clock is derived from the incoming TMDS clock on the receiver. The receiver knows exactly how many samples arrive in a given number of TMDS clocks based on the N and CTS parameters it is provided. If it synthesizes a clock using a PLL with appropriate dividers from this information it should be able to regenerate the original clock (or close too it) and track the incoming audio rate quite closely. At least that's how I imagine it might work.
Comments
To expand on the last sentence, my reading of the spec is that only one data packet is needed per line for <= 96kHz stereo audio and 640x480 or 720x480/576 video. This packet could contain up to four audio samples but only one or two samples are required for 44.1 kHz or 48 kHz audio as line frequency is 31.5 kHz. Is that correct?
That is my understanding, yes. Each 32 pixel wide audio packet within the data island has sufficient capacity to send from 0-4, 24 bit stereo audio samples. We are sending video lines at 31.5kHz so there is the bandwidth for streaming up to 126,000 samples per second, which therefore supports from 32kHz up to 96kHz audio (at the standard audio sampling rates of 32/44.1/48kHz etc), but not the next 4x oversampling rates of 128/176.4/192kHz.
If we send two of these audio packets in the island per scan line during horizontal blanking we can get up to 192kHz stereo audio. Or instead it can be used for multichannel audio up to 8 channels (at lower rates). But sending two audio packets on every line becomes problematic because there won't be enough time to encode a third packet dynamically, at least given the way my code is currently constructed. I'm thinking if we needed 3 packets per line we could possibly send pre-encoded static packets for the third packet, already stored in hub. Or occasionally just skip an audio packet and substitute it with another dynamically encoded packet type now and again and hope to catch up the audio in later packets, because we do have a little bit of bandwidth left over. As long as it is only done infrequently and the downstream device has an audio buffer of its own to compensate for this variation, nothing should be lost.
Right now in my example I currently use an audio sample fifo in hub RAM which I read from and then write back the current read (tail) pointer after each access that all other COGs can see, plus an 8 bit incrementing counter packed into the same long. My COG will take either one or two samples from this fifo on each scan line according to an accumulator I maintain that I add to and subtract from and whose value indicates when I am allowed to read and send another sample. To remain in sync, COGs that write existing pre-sampled audio content (like from an SD card, or being generated by something like SIDCog etc) will just need to keep the fifo fed so it doesn't underflow which is fairly easy. However this may be more problematic for COGs that actually want to sample audio on the fly from IO pin's ADCs synchronously at 44.1kHz etc and send via HDMI, they would will need to use the 25.2/27MHz HDMI clock being output on the pin and divide it down themselves appropriately for their own clocking needs. Unfortunately only 48kHz and 25.2MHz are perfect integer multiples, while the others are integer fractions. Another audio timing model where the HDMI audio transmit decision gets decided by another COG might be needed for that type of application, but that can be looked at later.
Update: Just realised that we should use the core clock too which buys you 10x improvement over the HDMI clock, and then this also allows precise access to 32kHz, which is a factor of 252MHz (divide by 7875), but not of 25.2MHz. 270MHz is also a perfect multiple of 48kHz too which is nice. Extracting 44.1kHz precisely is not possible from 252/270MHz when using a simple counter without a PLL but we can get a close 44102.2Hz (252MHz/5714) and 44103.2Hz (270MHz/6122) in VGA and 480p/576p modes respectively we could try to adjust N and CTS fields in the clock regeneration packet accordingly. Needs more thought...
I did look at it briefly once some time back on my Linux machine IIRC. Their home page lists TMDS as a possible candidate for a future protocol decoder. If they had it right now I'd probably go back to trying it.
HDMI sinks can generate the audio clock from the video clock using the following formula:
The HDMI v1.4 spec has tables of N and CTS starting at actual page 137 and parts are reproduced below:
N and CTS above result in perfect fs values, thus a P2 sysclk of 252 or 270 MHz can output 32/44.1/48 kHz audio (and multiples of the latter two) with no error.
I have included 25.2/1.001 as that is 25.175, but I have omitted 27*1.001.
This principle can be even more loose for video frame rate. Exact numbers are entirely unimportant ... for human presentation.