Shop OBEX P1 Docs P2 Docs Learn Events
Testing new P2 EVAL RevB board with glob-top chip - Page 3 — Parallax Forums

Testing new P2 EVAL RevB board with glob-top chip

13

Comments

  • TonyB_TonyB_ Posts: 2,196
    edited 2019-09-25 10:06
    rogloh wrote: »
    @garryj we should also check USB works with 270MHz operation too as that allows HDTV resolutions over HDMI instead of VGA.

    Yes, TonyB and JRetSapDoog, I agree that 252MHz would be better for sending VGA over DVI/HDMI, not 250MHz. 250MHz happened to be the frequency setup in the Chip's original HDMI demo code and it worked on my setup, though officially VGA is meant to be using a pixel rate of 25.175MHz in the specs. Being as close as possible to official VGA timing should help a larger range of display devices be supported by P2s, in particular HDTVs.

    As an aside, 252 = 2 x 2 x 3 x 3 x 7 and therefore has a good range of factors: /2, /3, /4, /6, /7, /9, /12.
    rogloh wrote: »
    Today I played around with the frequencies of the revision B P2 and managed to drive my HDTV over an HDMI cable with it clocking at 270MHz. The resolution is 720x480@60Hz (480p60) which is an official TV resolution with a pixel rate of 27MHz. My Pioneer plasma recognized it and scaled accordingly.

    Somebody did an FPGA-emulation of the ZX Spectrum with HDMI output at 280MHz, i.e pixel clock of 28MHz, so that all the original video timings could be multiplied by exactly four. I assume the HDMI timings were based on 270MHz with extra horizontal blanking.

    Chip has already tested 1280x720p24 at 297MHz successfully and I've been thinking what other HDMI modes might be possible. Small 800x480 LCD displays have 525 total lines at 60Hz with typically 1056 pixels/line using a pixel clock of 33.264MHz. An HDMI clock of 333.3MHz should work and there is probably latitude for slower clocks with less blanking.
  • roglohrogloh Posts: 5,852
    edited 2019-09-20 14:16
    TonyB_ wrote: »
    Chip has already tested 1280x720p24 at 297MHz successfully and I've been thinking what other HDMI modes might be possible.
    Hmm, might have to try that out too. I know my plasma's scaler is meant to be able to support taking in 1080p24 over HDMI so with some luck it might also support 720p24 as well. This model is natively a 1368x768 panel. Worth a shot for even higher resolution. :smile:

    Update: well I couldn't get it to work at 720p24 after a few attempts. I might possibly have messed something up in the code or this particular HDTV doesn't like it. It just displays "This signal is not supported". I tried outputting 1280x720 using an outer "container" of 1650x750 pixels at 24Hz with the P2 clocked at 297MHz using a variety of sync widths and positions but it just didn't seem to like it.
  • evanhevanh Posts: 16,070
    edited 2019-09-20 16:50
    24 Hz doesn't work on monitors. It sucks to use interactively anyway.

    PS: Here's a link from some source code I just found in my collection - http://forums.parallax.com/discussion/comment/1463078/#Comment_1463078
  • TonyB_TonyB_ Posts: 2,196
    edited 2019-09-20 19:23
    rogloh wrote: »
    TonyB_ wrote: »
    Chip has already tested 1280x720p24 at 297MHz successfully and I've been thinking what other HDMI modes might be possible.
    Hmm, might have to try that out too. I know my plasma's scaler is meant to be able to support taking in 1080p24 over HDMI so with some luck it might also support 720p24 as well. This model is natively a 1368x768 panel. Worth a shot for even higher resolution. :smile:

    Update: well I couldn't get it to work at 720p24 after a few attempts. I might possibly have messed something up in the code or this particular HDTV doesn't like it. It just displays "This signal is not supported". I tried outputting 1280x720 using an outer "container" of 1650x750 pixels at 24Hz with the P2 clocked at 297MHz using a variety of sync widths and positions but it just didn't seem to like it.

    Here is Chip's code:
    http://forums.parallax.com/discussion/comment/1475209/#Comment_1475209
    http://forums.parallax.com/discussion/comment/1475226/#Comment_1475226

    EDIT:
    Note the monster monitor.
  • RaymanRayman Posts: 14,789
    I hope somebody figures out how to do audio over HDMI...
  • @TonyB_ Thanks for the links. It still didn't work on my Pioneer HDTV with Chip's timing, same result. I guess 720p24 is not always going to be possible and will be monitor dependent. It's certainly not a "standard" video mode.

    I'll go back to the easier 480/576p modes with HDMI for now, we can always use component for any real HD resolution modes and that will allow far higher pixel clocks.
  • Rayman wrote: »
    I hope somebody figures out how to do audio over HDMI...
    Yes that would be awesome on the P2. I've done it in an FPGA in the past by mostly using someone else's Verilog code and got it working nicely on the FleaFPGA Ohm board with the P1V using HDMI. Will be interesting to see if its fully doable in the P2 in software. You do need to compute some ECC stuff dynamically IIRC and get all the infoframe data format right. Timing is everything.
  • evanhevanh Posts: 16,070
    Since audio is transported during horizontal blanking, the timing, at least, should be easy to fit.
  • It would be rather nice if an HDMI COG could do go some of the audio processing work in the vertical blanking time when the streamer is not active and it gets full access to hub memory and can fill LUT with audio sample data if that memory space is also available. Then during the times it is streaming out full scanlines of video from the hub it could also potentially go compute ECCs while the streamer is active and build up data islands to be sent. How much of it fits there in terms of execution time and how much additional latency that method adds to audio needs to be examined, not sure yet if doable. Certainly there is plenty of COG RAM space free in a simple frame buffer type of COG for other functions and for holding small look up tables etc. It might not all fit in from an execution time perspective but it's certainly worth a look to see if could save a COG. Otherwise a separate COG could be allocated for audio data processing and just writes all its data islands into hub RAM for the HDMI COG to also stream out during horizontal blanking.
  • One thing I was wondering... does hdmi version of video support reduced blanking like the vesa spec for VGA does?
    Or does it need that horizontal blanking time for audio transmission?
  • I did a null data island on the ES1. forums.parallax.com/discussion/comment/1475526/#Comment_1475526 Decided it would be a waste of effort to do audio on ES1. With the ES2, the cog can form the audio packets while the streamer handles video, as Roger mentioned.

    HDMI can't load the FIFO much. The max is 1 long every 10 clocks. Block move yields to the FIFO.
  • evanhevanh Posts: 16,070
    Thinking about the amount of processing for the island packet, it's no more than 64 bits of audio data per scan line. That can't take much to bundle up.

    Saucy,
    How is the data island handled around hsync? Is there a proper sync any longer or is blanking doing the real work here?

  • evanhevanh Posts: 16,070
    HDMI can't load the FIFO much. The max is 1 long every 10 clocks. Block move yields to the FIFO.
    That's right, plenty of hubRAM bandwidth for the cog to process on the fly and still place everything back in hubRAM for the streamer to seamlessly pick up later.

  • roglohrogloh Posts: 5,852
    edited 2019-09-21 08:23
    Tubular wrote: »
    One thing I was wondering... does hdmi version of video support reduced blanking like the vesa spec for VGA does?
    Or does it need that horizontal blanking time for audio transmission?

    It does need to leave some some horizontal blanking for audio packets to be sent. You need to be able to at least send one data island packet of 32 pixels duration plus around 30 more pixels of additional overhead for guard bands and preambles etc. In the higher resolution mode cases of having large amounts of extra blanking space on each scanline I think it should be possible to support some amount of reduced blanking but still have some audio packets being sent.
    evanh wrote: »
    Thinking about the amount of processing for the island packet, it's no more than 64 bits of audio data per scan line. That can't take much to bundle up.

    You'd think so but there is some bitwise interleaving across RGB channels going on in both time and space, plus dynamic error correction codes plus 4b to 10b conversion, plus packet header computation and ECC on that too. Some of the packet header fields change, some are fixed. Its quite a bit of work to do but hopefully a packet can be generated in the 31.7us scan line budget for 480p. The bitwise stuff will certainly slow things down - perhaps some look up tables may help a little there for translation. You need to take two bits at a time from each of the four 7 byte sub packets in a packet sent plus compute an additional parity byte per sub packet, and then use them to feed different bit lanes of each nibble over two of the colour channels. Repeat for all 4 sub packets to build up the complete nibbles to be sent on each channel, translate from 4b to 10b codes using TERC4 encoding and also compute an ECC at the same time on these nibble bit lanes. There is also a 24 bit packet header sent over a bit of the third channel in parallel to this which also includes the HSYNC and VSYNC bits in other lanes, and includes 8 more parity bits from another ECC on top. Simple and clear as mud. :tongue: Oh and don't forget to shift those 192 bit channel status word bits generated per audio block serially in the audio packet header before you compute the parity ECC.

    @cgracey , did we get a TERC4 encoder in the P2 rev B as well as 8b to 10b? I can't recall if you added that too. No drama if we didn't we can use a small look up table to map the TERC4 bits to 10b symbols.
    I did a null data island on the ES1. forums.parallax.com/discussion/comment/1475526/#Comment_1475526 Decided it would be a waste of effort to do audio on ES1. With the ES2, the cog can form the audio packets while the streamer handles video, as Roger mentioned.

    HDMI can't load the FIFO much. The max is 1 long every 10 clocks. Block move yields to the FIFO.

    Ok that is great that there is some hub bandwidth left in this HDMI COG, should be enough to read an audio buffer for new samples dynamically rather than all done once at the start of the frame, helping to reduce latency. I was thinking there might not be any hub access during active video scanlines as it would interfere with streaming, but it thankfully appears I am wrong. I think I was just remembering the bit bang HDMI where the COG couldn't touch the hub. I hope it all fits into a single COG in the end, makes a lot of sense to try.
  • evanhevanh Posts: 16,070
    rogloh wrote: »
    ... but hopefully a packet can be generated in the 31.7us scan line budget for 480p.
    At 250 MHz clock rate, that makes 7925 sysclocks per line. You could write it in Python. :P

  • roglohrogloh Posts: 5,852
    edited 2019-09-21 12:01
    LOL evanh. Well I think there should be sufficient time but not in Python. :smile:

    Excluding the audio buffer sample test and read overhead, which is low, some perspective...

    There will be 56 bit gets + 56 bit sets to create 14 nibbles for ECC computation purposes per bit lane x 4 bit lanes = 112 instructions * 4 = 896 sysclocks

    Then 24 bit gets and sets for header row construction on the third channel = 96 clocks.
    Insertion of VSYNC/HSYNC flag state over 2x32 bits = ?? clocks. This pattern should remain fixed for each packet in the island if they start at the same scanline offset every time which will help.

    Then another 32 bit sets * 12 lanes = 384 instructions = 768 clocks to fill in 4b TERC values on three HDMI channels, using data extracted in prior step to avoid second bit get operations.

    Then 12 9 x 31 nibble oriented ECC/CRC generation steps likely via LUT approach though it could also be bit oriented if that method has any chance to save cycles and avoid "double handling" of bits. Each LUT based iteration needs a COG/LUT indexed read and XOR and shifting etc, probably 4-5 instructions per iteration there x 372 279 times so at least 2976 2232 sys clocks and could be higher - this is the big one.

    Then 4b to 10b lookup table translation * 96 times and insertion into 32 3x10b longs written into HUB or COG RAM as immediates. If Chip has added TERC4 encoding this step might not be needed and we can work with the 4 bit values instead. I have a feeling it was requested and may be in the P2 already.

    I am not accounting for any loop register shift overhead or dynamic code patching in the loop and there won't be enough space to do it fully unrolled in COG RAM, though maybe some hub exec could help us there if we can run that while still streaming HDMI.

    Gut feeling tells me audio should probably fit in the time budget (we have almost 125 MIPS at our disposal!) but it will still take a reasonable amount of time to do because it is bit-oriented. I need to hack up some sample code to see how much it is. Some optimizations like preloading static values where possible will help too.

    EDIT: adjusted values above, as only 9 of the 12 bit lanes require ECC. By the way all this was just for a single packet, though there may be cases where you need to send a second type of packet on the same scanline as the audio data, such as AVI info frames, clock regen packets etc, or when multichannel or high sample rate audio is present. That will obviously need more CPU cycles again and multichannel (8 chnl) audio or high sample rates >96kHz may well push us over the limit. AVI info frames probably only need to be sent once per frame and may be rather static for each resolution used. The clock regen packets will need some attention as they need to be sent more frequently.

    EDIT2: realised some of the ECC stuff may still be very wrong above, I think the ECC can actually be done on the raw data bytes before encoding all the packets bits into different TERC4 nibbles so that should save a lot of these cycles. I'm too rusty on this but I'll figure it out eventually...
  • evanhevanh Posts: 16,070
    edited 2019-09-21 10:35
    Bare in mind the streamer hogs the FIFO so hubexec is off limits.

    Spare lutRAM is best used for code, keep the data in cogRAM. ALTxx byte, nibble and bit indexing ops are cool. ALTB bit indexing with the TESTxx and BITxx instructions is particularly flexible.

  • evanhevanh Posts: 16,070
    edited 2019-09-21 15:11
    Certainly sounds like quite a lot to make it all come together. Better suited to custom hardware I guess.

    On the matter of features, plain PCM stereo is more than enough. In fact, even that is only because the HDMI cog isn't likely to be doing much else. If it can be put to another use, eg: sprites, then forget audio on HDMI and just have analogue outs.

  • evanh wrote: »
    Thinking about the amount of processing for the island packet, it's no more than 64 bits of audio data per scan line. That can't take much to bundle up.

    Saucy,
    How is the data island handled around hsync? Is there a proper sync any longer or is blanking doing the real work here?

    The sync values are still sent during a data island. 2 bits of the channel 0 TERC4 are reserved for the sync. Of course, in the code I positioned the data island in a place where the sync would not change.

    hamsterworks.co.nz/mediawiki/index.php/Minimal_HDMI
  • cgraceycgracey Posts: 14,231
    edited 2019-09-21 16:16
    In our HDMI implementation, 8-bit video data goes through 8-to-10-bit TMDS conversion, while any custom data must be expressed as raw 10-bit values.

    I didn't know about TERC4 when I wrote the Verilog code.
  • evanhevanh Posts: 16,070
    edited 2019-09-21 16:44
    The sync values are still sent during a data island. 2 bits of the channel 0 TERC4 are reserved for the sync. Of course, in the code I positioned the data island in a place where the sync would not change.
    Hmm, that doesn't mean a thing to me. Is the sync still timing sensitive or just a marker somewhere in this island?

    And, reading what Chip just wrote, does that mean the existing implementation uses some mechanism other than this TERC4?
  • evanh wrote: »
    The sync values are still sent during a data island. 2 bits of the channel 0 TERC4 are reserved for the sync. Of course, in the code I positioned the data island in a place where the sync would not change.
    Hmm, that doesn't mean a thing to me. Is the sync still timing sensitive or just a marker somewhere in this island?

    And, reading what Chip just wrote, does that mean the existing implementation uses some mechanism other than this TERC4?

    Timing still matters. The sync is just encoded in a different format. Although it might be necessary to position the island so that sync signals don't change during a guard band.

    The hardware only does TMDS encoding. Everything else, including syncs, must be sent as raw 10 bit data. One of the things that would have made HDMI audio on RevA so annoying is that streamer wouldn't take 3 bits at a time. So sending 3 independent channels would mean running the streamer 4 bits at a time. That means that only 8 bits per channel can be packed into a long. So the 10 bit words would need to be repacked to 8 bits.

    We don't need a TERC4 hardware encoder, just a look-up-table to convert 4 bits to 10 bits. Then we pack 3 of the 10 bit words into 1 long.
  • evanh wrote: »
    The sync values are still sent during a data island. 2 bits of the channel 0 TERC4 are reserved for the sync. Of course, in the code I positioned the data island in a place where the sync would not change.
    Hmm, that doesn't mean a thing to me. Is the sync still timing sensitive or just a marker somewhere in this island?

    And, reading what Chip just wrote, does that mean the existing implementation uses some mechanism other than this TERC4?
    For DVI there are 4 possible combinations of sync bits. Each of these maps to a 10 bit raw sequence. So there are 4 10bit words that signal syncs.

    TERC4 maps 4 bits to 10. 4 bits input is 16 possible combinations. 2 of the bits on channel 0 are used for sync. So each sync state now has 4 possible 10 bit sequences to represent it instead of one unique 10 bit sequence. The specific sequence used depends on the other 2 bits, which come from the data island.
  • So after hacking up some example for HDMI compatible audio the good news is that there will be sufficient time to TERC4 encode two data packets per scan line. I found a way to get down to around 2710 clock cycles to encode each data island packet which is 271 pixels worth of the scan line. So during the time we are actively streaming 640 RGB pixels at VGA resolution and are otherwise unoccupied we would have enough time there to go construct and encode up to two packets and accumulate stereo audio samples at rates <=96kHz, plus do other required housekeeping. Each audio sample actually processed adds about 84 more clocks (~9 pixels) to this number. This would allow audio to be present on every scan line (which is important to help keep latency down) and still have enough time to dynamically encode another type of packet such as other info frames etc.

    The approach I used consumed 256 table entries in the LUTRAM and ~410 COG longs for completely supporting a framebuffer or line buffer type of HDMI compatible COG including audio support. The free COG space left may be handy to fit a few other features. I expect some optimizations could be found to shrink things further in size but probably not reduce the packet encoding time too much more. There are 256 remaining LUT RAM entries which could be used to hold a palette in 8 bit colour modes, but ideally this could ultimately also be made to interoperate with some type of external HyperRAM arbiter COG and help lessen some of the HUB RAM use for doing full graphic framebuffers.

    Bad news is I am still learning some P2 stuff to get this going but I think it is almost completed and I hope to have it running soon...

    The "crcbit" instruction was a lifesaver, hope that instruction fully works on RevB or this entire method is dead in the water.


  • evanhevanh Posts: 16,070
    Wow, quick work!
  • Amazing, Roger!

    That crcbit instruction was the one Cluso was so keen on, wasn't it?
  • roglohrogloh Posts: 5,852
    edited 2019-09-24 14:45
    evanh wrote: »
    Wow, quick work!
    I've spent a lot of hours on it already and am somewhat lucky as I had a Verilog HDMI compatible audio example as a reference that I got working with P1V in the past year and I already know works on my HDTV. The bonus here is that the P2 recompile/download cycle should be measured in several seconds instead of more than 15 minutes each time with compiling and downloading Lattice and Altera FPGAs as I went through before. That was painfully slow.

    If I slow down the P2 video output clock and feed the TMDS pins into my logic analyzer I hope to be able to find any problems. Though it would be really nice to be able to decode the 10b symbols in the Saleae Logic software instead of hunting for start of each pattern manually. Maybe ozpropdev's inbuilt logic analyzer could be useful for this too at some point.
    Tubular wrote: »
    That crcbit instruction was the one Cluso was so keen on, wasn't it?
    Possibly, I think it could be useful for plenty of things that require CRCs and bit shifting like data comms etc.
  • rogloh wrote: »
    Tubular wrote: »
    That crcbit instruction was the one Cluso was so keen on, wasn't it?
    Possibly, I think it could be useful for plenty of things that require CRCs and bit shifting like data comms etc.
    That little beauty got USB CRC calculations at 80MHz without lookup tables :smiley:
  • Cluso99Cluso99 Posts: 18,069
    Great work there Roger. :smiley:
    You’ve certainly been busy..
    I knew the crcbit instruction would come in useful in other things besides USB.
  • rogloh wrote: »
    If I slow down the P2 video output clock and feed the TMDS pins into my logic analyzer I hope to be able to find any problems. Though it would be really nice to be able to decode the 10b symbols in the Saleae Logic software instead of hunting for start of each pattern manually. Maybe ozpropdev's inbuilt logic analyzer could be useful for this too at some point.

    @rogloh
    Have you tried PulseView? It mentions decode support for a few different USB layers and it looks as though it supports a couple different Saleae models. I've never used one of Saleae's LA's, though I remember seeing some recorded video of their software when the first models came out...definitely to this day looks slick, but maybe this is a viable alternative for things their software just doesn't do (yet?).
Sign In or Register to comment.