HDMI added to Prop2

1121315171821

Comments

  • Roger
    Ignore my posted code, its wrong!
    Here's your code with some editing
    getnib evennibble,data,#0
    alts evennibble, #table1
    wrlut 0-0,xx
    alts evennibble, #table2
    wrlut 0-0,yy
    getnib oddnibble,data,#1
    alts oddnible, #table3
    wrlut 0-0,zz
    
    We still need to inc xx,yy,zz tho, and allow for hub read too. :(

    Melbourne, Australia
  • I don't follow why the 160 wrlong back to hub needs to be spaced only every 5 longs. Can someone explain a little better please.
    Not sure if you can catch the window with the write backs. A hub stall would be a killer if you cannot catch the window.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • roglohrogloh Posts: 976
    edited 2018-11-02 - 07:59:06
    Yeah ozpropdev, I found the proper instruction set document that shows the correct WRLUT usage, I had it backwards above and I think I need altd instead of alts because of it.
    eachscanline:
    ' read in 80 longs into LUT containing the 640 pixels.
    
    loop: ' repeat this block of 8 pixels 80 times
    
    rdlut data, lutpixelbufaddr
    add lutpixelbufaddr, #1
    
    getnib evennibble,data,#0
    altd evennibble, #table1
    wrlut 0-0, #1
    altd evennibble, #table2  ' do I need another getnib if evennibble gets trashed above by altd, or does it stay intact ?
    wrlut 0-0, #2
    getnib oddnibble,data,#1
    altd oddnibble, #table3
    wrlut 0-0, #4
    
    getnib evennibble,data,#2
    altd evennibble, #table1
    wrlut 0-0, #6
    altd evennibble, #table2  
    wrlut 0-0, #7
    getnib oddnibble,data,#3
    altd oddnibble, #table3
    wrlut 0-0, #9
    
    getnib evennibble,data,#4
    altd evennibble, #table1
    wrlut 0-0, #11
    altd evennibble, #table2
    wrlut 0-0, #12
    getnib oddnibble,data,#5
    altd oddnibble, #table3
    wrlut 0-0, #14
    
    getnib evennibble,data,#6
    altd evennibble, #table1
    wrlut 0-0, #16
    altd evennibble, #table2  
    wrlut 0-0, #17
    getnib oddnibble,data,#7
    altd oddnibble, #table3
    wrlut 0-0, #19
    
    ' here you would write 20 longs from LUT to HUB for 8 pixels, or unroll further to gain more budget for setup overheads
    ' repeat this loop 80 times
    
    

    I don't think you need to increment the LUT address constants if you write each group of 20 longs within this loop. Only 12 of the 20 longs need to be overwritten, the other 8 (at offsets 0,3,5,8,10,13,15,18 in the LUT write buffer) are fixed constants. If the overhead to setup the write of the 20 longs from the LUT to hub is small I think it can be done in this small batch of 8 pixels, or you could just keep unrolling further. The only thing I'm unsure about is whether the altd instruction also trashes its D register as well as altering the D field of the next instruction. If it does then either a "NR" or duplicate "getnib evennibble" from the original source data might be required (adding another instruction, which makes 9 instead of 8 per pixel pair, still okay). Documentation seems to imply it remains intact so we can keep the 8 instructions per pixel pair which would be nice.
  • roglohrogloh Posts: 976
    edited 2018-11-02 - 08:25:12
    Cluso99 wrote: »
    I don't follow why the 160 wrlong back to hub needs to be spaced only every 5 longs. Can someone explain a little better please.
    Not sure if you can catch the window with the write backs. A hub stall would be a killer if you cannot catch the window.

    The way I see it is that each two pixels generates 5 LONGs worth of encoded data, 2 of which are always constant. In this group of 5 longs 1,2,4 are written while longs 0,3 are fixed, and could already be setup in hub in the line buffer. So in theory you only actually need to write 3 longs of data to hub, but it is faster just to write contiguous blocks of longs in a burst, and probably even better yet some multiple of 16 longs, so perhaps we should unroll with 4 groups of my code above making 80 longs generated or 5x16 transfers.
  • roglohrogloh Posts: 976
    edited 2018-11-02 - 11:40:02
    Can't help thinking there is some scope to have the COG controlling the HDMI pin streaming also do some work per scanline and perhaps transfer data in/out from the shared LUT to the second COG building the scanline, because once the streamer is setup for the scanline and started what is this COG doing otherwise? Waiting for the end of the line?

    Reading more about the P2 block moves in the documentation, it would appear we don't really need to use the LUT RAM to achieve the same result in my algorithm above as we can also transfer in/out of COG ram at burst rate of one long per clock, so this LUT RAM could be fully freed up and then possibly shared by the other HDMI output COG. With both COGs doing work maybe there is even scope for holding a small font table (say 8 scanlines) in some COGRAM or LUTRAM and somehow generating real text on the fly with just a 2 COG setup, now that would be pretty cool...challenge is on.

    Concept:
    COG1 is the display COG, which manages its pin streamer per scanline to stream in the 800 pixels per scanline (contained within 2000 longs) including the horizontal blanking sync region
    COG2 is the render COG, that writes 640 pixels worth of data (1600 longs) into the scanline buffer held in hub RAM to be consumed by COG1 pin streamer for each scanline

    Basic sequence is:
    COG2 - per every 8 scanlines, reads from HUB the 80 character array for the row, perhaps in monochrome (byte oriented) or hopefully colour (word oriented) into the shared LUT RAM
    COG1 - reads character data from shared LUT and uses it to look up pixels from a font table in its COG RAM, writes back 640 resulting pixel colours (FG or BG) as nibbles held in 80 longs
    COG2 - reads from shared LUT the line's pixel buffer and then reads from the 3 16 long tables in COG RAM to build the scanline buffer data to be transferred in COG RAM
    COG2 - writes buffer in COG RAM to hub RAM in appropriately sized bursts

    Have to check if it's feasible to transfer and do font lookups in the budget available, or it there is some other partitioning of work that is better or maybe even having the font in hub RAM and using COG2 to do some of the conversion. Be awesome if this was doable.
  • TonyB_TonyB_ Posts: 1,106
    edited 2018-11-03 - 00:46:45
    Wow, lots of comments overnight which I'll study later.
    Formerly known as TonyB
  • Roger
    Be aware that LUT sharing inhibits streamer use in the shared cogs.
    Melbourne, Australia
  • Ok, pity, thanks ozpropdev. Didn't realize that. May not be able to do text without the 3rd COG then.
  • Hi Ozpropdev and roghlo

    Such a strong word "inhibits" could be understood as "precludes", but, when Lut sharing is enabled and, according to the docs: "lookup RAM writes from the other Cog are implemented on the 2nd port of the lookup RAM, which port is shared by the streamer in DDS/LUT modes. If an external write (from the other Cog) occurs on the same clock as a streamer read, the external write gets priority. It is not intended that external writes would be enabled at the same time the streamer is in DDS/LUT mode".

    I've not been closelly following all discussions on that topic, but, IMHO, since Streamers and Cogs activities are fully deterministic, perhaps, in some cases one could leverage from any possible interleaving scheme between the two.

    Another feasible idea, and, perhaps, Chip could agregate a bit of information, since it's not clear if there is any possibility of "data forwarding", i. e., if both accesses (Streamer read x paired Cog write) simultaneous to the utmost transition details, actual data being writen by the paired Cog can be /forwarded/read by the Streamer, since, at the end, they are accessing the Lut ram at the same time, but with differing "intentions".

    If addresses are not the same, but, only data can be safelly passed through, in such cases, it can be left to the code designer to warranty that the Streamer doesn't collect garbage or some unintended value.

    Henrique
  • @Yanomani. Was originally sort of hoping that the HDMI COG's streamer is just outputting raw byte sequence to pins and not needing to involve the LUT, so it could be potentially used for other purposes. Perhaps it doesn't work that way though. Not sure as I've not played with P2 yet, real or FPGA. Just kicking around ideas right now.
  • YanomaniYanomani Posts: 707
    edited 2018-11-02 - 14:26:57
    @roghlo @rogloh

    We've been both sailing at the same ocean.

    All of P2 I can "reach" is inside my mind and at the references, in case it (my internal memory) fails.

    And it's enough for the moment, since doing setups for time-consuming tests can be a burden.

    Sure, I'll love to test some concepts at the "real thing", mainly at Hyperram and 100 - 133 MHz, QSPI-enabled 8-pin flash/ram devices.

    Time will come, sooner or later.

    Henrique
  • evanhevanh Posts: 6,341
    edited 2018-11-02 - 14:59:28
    Decided to look that one up ... lutram sharing can be used in all streamer modes. Certainly no issue with hubram exclusive modes. Here's the full text:
    LOOKUP RAM SHARING BETWEEN PAIRED COGS
    Adjacent cogs whose ID numbers differ by only the LSB (cogs 0 and 1, 2 and 3, 4 and 5, etc.) can each allow their lookup RAMs to be written by the other cog via its local lookup RAM writes. This allows adjacent cogs to share data very quickly through their lookup RAMs.

    The ‘SETLUTS D/#’ instruction is used to enable the lookup RAM to receive writes from the adjacent cog:
    SETLUTS #0 ‘disallow writes from other cog (default)
    SETLUTS #1 ‘allow writes from other cog

    Lookup RAM writes from the other cog are implemented on the 2nd port of the lookup RAM, which port is shared by the
    streamer in DDS/LUT modes. If an external write occurs on the same clock as a streamer read, the external write gets priority. It is not intended that external writes would be enabled at the same time the streamer is in DDS/LUT mode.

    In order to find and start two adjacent cogs with which this write-sharing scheme can be used, the COGINIT instruction has a mechanism for finding an even/odd pair and then starting them both with the same parameters. It will be necessary for the program to differentiate between even and odd cogs and possibly restart one, or both, with the final, intended program. To have COGINIT find and start two adjacent cogs, use %x_1_xxx1 for the D/# operand.

    To facilitate handshaking between cogs sharing lookup RAM, the SETSE1...4 instructions can be used to set up lookup RAM read and write events.

    My interpretation, if there is a simultaneous read and write, is the streamer will just read the new data that is present on that lutram port. This works, critically, because the streamer has no ability to write lutram.

    EDIT: Oops, no it won't work for streamer lutram modes. The address bus is screwed. If anything the streamer will see whatever is being written no matter what address it is trying to read. Or it will be zero.

    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • Hehe, rereading my last comment and the correction, they pretty much say the same thing. I just had differing understanding when I wrote each paragraph.

    In the first attempt I was only pondering what happens if the addresses are identical. But without having stated it that way, what I wrote can be viewed as disparate addresses and therefore glitchy fetches for the streamer. Same as the edit.

    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • TonyB_TonyB_ Posts: 1,106
    edited 2018-11-03 - 00:35:34
    I've added a new table (second code block) to the description, showing the 20 bytes/5 longs that are repeated throughout the line buffer:
    http://forums.parallax.com/discussion/comment/1451704/#Comment_1451704

    The line buffer cog could disable LUT sharing just before it streams TMDS longs from LUT to hub RAM and enable it again afterwards, therefore LUT sharing and streamer contention is a non-issue, I think. Streaming the TMDS data might require double line buffering, depending on the exact timing and size of streamed blocks. I'll look into this a bit more.
    Formerly known as TonyB
  • @evanh
    Just confirming,
    If I enable LUT sharing with one of my video drivers and from the shared cog write to LUT addresses other than the ones used by the streamer it still "glitches" the video output.

    One bus and two users, someone has to step aside.
    Melbourne, Australia
  • roglohrogloh Posts: 976
    edited 2018-11-02 - 23:01:50
    So what happens if you stream directly to pins from hub RAM with LUT sharing, (i.e. not a LUT RAM mode), does it still glitch the pin output? I'd expect/hope not and this is the operating condition I was wondering about. Can't the 250MHz byte stream with the HDMI 10 bit codes be sent to the pins directly on the HDMI output COG at this rate or does it need to go through the LUT RAM for some reason?

    If LUT sharing is possible during streaming the video line I was computing things roughly overnight in my head and I am starting to believe there may be sufficient bandwidth to have the HDMI COG do the font lookup on text data it is sent and then pass the pixel result back to the render COG in the shared LUT. I think we could squeeze in an 8 scanline font for 128 characters and have it stored in COGRAM of the HDMI output COG. The upper bit of the 8 bit character would have to be used for something else, like an attribute, inverse/flash/underline etc or just ignored. The reason is there is not enough space for all 256 characters in the 512 entry COG RAM with an 8 scanline font as some of it needs to be used for real COG code plus the registers. But 128 characters would still work nicely for a text mode.

    TonyB_, my basic idea is that your streamer COG is just streaming, and there is just a single line buffer COG required filling the line buffer. If the streaming COG is mostly idle after it renews the streaming condition for each scanline, then it hopefully can use LUT sharing to do font lookups in this spare time per scanline and pass the colour pixel nibbles back to the line buffer COG which just does the table lookups to compute the 10bit codes to fill the line buffer in hub RAM. We just don't know if LUT sharing works while streaming bytes to pins. I hope it does.
  • TubularTubular Posts: 3,429
    edited 2018-11-02 - 23:30:15
    .
  • My understanding is lut sharing was made possible by using the cogs streamer bus as a port between cogs.
    Only one thing can use the bus at a time, WRLUT wins over streamer function.
    Melbourne, Australia
  • But, couldn't the accesses be time-interleaved, ensuring that only one of the two is trying to access the Lut at any given moment/sysclock cycle?
  • The thing that gives me hope that LUT sharing might still be possible is the higlighted wording below:

    "Lookup RAM writes from the other cog are implemented on the 2nd port of the lookup RAM, which port is shared by the streamer in DDS/LUT modes. If an external write occurs on the same clock as a streamer read, the external write gets priority. It is not intended that external writes would be enabled at the same time the streamer is in DDS/LUT mode."

    I was hoping to use the 8 bit RFBYTE mode for the pin streaming not the DDS or LUT modes. Maybe Chip can tell us if this works with LUT sharing and a the sustained 250MHz rate without drops. Using RFBYTE was exactly how he got his HDMI gradient demo working...according to the line in his code
    line_mod	long	$1080<<16 + linebytes	'RFBYTE streamer mode
    
  • rogloh wrote: »
    The thing that gives me hope that LUT sharing might still be possible is the higlighted wording below:

    "Lookup RAM writes from the other cog are implemented on the 2nd port of the lookup RAM, which port is shared by the streamer in DDS/LUT modes. If an external write occurs on the same clock as a streamer read, the external write gets priority. It is not intended that external writes would be enabled at the same time the streamer is in DDS/LUT mode."

    I was hoping to use the 8 bit RFBYTE mode for the pin streaming not the DDS or LUT modes. Maybe Chip can tell us if this works with LUT sharing and a the sustained 250MHz rate without drops. Using RFBYTE was exactly how he got his HDMI gradient demo working...according to the line in his code
    line_mod	long	$1080<<16 + linebytes	'RFBYTE streamer mode
    

    The RFBYTE streaming mode doesn't even involve the LUT, so there would be no conflict.
  • jmgjmg Posts: 13,059
    Yanomani wrote: »
    But, couldn't the accesses be time-interleaved, ensuring that only one of the two is trying to access the Lut at any given moment/sysclock cycle?

    Yes, some form of time interleave would be needed. What is the streamer cycles between reads in this HDMI test ?
  • TonyB_TonyB_ Posts: 1,106
    edited 2018-11-03 - 00:40:16
    rogloh wrote: »
    So what happens if you stream directly to pins from hub RAM with LUT sharing, (i.e. not a LUT RAM mode), does it still glitch the pin output? I'd expect/hope not and this is the operating condition I was wondering about. Can't the 250MHz byte stream with the HDMI 10 bit codes be sent to the pins directly on the HDMI output COG at this rate or does it need to go through the LUT RAM for some reason?

    If LUT sharing is possible during streaming the video line I was computing things roughly overnight in my head and I am starting to believe there may be sufficient bandwidth to have the HDMI COG do the font lookup on text data it is sent and then pass the pixel result back to the render COG in the shared LUT. I think we could squeeze in an 8 scanline font for 128 characters and have it stored in COGRAM of the HDMI output COG. The upper bit of the 8 bit character would have to be used for something else, like an attribute, inverse/flash/underline etc or just ignored. The reason is there is not enough space for all 256 characters in the 512 entry COG RAM with an 8 scanline font as some of it needs to be used for real COG code plus the registers. But 128 characters would still work nicely for a text mode.

    TonyB_, my basic idea is that your streamer COG is just streaming, and there is just a single line buffer COG required filling the line buffer. If the streaming COG is mostly idle after it renews the streaming condition for each scanline, then it hopefully can use LUT sharing to do font lookups in this spare time per scanline and pass the colour pixel nibbles back to the line buffer COG which just does the table lookups to compute the 10bit codes to fill the line buffer in hub RAM. We just don't know if LUT sharing works while streaming bytes to pins. I hope it does.

    rogloh,

    Thank you for taking the time to write the code here. Having studied it better, I think it proves that only one cog is sufficient to fill the line buffer for a 640x480x4bpp bit-mapped graphics mode, with no need to use the spare capacity in the streamer cog. The loop could indeed be larger, as you say, and there is the option of reading the pixel row of 80 longs into cog RAM instead of LUT RAM, albeit at the cost of an extra 80 cycles per line.

    The streamer cog cannot use hub RAM once streaming has started, so any data between the cogs must pass via the LUTs and sharing would need to be enabled both ways at times, but disabled in the line buffer cog when it is streaming from LUT to hub RAM. I'm not entirely clear about your additional text mode, but the line buffer cog won't have much spare time for extras.
    Formerly known as TonyB
  • I'm assuming everyone now understands that not all streamer modes involve lutram. But, yeah, stream via lookup, together with sharing, will glitch the streamer output. It doesn't matter what lutram addresses are being read or written.

    Interleaving would halve the max stream rate and also add jitter to WRLUT. Would ruin any chance of HDMI and one of the details that Cluso wanted from cog pairing was no jitter.

    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • TonyB_ wrote: »
    The streamer cog cannot use hub RAM once streaming has started, ...

    Good point. For HDMI's 250 MHz, the streamer will be max'ing out that cog's hubram slices.

    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • Now I can see! A bit more reading and the fault is mine, for sure.

    During RFLONG LUT modes (1/2/4/8-bits per iteration), the Streamer will be doing (bit/twit/nibble/byte)-indexed readings at Lut ram, every sys_clock, though its NCO rollover would be occuring each 32/16/8 or 4 sys_clocks, depending on the selected mode of indexing.

    Thus, if RFLONGs are tightly coupled, one after another, there will be no opportunity for a write to come and succeed, from the other Cog.

    I believe Lut sharing don't needs to be disabled at all; it suffices don't allowing the routines to do any writing at the paired Cog Lut, while the other Cog Streamer is actively using it.

    Though, If the address/data buses are wired-or'ed, writing a data value of zero, at address zero, should produce no deleterious effects.
  • TonyB_TonyB_ Posts: 1,106
    edited 2018-11-03 - 01:18:05
    640x480x4bpp HMDI can be done with two cogs. The loop for 8 pixels uses 34 instructions and 69 cycles. Unrolling that 10 times the code will fit in cog RAM comfortably. 80 pixels means writing 200 longs to hub RAM with SETQ+WRLONG, making a sub-total of ~900 cycles. Multiplying that by 8 gives ~7200 cycles in total and there are 8000 cycles per line.
    Formerly known as TonyB
  • roglohrogloh Posts: 976
    edited 2018-11-03 - 01:28:35
    Ok, couple of things..

    First, it's awesome that Chip confirms we can use LUT sharing during RFBYTE pin streaming with no conflicts, this opens up other possibilities.

    Second, the streaming of data in the line buffer COG to/from hub does not need to be done from LUT RAM it can be COG RAM. I think the following sequence can occur in this line buffer COG:

    In a graphics bitmap data mode, (with no need for font stuff):
    1) for each scanline, read in 80 longs of scanline data from hub holding 640 pixels formatted as 4bpp colour data into COG RAM
    2) iterate through this data in blocks of 8 or more pixels and use 3 lookup tables held in COG RAM and indexed by these pixel nibbles to write 3 long lookup results into a temporary holding buffer in COG RAM. This holding buffer is also preallocated with the two constant longs that do not change for each odd/even pixel pair
    3) write these results from COG RAM temporary holding buffer back to hub RAM into a 2000 long line buffer for the HDMI COG
    4) iterate through groups of 8 or 16 output longs at a time, whatever suits optimal burst transfers to hub in step 3
    5) iterate through each scanline of frame, and generate blanking lines with VSYNC appropriately - these will be constant data lines so can already be sitting around in hub, or generated by the COG dynamically


    For a potential text mode, using the LUT sharing method between two COGs:
    1) Line buffer COG reads in 80 words containing the 8 bit FG/BG colour nibbles and 8 bit character data for the current row into COG RAM
    2) Line buffer COG writes these 80 words into the shared LUT (or combine this with step 1, directly if possible if it doesn't upset the sharing)
    3) HDMI streamer COG reads these 80 words from its LUT, looks up FONT table in its COG RAM to get 8 pixels for each character and scanline. Font table holds 8 scanlines for 128 characters, using 256 locations (probably best in lower 256 COG RAM addresses for direct addressing). It does this simultaneously with the pin streaming work going on to output the HDMI data from the 2000 long line buffer in HUB.
    4) HDMI streamer COG writes 8 pixels worth of either the FG or BG nibble data per pixel using the font result back into the shared LUT RAM as a packed long per character
    5) Line buffer COG reads these 80 long results back from shared LUT (which now looks just like the graphics scanline data in graphics mode above)
    6) Line buffer COG now does steps 2,3,4,5 from the graphics mode above using this source data taken from the LUT

    Because much of the work is repeated between both modes above I would think it might also be possible to develop a scheme that can do either text or graphics or both, at least a character row boundary certainly a per frame basis, if not individual scanlines. Then you could mix and match text and VGA graphics (4 bpp). With 8 scanline fonts this gives some decent 80x60 text in a 480 line mode, but you could also output just 400 active lines of the 525 sent to get a more conventional 50 row mode (or the standard 25 with scanline doubling), whatever suits. Monitors should take it.

    I now want to hurry up and make up my board for P2D2 with an HDMI connector to try some ideas out. Or maybe ozpropdev or Chip could even get something going as well with these ideas if they have the time.

    ps. TonyB_, yes it should be doable. I think there is a risk that it might blow to 9 instructions per pixel pair in the critical loop if evennibble needs to be read twice but even then it should (just) still fit in. Unrolling further buys a bit more.
  • Yanomani wrote: »
    Though, If the address/data buses are wired-or'ed, writing a data value of zero, at address zero, should produce no deleterious effects.

    Wouldn't really help. But here's the actual rule, from docs: "If an external write occurs on the same clock as a streamer read, the external write gets priority."

    External meaning the paired cog. So, streamer addressing is ignored when there is a clash. This produces glitches in streaming data.

    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • TonyB_ wrote: »
    640x480x4bpp HMDI can be done with two cogs. The loop for 8 pixels uses 34 instructions and 69 cycles. Unrolling that 10 times the code will fit in cog RAM comfortably. 80 pixels means writing 200 longs to hub RAM with SETQ+WRLONG, making a sub-total of ~900 cycles. Multiplying that by 8 gives ~7200 cycles in total and there are 8000 cycles per line.

    720x480x4bpp and 720x576x4bpp @ 27MHz would also be possible. ~8100 cycles in total and there are 8580 cycles per line.
    Formerly known as TonyB
Sign In or Register to comment.