All PASM2 gurus - help optimizing a text driver over DVI?

2456722

Comments

  • TonyB_ wrote: »
    Cluso99 wrote: »
    This section
    setq ##$33333333 (might be backwards, didn't check)
    muxq x,a
    setq ##$CCCCCCCC
    muxq y,b
    
    could become
    setq ##$33333333 (might be backwards, didn't check)
    muxq x,a
    ror b,#2
    muxq y,b
    rol b, #2
    
    It's the same footprint and clocks but maybe you can do something with this to save a rotate???
    Remember the setq/muxq retains the last setq value.

    For a time-critical loop, better to use the following?
    setq reg_33333333 (might be backwards, didn't check)
    muxq x,a
    setq reg_CCCCCCCC
    muxq y,b
    

    No, it's the same time and same code space.
    The setq is actually 2 instructions for 4 clocks since it's loading a full long - its a combo AUGS and SETQ.
    My Prop boards: P8XBlade2 , RamBlade , CpuBlade , TriBlade
    P1 Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    P1: Tools (Index) , Emulators (Index) , ZiCog (Z80)
    P2: Tools & Code , Tricks & Traps
  • Brain fade there Cluso. Tony's one isn't immediate data.
    We have the vastness of the internet and yet billions of people decided to spend most of their time within a horribly designed, fake-news emporium of a website that sucks every possible piece of personal information out of you so it can sell it to others. And they see nothing wrong with that.
  • AJLAJL Posts: 215
    edited 2019-10-10 - 01:46:27
    An untested revision of the original code that doesn't address pixel doubling but shaves space and a little time:
    testb   modedata, #0 wz      'flashing or high intensity bg?
            mov skmask, skstandard 'load standard skipmask
     if_z  and skmask, skipflash   'no flash; set-up skipping of test
       
       
       
            rep     @.endloop-2, #80       'repeat loop 80 times for 80 chars; adjust loop length for skips
            skipf skmask 
            mov qmask, qstandard 
            rdlut first, ptra++            'skip every even loop
            getword second, first, #1     'skip every even loop
            mov first, second             'skip every odd loop
            nop                             'skip every odd loop, keep loop lengths equal
            setbyte fontaddr, first, #0 
            bitl first, #15 wcz 
            rdbyte pixels, fontaddr 
     if_c   mov qmask, qflash 
            movbyts first, #%01010101 'becomes BF_BF_BF_BF      2 cycles
            mov temp, first         '2 cycles
            rol temp, #4             'temp becomes FB_FB_FB_FB     2 cycles
            setq qmask              'if flash, first becomes BB_BB_BB_BB   2 cycles
            muxq first, temp       'else first becomes FF_FB_BF_BB      2 cycles
            movbyts first, pixels  'select pixel colours
            wrlut   first, ptrb++ 'save to LUT memory
            xor skmask, even_odd 'flip skipmask
    .endloop   
           pop     ptrb 
           pop     prta 
           setq2   #80-1 'write 640 nibble pixels to HUB RAM
           wrlong  $100+40, ptrb++ 
          ret 
       
       
    skstandard  %0000000000000000000110 
    skipflash   %0000000000000100000000 
    qstandard   $F0FF000F 
    qflash      $0F0F0F0F 
    even_odd   %0000000000000000011110 
    
  • roglohrogloh Posts: 1,629
    edited 2019-10-10 - 04:53:13
    ersmith wrote: »
    For graphics this would definitely be a big win. For text, I don't think it's worth sweating over...
    Agree. I want this in there for the graphics from HUB, having 320 pixel text is just a bonus for those retro applications given that you are already running a VGA screen.

    I plan to hook this driver into accessing HyperRAM memory as well, that's also on the cards now and makes sense for graphics modes specifically. Been thinking about that, the 320 pixels doubling process is problematic there as you may need to ask for the line data two scan line's prior if you want to pixel double it. My current processing model computes a line of pixels (including graphics pixel doubling on the fly), then writes it out to hub and gets it ready for being output on the next scan line when it begins the next scan line contents. A mouse cursor can also get rendered on it later once the whole line is complete. The problem here is there is may not be a line of data ready from HyperRAM in time to fully read in and process unless you can start to work on it no faster than it arrives in memory, but before notification of the transfer being complete. Worst case is HyperRAM sourced graphics data might not be pixel doubled, or I can look into modifying the sequence to issue the command 2 scan lines prior if it doesn't fit the existing pipeline. Don't like that latter solution because one day I may wish to support arbitrary screen modes per scanline with a region list and I won't be setup for the new mode until the next line...
    It's easily do-able. As you say, the whole font scanline is only ~64 cycles to read in, and it saves a random read from HUB which is potentially 16 cycles per character; the replacement lookup using altgb is only 4 cycles per character, so with an 80 character wide screen that's a total savings of 80*(16-4) - 64 = 896 cycles per scan line, which is pretty significant.

    I thought about it late last night after also adding it all up as well and found those clock savings so very tempting. Now I'd sort of been preserving the upper LUT half for a potential TERC4 table to be put there some day but looking more at it found I could compress that 256 table down to 16 longs in COG RAM instead and free up the entire upper LUTRAM for a LOT more code! This table reduction ended up only costing me something about ~180 clocks to do this which I think is well worth it and still (just) remains within the budget. So I plan to change the layout over time which will free up more COG RAM. This should allow the font table to fit into 64 longs of COG RAM as you suggested. The bonus is this font buffer can be shared for other uses too.

    The only thing I don't like the look of so far with respect to its overall execution time is this 64 bit text format with direct RGB24 selection. I think that could have some future limitations as it would probably take in the vicinity of over half an active line to compute everything in 80 column mode, 40 columns might be okay. With pure DVI you can use the entire active scanline plus a bit more so even that is probably still possible, maybe even with flashing and inverse/underline effects etc. If I can fit it in it would be quite good to be compatible with your ANSI driver. Perhaps a 16bpp driver can also be done in case that helps...or I could try to use this rgbsqz instruction. Needs some more thought...
    I would guess the the TMDS encoder just captures the streamer output every 10 clocks. If that is the case, then just run the streamer at sysclock/20 for pixel doubling.
    That's my hope too and I'd sort of guess that might be what it would do. Just don't know if it is true yet and how easily to switchover it's frequency on the precise clock edge needed if you do it on the fly. Might not be achievable unless it is buffered with the xcont instructions. I'm working on the software doubling approach in the meantime.

    Cluso99 and TonyB_.
    Yeah all the sneaky little inner loop savings add up and I should certainly use setq with a register.

    I hope to work on this a bit more later tonight with some of these new ideas.
    AJL wrote: »
    An untested revision of the original code that doesn't address pixel doubling but shaves space and a little time:
    Interesting use of SKIPF. Yeah I was considering using some type of skip code in the nibble and two-bit modes pixel doubling, since the code is so similar, it should save code space, plus using SKIPF could reduce the additional execution overhead where the code differs. Am hoping SKIP/SKIPF fully works with REP and could even skip the rep instruction itself if such a sequence is ever desired, but have never tried using it yet.
  • I haven't seen any docs on how to use that hardware.
    We have the vastness of the internet and yet billions of people decided to spend most of their time within a horribly designed, fake-news emporium of a website that sucks every possible piece of personal information out of you so it can sell it to others. And they see nothing wrong with that.
  • roglohrogloh Posts: 1,629
    edited 2019-10-10 - 04:18:21
    A couple of the locals here in Oz have already had HyperRAM working with the P2, and ozpropdev has the new board working I believe. I have a board ready and am getting rather keen to try out soon myself with my DVI video driver. Ideally a COG could be developed that would function as an arbiter to this shared memory, giving a video COG's transfer requests priority and allowing it larger, streaming type of memory transfers, whilst also giving other COGs access to this shared RAM in the other times and also not starving refresh. It's an interesting scheduling type of problem to solve. I think a video COG could help it out by advertising when it expects its next access will likely be coming after this one (typically one scan line later, but could be more during vertical refresh time), and when it would like the results to be ready and the HyperRAM COG could use this information to allocate and divide any of the remaining free time up amongst its other COG requests while not starving them out for too long.

    If you can provide larger burst sizes to the other non-video COGs you can get reasonable bandwidth, and if you break up the video transfer request into slightly smaller bursts (which helps refresh), you can reduce the latency for small accesses as well if you interleave them in. It's quite a tradeoff to solve... and a weighted deficit round robin or some other sort of scheduler algorithm may allow fairness for non-video COGs. This type of stuff reminds me of my old packet network switching days, LOL. Also in cases where there are only ever two COGs sharing the memory, which is probably going to be quite common, it becomes simpler to solve than if it is all 8 COGs vying for it, so if the HyperRAM driver COG adapts dynamically to its request load and can adjust its strategy accordingly that could be good too.

    Allowing "reasonably" sized larger bursts to non-video COGs will make this shared memory work better. Just giving single 32 bit long size memory accesses will be far too slow for video writes to the whole screen for example. Better to allow transfers of something in the 64-128 byte order or higher which is on the 1us sort of timescale. A second non-video COG could then get given a handful (like 4-8) of these opportunites per scanline for example and hopefully achieve transfer rates in the vicinity of 16MB/s itself while still sharing it with a VGA driver resolution operating at the highest colour depths. Lower colour depths and resolutions would only increase the available bandwidth and access opportunities to other COGs further.
  • You can skip a rep, and you can rep a skip. The caveats are listed in the v33 revision of the Parallax Propeller 2 document.

    @cgracey posted the documentation and instruction sheet links in the first post of the New P2 silicon thread
  • evanhevanh Posts: 8,261
    edited 2019-10-10 - 06:52:40
    Oh, I was meaning the TDMS encoding hardware. Config/use of. Only info I've come across is handcrafted streamer patterns for the revA chips I think.

    EDIT: Found it in the list of links that Saucy put up. Undocumented CMOD bit 8 enables it.
    We have the vastness of the internet and yet billions of people decided to spend most of their time within a horribly designed, fake-news emporium of a website that sucks every possible piece of personal information out of you so it can sell it to others. And they see nothing wrong with that.
  • TonyB_TonyB_ Posts: 1,346
    edited 2019-10-10 - 18:38:48
    If all of the video (graphics + text) needs pixel doubling or more, e.g. due to resolution lower than HDMI minimum, simply set PixelRepetition (PR0:PR3) in the AVI InfoFrame.

    EDIT:
    Wrong - see below.
    Formerly known as TonyB
  • I believe even then you still need to send the pixels twice TonyB_.
  • As far as I can tell from my armchair, one could just use the center 320 pixels and set the InfoFrame to reflect that, which should cause the display to stretch it out for you. Although I wouldn't count on that being supported properly by every display.
  • TonyB_TonyB_ Posts: 1,346
    edited 2019-10-10 - 18:39:08
    rogloh wrote: »
    TonyB_ wrote: »
    If all of the video (graphics + text) needs pixel doubling or more, e.g. due to resolution lower than HDMI minimum, simply set Pixel Repetition (PR0:PR3) in the AVI InfoFrame.
    I believe even then you still need to send the pixels twice TonyB_.

    Roger, I could be completely wrong on this and I apologise if so. The HDMI spec v1.4 does give the strong impression that Pixel Repetition operates more like Pixel Decimation, where the sink skips pixels repeated by the source
    ... but ...
    page 130 of 197 (actual page 145 of 425 in the PDF) shows that pixel doubling increases the number of audio channels. 640x480p/576p or 720x480p/576p have 800 or 858 pixels/line with same line frequency. Thus horizontal blanking is 800-640=160 or 858-720=138 pixels, hence 138 clocks at top of page. However, with pixel doubling, blanking period is 276 clocks, leaving only 800-276=524 or 858-276=582 for video.

    EDIT:
    Wrong - see below.
    Formerly known as TonyB
  • I think the way it works is by transmitting twice as many pixels per line it increases the audio channel capacity because it also gives more horizontal blanking period for sending extra audio packets at a given pixel frequency. The linerate is effectively halved in this case, e.g. 480i or 288p can be sent with 2x pixel repetition, keeping the total pixel rate constant at 27Mpixels/sec. So you would send out 1440 pixels on the line (duplicating an original 720 pixels and indicating this in the info frame), but the line frequency would now become 27MHz/(858*2) = 17.734kHz instead of usual 31.468kHz for example. The TV would then only display half of these pixels on the screen because it sees this repetition information in the info frame. This way you can maintain the minimum TMDS link rate of 250MHz. That seems to be at least one of the purposes of it, but perhaps there is more to it...
  • Thanks for the explanation. Small correction, I think, line frequency is unchanged but pixel clock must be increased to get extra audio channels, from 27 to 54 MHz for 2x repetition in this example.
    Formerly known as TonyB
  • Monitors won't go below the arbitrary bottom limit of about 30 kHz line rate under any circumstances. It's what defines the minimum DVI clock rate. TVs do, but only for very specific modes. It's all a tad sucky really.
    We have the vastness of the internet and yet billions of people decided to spend most of their time within a horribly designed, fake-news emporium of a website that sucks every possible piece of personal information out of you so it can sell it to others. And they see nothing wrong with that.
  • evanh wrote: »
    Monitors won't go below the arbitrary bottom limit of about 30 kHz line rate under any circumstances.

    LCD-based DVI/HDMI monitors still enforce this limit? (which was/is a physical constraint of VGA CRT monitors)
    That sucks indeed.
  • SaucySolitonSaucySoliton Posts: 264
    edited 2019-10-10 - 19:09:27
    cgracey wrote: »
    I worked it into the streamer. It takes the upper 3 bytes of the output long for R/G/B and the lower byte is used for command codes. In lieu of the normal 32 bits, it outputs {24'b0, r, !r, g, !g, b, !b, clk, !clk}. You have to pace your pixels at 1/10th the chip clock using 'SETXFRQ ##$0CCCCCCC'. To enable HDMI encoding: 'SETCMOD #$80'. It's maybe only good for 640x480 @60Hz, but gets us an HDMI output that will work with modern displays.
    forums.parallax.com/discussion/comment/1449549/#Comment_1449549
    Looks like the CMOD setting changed.

    On P1 I don't think there is any way to reset the chroma phase. So I would think that there is no way to reset the TMDS phase on the streamer because there is no need for it. As long as the streamer runs at sysclock/10, they will be fine. It might add some delay as the streamer may output data to the TMDS encoder for several clocks until the encoder needs it.


    For software pixel doubling, I think it would be best to duplicate the 1 bpp data from the font instead of the 4bpp data from the color encoder. :)
       mov        dpix,##$FFCC_3300
                                   ' pixelbyte = 76543210
       movbyts    dpix,pixelbyte   ' dpix = 77667766_55445544_33223322_11001100
       and        dpix,##$F00F_F00F  ' dpix = 7766zzzz_zzzz5544_3322zzzz_zzzz1100
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = zzzzzzzz_7766zzzz_zzzz5544_3322zzzz
       or         dpix,tmp         ' dpix = 7766zzzz_77665544_33225544_33221100 
    
       mov        tmp,FFFB_BFBB
       movbyts    FFFB_BFBB,dpix
       shr        dpix,#16 
       movbyts    tmp,dpix
    
       ' 4 color mode 
       mov        dpix,##$FFAA_5500
                                   ' pixelbyte = 76543210
       movbyts    dpix,pixelbyte   ' dpix = 76767676_54545454_32323232_10101010
       and        dpix,##$F00F_F00F' dpix = 7676zzzz_zzzz5454_3232zzzz_zzzz1010
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = zzzzzzzz_7676zzzz_zzzz5454_3232zzzz
       or         dpix,tmp         ' dpix = 7676zzzz_76765454_32325454_32321010 
    

    That movbyts make a clever little look-up-table.
    James https://github.com/SaucySoliton/

    Invention is the Science of Laziness
  • AJLAJL Posts: 215
    edited 2019-10-10 - 22:20:08
    cgracey wrote: »
    I worked it into the streamer. It takes the upper 3 bytes of the output long for R/G/B and the lower byte is used for command codes. In lieu of the normal 32 bits, it outputs {24'b0, r, !r, g, !g, b, !b, clk, !clk}. You have to pace your pixels at 1/10th the chip clock using 'SETXFRQ ##$0CCCCCCC'. To enable HDMI encoding: 'SETCMOD #$80'. It's maybe only good for 640x480 @60Hz, but gets us an HDMI output that will work with modern displays.
    forums.parallax.com/discussion/comment/1449549/#Comment_1449549
    Looks like the CMOD setting changed.

    On P1 I don't think there is any way to reset the chroma phase. So I would think that there is no way to reset the TMDS phase on the streamer because there is no need for it. As long as the streamer runs at sysclock/10, they will be fine. It might add some delay as the streamer may output data to the TMDS encoder for several clocks until the encoder needs it.


    For software pixel doubling, I think it would be best to duplicate the 1 bpp data from the font instead of the 4bpp data from the color encoder. :)
       mov        dpix,##$FFCC_3300
                                   ' pixelbyte = 76543210
       movbyts    dpix,pixelbyte   ' dpix = 77667766_55445544_33223322_11001100
       and        dpix,##$F00F_F00F  ' dpix = 7766zzzz_zzzz5544_3322zzzz_zzzz1100
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = zzzzzzzz_7766zzzz_zzzz5544_3322zzzz
       or         dpix,tmp         ' dpix = 7766zzzz_77665544_33225544_33221100 
    
       mov        tmp,FFFB_BFBB
       movbyts    FFFB_BFBB,dpix
       shr        dpix,#16 
       movbyts    tmp,dpix
    
       ' 4 color mode 
       mov        dpix,##$FFAA_5500
                                   ' pixelbyte = 76543210
       movbyts    dpix,pixelbyte   ' dpix = 76767676_54545454_32323232_10101010
       and        dpix,##$F00F_F00F' dpix = 7676zzzz_zzzz5454_3232zzzz_zzzz1010
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = zzzzzzzz_7676zzzz_zzzz5454_3232zzzz
       or         dpix,tmp         ' dpix = 7676zzzz_76765454_32325454_32321010 
    

    That movbyts make a clever little look-up-table.

    I think you need to review how movbyts works; you seem to have misunderstood it's function (perhaps you've got the operands backwards).

    Only the low 8 bits of the source register (or immediate value) are used as a series of 'twits' to control how the bytes of D are selected.

    The format of the instruction is
    MOVBYTS D,{#}S
    

    Giving a pattern for pixelbyte to work through your first snippet as an example:
       mov        dpix,##$FFCC_3300
                                   ' pixelbyte = %00_01_11_10
       movbyts    dpix,pixelbyte   ' dpix = $00_33_FF_CC
       and        dpix,##$F00F_F00F  ' dpix = $00_03_F0_0C
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = $0C_00_03_F0
       or         dpix,tmp         ' dpix = $0C_03_F3_FC 
    
    

  • From my armchair, I think doubling 8 bits of 1BPP data should be as simple as
    ' pixelbyte contains 8 mono pixels
    mergew pixelbyte
    mul pixelbyte,#3
    

    Or am I having a brain fart? (I don't have a P2, so this is ofc just theory from looking at the opcode sheet)
  • AJL wrote: »
    cgracey wrote: »
    I worked it into the streamer. It takes the upper 3 bytes of the output long for R/G/B and the lower byte is used for command codes. In lieu of the normal 32 bits, it outputs {24'b0, r, !r, g, !g, b, !b, clk, !clk}. You have to pace your pixels at 1/10th the chip clock using 'SETXFRQ ##$0CCCCCCC'. To enable HDMI encoding: 'SETCMOD #$80'. It's maybe only good for 640x480 @60Hz, but gets us an HDMI output that will work with modern displays.
    forums.parallax.com/discussion/comment/1449549/#Comment_1449549
    Looks like the CMOD setting changed.

    On P1 I don't think there is any way to reset the chroma phase. So I would think that there is no way to reset the TMDS phase on the streamer because there is no need for it. As long as the streamer runs at sysclock/10, they will be fine. It might add some delay as the streamer may output data to the TMDS encoder for several clocks until the encoder needs it.


    For software pixel doubling, I think it would be best to duplicate the 1 bpp data from the font instead of the 4bpp data from the color encoder. :)
       mov        dpix,##$FFCC_3300
                                   ' pixelbyte = 76543210
       movbyts    dpix,pixelbyte   ' dpix = 77667766_55445544_33223322_11001100
       and        dpix,##$F00F_F00F  ' dpix = 7766zzzz_zzzz5544_3322zzzz_zzzz1100
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = zzzzzzzz_7766zzzz_zzzz5544_3322zzzz
       or         dpix,tmp         ' dpix = 7766zzzz_77665544_33225544_33221100 
    
       mov        tmp,FFFB_BFBB
       movbyts    FFFB_BFBB,dpix
       shr        dpix,#16 
       movbyts    tmp,dpix
    
       ' 4 color mode 
       mov        dpix,##$FFAA_5500
                                   ' pixelbyte = 76543210
       movbyts    dpix,pixelbyte   ' dpix = 76767676_54545454_32323232_10101010
       and        dpix,##$F00F_F00F' dpix = 7676zzzz_zzzz5454_3232zzzz_zzzz1010
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = zzzzzzzz_7676zzzz_zzzz5454_3232zzzz
       or         dpix,tmp         ' dpix = 7676zzzz_76765454_32325454_32321010 
    

    That movbyts make a clever little look-up-table.

    I think you need to review how movbyts works; you seem to have misunderstood it's function (perhaps you've got the operands backwards).

    Only the low 8 bits of the source register (or immediate value) are used as a series of 'twits' to control how the bytes of D are selected.

    The format of the instruction is
    MOVBYTS D,{#}S
    

    Giving a pattern for pixelbyte to work through your first snippet as an example:
       mov        dpix,##$FFCC_3300
                                   ' pixelbyte = %00_01_11_10
       movbyts    dpix,pixelbyte   ' dpix = $00_33_FF_CC
       and        dpix,##$F00F_F00F  ' dpix = $00_03_F0_0C
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = $0C_00_03_F0
       or         dpix,tmp         ' dpix = $0C_03_F3_FC 
    
    
    I used the bit numbers for pixelbyte in my notes.

    pixelbyte in: %00_01_11_10
    dpix out: %xxxx_xxxx_0000_0011_xxxx_xxxx_1111_1100
    The goal was to duplicate the bits in pixelbyte. Looks like it worked. You did ror instead of shr, but not a problem since it only affects the don't care bits.
    James https://github.com/SaucySoliton/

    Invention is the Science of Laziness
  • Wuerfel_21 wrote: »
    From my armchair, I think doubling 8 bits of 1BPP data should be as simple as
    ' pixelbyte contains 8 mono pixels
    mergew pixelbyte
    mul pixelbyte,#3
    

    Or am I having a brain fart? (I don't have a P2, so this is ofc just theory from looking at the opcode sheet)

    Winner! Looks good.

    James https://github.com/SaucySoliton/

    Invention is the Science of Laziness
  • roglohrogloh Posts: 1,629
    edited 2019-10-11 - 01:33:42
    Yes I have had some success with combining ideas from various smart people here including ozpropdev, AJL, ersmith and Saucy, etc.

    I've managed to put together the following code block for 16 colour text that shares the same code loop for both 40 and 80 column modes and it fits the budget I wanted. Just tried it and it works nicely. 80 column mode falls through to some excess instructions at the end into the longer 40 column mode loop but I ensure only temporary variables are affected. A conditional jump at the beginning of the 40 column stuff fixes it but adds another long and 80 more clocks to the 40 column code.
                setq    #40-1                   'get next 80 chars with colours
                rdlong  scratch, ptra           'read 40 longs from HUB into COGRAM
    
                testb   modedata, #0 wz         'flashing / high intensity bg test
        if_nz   mov     test1, instr1
        if_z    mov     test1, #0 
    
                testb   modedata, #2 wz         '40/80 column mode test
                mov     a, #0 wc
    textloops
    if_z        rep     #25, #40                '2000 clocks for 40 column mode
    if_nz       rep     #16, #80                '2560 clocks for 80 column mode
                altgw   a, #scratch
                getword first, 0-0, #0          'get next character and colours
                getbyte fontaddr, first, #0     'determine font lookup address
                altgb   fontaddr, #font         'extract font offset
                getbyte pixels, 0-0, #0         'get font information for character
    test1       bitl    first, #15 wcz          'test (and clear) flashing bit
        if_c    and     pixels, flash           'make it all background if flashing
                movbyts first, #%01010101       'becomes BF_BF_BF_BF
                mov     b, first                'grab a copy 
                rol     b, #4                   'b becomes FB_FB_FB_FB
                setq    fgbg_mask               'mask is #$F0FF000F                
                muxq    first, b                'first becomes FF_FB_BF_BB
                testb   modedata, #2 wz         'repeat test again, z gets trashed
    if_nz       movbyts first, pixels           'select pixel colours
    if_nz       wrlut   first, ptrb++           'instrs. skipped if pixel doubled
                add     a, #1                   'repeat or fall through for 40 cols
                setword pixels, pixels, #1      'replicate words
                mergew  pixels                  'double pixels
                mov     b, first                'save a copy before we lose colours
                movbyts first, pixels           'compute 4 lower colours of char
    if_z        wrlut   first, ptrb++           'save it
                ror     pixels, #8              'get upper pixels
                movbyts b, pixels               'compute 4 higher colours of char
    if_z        wrlut   b, ptrb++               'save it
    endloop 
    ....
    scratch    long 0[40]
    font       long 0[64]
    
    


    Weufel_21, I knew something like that was going to be possible with those instructions. I think I can leverage it also for the nibble replication too. Been thinking about it all night... my original twit and nibble multipliers are my only ugly slow routines left until I fix it.

    Update: There's another variant of the above that works with LUTRAM buffer in case I don't have enough COGRAM for holding the font and the characters. It only adds one clock cycle per loop iteration because RDLUT takes 3 clocks. You save on the counter and altgw, but need to do a rotate and xor after the RDLUT so overall it adds 1 clock which is still okay I suspect.
  • Wuerfel_21 wrote: »
    evanh wrote: »
    Monitors won't go below the arbitrary bottom limit of about 30 kHz line rate under any circumstances.

    LCD-based DVI/HDMI monitors still enforce this limit? (which was/is a physical constraint of VGA CRT monitors)
    That sucks indeed.

    It wasn't a physical constraint of CRTs either. That was a marketing myth. It has always been arbitrary.

    We have the vastness of the internet and yet billions of people decided to spend most of their time within a horribly designed, fake-news emporium of a website that sucks every possible piece of personal information out of you so it can sell it to others. And they see nothing wrong with that.
  • AJL wrote: »
    cgracey wrote: »
    I worked it into the streamer. It takes the upper 3 bytes of the output long for R/G/B and the lower byte is used for command codes. In lieu of the normal 32 bits, it outputs {24'b0, r, !r, g, !g, b, !b, clk, !clk}. You have to pace your pixels at 1/10th the chip clock using 'SETXFRQ ##$0CCCCCCC'. To enable HDMI encoding: 'SETCMOD #$80'. It's maybe only good for 640x480 @60Hz, but gets us an HDMI output that will work with modern displays.
    forums.parallax.com/discussion/comment/1449549/#Comment_1449549
    Looks like the CMOD setting changed.

    On P1 I don't think there is any way to reset the chroma phase. So I would think that there is no way to reset the TMDS phase on the streamer because there is no need for it. As long as the streamer runs at sysclock/10, they will be fine. It might add some delay as the streamer may output data to the TMDS encoder for several clocks until the encoder needs it.


    For software pixel doubling, I think it would be best to duplicate the 1 bpp data from the font instead of the 4bpp data from the color encoder. :)
       mov        dpix,##$FFCC_3300
                                   ' pixelbyte = 76543210
       movbyts    dpix,pixelbyte   ' dpix = 77667766_55445544_33223322_11001100
       and        dpix,##$F00F_F00F  ' dpix = 7766zzzz_zzzz5544_3322zzzz_zzzz1100
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = zzzzzzzz_7766zzzz_zzzz5544_3322zzzz
       or         dpix,tmp         ' dpix = 7766zzzz_77665544_33225544_33221100 
    
       mov        tmp,FFFB_BFBB
       movbyts    FFFB_BFBB,dpix
       shr        dpix,#16 
       movbyts    tmp,dpix
    
       ' 4 color mode 
       mov        dpix,##$FFAA_5500
                                   ' pixelbyte = 76543210
       movbyts    dpix,pixelbyte   ' dpix = 76767676_54545454_32323232_10101010
       and        dpix,##$F00F_F00F' dpix = 7676zzzz_zzzz5454_3232zzzz_zzzz1010
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = zzzzzzzz_7676zzzz_zzzz5454_3232zzzz
       or         dpix,tmp         ' dpix = 7676zzzz_76765454_32325454_32321010 
    

    That movbyts make a clever little look-up-table.

    I think you need to review how movbyts works; you seem to have misunderstood it's function (perhaps you've got the operands backwards).

    Only the low 8 bits of the source register (or immediate value) are used as a series of 'twits' to control how the bytes of D are selected.

    The format of the instruction is
    MOVBYTS D,{#}S
    

    Giving a pattern for pixelbyte to work through your first snippet as an example:
       mov        dpix,##$FFCC_3300
                                   ' pixelbyte = %00_01_11_10
       movbyts    dpix,pixelbyte   ' dpix = $00_33_FF_CC
       and        dpix,##$F00F_F00F  ' dpix = $00_03_F0_0C
       mov        tmp,dpix
       shr        tmp,#8           ' tmp  = $0C_00_03_F0
       or         dpix,tmp         ' dpix = $0C_03_F3_FC 
    
    
    I used the bit numbers for pixelbyte in my notes.

    pixelbyte in: %00_01_11_10
    dpix out: %xxxx_xxxx_0000_0011_xxxx_xxxx_1111_1100
    The goal was to duplicate the bits in pixelbyte. Looks like it worked. You did ror instead of shr, but not a problem since it only affects the don't care bits.

    Yes, I did ror instead of shr. My mistake.

    I see, if you throw away bytes 3 and 1 it produces the result you are after, as the full result is actually
    dpix out: %0000_0000_0000_0011_1111_0011_1111_1100


    But, of course there's the mergew, mul sequence that is best :-)



  • Wuerfel_21 wrote: »
    From my armchair, I think doubling 8 bits of 1BPP data should be as simple as
    ' pixelbyte contains 8 mono pixels
    mergew pixelbyte
    mul pixelbyte,#3
    

    Or am I having a brain fart? (I don't have a P2, so this is ofc just theory from looking at the opcode sheet)

    This works for up to 16 pixels provided high word is zero. If not:
    	rolword	pixels,pixels,#0
    	mergew	pixels
    

    Copied from:
    http://forums.parallax.com/discussion/comment/1409553/#Comment_1409553
    Formerly known as TonyB
  • roglohrogloh Posts: 1,629
    edited 2019-10-11 - 04:21:20
    We need to find a way to replicate nibbles from a starting word like 0000DCBA into this such pattern DDCCBBAA where "A"-"D" here represent the 4 bit nibbles in a 32 bit long. I still can't quite see it yet, but I'm almost 100% convinced it is doable with some clever use of mergeb/mergew/splitb/splitw/setbyts/ror/or/mul etc. My own way still takes 8 instructions to do this which is slow.

  • what about following Wuerful's lead and getting to 0D0C0B0A then multiplying by 17
  • roglohrogloh Posts: 1,629
    edited 2019-10-11 - 05:22:49
    Yes Lachlan, that's sort of what I've been trying to do. I must be missing something, but I think his method is good for pixel doubling, not getting to 0D0C0B0A which is required first. You can then do the multiply trick as you suggest to get the nibbles doubled.

    I've looked at what patterns can be reached using split/merge etc on the original data, and what can be used in the final step with split/merge/mul etc, but I can't see a fast way (yet) that they can be connected via intermediate instructions. I've convinced myself it must be doable somehow.

    I can do it in 8 instructions, but ideally if it can be done in less than this, that would be great.
  • Roger,this is the best I can think of at the momennt.
    		mov	pa,##$dcba
    
    		movbyts	pa,#%%1100
    		mov	pb,pa
    		shl	pb,#4		
    		and	pb,mask
    		andn	pa,mask	
    		or	pa,pb
    ...	
    mask		long	$0ff00ff0
    
    Melbourne, Australia
  • roglohrogloh Posts: 1,629
    edited 2019-10-11 - 05:46:16
    Cool. I think this can be shaved by one instruction with a setq, muxq sequence.
    movbyts pa, #%%1100
    mov pb, pa
    shl pb,#4
    setq mask
    muxq pa, pb
    
    etc
    

    Five is better than 8 I had before. Will code this up.
Sign In or Register to comment.