Shop OBEX P1 Docs P2 Docs Learn Events
All PASM2 gurus - help optimizing a text driver over DVI? - Page 4 — Parallax Forums

All PASM2 gurus - help optimizing a text driver over DVI?

1246729

Comments

  • cgraceycgracey Posts: 14,206
    evanh wrote: »
    cgracey wrote: »
    Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.
    That's not much of an argument for an object that is running the whole cog - without interrupts. And debuggers can preserve Q using MUXQ and SETQ. And, the optimised, but buggy, example above actually is protected from interrupts anyway. The SETQ is followed immediately by a REP.

    Good idea about using MUXQ to read Q.
  • evanhevanh Posts: 16,032
    edited 2019-10-11 13:58
    Huh, I guess that can be used inside any ISR as well. Putting them back in limits for this.
  • rogloh wrote: »
    Here is the working code, pixels doubled properly, 320 on screen.
    doublenibbles
                rep     #14, readburst
                rdlut   b, ptra++
                getword a, b, #1                
                movbyts b, #%%1100
                mov     pb, b
                shl     pb, #4
                setq    nibblemask
                muxq    b, pb
                wrlut   b, ptrb++
                movbyts a, #%%1100
                mov     pb, a
                shl     pb, #4
                setq    nibblemask
                muxq    a, pb
                wrlut   a, ptrb++
                ret
    

    Is the second setq nibblemask needed?
  • evanhevanh Posts: 16,032
    edited 2019-10-11 15:12
    Good point, a singular SETQ can be placed anywhere between the RDLUT and first MUXQ.
  • cgraceycgracey Posts: 14,206
    edited 2019-10-11 18:29
    I updated the v33 documentation to cover Q issues:

    SETQ CONSIDERATIONS

    The SETQ and SETQ2 instructions write to the Q register and are intended to precede a companion instruction. The value written to the Q register by SETQ/SETQ2 will persist until any of these events occur:

    * XORO32 executes - Q is set to the XORO32 result.
    * RDLUT executes - Q is set to the data read from the lookup RAM.
    * GETXACC executes - Q is set to the Goertzel sine accumulator value.
    * CRCNIB executes - Q gets shifted left by four bits.
    * COGINIT/QDIV/QFRAC/QROTATE executes without a preceding SETQ instruction - Q is set to zero.

    It is possible to retrieve the current Q value by the following sequence:
    MOV	qval,#0			'reset qval
    MUXQ	qval,##$FFFFFFFF	'for each '1' bit in Q, set the same bit in qval
    

    SETQ/SETQ2 shields the next instruction from interruption to prevent an interrupt service routine from inadvertently altering Q before the intended instruction can utilize its value.
  • TonyB_ wrote: »
    Is the second setq nibblemask needed?
    evanh wrote: »
    Good point, a singular SETQ can be placed anywhere between the RDLUT and first MUXQ.
    Had the same thought last night and will change the code. We're all on the same page it seems.

    I'm glad this prompted figuring out what changes Q too. That could be a gotcha for people otherwise.
  • evanhevanh Posts: 16,032
    Thanks for the update Chip. That is a number of instructions alright. COGINIT is the only one not likely to cause confusion. The RDLUT case, is that because of Xbyte feature?
  • evanhevanh Posts: 16,032
    Hehe, this has only come to light because MUXQ uses the Q register as an ordinary operand, without any SETQ prefixing. Every other prior documented use of Q required SETQ as a prefixing instruction to trigger a modal change.

  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2019-10-12 00:17
    @rogloh - thanks for all your hard work, I'm looking forward to being able to incorporate your video driver when I get my new P2D2 motherboards done with HDMI. I really hate the size of the HDMI connector though and I will look at alternatives such as mini HDMI or displayport since you only need a passive HDMI cable.

    But I haven't played much with the new P2 instruction set enough and now know about movbyts and muxq :) I even created definitions for these instructions so I could interact with them like this:
    TAQOZ# $12345678 := Q ---  ok
    TAQOZ# Q .L --- $1234_5678 ok
    TAQOZ# Q 0 MOVBYTS .L --- $7878_7878 ok
    TAQOZ# Q %00011011 MOVBYTS .L --- $7856_3412 ok
    TAQOZ# Q %01001110 MOVBYTS .L --- $5678_1234 ok
    
    (I think it would be good to create a %% twit prefix for my numbers)

    When I was getting my P2 video player to work I was working with 320x240 bitmap source frames and I wanted to double-up the pixels and lines for full screen. This is the kludgy and ugly hubexec/lut routine I did that is called for each line of source.
    ' 3ms per frame FULL-SCREEN upscaler -
    ' ( src scr --  src+640 scr )
    DWIDTH			setq2	#32-1		' copy into LUT
    			rdlong	lutfree,##DWLUT
    			jmp	#lutfree+$200	' and run
    
    DWLUT			setq2	#80-1
    			rdlong	srcbuf,tos1	' read all of source
    '
    			mov	r3,#srcbuf	' set lut src
    			mov	r1,#lmbuf	' set lut dst
    
    			'rep	@.l0,#80	' (note - bug here as rep dest is 17 instead of 18)
    			rep	#18,#80
    			rdlut	acc,r3		' read 4 bytes of source
    			add	r3,#1
    
    			getbyte	X,acc,#0	' get byte
    			setbyte	X,acc,#1	' double it up
    			shr	acc,#8		' next byte
    			setbyte	X,acc,#2	' set it
    			setbyte	X,acc,#3	' double it up
    			wrlut	X,r1		' save in lut
    			shr	acc,#8
    			add	r1,#1		' update dst ptr
    
    			getbyte	X,acc,#0	' get byte
    			setbyte	X,acc,#1	' double it up
    			shr	acc,#8		' next byte
    			setbyte	X,acc,#2	' set it
    			setbyte	X,acc,#3	' double it up
    			wrlut	X,r1		' save in lut
    			shr	acc,#8
    			add	r1,#1		' update dst ptr
    
    .l0			setq2	#160-1		' write out result to screen dst
    			wrlong	lmbuf,tos
    			add	tos,##640	' double lines as well'
    			setq2	#160-1		' write out result to screen dst
    			wrlong	lmbuf,tos
    			add	tos,##640
    			ret
    
    
  • roglohrogloh Posts: 5,837
    edited 2019-10-12 01:42
    No worries Peter, I think it will be quite useful for people even if it limits you to situations where you operate the P2 at minimum rates of around 250MHz or higher. This first version is pure DVI initially over HDMI which will get us going faster, I'll have to see about incorporating any HDMI work later...

    The bonus with the approach taken here is that this driver or future derivatives from it should be able to accept scanline data from other COGs and display them, as well as having it's own text rendering and graphics mode capabilities built in. So you could simply spawn external sprite or other video streaming cog(s) each rendering into their scanline buffers and this driver would be able to display it and also pixel double it. I think the scanline data could potentially indicate its own colour mode too. I have ideas of adding a region list to the inbuilt line sequencing too so some groups of scanlines could be text, others graphics, etc, all in the same frame which allows split screens and other fun things. If the "mode" can be changed dynamically per scanline that allows lots of possibilities and if some screen data could come from hub and some from HyperRAM that could be rather good too. That would obviously require further work and probably dynamically loadable video COG code instead of running fully self contained as it does right now, but it should be achievable.

    With respect to your comments on MOVBYTS and in your Forth, it is nice you have discovered this beauty too and it's becoming one of my favorites for graphics (I really like that incmod too). I could see that having instructions like this presented in an nice interactive Forth environment would have been a good way for me to experiment around and check the MERGEB/SPLITB/MERGE/SPLITW behaviour to see what combinations could shuffle bits in interesting ways. Are those in there too? Doing it on paper does your head in and I gave up after some time, yet I was tempted to write some code environment to play around with these (is there any interactive P2 simulator anywhere yet?). I still reckon there is probably some clever way to replicate nibbles using those new P2 bit shuffle instructions combined with other instructions, however whether it can be done in less than 4 instructions (which is the benchmark now) I don't know.
  • roglohrogloh Posts: 5,837
    edited 2019-10-12 03:27
    I've looked at what AJL said earlier and copied the optimized current code sequences for doubling pixels to a spreadsheet. The doubling of longs were not included here (though they could be too) because I will probably treat those as a special case anyway for performance reasons.

    Good thing is the biggest skip jump seems to be 6 instructions so it should be pretty fast and unwanted instructions can be skipped instead of cancelled I think. I think these sequences will allow the skipf compression, though ideally its better if the skipf works from outside a rep loop. I need to check that. Otherwise it would need to also be in the loop which will add overhead of 2 clocks per iteration which I don't like.

    One bonus I see here is that needing only 22 bits of skip information is handy because it leaves over 9 bits in a per mode skip table to hold a jump pointer. :smile:

    A spreadsheet is the best tool I found for doing this. Each instruction can be dragged around in its cell, and rows can be added or moved such that each row contains the same type of instruction. Now the skip patterns can easily be figured out...

    Make sure your browser screen is wide enough or reduce its font or copy it out a file to be able to see this properly column aligned.
        doublebits              doublenits              doublenibbles           doublebytes             doublewords
        ---------------------   ---------------------   ---------------------   ---------------------   ---------------------
    1   rep     #8, readburst   rep     #11, readburst  rep     #13, readburst  rep     #6, readburst   rep     #6, readburst
    2   rdlut   a, ptra++       rdlut   a, ptra++       rdlut   a, ptra++       rdlut   a, ptra++       rdlut   a, ptra++
    3                                                   setq    nibblemask                              
    4                           getword b, a, #1                                                        
    5                           getword a, a, #0                                                        
    6                           mergew  a                                                               
    7                           mul     a, #3                                                           
    8   mov     b, a                                    mov     b, a            mov     b, a            mov     b, a
    9   movbyts a, #%%1010                              movbyts a, #%%1100      movbyts a, #%%1100      movbyts a, #%%1010
    10                                                  mov     pb, a                                   
    11  mergew  a               mergew  a                                                               
    12                                                  shl     pb, #4                                  
    13                                                  muxq    a, pb                                   
    14  wrlut   a, ptrb++       wrlut   a, ptrb++       wrlut   a, ptrb++       wrlut   a, ptrb++       wrlut   a, ptrb++
    15  movbyts b, #%%3232                              movbyts b, #%%3322      movbyts b, #%%3322      movbyts b, #%%3232
    16  mergew  b               mergew  b                                                               
    17                          mul     b, #3                                                           
    18                          mergew  b                                                               
    19                                                  mov     pb, b                                   
    20                                                  shl     pb, #4                                  
    21                                                  muxq    b, pb                                   
    22  wrlut   b, ptrb++       wrlut   b, ptrb++       wrlut   b, ptrb++       wrlut   b, ptrb++       wrlut   b, ptrb++
    
  • cgraceycgracey Posts: 14,206
    evanh wrote: »
    Thanks for the update Chip. That is a number of instructions alright. COGINIT is the only one not likely to cause confusion. The RDLUT case, is that because of Xbyte feature?

    RDLUT takes three clocks. I couldn't do it in two because of compounded LUT-access and result-mux'ing delays. So, it writes the read value into Q and there is already conduit in place to feed Q to the result quickly.

    XBYTE doesn't affect Q, since it mux's the LUT read value directly into D for the hidden EXECF instruction that XBYTE does.
  • evanhevanh Posts: 16,032
    edited 2019-10-12 03:36
    cgracey wrote: »
    XBYTE doesn't affect Q, since it mux's the LUT read value directly into D for the hidden EXECF instruction that XBYTE does.
    I'd guessed wrong there. Good extra info though ... now I think about it, if Xbyte used Q then that would mess with the bytecodes it's executing.

  • cgraceycgracey Posts: 14,206
    I made some graphics tonight for the bit-twiddling instructions.

    P2_Instruction_SPLITB_MERGEB.png
    P2_instruction_SPLITW_MERGEW.png
    P2_instruction_SEUSSF_SEUSSR.png
  • Very nice to be able to see it visually.

    The Seuss one is certainly a weird looking thing. What applications would it be typically be used for? Pseudo random sequences perhaps, though I guess it would become cyclic after only 32 operations. Some type of noise generator where the actual value of the long changes the delay until the next oscillator pulse..?
  • Yes, I need to get my head around this too, so I created ASM definitions for these functions and put them in a walking ones loop and printed the results for each operation.
    TAQOZ# DEMO --- 
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._...._...._...1
    SPLITB %...._...._...._...._...._...._...._...1
    MERGEB %...._...._...._...._...._...._...._...1
    SPLITW %...._...._...._...._...._...._...._...1
    MERGEW %...._...._...._...._...._...._...._...1
    SEUSSF %..11_.1.1_.1.._11.1_1.1._.11._.1.1_...1
    SEUSSR %111._1.1._.1.1_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._...._...._..1.
    SPLITB %...._...._...._...._...._...1_...._....
    MERGEB %...._...._...._...._...._...._...1_....
    SPLITW %...._...._...._...1_...._...._...._....
    MERGEW %...._...._...._...._...._...._...._.1..
    SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.111_...1
    SEUSSR %111._1.11_.111_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._...._...._.1..
    SPLITB %...._...._...._...1_...._...._...._....
    MERGEB %...._...._...._...._...._...1_...._....
    SPLITW %...._...._...._...._...._...._...._..1.
    MERGEW %...._...._...._...._...._...._...1_....
    SEUSSF %..11_.1.1_.1.._1..1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_11.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._...._...._1...
    SPLITB %...._...1_...._...._...._...._...._....
    MERGEB %...._...._...._...._...1_...._...._....
    SPLITW %...._...._...._..1._...._...._...._....
    MERGEW %...._...._...._...._...._...._.1.._....
    SEUSSF %..11_.1.._.1.._11.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...1_..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._...._...1_....
    SPLITB %...._...._...._...._...._...._...._..1.
    MERGEB %...._...._...._...1_...._...._...._....
    SPLITW %...._...._...._...._...._...._...._.1..
    MERGEW %...._...._...._...._...._...1_...._....
    SEUSSF %..11_11.1_.1.._11.1_1.1._111._.1.1_...1
    SEUSSR %11.._1.11_.1.1_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._...._..1._....
    SPLITB %...._...._...._...._...._..1._...._....
    MERGEB %...._...._...1_...._...._...._...._....
    SPLITW %...._...._...._.1.._...._...._...._....
    MERGEW %...._...._...._...._...._.1.._...._....
    SEUSSF %..11_.1.1_.1.._.1.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...._..11_..1._1111
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._...._.1.._....
    SPLITB %...._...._...._..1._...._...._...._....
    MERGEB %...._...1_...._...._...._...._...._....
    SPLITW %...._...._...._...._...._...._...._1...
    MERGEW %...._...._...._...._...1_...._...._....
    SEUSSF %..11_.1.1_.1.1_11.1_1.1._111._.1.1_...1
    SEUSSR %1.1._1.11_.1.1_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._...._1..._....
    SPLITB %...._..1._...._...._...._...._...._....
    MERGEB %...1_...._...._...._...._...._...._....
    SPLITW %...._...._...._1..._...._...._...._....
    MERGEW %...._...._...._...._.1.._...._...._....
    SEUSSF %.111_.1.1_.1.._11.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_.1.._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._...1_...._....
    SPLITB %...._...._...._...._...._...._...._.1..
    MERGEB %...._...._...._...._...._...._...._..1.
    SPLITW %...._...._...._...._...._...._...1_....
    MERGEW %...._...._...._...1_...._...._...._....
    SEUSSF %..1._.1.1_.1.._11.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_..1._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._..1._...._....
    SPLITB %...._...._...._...._...._.1.._...._....
    MERGEB %...._...._...._...._...._...._..1._....
    SPLITW %...._...._...1_...._...._...._...._....
    MERGEW %...._...._...._.1.._...._...._...._....
    SEUSSF %..11_...1_.1.._11.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_...1_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._.1.._...._....
    SPLITB %...._...._...._.1.._...._...._...._....
    MERGEB %...._...._...._...._...._..1._...._....
    SPLITW %...._...._...._...._...._...._..1._....
    MERGEW %...._...._...1_...._...._...._...._....
    SEUSSF %..11_.1.1_.11._11.1_1.1._111._.1.1_...1
    SEUSSR %111._..11_.1.1_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...._1..._...._....
    SPLITB %...._.1.._...._...._...._...._...._....
    MERGEB %...._...._...._...._..1._...._...._....
    SPLITW %...._...._..1._...._...._...._...._....
    MERGEW %...._...._.1.._...._...._...._...._....
    SEUSSF %..11_.111_.1.._11.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...._..11_..1._11..
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._...1_...._...._....
    SPLITB %...._...._...._...._...._...._...._1...
    MERGEB %...._...._...._..1._...._...._...._....
    SPLITW %...._...._...._...._...._...._.1.._....
    MERGEW %...._...1_...._...._...._...._...._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.1_1..1
    SEUSSR %111._1.11_.1.1_.111_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._..1._...._...._....
    SPLITB %...._...._...._...._...._1..._...._....
    MERGEB %...._...._..1._...._...._...._...._....
    SPLITW %...._...._.1.._...._...._...._...._....
    MERGEW %...._.1.._...._...._...._...._...._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._1111_.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.._...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._.1.._...._...._....
    SPLITB %...._...._...._1..._...._...._...._....
    MERGEB %...._..1._...._...._...._...._...._....
    SPLITW %...._...._...._...._...._...._1..._....
    MERGEW %...1_...._...._...._...._...._...._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._111._11.1_...1
    SEUSSR %1111_1.11_.1.1_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...._1..._...._...._....
    SPLITB %...._1..._...._...._...._...._...._....
    MERGEB %..1._...._...._...._...._...._...._....
    SPLITW %...._...._1..._...._...._...._...._....
    MERGEW %.1.._...._...._...._...._...._...._....
    SEUSSF %..11_.1.1_11.._11.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.._.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._...1_...._...._...._....
    SPLITB %...._...._...._...._...._...._...1_....
    MERGEB %...._...._...._...._...._...._...._.1..
    SPLITW %...._...._...._...._...._...1_...._....
    MERGEW %...._...._...._...._...._...._...._..1.
    SEUSSF %..11_.1.1_.1.._11.1_1..._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_...1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._..1._...._...._...._....
    SPLITB %...._...._...._...._...1_...._...._....
    MERGEB %...._...._...._...._...._...._.1.._....
    SPLITW %...._...1_...._...._...._...._...._....
    MERGEW %...._...._...._...._...._...._...._1...
    SEUSSF %..11_.1.1_.1.._11.1_1.11_111._.1.1_...1
    SEUSSR %111._1111_.1.1_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._.1.._...._...._...._....
    SPLITB %...._...._...1_...._...._...._...._....
    MERGEB %...._...._...._...._...._.1.._...._....
    SPLITW %...._...._...._...._...._..1._...._....
    MERGEW %...._...._...._...._...._...._..1._....
    SEUSSF %..11_.1.1_.1.._11.._1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...._..11_..1._1..1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...._1..._...._...._...._....
    SPLITB %...1_...._...._...._...._...._...._....
    MERGEB %...._...._...._...._.1.._...._...._....
    SPLITW %...._..1._...._...._...._...._...._....
    MERGEW %...._...._...._...._...._...._1..._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.1_.1.1
    SEUSSR %111._1.11_.1.1_.1.1_...._..11_...._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._...1_...._...._...._...._....
    SPLITB %...._...._...._...._...._...._..1._....
    MERGEB %...._...._...._.1.._...._...._...._....
    SPLITW %...._...._...._...._...._.1.._...._....
    MERGEW %...._...._...._...._...._..1._...._....
    SEUSSF %..11_.1.1_.1.._11.1_..1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...._..11_.11._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._..1._...._...._...._...._....
    SPLITB %...._...._...._...._..1._...._...._....
    MERGEB %...._...._.1.._...._...._...._...._....
    SPLITW %...._.1.._...._...._...._...._...._....
    MERGEW %...._...._...._...._...._1..._...._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.1_..11
    SEUSSR %111._1.11_.1.1_.1.1_...._.111_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._.1.._...._...._...._...._....
    SPLITB %...._...._..1._...._...._...._...._....
    MERGEB %...._.1.._...._...._...._...._...._....
    SPLITW %...._...._...._...._...._1..._...._....
    MERGEW %...._...._...._...._..1._...._...._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._11.._.1.1_...1
    SEUSSR %.11._1.11_.1.1_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...._1..._...._...._...._...._....
    SPLITB %..1._...._...._...._...._...._...._....
    MERGEB %.1.._...._...._...._...._...._...._....
    SPLITW %...._1..._...._...._...._...._...._....
    MERGEW %...._...._...._...._1..._...._...._....
    SEUSSF %1.11_.1.1_.1.._11.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_1..._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._...1_...._...._...._...._...._....
    SPLITB %...._...._...._...._...._...._.1.._....
    MERGEB %...._...._...._...._...._...._...._1...
    SPLITW %...._...._...._...._...1_...._...._....
    MERGEW %...._...._...._..1._...._...._...._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.1_....
    SEUSSR %111._1.11_.1.1_.1.1_...._..11_..1._.1.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._..1._...._...._...._...._...._....
    SPLITB %...._...._...._...._.1.._...._...._....
    MERGEB %...._...._...._...._...._...._1..._....
    SPLITW %...1_...._...._...._...._...._...._....
    MERGEW %...._...._...._1..._...._...._...._....
    SEUSSF %...1_.1.1_.1.._11.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...._1.11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._.1.._...._...._...._...._...._....
    SPLITB %...._...._.1.._...._...._...._...._....
    MERGEB %...._...._...._...._...._1..._...._....
    SPLITW %...._...._...._...._..1._...._...._....
    MERGEW %...._...._..1._...._...._...._...._....
    SEUSSF %..11_.1.1_.1.._1111_1.1._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...._...1_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...._1..._...._...._...._...._...._....
    SPLITB %.1.._...._...._...._...._...._...._....
    MERGEB %...._...._...._...._1..._...._...._....
    SPLITW %..1._...._...._...._...._...._...._....
    MERGEW %...._...._1..._...._...._...._...._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._1.1._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...._..11_..11_11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %...1_...._...._...._...._...._...._....
    SPLITB %...._...._...._...._...._...._1..._....
    MERGEB %...._...._...._1..._...._...._...._....
    SPLITW %...._...._...._...._.1.._...._...._....
    MERGEW %...._..1._...._...._...._...._...._....
    SEUSSF %..11_.1.1_.1.._11.1_111._111._.1.1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...._..1._..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %..1._...._...._...._...._...._...._....
    SPLITB %...._...._...._...._1..._...._...._....
    MERGEB %...._...._1..._...._...._...._...._....
    SPLITW %.1.._...._...._...._...._...._...._....
    MERGEW %...._1..._...._...._...._...._...._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.._...1
    SEUSSR %111._1..1_.1.1_.1.1_...._..11_..1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %.1.._...._...._...._...._...._...._....
    SPLITB %...._...._1..._...._...._...._...._....
    MERGEB %...._1..._...._...._...._...._...._....
    SPLITW %...._...._...._...._1..._...._...._....
    MERGEW %..1._...._...._...._...._...._...._....
    SEUSSF %..11_.1.1_.1.._11.1_1.1._111._...1_...1
    SEUSSR %111._1.11_.1.1_.1.1_...._..11_1.1._11.1
    
            1098_7654_3210_9876_5432_1098_7654_3210
    D      %1..._...._...._...._...._...._...._....
    SPLITB %1..._...._...._...._...._...._...._....
    MERGEB %1..._...._...._...._...._...._...._....
    SPLITW %1..._...._...._...._...._...._...._....
    MERGEW %1..._...._...._...._...._...._...._....
    SEUSSF %..11_.1.1_...._11.1_1.1._111._.1.1_...1
    SEUSSR %111._1.11_11.1_.1.1_...._..11_..1._11.1
    
  • I was able to get the skipf based pixel doubling working now. Only saved 16 longs but it's something. Found you can't do a SKIPF outside the REP block (as I expected) but it worked when inside the REP loop. Right now the doubling of longs is included in it too but that's blown the budget out to more than half a scanline. Fine for pure DVI right now but ultimately I'd like to optimize further if possible. The 320x240 truecolor pixel doubling is a bit of a processor pig, and the 16bpp is pretty tight now too. The SKIPF adds a two clock penalty in each inner loop iteration, plus one more dummy instruction is needed to avoid the skip of over 7 instructions in 24bpp mode, making the inner loop total alone in that mode 11 clocks per iteration x 320 iterations = 3520 cycles (352 pixels), before outer loop overhead and 960 long hub transfers are included. I also want need to get the outer loop overhead down by increasing the burst size in these two more cycle intense modes. This is doable if I use more of the LUT and trash the palette space, to get burst sizes larger than just the 40 longs at a time I used today, and the current outer loop penalty is in the order of 50 clocks x 8 transfers which is already 400 clocks of my ideal 3200 budget. Using up more LUT space for this transfer will work okay for these two modes because neither colour mode uses a palette anyway.
    doublepixels
                rep     instcount, readburst
                skipf   pattern
                rdlut   a, ptra++
                setq    nibblemask
                getword b, a, #1
                getword a, a, #0
                mergew  a
                mul     a, #3
                mov     b, a
                movbyts a, bytemask1
                mov     pb, a
                mergew  a
                shl     pb, #4
                muxq    a, pb
                wrlut   a, ptrb++
                movbyts b, bytemask2
                mergew  b
                mul     b, #3
                mergew  b
                mov     pb, b
                shl     pb, #4
                muxq    b, pb
                wrlut   b, ptrb++
                ret
    
    nibblemask  long    $0ff00ff0
    instcount   long    0
    pattern     long    0
    bytemask1   long    %%1100
    bytemask2   long    %%3322
    '                   ____skip_pattern______inst
    doublebits  long    %11111000110100111111_1001
    doublenits  long    %11100010110111000010_1100
    doublenibs  long    %00011100001000111100_1110
    doublebytes long    %11111100111100111110_0111
    doublewords long    %11111100111100111111_0111
    doublelongs long    %11111110111110111110_0101
    
  • cgraceycgracey Posts: 14,206
    edited 2019-10-12 17:09
    So, the 320x240 RGB24 mode is the most tedious? The problem is just getting pixels to repeat, right?

    Have you thought of just doing two XCONT's per pixel, using the same data?
  • cgraceycgracey Posts: 14,206
    If I had known this was going to be an issue, I could have easily made the TMDS serializer repeat data until new data came.
  • TonyB_TonyB_ Posts: 2,196
    edited 2019-10-12 20:45
    rogloh wrote: »
    I was able to get the skipf based pixel doubling working now. Only saved 16 longs but it's something. Found you can't do a SKIPF outside the REP block (as I expected) but it worked when inside the REP loop.<snip>
    doublepixels
                rep     instcount, readburst
                skipf   pattern
                rdlut   a, ptra++
                setq    nibblemask
                getword b, a, #1
                getword a, a, #0
                mergew  a
                mul     a, #3
                mov     b, a
                movbyts a, bytemask1
                mov     pb, a
                mergew  a
                shl     pb, #4
                muxq    a, pb
                wrlut   a, ptrb++
                movbyts b, bytemask2
                mergew  b
                mul     b, #3
                mergew  b
                mov     pb, b
                shl     pb, #4
                muxq    b, pb
                wrlut   b, ptrb++
                ret
    
    nibblemask  long    $0ff00ff0
    instcount   long    0
    pattern     long    0
    bytemask1   long    %%1100
    bytemask2   long    %%3322
    '                   ____skip_pattern______inst
    doublebits  long    %11111000110100111111_1001
    doublenits  long    %11100010110111000010_1100
    doublenibs  long    %00011100001000111100_1110
    doublebytes long    %11111100111100111110_0111
    doublewords long    %11111100111100111111_0111
    doublelongs long    %11111110111110111110_0101
    

    Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?
  • Programming with SKIP(F) is so ugly...
    The assembly language really needs to be extended to allow defining skip patterns alongside the actual code.
  • Thanks, Chip. Could be worth
    Wuerfel_21 wrote: »
    Programming with SKIP(F) is so ugly...
    The assembly language really needs to be extended to allow defining skip patterns alongside the actual code.

    It might look ugly but it's really beautiful!
  • Yeah, it's beautiful in how small and elegant it can solve an issue, but writing (or worse, maintaining) skip code seems to be pretty nightmarish with the current tools.
  • jmgjmg Posts: 15,175
    Wuerfel_21 wrote: »
    Programming with SKIP(F) is so ugly...
    The assembly language really needs to be extended to allow defining skip patterns alongside the actual code.

    That would be 'nice to have', but how to approach it ?
    I did think of a spreadsheet column scheme, but that needs strict column align and is not clear to read, and is making P2 even more obtuse to new users....


    More work in the Assembler SW, but easier for users, would be to allow a CASE style syntax where the code is written multiple times, and then packed using SKIP,
    The ASM list file could report the final SKIP packing in a column-tagged form, of compressed to a single code block.

    That would be easy to work on, and easy to check if packed as users hoped.
    Errors would generate when SKIP was exceeded, or SKIP illegal constructs were used.
  • Yes, I thought of a column scheme, too. Not ideal, but very significantly easier to write and read than magic binary literals somewhere else.

    But the CASE idea is pretty neat, too.
    It wouldn't even be super difficult to implement (based on my experience writing my own P1 XMM Ruby DSL COUGH I mean macro assembler :smile: )
  • cgracey wrote: »
    So, the 320x240 RGB24 mode is the most tedious? The problem is just getting pixels to repeat, right?

    Have you thought of just doing two XCONT's per pixel, using the same data?

    That might work, but then I think I would need stick around and do 320 XCONT instructions during the time I wish to process the next scanline and it steals cycles from me doing that. Maybe an regular interrupt routine could do it in the background but I am also using REP blocks. I prefer to just send a single XCONT to the streamer to buy a nice big block of time (~640 pixels worth) to be able to go off and process a bunch of things for a while in relative peace.
    TonyB_ wrote: »
    Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?

    You've got a good eye. Yes I make use of the skip pattern's LSB for an additional purpose, this setup code was not shown above. I test and clear it and this indicates whether the two byte mask values used need to be xor'd or not. These masks also vary per colour depth, as well as the repeat block length and skip patterns. Here's that part. "a" below is a pointer to the double table entry I've compute earlier per mode.
                sets    lookup, a
                mov     bytemask1, #%%1100      'setup default masks
                mov     bytemask2, #%%3322
    lookup      mov     pattern, 0-0            'get the long data for this bpp
                getnib  instcount, pattern, #0  'get the number of rep instructions
                shr     pattern, #4             'move down to next data
                bitl    pattern, #0 wcz         'test and clear LSB of pattern
        if_c    xor     bytemask1, #%00010100   'use it to flip the masks
        if_c    xor     bytemask2, #%00010100
    

    The two bit doubler appears to double the pixel size on the screen but looking again just now I am noticing that the colour palette is doing something strange. It might not be correct and may be repeating the same bit instead of an actual pair, I will check into it...
  • Maybe the streamer could help with doubling RGB32. Create a table of bytes in hub ram [0,0,1,1,2,2,...,254,254,255,255] and rdfast that address. Put the pixel data in the LUT. Start the streamer in 8 bit LUT mode.

    A minor problem is we can't get all the way across the screen on 1 xcont. After 256 input pixels we'd need to do another xcont with a different LUT offset to get the rest. Or we could split the screen width in half. I don't think it's even that bad to have to use to 2 xcont to get the full screen width as we can load one part while the streamer displays the other part.
  • It's a very interesting idea Saucy! I think we could certainly load up all 320 into the LUT and switch palette base address at the 256 pixel mark from 0 to $100 with a second xcont instruction. I was trying to keep the first 64 entries in the upper LUTRAM half available anyway for general transfer use for other reasons so this might have merit. Certainly if the entire screen is in the same mode I could see it may work.

    In the future if each scanline can run in a different mode and uses LUTRAM in different ways there could be problems in sharing it and setting it up for the next line while it currently in use.

    I guess the mouse sprite rendering would also have to be done differently in that case, as it would need to render into LUT instead of HUB, as I was planning.

    One thing I've found with the LUT is that you need to keep the lower byte 0 otherwise the TMDS output can become corrupted. I think this is because some bit(s) of the lower byte indicate to issue direct 10b symbol instead of translating RGB video symbols to 10b.
  • cgraceycgracey Posts: 14,206
    If bit1 is high in the HDMI data, it goes into literal mode, using the 30 bits above bit 1 to make three 10-bit fields. Bit 0 is normally used in VGA mode for HSYNC. I have not tried it, but it should be possible to output both HDMI and VGA simultaneously from the same streamer.
  • roglohrogloh Posts: 5,837
    edited 2019-10-13 04:29
    TonyB_ wrote: »
    Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?
    LOL, I don't know what I was dreaming coming up with that doubling sequence, it's broken for sure. Even the multiply by 3 doesn't work if it's using a 16x16 multiplier.

    I have looked further at this with Chip's diagrams and I am thinking I finally might have come across a simple sequence for the 2-bit doubling.
    32 bit initial input is:     XXXXXXXX_XXXXXXXX_abcdefgh_ijklmnop 
    
    SPLITW  a          ' result  XXXXXXXX_acegikmo_XXXXXXXX_bdfhjlnp  
    MOVBYTS a, #%%2020 ' result  acegikmo_bdfhjlnp_acegikmo_bdfhjlnp
    MERGEB  a          ' result  ababcdcd_efefghgh_ijijklkl_mnmnopop
    
    
    Simply repeat for doing the upper 16 bits, just change the MOVBYTS value accordingly.

    I think this is right, or have I messed up again...nothing surprises me with this shuffling thing, it's been doing my head in. I'll test it out.

    There could be something similar for replicating nibbles too. I'm almost convinced of it. :smile:
Sign In or Register to comment.