All PASM2 gurus - help optimizing a text driver over DVI?

cgracey · 2019-10-11 13:47

evanh wrote: »

cgracey wrote: »

Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.

That's not much of an argument for an object that is running the whole cog - without interrupts. And debuggers can preserve Q using MUXQ and SETQ. And, the optimised, but buggy, example above actually is protected from interrupts anyway. The SETQ is followed immediately by a REP.

Good idea about using MUXQ to read Q.

evanh · 2019-10-11 13:55

Huh, I guess that can be used inside any ISR as well. Putting them back in limits for this.

TonyB_ · 2019-10-11 14:50

rogloh wrote: »

Here is the working code, pixels doubled properly, 320 on screen.

doublenibbles
            rep     #14, readburst
            rdlut   b, ptra++
            getword a, b, #1                
            movbyts b, #%%1100
            mov     pb, b
            shl     pb, #4
            setq    nibblemask
            muxq    b, pb
            wrlut   b, ptrb++
            movbyts a, #%%1100
            mov     pb, a
            shl     pb, #4
            setq    nibblemask
            muxq    a, pb
            wrlut   a, ptrb++
            ret

Is the second setq nibblemask needed?

evanh · 2019-10-11 15:11

Good point, a singular SETQ can be placed anywhere between the RDLUT and first MUXQ.

cgracey · 2019-10-11 18:24

I updated the v33 documentation to cover Q issues:

SETQ CONSIDERATIONS

The SETQ and SETQ2 instructions write to the Q register and are intended to precede a companion instruction. The value written to the Q register by SETQ/SETQ2 will persist until any of these events occur:

* XORO32 executes - Q is set to the XORO32 result.
* RDLUT executes - Q is set to the data read from the lookup RAM.
* GETXACC executes - Q is set to the Goertzel sine accumulator value.
* CRCNIB executes - Q gets shifted left by four bits.
* COGINIT/QDIV/QFRAC/QROTATE executes without a preceding SETQ instruction - Q is set to zero.

It is possible to retrieve the current Q value by the following sequence:

MOV	qval,#0			'reset qval
MUXQ	qval,##$FFFFFFFF	'for each '1' bit in Q, set the same bit in qval

SETQ/SETQ2 shields the next instruction from interruption to prevent an interrupt service routine from inadvertently altering Q before the intended instruction can utilize its value.

rogloh · 2019-10-11 21:51

TonyB_ wrote: »

Is the second setq nibblemask needed?

evanh wrote: »

Good point, a singular SETQ can be placed anywhere between the RDLUT and first MUXQ.

Had the same thought last night and will change the code. We're all on the same page it seems.

I'm glad this prompted figuring out what changes Q too. That could be a gotcha for people otherwise.

evanh · 2019-10-11 23:09

Thanks for the update Chip. That is a number of instructions alright. COGINIT is the only one not likely to cause confusion. The RDLUT case, is that because of Xbyte feature?

evanh · 2019-10-11 23:33

Hehe, this has only come to light because MUXQ uses the Q register as an ordinary operand, without any SETQ prefixing. Every other prior documented use of Q required SETQ as a prefixing instruction to trigger a modal change.

Peter Jakacki · 2019-10-12 00:13

@rogloh - thanks for all your hard work, I'm looking forward to being able to incorporate your video driver when I get my new P2D2 motherboards done with HDMI. I really hate the size of the HDMI connector though and I will look at alternatives such as mini HDMI or displayport since you only need a passive HDMI cable.

But I haven't played much with the new P2 instruction set enough and now know about movbyts and muxq

I even created definitions for these instructions so I could interact with them like this:

TAQOZ# $12345678 := Q ---  ok
TAQOZ# Q .L --- $1234_5678 ok
TAQOZ# Q 0 MOVBYTS .L --- $7878_7878 ok
TAQOZ# Q %00011011 MOVBYTS .L --- $7856_3412 ok
TAQOZ# Q %01001110 MOVBYTS .L --- $5678_1234 ok

(I think it would be good to create a %% twit prefix for my numbers)

When I was getting my P2 video player to work I was working with 320x240 bitmap source frames and I wanted to double-up the pixels and lines for full screen. This is the kludgy and ugly hubexec/lut routine I did that is called for each line of source.

' 3ms per frame FULL-SCREEN upscaler -
' ( src scr --  src+640 scr )
DWIDTH			setq2	#32-1		' copy into LUT
			rdlong	lutfree,##DWLUT
			jmp	#lutfree+$200	' and run

DWLUT			setq2	#80-1
			rdlong	srcbuf,tos1	' read all of source
'
			mov	r3,#srcbuf	' set lut src
			mov	r1,#lmbuf	' set lut dst

			'rep	@.l0,#80	' (note - bug here as rep dest is 17 instead of 18)
			rep	#18,#80
			rdlut	acc,r3		' read 4 bytes of source
			add	r3,#1

			getbyte	X,acc,#0	' get byte
			setbyte	X,acc,#1	' double it up
			shr	acc,#8		' next byte
			setbyte	X,acc,#2	' set it
			setbyte	X,acc,#3	' double it up
			wrlut	X,r1		' save in lut
			shr	acc,#8
			add	r1,#1		' update dst ptr

			getbyte	X,acc,#0	' get byte
			setbyte	X,acc,#1	' double it up
			shr	acc,#8		' next byte
			setbyte	X,acc,#2	' set it
			setbyte	X,acc,#3	' double it up
			wrlut	X,r1		' save in lut
			shr	acc,#8
			add	r1,#1		' update dst ptr

.l0			setq2	#160-1		' write out result to screen dst
			wrlong	lmbuf,tos
			add	tos,##640	' double lines as well'
			setq2	#160-1		' write out result to screen dst
			wrlong	lmbuf,tos
			add	tos,##640
			ret

rogloh · 2019-10-12 01:38

No worries Peter, I think it will be quite useful for people even if it limits you to situations where you operate the P2 at minimum rates of around 250MHz or higher. This first version is pure DVI initially over HDMI which will get us going faster, I'll have to see about incorporating any HDMI work later...

The bonus with the approach taken here is that this driver or future derivatives from it should be able to accept scanline data from other COGs and display them, as well as having it's own text rendering and graphics mode capabilities built in. So you could simply spawn external sprite or other video streaming cog(s) each rendering into their scanline buffers and this driver would be able to display it and also pixel double it. I think the scanline data could potentially indicate its own colour mode too. I have ideas of adding a region list to the inbuilt line sequencing too so some groups of scanlines could be text, others graphics, etc, all in the same frame which allows split screens and other fun things. If the "mode" can be changed dynamically per scanline that allows lots of possibilities and if some screen data could come from hub and some from HyperRAM that could be rather good too. That would obviously require further work and probably dynamically loadable video COG code instead of running fully self contained as it does right now, but it should be achievable.

With respect to your comments on MOVBYTS and in your Forth, it is nice you have discovered this beauty too and it's becoming one of my favorites for graphics (I really like that incmod too). I could see that having instructions like this presented in an nice interactive Forth environment would have been a good way for me to experiment around and check the MERGEB/SPLITB/MERGE/SPLITW behaviour to see what combinations could shuffle bits in interesting ways. Are those in there too? Doing it on paper does your head in and I gave up after some time, yet I was tempted to write some code environment to play around with these (is there any interactive P2 simulator anywhere yet?). I still reckon there is probably some clever way to replicate nibbles using those new P2 bit shuffle instructions combined with other instructions, however whether it can be done in less than 4 instructions (which is the benchmark now) I don't know.

rogloh · 2019-10-12 03:06

I've looked at what AJL said earlier and copied the optimized current code sequences for doubling pixels to a spreadsheet. The doubling of longs were not included here (though they could be too) because I will probably treat those as a special case anyway for performance reasons.

Good thing is the biggest skip jump seems to be 6 instructions so it should be pretty fast and unwanted instructions can be skipped instead of cancelled I think. I think these sequences will allow the skipf compression, though ideally its better if the skipf works from outside a rep loop. I need to check that. Otherwise it would need to also be in the loop which will add overhead of 2 clocks per iteration which I don't like.

One bonus I see here is that needing only 22 bits of skip information is handy because it leaves over 9 bits in a per mode skip table to hold a jump pointer.

A spreadsheet is the best tool I found for doing this. Each instruction can be dragged around in its cell, and rows can be added or moved such that each row contains the same type of instruction. Now the skip patterns can easily be figured out...

Make sure your browser screen is wide enough or reduce its font or copy it out a file to be able to see this properly column aligned.

    doublebits              doublenits              doublenibbles           doublebytes             doublewords
    ---------------------   ---------------------   ---------------------   ---------------------   ---------------------
1   rep     #8, readburst   rep     #11, readburst  rep     #13, readburst  rep     #6, readburst   rep     #6, readburst
2   rdlut   a, ptra++       rdlut   a, ptra++       rdlut   a, ptra++       rdlut   a, ptra++       rdlut   a, ptra++
3                                                   setq    nibblemask                              
4                           getword b, a, #1                                                        
5                           getword a, a, #0                                                        
6                           mergew  a                                                               
7                           mul     a, #3                                                           
8   mov     b, a                                    mov     b, a            mov     b, a            mov     b, a
9   movbyts a, #%%1010                              movbyts a, #%%1100      movbyts a, #%%1100      movbyts a, #%%1010
10                                                  mov     pb, a                                   
11  mergew  a               mergew  a                                                               
12                                                  shl     pb, #4                                  
13                                                  muxq    a, pb                                   
14  wrlut   a, ptrb++       wrlut   a, ptrb++       wrlut   a, ptrb++       wrlut   a, ptrb++       wrlut   a, ptrb++
15  movbyts b, #%%3232                              movbyts b, #%%3322      movbyts b, #%%3322      movbyts b, #%%3232
16  mergew  b               mergew  b                                                               
17                          mul     b, #3                                                           
18                          mergew  b                                                               
19                                                  mov     pb, b                                   
20                                                  shl     pb, #4                                  
21                                                  muxq    b, pb                                   
22  wrlut   b, ptrb++       wrlut   b, ptrb++       wrlut   b, ptrb++       wrlut   b, ptrb++       wrlut   b, ptrb++

cgracey · 2019-10-12 03:16

evanh wrote: »

Thanks for the update Chip. That is a number of instructions alright. COGINIT is the only one not likely to cause confusion. The RDLUT case, is that because of Xbyte feature?

RDLUT takes three clocks. I couldn't do it in two because of compounded LUT-access and result-mux'ing delays. So, it writes the read value into Q and there is already conduit in place to feed Q to the result quickly.

XBYTE doesn't affect Q, since it mux's the LUT read value directly into D for the hidden EXECF instruction that XBYTE does.

evanh · 2019-10-12 03:34

cgracey wrote: »

XBYTE doesn't affect Q, since it mux's the LUT read value directly into D for the hidden EXECF instruction that XBYTE does.

I'd guessed wrong there. Good extra info though ... now I think about it, if Xbyte used Q then that would mess with the bytecodes it's executing.

cgracey · 2019-10-12 09:44

I made some graphics tonight for the bit-twiddling instructions.

rogloh · 2019-10-12 11:09

Very nice to be able to see it visually.

The Seuss one is certainly a weird looking thing. What applications would it be typically be used for? Pseudo random sequences perhaps, though I guess it would become cyclic after only 32 operations. Some type of noise generator where the actual value of the long changes the delay until the next oscillator pulse..?

Peter Jakacki · 2019-10-12 12:06

Yes, I need to get my head around this too, so I created ASM definitions for these functions and put them in a walking ones loop and printed the results for each operation.

TAQOZ# DEMO --- 
        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._...._...._...1
SPLITB %...._...._...._...._...._...._...._...1
MERGEB %...._...._...._...._...._...._...._...1
SPLITW %...._...._...._...._...._...._...._...1
MERGEW %...._...._...._...._...._...._...._...1
SEUSSF %..11_.1.1_.1.._11.1_1.1._.11._.1.1_...1
SEUSSR %111._1.1._.1.1_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._...._...._..1.
SPLITB %...._...._...._...._...._...1_...._....
MERGEB %...._...._...._...._...._...._...1_....
SPLITW %...._...._...._...1_...._...._...._....
MERGEW %...._...._...._...._...._...._...._.1..
SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.111_...1
SEUSSR %111._1.11_.111_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._...._...._.1..
SPLITB %...._...._...._...1_...._...._...._....
MERGEB %...._...._...._...._...._...1_...._....
SPLITW %...._...._...._...._...._...._...._..1.
MERGEW %...._...._...._...._...._...._...1_....
SEUSSF %..11_.1.1_.1.._1..1_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_11.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._...._...._1...
SPLITB %...._...1_...._...._...._...._...._....
MERGEB %...._...._...._...._...1_...._...._....
SPLITW %...._...._...._..1._...._...._...._....
MERGEW %...._...._...._...._...._...._.1.._....
SEUSSF %..11_.1.._.1.._11.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_...1_..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._...._...1_....
SPLITB %...._...._...._...._...._...._...._..1.
MERGEB %...._...._...._...1_...._...._...._....
SPLITW %...._...._...._...._...._...._...._.1..
MERGEW %...._...._...._...._...._...1_...._....
SEUSSF %..11_11.1_.1.._11.1_1.1._111._.1.1_...1
SEUSSR %11.._1.11_.1.1_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._...._..1._....
SPLITB %...._...._...._...._...._..1._...._....
MERGEB %...._...._...1_...._...._...._...._....
SPLITW %...._...._...._.1.._...._...._...._....
MERGEW %...._...._...._...._...._.1.._...._....
SEUSSF %..11_.1.1_.1.._.1.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_...._..11_..1._1111

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._...._.1.._....
SPLITB %...._...._...._..1._...._...._...._....
MERGEB %...._...1_...._...._...._...._...._....
SPLITW %...._...._...._...._...._...._...._1...
MERGEW %...._...._...._...._...1_...._...._....
SEUSSF %..11_.1.1_.1.1_11.1_1.1._111._.1.1_...1
SEUSSR %1.1._1.11_.1.1_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._...._1..._....
SPLITB %...._..1._...._...._...._...._...._....
MERGEB %...1_...._...._...._...._...._...._....
SPLITW %...._...._...._1..._...._...._...._....
MERGEW %...._...._...._...._.1.._...._...._....
SEUSSF %.111_.1.1_.1.._11.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_.1.._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._...1_...._....
SPLITB %...._...._...._...._...._...._...._.1..
MERGEB %...._...._...._...._...._...._...._..1.
SPLITW %...._...._...._...._...._...._...1_....
MERGEW %...._...._...._...1_...._...._...._....
SEUSSF %..1._.1.1_.1.._11.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_..1._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._..1._...._....
SPLITB %...._...._...._...._...._.1.._...._....
MERGEB %...._...._...._...._...._...._..1._....
SPLITW %...._...._...1_...._...._...._...._....
MERGEW %...._...._...._.1.._...._...._...._....
SEUSSF %..11_...1_.1.._11.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_...1_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._.1.._...._....
SPLITB %...._...._...._.1.._...._...._...._....
MERGEB %...._...._...._...._...._..1._...._....
SPLITW %...._...._...._...._...._...._..1._....
MERGEW %...._...._...1_...._...._...._...._....
SEUSSF %..11_.1.1_.11._11.1_1.1._111._.1.1_...1
SEUSSR %111._..11_.1.1_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...._1..._...._....
SPLITB %...._.1.._...._...._...._...._...._....
MERGEB %...._...._...._...._..1._...._...._....
SPLITW %...._...._..1._...._...._...._...._....
MERGEW %...._...._.1.._...._...._...._...._....
SEUSSF %..11_.111_.1.._11.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_...._..11_..1._11..

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._...1_...._...._....
SPLITB %...._...._...._...._...._...._...._1...
MERGEB %...._...._...._..1._...._...._...._....
SPLITW %...._...._...._...._...._...._.1.._....
MERGEW %...._...1_...._...._...._...._...._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.1_1..1
SEUSSR %111._1.11_.1.1_.111_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._..1._...._...._....
SPLITB %...._...._...._...._...._1..._...._....
MERGEB %...._...._..1._...._...._...._...._....
SPLITW %...._...._.1.._...._...._...._...._....
MERGEW %...._.1.._...._...._...._...._...._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._1111_.1.1_...1
SEUSSR %111._1.11_.1.1_.1.._...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._.1.._...._...._....
SPLITB %...._...._...._1..._...._...._...._....
MERGEB %...._..1._...._...._...._...._...._....
SPLITW %...._...._...._...._...._...._1..._....
MERGEW %...1_...._...._...._...._...._...._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._111._11.1_...1
SEUSSR %1111_1.11_.1.1_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...._1..._...._...._....
SPLITB %...._1..._...._...._...._...._...._....
MERGEB %..1._...._...._...._...._...._...._....
SPLITW %...._...._1..._...._...._...._...._....
MERGEW %.1.._...._...._...._...._...._...._....
SEUSSF %..11_.1.1_11.._11.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.._.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._...1_...._...._...._....
SPLITB %...._...._...._...._...._...._...1_....
MERGEB %...._...._...._...._...._...._...._.1..
SPLITW %...._...._...._...._...._...1_...._....
MERGEW %...._...._...._...._...._...._...._..1.
SEUSSF %..11_.1.1_.1.._11.1_1..._111._.1.1_...1
SEUSSR %111._1.11_.1.1_...1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._..1._...._...._...._....
SPLITB %...._...._...._...._...1_...._...._....
MERGEB %...._...._...._...._...._...._.1.._....
SPLITW %...._...1_...._...._...._...._...._....
MERGEW %...._...._...._...._...._...._...._1...
SEUSSF %..11_.1.1_.1.._11.1_1.11_111._.1.1_...1
SEUSSR %111._1111_.1.1_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._.1.._...._...._...._....
SPLITB %...._...._...1_...._...._...._...._....
MERGEB %...._...._...._...._...._.1.._...._....
SPLITW %...._...._...._...._...._..1._...._....
MERGEW %...._...._...._...._...._...._..1._....
SEUSSF %..11_.1.1_.1.._11.._1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_...._..11_..1._1..1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...._1..._...._...._...._....
SPLITB %...1_...._...._...._...._...._...._....
MERGEB %...._...._...._...._.1.._...._...._....
SPLITW %...._..1._...._...._...._...._...._....
MERGEW %...._...._...._...._...._...._1..._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.1_.1.1
SEUSSR %111._1.11_.1.1_.1.1_...._..11_...._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._...1_...._...._...._...._....
SPLITB %...._...._...._...._...._...._..1._....
MERGEB %...._...._...._.1.._...._...._...._....
SPLITW %...._...._...._...._...._.1.._...._....
MERGEW %...._...._...._...._...._..1._...._....
SEUSSF %..11_.1.1_.1.._11.1_..1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_...._..11_.11._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._..1._...._...._...._...._....
SPLITB %...._...._...._...._..1._...._...._....
MERGEB %...._...._.1.._...._...._...._...._....
SPLITW %...._.1.._...._...._...._...._...._....
MERGEW %...._...._...._...._...._1..._...._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.1_..11
SEUSSR %111._1.11_.1.1_.1.1_...._.111_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._.1.._...._...._...._...._....
SPLITB %...._...._..1._...._...._...._...._....
MERGEB %...._.1.._...._...._...._...._...._....
SPLITW %...._...._...._...._...._1..._...._....
MERGEW %...._...._...._...._..1._...._...._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._11.._.1.1_...1
SEUSSR %.11._1.11_.1.1_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...._1..._...._...._...._...._....
SPLITB %..1._...._...._...._...._...._...._....
MERGEB %.1.._...._...._...._...._...._...._....
SPLITW %...._1..._...._...._...._...._...._....
MERGEW %...._...._...._...._1..._...._...._....
SEUSSF %1.11_.1.1_.1.._11.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_1..._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._...1_...._...._...._...._...._....
SPLITB %...._...._...._...._...._...._.1.._....
MERGEB %...._...._...._...._...._...._...._1...
SPLITW %...._...._...._...._...1_...._...._....
MERGEW %...._...._...._..1._...._...._...._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.1_....
SEUSSR %111._1.11_.1.1_.1.1_...._..11_..1._.1.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._..1._...._...._...._...._...._....
SPLITB %...._...._...._...._.1.._...._...._....
MERGEB %...._...._...._...._...._...._1..._....
SPLITW %...1_...._...._...._...._...._...._....
MERGEW %...._...._...._1..._...._...._...._....
SEUSSF %...1_.1.1_.1.._11.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_...._1.11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._.1.._...._...._...._...._...._....
SPLITB %...._...._.1.._...._...._...._...._....
MERGEB %...._...._...._...._...._1..._...._....
SPLITW %...._...._...._...._..1._...._...._....
MERGEW %...._...._..1._...._...._...._...._....
SEUSSF %..11_.1.1_.1.._1111_1.1._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_...._...1_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...._1..._...._...._...._...._...._....
SPLITB %.1.._...._...._...._...._...._...._....
MERGEB %...._...._...._...._1..._...._...._....
SPLITW %..1._...._...._...._...._...._...._....
MERGEW %...._...._1..._...._...._...._...._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._1.1._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_...._..11_..11_11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %...1_...._...._...._...._...._...._....
SPLITB %...._...._...._...._...._...._1..._....
MERGEB %...._...._...._1..._...._...._...._....
SPLITW %...._...._...._...._.1.._...._...._....
MERGEW %...._..1._...._...._...._...._...._....
SEUSSF %..11_.1.1_.1.._11.1_111._111._.1.1_...1
SEUSSR %111._1.11_.1.1_.1.1_...._..1._..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %..1._...._...._...._...._...._...._....
SPLITB %...._...._...._...._1..._...._...._....
MERGEB %...._...._1..._...._...._...._...._....
SPLITW %.1.._...._...._...._...._...._...._....
MERGEW %...._1..._...._...._...._...._...._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._111._.1.._...1
SEUSSR %111._1..1_.1.1_.1.1_...._..11_..1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %.1.._...._...._...._...._...._...._....
SPLITB %...._...._1..._...._...._...._...._....
MERGEB %...._1..._...._...._...._...._...._....
SPLITW %...._...._...._...._1..._...._...._....
MERGEW %..1._...._...._...._...._...._...._....
SEUSSF %..11_.1.1_.1.._11.1_1.1._111._...1_...1
SEUSSR %111._1.11_.1.1_.1.1_...._..11_1.1._11.1

        1098_7654_3210_9876_5432_1098_7654_3210
D      %1..._...._...._...._...._...._...._....
SPLITB %1..._...._...._...._...._...._...._....
MERGEB %1..._...._...._...._...._...._...._....
SPLITW %1..._...._...._...._...._...._...._....
MERGEW %1..._...._...._...._...._...._...._....
SEUSSF %..11_.1.1_...._11.1_1.1._111._.1.1_...1
SEUSSR %111._1.11_11.1_.1.1_...._..11_..1._11.1

rogloh · 2019-10-12 12:28

I was able to get the skipf based pixel doubling working now. Only saved 16 longs but it's something. Found you can't do a SKIPF outside the REP block (as I expected) but it worked when inside the REP loop. Right now the doubling of longs is included in it too but that's blown the budget out to more than half a scanline. Fine for pure DVI right now but ultimately I'd like to optimize further if possible. The 320x240 truecolor pixel doubling is a bit of a processor pig, and the 16bpp is pretty tight now too. The SKIPF adds a two clock penalty in each inner loop iteration, plus one more dummy instruction is needed to avoid the skip of over 7 instructions in 24bpp mode, making the inner loop total alone in that mode 11 clocks per iteration x 320 iterations = 3520 cycles (352 pixels), before outer loop overhead and 960 long hub transfers are included. I also want need to get the outer loop overhead down by increasing the burst size in these two more cycle intense modes. This is doable if I use more of the LUT and trash the palette space, to get burst sizes larger than just the 40 longs at a time I used today, and the current outer loop penalty is in the order of 50 clocks x 8 transfers which is already 400 clocks of my ideal 3200 budget. Using up more LUT space for this transfer will work okay for these two modes because neither colour mode uses a palette anyway.

doublepixels
            rep     instcount, readburst
            skipf   pattern
            rdlut   a, ptra++
            setq    nibblemask
            getword b, a, #1
            getword a, a, #0
            mergew  a
            mul     a, #3
            mov     b, a
            movbyts a, bytemask1
            mov     pb, a
            mergew  a
            shl     pb, #4
            muxq    a, pb
            wrlut   a, ptrb++
            movbyts b, bytemask2
            mergew  b
            mul     b, #3
            mergew  b
            mov     pb, b
            shl     pb, #4
            muxq    b, pb
            wrlut   b, ptrb++
            ret

nibblemask  long    $0ff00ff0
instcount   long    0
pattern     long    0
bytemask1   long    %%1100
bytemask2   long    %%3322
'                   ____skip_pattern______inst
doublebits  long    %11111000110100111111_1001
doublenits  long    %11100010110111000010_1100
doublenibs  long    %00011100001000111100_1110
doublebytes long    %11111100111100111110_0111
doublewords long    %11111100111100111111_0111
doublelongs long    %11111110111110111110_0101

cgracey · 2019-10-12 17:03

So, the 320x240 RGB24 mode is the most tedious? The problem is just getting pixels to repeat, right?

Have you thought of just doing two XCONT's per pixel, using the same data?

cgracey · 2019-10-12 19:54

If I had known this was going to be an issue, I could have easily made the TMDS serializer repeat data until new data came.

TonyB_ · 2019-10-12 20:37

rogloh wrote: »

I was able to get the skipf based pixel doubling working now. Only saved 16 longs but it's something. Found you can't do a SKIPF outside the REP block (as I expected) but it worked when inside the REP loop.<snip>

doublepixels
            rep     instcount, readburst
            skipf   pattern
            rdlut   a, ptra++
            setq    nibblemask
            getword b, a, #1
            getword a, a, #0
            mergew  a
            mul     a, #3
            mov     b, a
            movbyts a, bytemask1
            mov     pb, a
            mergew  a
            shl     pb, #4
            muxq    a, pb
            wrlut   a, ptrb++
            movbyts b, bytemask2
            mergew  b
            mul     b, #3
            mergew  b
            mov     pb, b
            shl     pb, #4
            muxq    b, pb
            wrlut   b, ptrb++
            ret

nibblemask  long    $0ff00ff0
instcount   long    0
pattern     long    0
bytemask1   long    %%1100
bytemask2   long    %%3322
'                   ____skip_pattern______inst
doublebits  long    %11111000110100111111_1001
doublenits  long    %11100010110111000010_1100
doublenibs  long    %00011100001000111100_1110
doublebytes long    %11111100111100111110_0111
doublewords long    %11111100111100111111_0111
doublelongs long    %11111110111110111110_0101

Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?

Wuerfel_21 · 2019-10-12 20:57

Programming with SKIP(F) is so ugly...
The assembly language really needs to be extended to allow defining skip patterns alongside the actual code.

TonyB_ · 2019-10-12 21:03

Thanks, Chip. Could be worth

Wuerfel_21 wrote: »

Programming with SKIP(F) is so ugly...
The assembly language really needs to be extended to allow defining skip patterns alongside the actual code.

It might look ugly but it's really beautiful!

Wuerfel_21 · 2019-10-12 21:12

Yeah, it's beautiful in how small and elegant it can solve an issue, but writing (or worse, maintaining) skip code seems to be pretty nightmarish with the current tools.

jmg · 2019-10-12 21:14

Wuerfel_21 wrote: »

Programming with SKIP(F) is so ugly...
The assembly language really needs to be extended to allow defining skip patterns alongside the actual code.

That would be 'nice to have', but how to approach it ?
I did think of a spreadsheet column scheme, but that needs strict column align and is not clear to read, and is making P2 even more obtuse to new users....

More work in the Assembler SW, but easier for users, would be to allow a CASE style syntax where the code is written multiple times, and then packed using SKIP,
The ASM list file could report the final SKIP packing in a column-tagged form, of compressed to a single code block.

That would be easy to work on, and easy to check if packed as users hoped.
Errors would generate when SKIP was exceeded, or SKIP illegal constructs were used.

Wuerfel_21 · 2019-10-12 21:26

Yes, I thought of a column scheme, too. Not ideal, but very significantly easier to write and read than magic binary literals somewhere else.

But the CASE idea is pretty neat, too.
It wouldn't even be super difficult to implement (based on my experience writing my own P1 XMM Ruby DSL COUGH I mean macro assembler

)

rogloh · 2019-10-13 01:49

cgracey wrote: »

So, the 320x240 RGB24 mode is the most tedious? The problem is just getting pixels to repeat, right?

Have you thought of just doing two XCONT's per pixel, using the same data?

That might work, but then I think I would need stick around and do 320 XCONT instructions during the time I wish to process the next scanline and it steals cycles from me doing that. Maybe an regular interrupt routine could do it in the background but I am also using REP blocks. I prefer to just send a single XCONT to the streamer to buy a nice big block of time (~640 pixels worth) to be able to go off and process a bunch of things for a while in relative peace.

TonyB_ wrote: »

Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?

You've got a good eye. Yes I make use of the skip pattern's LSB for an additional purpose, this setup code was not shown above. I test and clear it and this indicates whether the two byte mask values used need to be xor'd or not. These masks also vary per colour depth, as well as the repeat block length and skip patterns. Here's that part. "a" below is a pointer to the double table entry I've compute earlier per mode.

            sets    lookup, a
            mov     bytemask1, #%%1100      'setup default masks
            mov     bytemask2, #%%3322
lookup      mov     pattern, 0-0            'get the long data for this bpp
            getnib  instcount, pattern, #0  'get the number of rep instructions
            shr     pattern, #4             'move down to next data
            bitl    pattern, #0 wcz         'test and clear LSB of pattern
    if_c    xor     bytemask1, #%00010100   'use it to flip the masks
    if_c    xor     bytemask2, #%00010100

The two bit doubler appears to double the pixel size on the screen but looking again just now I am noticing that the colour palette is doing something strange. It might not be correct and may be repeating the same bit instead of an actual pair, I will check into it...

SaucySoliton · 2019-10-13 02:52

Maybe the streamer could help with doubling RGB32. Create a table of bytes in hub ram [0,0,1,1,2,2,...,254,254,255,255] and rdfast that address. Put the pixel data in the LUT. Start the streamer in 8 bit LUT mode.

A minor problem is we can't get all the way across the screen on 1 xcont. After 256 input pixels we'd need to do another xcont with a different LUT offset to get the rest. Or we could split the screen width in half. I don't think it's even that bad to have to use to 2 xcont to get the full screen width as we can load one part while the streamer displays the other part.

rogloh · 2019-10-13 03:15

It's a very interesting idea Saucy! I think we could certainly load up all 320 into the LUT and switch palette base address at the 256 pixel mark from 0 to $100 with a second xcont instruction. I was trying to keep the first 64 entries in the upper LUTRAM half available anyway for general transfer use for other reasons so this might have merit. Certainly if the entire screen is in the same mode I could see it may work.

In the future if each scanline can run in a different mode and uses LUTRAM in different ways there could be problems in sharing it and setting it up for the next line while it currently in use.

I guess the mouse sprite rendering would also have to be done differently in that case, as it would need to render into LUT instead of HUB, as I was planning.

One thing I've found with the LUT is that you need to keep the lower byte 0 otherwise the TMDS output can become corrupted. I think this is because some bit(s) of the lower byte indicate to issue direct 10b symbol instead of translating RGB video symbols to 10b.

cgracey · 2019-10-13 03:52

If bit1 is high in the HDMI data, it goes into literal mode, using the 30 bits above bit 1 to make three 10-bit fields. Bit 0 is normally used in VGA mode for HSYNC. I have not tried it, but it should be possible to output both HDMI and VGA simultaneously from the same streamer.

rogloh · 2019-10-13 04:28

TonyB_ wrote: »

Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?

LOL, I don't know what I was dreaming coming up with that doubling sequence, it's broken for sure. Even the multiply by 3 doesn't work if it's using a 16x16 multiplier.

I have looked further at this with Chip's diagrams and I am thinking I finally might have come across a simple sequence for the 2-bit doubling.

32 bit initial input is:     XXXXXXXX_XXXXXXXX_abcdefgh_ijklmnop 

SPLITW  a          ' result  XXXXXXXX_acegikmo_XXXXXXXX_bdfhjlnp  
MOVBYTS a, #%%2020 ' result  acegikmo_bdfhjlnp_acegikmo_bdfhjlnp
MERGEB  a          ' result  ababcdcd_efefghgh_ijijklkl_mnmnopop

Simply repeat for doing the upper 16 bits, just change the MOVBYTS value accordingly.

I think this is right, or have I messed up again...nothing surprises me with this shuffling thing, it's been doing my head in. I'll test it out.

There could be something similar for replicating nibbles too. I'm almost convinced of it.

All PASM2 gurus - help optimizing a text driver over DVI?

Comments