Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.
That's not much of an argument for an object that is running the whole cog - without interrupts. And debuggers can preserve Q using MUXQ and SETQ. And, the optimised, but buggy, example above actually is protected from interrupts anyway. The SETQ is followed immediately by a REP.
I updated the v33 documentation to cover Q issues:
SETQ CONSIDERATIONS
The SETQ and SETQ2 instructions write to the Q register and are intended to precede a companion instruction. The value written to the Q register by SETQ/SETQ2 will persist until any of these events occur:
* XORO32 executes - Q is set to the XORO32 result.
* RDLUT executes - Q is set to the data read from the lookup RAM.
* GETXACC executes - Q is set to the Goertzel sine accumulator value.
* CRCNIB executes - Q gets shifted left by four bits.
* COGINIT/QDIV/QFRAC/QROTATE executes without a preceding SETQ instruction - Q is set to zero.
It is possible to retrieve the current Q value by the following sequence:
MOV qval,#0 'reset qval
MUXQ qval,##$FFFFFFFF 'for each '1' bit in Q, set the same bit in qval
SETQ/SETQ2 shields the next instruction from interruption to prevent an interrupt service routine from inadvertently altering Q before the intended instruction can utilize its value.
Thanks for the update Chip. That is a number of instructions alright. COGINIT is the only one not likely to cause confusion. The RDLUT case, is that because of Xbyte feature?
Hehe, this has only come to light because MUXQ uses the Q register as an ordinary operand, without any SETQ prefixing. Every other prior documented use of Q required SETQ as a prefixing instruction to trigger a modal change.
@rogloh - thanks for all your hard work, I'm looking forward to being able to incorporate your video driver when I get my new P2D2 motherboards done with HDMI. I really hate the size of the HDMI connector though and I will look at alternatives such as mini HDMI or displayport since you only need a passive HDMI cable.
But I haven't played much with the new P2 instruction set enough and now know about movbyts and muxq I even created definitions for these instructions so I could interact with them like this:
TAQOZ# $12345678 := Q --- ok
TAQOZ# Q .L --- $1234_5678 ok
TAQOZ# Q 0 MOVBYTS .L --- $7878_7878 ok
TAQOZ# Q %00011011 MOVBYTS .L --- $7856_3412 ok
TAQOZ# Q %01001110 MOVBYTS .L --- $5678_1234 ok
(I think it would be good to create a %% twit prefix for my numbers)
When I was getting my P2 video player to work I was working with 320x240 bitmap source frames and I wanted to double-up the pixels and lines for full screen. This is the kludgy and ugly hubexec/lut routine I did that is called for each line of source.
' 3ms per frame FULL-SCREEN upscaler -
' ( src scr -- src+640 scr )
DWIDTH setq2 #32-1 ' copy into LUT
rdlong lutfree,##DWLUT
jmp #lutfree+$200 ' and run
DWLUT setq2 #80-1
rdlong srcbuf,tos1 ' read all of source
'
mov r3,#srcbuf ' set lut src
mov r1,#lmbuf ' set lut dst
'rep @.l0,#80 ' (note - bug here as rep dest is 17 instead of 18)
rep #18,#80
rdlut acc,r3 ' read 4 bytes of source
add r3,#1
getbyte X,acc,#0 ' get byte
setbyte X,acc,#1 ' double it up
shr acc,#8 ' next byte
setbyte X,acc,#2 ' set it
setbyte X,acc,#3 ' double it up
wrlut X,r1 ' save in lut
shr acc,#8
add r1,#1 ' update dst ptr
getbyte X,acc,#0 ' get byte
setbyte X,acc,#1 ' double it up
shr acc,#8 ' next byte
setbyte X,acc,#2 ' set it
setbyte X,acc,#3 ' double it up
wrlut X,r1 ' save in lut
shr acc,#8
add r1,#1 ' update dst ptr
.l0 setq2 #160-1 ' write out result to screen dst
wrlong lmbuf,tos
add tos,##640 ' double lines as well'
setq2 #160-1 ' write out result to screen dst
wrlong lmbuf,tos
add tos,##640
ret
No worries Peter, I think it will be quite useful for people even if it limits you to situations where you operate the P2 at minimum rates of around 250MHz or higher. This first version is pure DVI initially over HDMI which will get us going faster, I'll have to see about incorporating any HDMI work later...
The bonus with the approach taken here is that this driver or future derivatives from it should be able to accept scanline data from other COGs and display them, as well as having it's own text rendering and graphics mode capabilities built in. So you could simply spawn external sprite or other video streaming cog(s) each rendering into their scanline buffers and this driver would be able to display it and also pixel double it. I think the scanline data could potentially indicate its own colour mode too. I have ideas of adding a region list to the inbuilt line sequencing too so some groups of scanlines could be text, others graphics, etc, all in the same frame which allows split screens and other fun things. If the "mode" can be changed dynamically per scanline that allows lots of possibilities and if some screen data could come from hub and some from HyperRAM that could be rather good too. That would obviously require further work and probably dynamically loadable video COG code instead of running fully self contained as it does right now, but it should be achievable.
With respect to your comments on MOVBYTS and in your Forth, it is nice you have discovered this beauty too and it's becoming one of my favorites for graphics (I really like that incmod too). I could see that having instructions like this presented in an nice interactive Forth environment would have been a good way for me to experiment around and check the MERGEB/SPLITB/MERGE/SPLITW behaviour to see what combinations could shuffle bits in interesting ways. Are those in there too? Doing it on paper does your head in and I gave up after some time, yet I was tempted to write some code environment to play around with these (is there any interactive P2 simulator anywhere yet?). I still reckon there is probably some clever way to replicate nibbles using those new P2 bit shuffle instructions combined with other instructions, however whether it can be done in less than 4 instructions (which is the benchmark now) I don't know.
I've looked at what AJL said earlier and copied the optimized current code sequences for doubling pixels to a spreadsheet. The doubling of longs were not included here (though they could be too) because I will probably treat those as a special case anyway for performance reasons.
Good thing is the biggest skip jump seems to be 6 instructions so it should be pretty fast and unwanted instructions can be skipped instead of cancelled I think. I think these sequences will allow the skipf compression, though ideally its better if the skipf works from outside a rep loop. I need to check that. Otherwise it would need to also be in the loop which will add overhead of 2 clocks per iteration which I don't like.
One bonus I see here is that needing only 22 bits of skip information is handy because it leaves over 9 bits in a per mode skip table to hold a jump pointer.
A spreadsheet is the best tool I found for doing this. Each instruction can be dragged around in its cell, and rows can be added or moved such that each row contains the same type of instruction. Now the skip patterns can easily be figured out...
Make sure your browser screen is wide enough or reduce its font or copy it out a file to be able to see this properly column aligned.
doublebits doublenits doublenibbles doublebytes doublewords
--------------------- --------------------- --------------------- --------------------- ---------------------
1 rep #8, readburst rep #11, readburst rep #13, readburst rep #6, readburst rep #6, readburst
2 rdlut a, ptra++ rdlut a, ptra++ rdlut a, ptra++ rdlut a, ptra++ rdlut a, ptra++
3 setq nibblemask
4 getword b, a, #1
5 getword a, a, #0
6 mergew a
7 mul a, #3
8 mov b, a mov b, a mov b, a mov b, a
9 movbyts a, #%%1010 movbyts a, #%%1100 movbyts a, #%%1100 movbyts a, #%%1010
10 mov pb, a
11 mergew a mergew a
12 shl pb, #4
13 muxq a, pb
14 wrlut a, ptrb++ wrlut a, ptrb++ wrlut a, ptrb++ wrlut a, ptrb++ wrlut a, ptrb++
15 movbyts b, #%%3232 movbyts b, #%%3322 movbyts b, #%%3322 movbyts b, #%%3232
16 mergew b mergew b
17 mul b, #3
18 mergew b
19 mov pb, b
20 shl pb, #4
21 muxq b, pb
22 wrlut b, ptrb++ wrlut b, ptrb++ wrlut b, ptrb++ wrlut b, ptrb++ wrlut b, ptrb++
Thanks for the update Chip. That is a number of instructions alright. COGINIT is the only one not likely to cause confusion. The RDLUT case, is that because of Xbyte feature?
RDLUT takes three clocks. I couldn't do it in two because of compounded LUT-access and result-mux'ing delays. So, it writes the read value into Q and there is already conduit in place to feed Q to the result quickly.
XBYTE doesn't affect Q, since it mux's the LUT read value directly into D for the hidden EXECF instruction that XBYTE does.
The Seuss one is certainly a weird looking thing. What applications would it be typically be used for? Pseudo random sequences perhaps, though I guess it would become cyclic after only 32 operations. Some type of noise generator where the actual value of the long changes the delay until the next oscillator pulse..?
Yes, I need to get my head around this too, so I created ASM definitions for these functions and put them in a walking ones loop and printed the results for each operation.
I was able to get the skipf based pixel doubling working now. Only saved 16 longs but it's something. Found you can't do a SKIPF outside the REP block (as I expected) but it worked when inside the REP loop. Right now the doubling of longs is included in it too but that's blown the budget out to more than half a scanline. Fine for pure DVI right now but ultimately I'd like to optimize further if possible. The 320x240 truecolor pixel doubling is a bit of a processor pig, and the 16bpp is pretty tight now too. The SKIPF adds a two clock penalty in each inner loop iteration, plus one more dummy instruction is needed to avoid the skip of over 7 instructions in 24bpp mode, making the inner loop total alone in that mode 11 clocks per iteration x 320 iterations = 3520 cycles (352 pixels), before outer loop overhead and 960 long hub transfers are included. I also want need to get the outer loop overhead down by increasing the burst size in these two more cycle intense modes. This is doable if I use more of the LUT and trash the palette space, to get burst sizes larger than just the 40 longs at a time I used today, and the current outer loop penalty is in the order of 50 clocks x 8 transfers which is already 400 clocks of my ideal 3200 budget. Using up more LUT space for this transfer will work okay for these two modes because neither colour mode uses a palette anyway.
doublepixels
rep instcount, readburst
skipf pattern
rdlut a, ptra++
setq nibblemask
getword b, a, #1
getword a, a, #0
mergew a
mul a, #3
mov b, a
movbyts a, bytemask1
mov pb, a
mergew a
shl pb, #4
muxq a, pb
wrlut a, ptrb++
movbyts b, bytemask2
mergew b
mul b, #3
mergew b
mov pb, b
shl pb, #4
muxq b, pb
wrlut b, ptrb++
ret
nibblemask long $0ff00ff0
instcount long 0
pattern long 0
bytemask1 long %%1100
bytemask2 long %%3322
' ____skip_pattern______inst
doublebits long %11111000110100111111_1001
doublenits long %11100010110111000010_1100
doublenibs long %00011100001000111100_1110
doublebytes long %11111100111100111110_0111
doublewords long %11111100111100111111_0111
doublelongs long %11111110111110111110_0101
I was able to get the skipf based pixel doubling working now. Only saved 16 longs but it's something. Found you can't do a SKIPF outside the REP block (as I expected) but it worked when inside the REP loop.<snip>
doublepixels
rep instcount, readburst
skipf pattern
rdlut a, ptra++
setq nibblemask
getword b, a, #1
getword a, a, #0
mergew a
mul a, #3
mov b, a
movbyts a, bytemask1
mov pb, a
mergew a
shl pb, #4
muxq a, pb
wrlut a, ptrb++
movbyts b, bytemask2
mergew b
mul b, #3
mergew b
mov pb, b
shl pb, #4
muxq b, pb
wrlut b, ptrb++
ret
nibblemask long $0ff00ff0
instcount long 0
pattern long 0
bytemask1 long %%1100
bytemask2 long %%3322
' ____skip_pattern______inst
doublebits long %11111000110100111111_1001
doublenits long %11100010110111000010_1100
doublenibs long %00011100001000111100_1110
doublebytes long %11111100111100111110_0111
doublewords long %11111100111100111111_0111
doublelongs long %11111110111110111110_0101
Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?
Yeah, it's beautiful in how small and elegant it can solve an issue, but writing (or worse, maintaining) skip code seems to be pretty nightmarish with the current tools.
Programming with SKIP(F) is so ugly...
The assembly language really needs to be extended to allow defining skip patterns alongside the actual code.
That would be 'nice to have', but how to approach it ?
I did think of a spreadsheet column scheme, but that needs strict column align and is not clear to read, and is making P2 even more obtuse to new users....
More work in the Assembler SW, but easier for users, would be to allow a CASE style syntax where the code is written multiple times, and then packed using SKIP,
The ASM list file could report the final SKIP packing in a column-tagged form, of compressed to a single code block.
That would be easy to work on, and easy to check if packed as users hoped.
Errors would generate when SKIP was exceeded, or SKIP illegal constructs were used.
Yes, I thought of a column scheme, too. Not ideal, but very significantly easier to write and read than magic binary literals somewhere else.
But the CASE idea is pretty neat, too.
It wouldn't even be super difficult to implement (based on my experience writing my own P1 XMM Ruby DSL COUGH I mean macro assembler )
So, the 320x240 RGB24 mode is the most tedious? The problem is just getting pixels to repeat, right?
Have you thought of just doing two XCONT's per pixel, using the same data?
That might work, but then I think I would need stick around and do 320 XCONT instructions during the time I wish to process the next scanline and it steals cycles from me doing that. Maybe an regular interrupt routine could do it in the background but I am also using REP blocks. I prefer to just send a single XCONT to the streamer to buy a nice big block of time (~640 pixels worth) to be able to go off and process a bunch of things for a while in relative peace.
Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?
You've got a good eye. Yes I make use of the skip pattern's LSB for an additional purpose, this setup code was not shown above. I test and clear it and this indicates whether the two byte mask values used need to be xor'd or not. These masks also vary per colour depth, as well as the repeat block length and skip patterns. Here's that part. "a" below is a pointer to the double table entry I've compute earlier per mode.
sets lookup, a
mov bytemask1, #%%1100 'setup default masks
mov bytemask2, #%%3322
lookup mov pattern, 0-0 'get the long data for this bpp
getnib instcount, pattern, #0 'get the number of rep instructions
shr pattern, #4 'move down to next data
bitl pattern, #0 wcz 'test and clear LSB of pattern
if_c xor bytemask1, #%00010100 'use it to flip the masks
if_c xor bytemask2, #%00010100
The two bit doubler appears to double the pixel size on the screen but looking again just now I am noticing that the colour palette is doing something strange. It might not be correct and may be repeating the same bit instead of an actual pair, I will check into it...
Maybe the streamer could help with doubling RGB32. Create a table of bytes in hub ram [0,0,1,1,2,2,...,254,254,255,255] and rdfast that address. Put the pixel data in the LUT. Start the streamer in 8 bit LUT mode.
A minor problem is we can't get all the way across the screen on 1 xcont. After 256 input pixels we'd need to do another xcont with a different LUT offset to get the rest. Or we could split the screen width in half. I don't think it's even that bad to have to use to 2 xcont to get the full screen width as we can load one part while the streamer displays the other part.
It's a very interesting idea Saucy! I think we could certainly load up all 320 into the LUT and switch palette base address at the 256 pixel mark from 0 to $100 with a second xcont instruction. I was trying to keep the first 64 entries in the upper LUTRAM half available anyway for general transfer use for other reasons so this might have merit. Certainly if the entire screen is in the same mode I could see it may work.
In the future if each scanline can run in a different mode and uses LUTRAM in different ways there could be problems in sharing it and setting it up for the next line while it currently in use.
I guess the mouse sprite rendering would also have to be done differently in that case, as it would need to render into LUT instead of HUB, as I was planning.
One thing I've found with the LUT is that you need to keep the lower byte 0 otherwise the TMDS output can become corrupted. I think this is because some bit(s) of the lower byte indicate to issue direct 10b symbol instead of translating RGB video symbols to 10b.
If bit1 is high in the HDMI data, it goes into literal mode, using the 30 bits above bit 1 to make three 10-bit fields. Bit 0 is normally used in VGA mode for HSYNC. I have not tried it, but it should be possible to output both HDMI and VGA simultaneously from the same streamer.
Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?
LOL, I don't know what I was dreaming coming up with that doubling sequence, it's broken for sure. Even the multiply by 3 doesn't work if it's using a 16x16 multiplier.
I have looked further at this with Chip's diagrams and I am thinking I finally might have come across a simple sequence for the 2-bit doubling.
32 bit initial input is: XXXXXXXX_XXXXXXXX_abcdefgh_ijklmnop
SPLITW a ' result XXXXXXXX_acegikmo_XXXXXXXX_bdfhjlnp
MOVBYTS a, #%%2020 ' result acegikmo_bdfhjlnp_acegikmo_bdfhjlnp
MERGEB a ' result ababcdcd_efefghgh_ijijklkl_mnmnopop
Simply repeat for doing the upper 16 bits, just change the MOVBYTS value accordingly.
I think this is right, or have I messed up again...nothing surprises me with this shuffling thing, it's been doing my head in. I'll test it out.
There could be something similar for replicating nibbles too. I'm almost convinced of it.
Comments
Good idea about using MUXQ to read Q.
Is the second setq nibblemask needed?
SETQ CONSIDERATIONS
The SETQ and SETQ2 instructions write to the Q register and are intended to precede a companion instruction. The value written to the Q register by SETQ/SETQ2 will persist until any of these events occur:
* XORO32 executes - Q is set to the XORO32 result.
* RDLUT executes - Q is set to the data read from the lookup RAM.
* GETXACC executes - Q is set to the Goertzel sine accumulator value.
* CRCNIB executes - Q gets shifted left by four bits.
* COGINIT/QDIV/QFRAC/QROTATE executes without a preceding SETQ instruction - Q is set to zero.
It is possible to retrieve the current Q value by the following sequence:
SETQ/SETQ2 shields the next instruction from interruption to prevent an interrupt service routine from inadvertently altering Q before the intended instruction can utilize its value.
I'm glad this prompted figuring out what changes Q too. That could be a gotcha for people otherwise.
But I haven't played much with the new P2 instruction set enough and now know about movbyts and muxq I even created definitions for these instructions so I could interact with them like this: (I think it would be good to create a %% twit prefix for my numbers)
When I was getting my P2 video player to work I was working with 320x240 bitmap source frames and I wanted to double-up the pixels and lines for full screen. This is the kludgy and ugly hubexec/lut routine I did that is called for each line of source.
The bonus with the approach taken here is that this driver or future derivatives from it should be able to accept scanline data from other COGs and display them, as well as having it's own text rendering and graphics mode capabilities built in. So you could simply spawn external sprite or other video streaming cog(s) each rendering into their scanline buffers and this driver would be able to display it and also pixel double it. I think the scanline data could potentially indicate its own colour mode too. I have ideas of adding a region list to the inbuilt line sequencing too so some groups of scanlines could be text, others graphics, etc, all in the same frame which allows split screens and other fun things. If the "mode" can be changed dynamically per scanline that allows lots of possibilities and if some screen data could come from hub and some from HyperRAM that could be rather good too. That would obviously require further work and probably dynamically loadable video COG code instead of running fully self contained as it does right now, but it should be achievable.
With respect to your comments on MOVBYTS and in your Forth, it is nice you have discovered this beauty too and it's becoming one of my favorites for graphics (I really like that incmod too). I could see that having instructions like this presented in an nice interactive Forth environment would have been a good way for me to experiment around and check the MERGEB/SPLITB/MERGE/SPLITW behaviour to see what combinations could shuffle bits in interesting ways. Are those in there too? Doing it on paper does your head in and I gave up after some time, yet I was tempted to write some code environment to play around with these (is there any interactive P2 simulator anywhere yet?). I still reckon there is probably some clever way to replicate nibbles using those new P2 bit shuffle instructions combined with other instructions, however whether it can be done in less than 4 instructions (which is the benchmark now) I don't know.
Good thing is the biggest skip jump seems to be 6 instructions so it should be pretty fast and unwanted instructions can be skipped instead of cancelled I think. I think these sequences will allow the skipf compression, though ideally its better if the skipf works from outside a rep loop. I need to check that. Otherwise it would need to also be in the loop which will add overhead of 2 clocks per iteration which I don't like.
One bonus I see here is that needing only 22 bits of skip information is handy because it leaves over 9 bits in a per mode skip table to hold a jump pointer.
A spreadsheet is the best tool I found for doing this. Each instruction can be dragged around in its cell, and rows can be added or moved such that each row contains the same type of instruction. Now the skip patterns can easily be figured out...
Make sure your browser screen is wide enough or reduce its font or copy it out a file to be able to see this properly column aligned.
RDLUT takes three clocks. I couldn't do it in two because of compounded LUT-access and result-mux'ing delays. So, it writes the read value into Q and there is already conduit in place to feed Q to the result quickly.
XBYTE doesn't affect Q, since it mux's the LUT read value directly into D for the hidden EXECF instruction that XBYTE does.
The Seuss one is certainly a weird looking thing. What applications would it be typically be used for? Pseudo random sequences perhaps, though I guess it would become cyclic after only 32 operations. Some type of noise generator where the actual value of the long changes the delay until the next oscillator pulse..?
Have you thought of just doing two XCONT's per pixel, using the same data?
Does the 2-bit doubler work? It seems too easy in your code. Also LSB should be 0 for all skip patterns?
The assembly language really needs to be extended to allow defining skip patterns alongside the actual code.
It might look ugly but it's really beautiful!
That would be 'nice to have', but how to approach it ?
I did think of a spreadsheet column scheme, but that needs strict column align and is not clear to read, and is making P2 even more obtuse to new users....
More work in the Assembler SW, but easier for users, would be to allow a CASE style syntax where the code is written multiple times, and then packed using SKIP,
The ASM list file could report the final SKIP packing in a column-tagged form, of compressed to a single code block.
That would be easy to work on, and easy to check if packed as users hoped.
Errors would generate when SKIP was exceeded, or SKIP illegal constructs were used.
But the CASE idea is pretty neat, too.
It wouldn't even be super difficult to implement (based on my experience writing my own P1 XMM Ruby DSL COUGH I mean macro assembler )
That might work, but then I think I would need stick around and do 320 XCONT instructions during the time I wish to process the next scanline and it steals cycles from me doing that. Maybe an regular interrupt routine could do it in the background but I am also using REP blocks. I prefer to just send a single XCONT to the streamer to buy a nice big block of time (~640 pixels worth) to be able to go off and process a bunch of things for a while in relative peace.
You've got a good eye. Yes I make use of the skip pattern's LSB for an additional purpose, this setup code was not shown above. I test and clear it and this indicates whether the two byte mask values used need to be xor'd or not. These masks also vary per colour depth, as well as the repeat block length and skip patterns. Here's that part. "a" below is a pointer to the double table entry I've compute earlier per mode.
The two bit doubler appears to double the pixel size on the screen but looking again just now I am noticing that the colour palette is doing something strange. It might not be correct and may be repeating the same bit instead of an actual pair, I will check into it...
A minor problem is we can't get all the way across the screen on 1 xcont. After 256 input pixels we'd need to do another xcont with a different LUT offset to get the rest. Or we could split the screen width in half. I don't think it's even that bad to have to use to 2 xcont to get the full screen width as we can load one part while the streamer displays the other part.
In the future if each scanline can run in a different mode and uses LUTRAM in different ways there could be problems in sharing it and setting it up for the next line while it currently in use.
I guess the mouse sprite rendering would also have to be done differently in that case, as it would need to render into LUT instead of HUB, as I was planning.
One thing I've found with the LUT is that you need to keep the lower byte 0 otherwise the TMDS output can become corrupted. I think this is because some bit(s) of the lower byte indicate to issue direct 10b symbol instead of translating RGB video symbols to 10b.
I have looked further at this with Chip's diagrams and I am thinking I finally might have come across a simple sequence for the 2-bit doubling.
Simply repeat for doing the upper 16 bits, just change the MOVBYTS value accordingly.
I think this is right, or have I messed up again...nothing surprises me with this shuffling thing, it's been doing my head in. I'll test it out.
There could be something similar for replicating nibbles too. I'm almost convinced of it.