Here's the updated code for pixel transfers, still have to do the two bit version. I've noticed many similarities in the sequences which could probably be leveraged using skipf to reduce this down further.
pixelmove long normalpixels
normalpixels
rep #2, readburst
rdlut a, ptra++
wrlut a, ptrb++
ret
doublebits
rep #8, readburst
rdlut a, ptra++
mov b, a
movbyts a, #%%1010
movbyts b, #%%3232
mergew a
wrlut a, ptrb++
mergew b
wrlut b, ptrb++
ret
doublebytes
rep #6, readburst
rdlut a, ptra++
mov b, a
movbyts a, #%%1100
wrlut a, ptrb++
movbyts b, #%%3322
wrlut b, ptrb++
ret
doublewords
rep #6, readburst
rdlut a, ptra++
mov b, a
movbyts a, #%%1010
wrlut a, ptrb++
movbyts b, #%%3232
wrlut b, ptrb++
ret
doublelongs
rep #3, readburst
rdlut a, ptra++
wrlong a, ptrb++
wrlong a, ptrb++
ret
doublenibbles
setq nibblemask
rep #12, readburst
rdlut b, ptra++
getword a, b, #1
movbyts b, #%%1100
mov pb, b
shl pb, #4
muxq b, pb
wrlut b, ptrb++
movbyts a, #%%1100
mov pb, a
shl pb, #4
muxq a, pb
wrlut a, ptrb++
ret
nibblemask long $0ff00ff0
alts modedata, #skmasks 'set up skipmask based on mode
mov skmask, 0-0
skipf skmask
alts modedata, #replengths 'replength must be adjusted for skipf
mov replength, 0-0
alts modedata, #amaps 'first transform map if needed
mov amap, 0-0
alts modedata, #bmaps 'second transform map if needed
mov bmap, 0-0
setq nibblemask
rep replength, readburst
rdlut a, ptra++
mov b, a 'copy for later
getword b, a, #1 'save second half
movbyts a, amap 'first transform
mov pb, a 'take a copy
shl pb, #4 'nibble shift first half
muxq a, pb 'merge nibble copy
mergew a 'double bits from first transform
wrlut a, prtb++ 'write first result
movbyts b, bmap 'second transform
mergewb 'double bits from second transform
mov pb, b 'take a copy
shl pb, #4 'nibble shift second half
muxq b, pb 'merge nibble copy
wrlut b, ptrb++ 'write second result
wrlong a, ptrb++ 'write first long
wrlong a, ptrb++ 'write second long
ret 'done
replengths
3
8
6
6
4
12
skmasks
%011111111011111110011100
%011011100001110100010000
%011011110011110100010000
%011011110011110100010000
%000111111111111110011100
%011000010010000010000000
amap
0
%%1010
%%1100
%%1010
0
%%1100
bmap
0
%%3232
%%3322
%%3232
0
%%1100
'modedata
' normalpixels = 0
' doublebits = 1
' doublebytes = 2
' doublewords = 3
' doublelongs = 4
' doublenibbles = 5
Apologies for the comment alignment, but I've run out of time to adjust the number of spaces.
replengths have been adjusted to account for cancelled instructions in the pipeline where there are more than 7 in a row. I think that's necessary, but can't test to be certain.
Untested, but by my calculations each method takes:
Looks interesting AJL. I'll have to take a look in more detail. I also need to make sure that the bursts are big enough to help reduce for the loop overhead however it can't really all fit in COG or LUT memory at once so I need to break it up into smaller transfers. Right now I transfer up to 40 longs in each burst for processing pixels from the source buffer, and obviously double this to writing back 80 longs out at a time to the next scanline buffer. The number of bursts required varies with the pixel depth. I think right now the doubling of longs from 320 pixels to 640 is not going to meet the budget unless I optimize it. In general I want to keep all these to below about 3100 clocks or so, under half an active scanline. I think it is just about doable. I can also share the input and output buffer if I process backwards instead of forwards, using ptra-- and ptrb-- etc, though I haven't resorted to that trick yet.
I have some of the doubling working now on my LCD monitor, the single 1bpp, 8bpp and 16bpp, 24bpp modes each seem to all be doubling pixels on the screen, giving a 320 pixel resolution. Need to work on the nibble, and 2bpp modes. My nibble doubling mode seems to be broken, maybe a bug. 2bpp is not coded yet. Getting pretty close now...
Fixed the nibble mode, I just found out that setq does not retain its value after muxq operation. I had it outside the loop just done once at the start in the sample code above, but found it needs to be inside the loop for pixels to double, evanh.
I've tested MUXQ before, what I've said is correct. Either something else is modifying Q, unlikely, or the bug was coincidental, like the hardcoded REP length! I've learnt with experience not to do that.
Weird, I changed it to be inside the loop and it fixed the problem with a modified loop length of 14. I would tend to agree hard coding it is rather dangerous and I've been caught out before with incorrect sizes. I think 12 is the right number though in this case. Maybe Q changes after one of the instructions in my loop?
So in the current state of my code now, all P2 colour modes seem to be now working at VGA resolution over DVI, and one of these video "modes" below is selectable per frame:
Two text modes, both 16 colour LUT based, optional line doubling,
either a flashing or high intensity background, flashing block text cursor with
data read from 16 bit character screen memory (classic VGA type of screen buffer)
- 80 column 16 colour text, 8xF size font
- 40 column 16 colour text, 8xF size font
Multiple graphics modes in one of four resolutions:
- 640xN
- 320xN
- 640xN/2 (line doubled)
- 320xN/2 (line doubled)
Colour modes are:
- LUMA8 (all 8 colours, 8 bit luminance)
- RGBI8 (3 bit colour, 5 bit luminance)
- RGB8 (3:3:2)
- RGB16 (5:6:5)
- RGB24 (8:8:8)
- LUT palette 1bpp
- LUT palette 2bpp
- LUT palette 4bpp
- LUT palette 8bpp
The active scanline count N can be setup as 350, 400, 480, etc, it's statically configurable with
suitable frame timing blanking parameters at compile time. Perhaps this could be made dynamic at some point.
The font scanline height F is also configurable per frame.
As it stands today this driver still has ~60 COG longs free, 256 LUTRAM free (at times, depending on where I double the pixels). This space should be able to increase with optimizations. It can also be massively increased if the code for the chosen mode is dynamically loaded to be executed (though that risks crashing the video driver if hub memory is corrupted, right now everything is nicely self-contained and stable). I will look into including few other niceties like a second block mouse cursor in text mode, frame sync state update, either a line or block cursor with or without flashing, and a graphics mouse sprite in graphics modes plus buffer wraparounds for scrolling. Then this should be plenty usable even before other things get added in time...especially HyperRAM buffer support, etc, etc.
Should be useful for others once P2 rev B's with HDMI are released in larger volumes. I'll certainly be using it for my own debug soon.
Weird, I changed it to be inside the loop and it fixed the problem with a modified loop length of 14. I would tend to agree hard coding it is rather dangerous and I've been caught out before with incorrect sizes. I think 12 is the right number though in this case. Maybe Q changes after one of the instructions in my loop?
Unless the tool set includes a method for automatically generating rep length parameters for code using skipf then hard coded is going to be necessary. I use a spreadsheet to write up the code, generate the skip patterns, and calculate the rep length values.
I accept that this isn’t likely to occur much in general code, but for drivers that are trying to squeeze multimodal code into COG/LUT space while keeping as much space free for buffers it seems like it will be necessary.
RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?
I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
Ah, yes, that'll be the why. I've completely ignored xbyte discussions but I see the xbyte sequence uses a hidden rdlut. I'm guessing that data is needed stored close to the ALU/pipeline and Q is it.
So I also suspect xbyte is the only reason Q is used by RDLUT at all.
I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
The Q register is just a parameter for those type instructions. EDIT: Or more importantly, the modal changes in those instructions are not triggered until the SETQ is a prefixing instruction.
RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?
I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
SETQ also shields interrupts to protect the next instruction.
Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.
Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.
That's not much of an argument for an object that is running the whole cog - without interrupts. And debuggers can preserve Q using MUXQ and SETQ. And, the optimised, but buggy, example above actually is protected from interrupts anyway. The SETQ is followed immediately by a REP.
Comments
Am I missing something? Shouldn't it be movbyts pa, #%01010000 ?
Does the %% change the interpretation of the immediate?
Edit. Doh! No that won't work for nibbles = 0, Scrap that idea.
Time to change brand of coffee I think!
Apologies for the comment alignment, but I've run out of time to adjust the number of spaces.
replengths have been adjusted to account for cancelled instructions in the pipeline where there are more than 7 in a row. I think that's necessary, but can't test to be certain.
Untested, but by my calculations each method takes:
I have some of the doubling working now on my LCD monitor, the single 1bpp, 8bpp and 16bpp, 24bpp modes each seem to all be doubling pixels on the screen, giving a 320 pixel resolution. Need to work on the nibble, and 2bpp modes. My nibble doubling mode seems to be broken, maybe a bug. 2bpp is not coded yet. Getting pretty close now...
Here is the bad code, pixels not doubled on the screen, though they are modified in a weird way, still same size 640 on screen.:
Here is the working code, pixels doubled properly, 320 on screen.
Why did that come about, Chip?
Is lutRAM location $1ff real? Strangely, RDLUT reg, ##$1ff doesn't affect Q.
Ha! Ah, no, somehow lutRAM address $1ff had the same data I was using for SETQ
As it stands today this driver still has ~60 COG longs free, 256 LUTRAM free (at times, depending on where I double the pixels). This space should be able to increase with optimizations. It can also be massively increased if the code for the chosen mode is dynamically loaded to be executed (though that risks crashing the video driver if hub memory is corrupted, right now everything is nicely self-contained and stable). I will look into including few other niceties like a second block mouse cursor in text mode, frame sync state update, either a line or block cursor with or without flashing, and a graphics mouse sprite in graphics modes plus buffer wraparounds for scrolling. Then this should be plenty usable even before other things get added in time...especially HyperRAM buffer support, etc, etc.
Should be useful for others once P2 rev B's with HDMI are released in larger volumes. I'll certainly be using it for my own debug soon.
I accept that this isn’t likely to occur much in general code, but for drivers that are trying to squeeze multimodal code into COG/LUT space while keeping as much space free for buffers it seems like it will be necessary.
I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
So I also suspect xbyte is the only reason Q is used by RDLUT at all.
Xbyte will be the reason why RDLUT is different.
Q gets used for data capture in a few ways. I will list them when I am at my computer today.
SETQ also shields interrupts to protect the next instruction.
Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.