Here's the updated code for pixel transfers, still have to do the two bit version. I've noticed many similarities in the sequences which could probably be leveraged using skipf to reduce this down further.
pixelmove long normalpixels
normalpixels
rep #2, readburst
rdlut a, ptra++
wrlut a, ptrb++
ret
doublebits
rep #8, readburst
rdlut a, ptra++
mov b, a
movbyts a, #%%1010movbyts b, #%%3232mergew a
wrlut a, ptrb++
mergew b
wrlut b, ptrb++
ret
doublebytes
rep #6, readburst
rdlut a, ptra++
mov b, a
movbyts a, #%%1100wrlut a, ptrb++
movbyts b, #%%3322wrlut b, ptrb++
ret
doublewords
rep #6, readburst
rdlut a, ptra++
mov b, a
movbyts a, #%%1010wrlut a, ptrb++
movbyts b, #%%3232wrlut b, ptrb++
ret
doublelongs
rep #3, readburst
rdlut a, ptra++
wrlong a, ptrb++
wrlong a, ptrb++
ret
doublenibbles
setq nibblemask
rep #12, readburst
rdlut b, ptra++
getword a, b, #1movbyts b, #%%1100movpb, b
shlpb, #4muxq b, pbwrlut b, ptrb++
movbyts a, #%%1100movpb, a
shlpb, #4muxq a, pbwrlut a, ptrb++
ret
nibblemask long$0ff00ff0
alts modedata, #skmasks 'set up skipmask based on modemov skmask, 0-0skipf skmask
alts modedata, #replengths 'replength must be adjusted for skipfmov replength, 0-0alts modedata, #amaps 'first transform map if neededmov amap, 0-0alts modedata, #bmaps 'second transform map if neededmov bmap, 0-0setq nibblemask
rep replength, readburst
rdlut a, ptra++
mov b, a 'copy for latergetword b, a, #1'save second halfmovbyts a, amap 'first transformmovpb, a 'take a copyshlpb, #4'nibble shift first halfmuxq a, pb'merge nibble copymergew a 'double bits from first transformwrlut a, prtb++ 'write first resultmovbyts b, bmap 'second transform
mergewb 'double bits from second transformmovpb, b 'take a copyshlpb, #4'nibble shift second halfmuxq b, pb'merge nibble copywrlut b, ptrb++ 'write second resultwrlong a, ptrb++ 'write first longwrlong a, ptrb++ 'write second longret'done
replengths
3866412
skmasks
%011111111011111110011100%011011100001110100010000%011011110011110100010000%011011110011110100010000%000111111111111110011100%011000010010000010000000
amap
0%%1010%%1100%%10100%%1100
bmap
0%%3232%%3322%%32320%%1100'modedata' normalpixels = 0' doublebits = 1' doublebytes = 2' doublewords = 3' doublelongs = 4' doublenibbles = 5
Apologies for the comment alignment, but I've run out of time to adjust the number of spaces.
replengths have been adjusted to account for cancelled instructions in the pipeline where there are more than 7 in a row. I think that's necessary, but can't test to be certain.
Untested, but by my calculations each method takes:
Looks interesting AJL. I'll have to take a look in more detail. I also need to make sure that the bursts are big enough to help reduce for the loop overhead however it can't really all fit in COG or LUT memory at once so I need to break it up into smaller transfers. Right now I transfer up to 40 longs in each burst for processing pixels from the source buffer, and obviously double this to writing back 80 longs out at a time to the next scanline buffer. The number of bursts required varies with the pixel depth. I think right now the doubling of longs from 320 pixels to 640 is not going to meet the budget unless I optimize it. In general I want to keep all these to below about 3100 clocks or so, under half an active scanline. I think it is just about doable. I can also share the input and output buffer if I process backwards instead of forwards, using ptra-- and ptrb-- etc, though I haven't resorted to that trick yet.
I have some of the doubling working now on my LCD monitor, the single 1bpp, 8bpp and 16bpp, 24bpp modes each seem to all be doubling pixels on the screen, giving a 320 pixel resolution. Need to work on the nibble, and 2bpp modes. My nibble doubling mode seems to be broken, maybe a bug. 2bpp is not coded yet. Getting pretty close now...
Fixed the nibble mode, I just found out that setq does not retain its value after muxq operation. I had it outside the loop just done once at the start in the sample code above, but found it needs to be inside the loop for pixels to double, evanh.
I've tested MUXQ before, what I've said is correct. Either something else is modifying Q, unlikely, or the bug was coincidental, like the hardcoded REP length! I've learnt with experience not to do that.
Weird, I changed it to be inside the loop and it fixed the problem with a modified loop length of 14. I would tend to agree hard coding it is rather dangerous and I've been caught out before with incorrect sizes. I think 12 is the right number though in this case. Maybe Q changes after one of the instructions in my loop?
Here is the bad code, pixels not doubled on the screen, though they are modified in a weird way, still same size 640 on screen.:
doublenibbles
setq nibblemask
rep #12, readburst
rdlut b, ptra++
getword a, b, #1movbyts b, #%%1100movpb, b
shlpb, #4muxq b, pbwrlut b, ptrb++
movbyts a, #%%1100movpb, a
shlpb, #4muxq a, pbwrlut a, ptrb++
ret
Here is the working code, pixels doubled properly, 320 on screen.
doublenibbles
rep #14, readburst
rdlut b, ptra++
getword a, b, #1movbyts b, #%%1100movpb, b
shlpb, #4setq nibblemask
muxq b, pbwrlut b, ptrb++
movbyts a, #%%1100movpb, a
shlpb, #4setq nibblemask
muxq a, pbwrlut a, ptrb++
ret
So in the current state of my code now, all P2 colour modes seem to be now working at VGA resolution over DVI, and one of these video "modes" below is selectable per frame:
Two text modes, both 16 colour LUT based, optional line doubling,
either a flashing or high intensity background, flashing block text cursor with
data read from16 bit character screen memory (classic VGA type of screen buffer)
- 80 column 16 colour text, 8xF size font
- 40 column 16 colour text, 8xF size font
Multiple graphics modes in one of four resolutions:
- 640xN
- 320xN
- 640xN/2 (line doubled)
- 320xN/2 (line doubled)
Colour modes are:
- LUMA8 (all 8 colours, 8 bit luminance)
- RGBI8 (3 bit colour, 5 bit luminance)
- RGB8 (3:3:2)
- RGB16 (5:6:5)
- RGB24 (8:8:8)
- LUT palette 1bpp
- LUT palette 2bpp
- LUT palette 4bpp
- LUT palette 8bpp
The active scanline count N can be setup as 350, 400, 480, etc, it's statically configurable with
suitable frame timing blanking parameters at compile time. Perhaps this could be made dynamic at some point.
The font scanline height F is also configurable per frame.
As it stands today this driver still has ~60 COG longs free, 256 LUTRAM free (at times, depending on where I double the pixels). This space should be able to increase with optimizations. It can also be massively increased if the code for the chosen mode is dynamically loaded to be executed (though that risks crashing the video driver if hub memory is corrupted, right now everything is nicely self-contained and stable). I will look into including few other niceties like a second block mouse cursor in text mode, frame sync state update, either a line or block cursor with or without flashing, and a graphics mouse sprite in graphics modes plus buffer wraparounds for scrolling. Then this should be plenty usable even before other things get added in time...especially HyperRAM buffer support, etc, etc.
Should be useful for others once P2 rev B's with HDMI are released in larger volumes. I'll certainly be using it for my own debug soon.
Weird, I changed it to be inside the loop and it fixed the problem with a modified loop length of 14. I would tend to agree hard coding it is rather dangerous and I've been caught out before with incorrect sizes. I think 12 is the right number though in this case. Maybe Q changes after one of the instructions in my loop?
Unless the tool set includes a method for automatically generating rep length parameters for code using skipf then hard coded is going to be necessary. I use a spreadsheet to write up the code, generate the skip patterns, and calculate the rep length values.
I accept that this isn’t likely to occur much in general code, but for drivers that are trying to squeeze multimodal code into COG/LUT space while keeping as much space free for buffers it seems like it will be necessary.
RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?
I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
Ah, yes, that'll be the why. I've completely ignored xbyte discussions but I see the xbyte sequence uses a hidden rdlut. I'm guessing that data is needed stored close to the ALU/pipeline and Q is it.
So I also suspect xbyte is the only reason Q is used by RDLUT at all.
I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
The Q register is just a parameter for those type instructions. EDIT: Or more importantly, the modal changes in those instructions are not triggered until the SETQ is a prefixing instruction.
RDLUT destroying Q is important to know but it doesn't appear to be mentioned in either the spreadsheet or the documentation. So XBYTE also affects Q?
I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
SETQ also shields interrupts to protect the next instruction.
Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.
Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.
That's not much of an argument for an object that is running the whole cog - without interrupts. And debuggers can preserve Q using MUXQ and SETQ. And, the optimised, but buggy, example above actually is protected from interrupts anyway. The SETQ is followed immediately by a REP.
Comments
Am I missing something? Shouldn't it be movbyts pa, #%01010000 ?
Does the %% change the interpretation of the immediate?
pixelmove long normalpixels normalpixels rep #2, readburst rdlut a, ptra++ wrlut a, ptrb++ ret doublebits rep #8, readburst rdlut a, ptra++ mov b, a movbyts a, #%%1010 movbyts b, #%%3232 mergew a wrlut a, ptrb++ mergew b wrlut b, ptrb++ ret doublebytes rep #6, readburst rdlut a, ptra++ mov b, a movbyts a, #%%1100 wrlut a, ptrb++ movbyts b, #%%3322 wrlut b, ptrb++ ret doublewords rep #6, readburst rdlut a, ptra++ mov b, a movbyts a, #%%1010 wrlut a, ptrb++ movbyts b, #%%3232 wrlut b, ptrb++ ret doublelongs rep #3, readburst rdlut a, ptra++ wrlong a, ptrb++ wrlong a, ptrb++ ret doublenibbles setq nibblemask rep #12, readburst rdlut b, ptra++ getword a, b, #1 movbyts b, #%%1100 mov pb, b shl pb, #4 muxq b, pb wrlut b, ptrb++ movbyts a, #%%1100 mov pb, a shl pb, #4 muxq a, pb wrlut a, ptrb++ ret nibblemask long $0ff00ff0
movbyts pa,#%%1100 mov pb,pa shl pb,#4 and pb,mask muxnibs pa,pb
Edit. Doh! No that won't work for nibbles = 0, Scrap that idea.
Time to change brand of coffee I think!
alts modedata, #skmasks 'set up skipmask based on mode mov skmask, 0-0 skipf skmask alts modedata, #replengths 'replength must be adjusted for skipf mov replength, 0-0 alts modedata, #amaps 'first transform map if needed mov amap, 0-0 alts modedata, #bmaps 'second transform map if needed mov bmap, 0-0 setq nibblemask rep replength, readburst rdlut a, ptra++ mov b, a 'copy for later getword b, a, #1 'save second half movbyts a, amap 'first transform mov pb, a 'take a copy shl pb, #4 'nibble shift first half muxq a, pb 'merge nibble copy mergew a 'double bits from first transform wrlut a, prtb++ 'write first result movbyts b, bmap 'second transform mergewb 'double bits from second transform mov pb, b 'take a copy shl pb, #4 'nibble shift second half muxq b, pb 'merge nibble copy wrlut b, ptrb++ 'write second result wrlong a, ptrb++ 'write first long wrlong a, ptrb++ 'write second long ret 'done replengths 3 8 6 6 4 12 skmasks %011111111011111110011100 %011011100001110100010000 %011011110011110100010000 %011011110011110100010000 %000111111111111110011100 %011000010010000010000000 amap 0 %%1010 %%1100 %%1010 0 %%1100 bmap 0 %%3232 %%3322 %%3232 0 %%1100 'modedata ' normalpixels = 0 ' doublebits = 1 ' doublebytes = 2 ' doublewords = 3 ' doublelongs = 4 ' doublenibbles = 5
Apologies for the comment alignment, but I've run out of time to adjust the number of spaces.
replengths have been adjusted to account for cancelled instructions in the pipeline where there are more than 7 in a row. I think that's necessary, but can't test to be certain.
Untested, but by my calculations each method takes:
' single pass 640 output pixel burst normalpixels 12 clocks 490 clocks doublebits 22 clocks 658 clocks doublebytes 18 clocks 498 clocks doublewords 18 clocks 498 clocks doublelongs 14 clocks 330 clocks doublenibbles 30 clocks 980 clocks
I have some of the doubling working now on my LCD monitor, the single 1bpp, 8bpp and 16bpp, 24bpp modes each seem to all be doubling pixels on the screen, giving a 320 pixel resolution. Need to work on the nibble, and 2bpp modes. My nibble doubling mode seems to be broken, maybe a bug. 2bpp is not coded yet. Getting pretty close now...
Here is the bad code, pixels not doubled on the screen, though they are modified in a weird way, still same size 640 on screen.:
doublenibbles setq nibblemask rep #12, readburst rdlut b, ptra++ getword a, b, #1 movbyts b, #%%1100 mov pb, b shl pb, #4 muxq b, pb wrlut b, ptrb++ movbyts a, #%%1100 mov pb, a shl pb, #4 muxq a, pb wrlut a, ptrb++ ret
Here is the working code, pixels doubled properly, 320 on screen.
doublenibbles rep #14, readburst rdlut b, ptra++ getword a, b, #1 movbyts b, #%%1100 mov pb, b shl pb, #4 setq nibblemask muxq b, pb wrlut b, ptrb++ movbyts a, #%%1100 mov pb, a shl pb, #4 setq nibblemask muxq a, pb wrlut a, ptrb++ ret
Why did that come about, Chip?
Is lutRAM location $1ff real? Strangely, RDLUT reg, ##$1ff doesn't affect Q.
Ha! Ah, no, somehow lutRAM address $1ff had the same data I was using for SETQ
Two text modes, both 16 colour LUT based, optional line doubling, either a flashing or high intensity background, flashing block text cursor with data read from 16 bit character screen memory (classic VGA type of screen buffer) - 80 column 16 colour text, 8xF size font - 40 column 16 colour text, 8xF size font Multiple graphics modes in one of four resolutions: - 640xN - 320xN - 640xN/2 (line doubled) - 320xN/2 (line doubled) Colour modes are: - LUMA8 (all 8 colours, 8 bit luminance) - RGBI8 (3 bit colour, 5 bit luminance) - RGB8 (3:3:2) - RGB16 (5:6:5) - RGB24 (8:8:8) - LUT palette 1bpp - LUT palette 2bpp - LUT palette 4bpp - LUT palette 8bpp The active scanline count N can be setup as 350, 400, 480, etc, it's statically configurable with suitable frame timing blanking parameters at compile time. Perhaps this could be made dynamic at some point. The font scanline height F is also configurable per frame.
As it stands today this driver still has ~60 COG longs free, 256 LUTRAM free (at times, depending on where I double the pixels). This space should be able to increase with optimizations. It can also be massively increased if the code for the chosen mode is dynamically loaded to be executed (though that risks crashing the video driver if hub memory is corrupted, right now everything is nicely self-contained and stable). I will look into including few other niceties like a second block mouse cursor in text mode, frame sync state update, either a line or block cursor with or without flashing, and a graphics mouse sprite in graphics modes plus buffer wraparounds for scrolling. Then this should be plenty usable even before other things get added in time...especially HyperRAM buffer support, etc, etc.
Should be useful for others once P2 rev B's with HDMI are released in larger volumes. I'll certainly be using it for my own debug soon.
I accept that this isn’t likely to occur much in general code, but for drivers that are trying to squeeze multimodal code into COG/LUT space while keeping as much space free for buffers it seems like it will be necessary.
I assume rdlut changes Q because setq+rdlut is a legal combination. I would expect any instruction for which setq is a valid prefix probably sets Q to a default value first.
So I also suspect xbyte is the only reason Q is used by RDLUT at all.
Xbyte will be the reason why RDLUT is different.
Q gets used for data capture in a few ways. I will list them when I am at my computer today.
SETQ also shields interrupts to protect the next instruction.
Beware that interrupt routines may disrupt Q's value. So, it's safest to always do a fresh SETQ.