No, it's the same time and same code space.
The setq is actually 2 instructions for 4 clocks since it's loading a full long - its a combo AUGS and SETQ.
For graphics this would definitely be a big win. For text, I don't think it's worth sweating over...
Agree. I want this in there for the graphics from HUB, having 320 pixel text is just a bonus for those retro applications given that you are already running a VGA screen.
I plan to hook this driver into accessing HyperRAM memory as well, that's also on the cards now and makes sense for graphics modes specifically. Been thinking about that, the 320 pixels doubling process is problematic there as you may need to ask for the line data two scan line's prior if you want to pixel double it. My current processing model computes a line of pixels (including graphics pixel doubling on the fly), then writes it out to hub and gets it ready for being output on the next scan line when it begins the next scan line contents. A mouse cursor can also get rendered on it later once the whole line is complete. The problem here is there is may not be a line of data ready from HyperRAM in time to fully read in and process unless you can start to work on it no faster than it arrives in memory, but before notification of the transfer being complete. Worst case is HyperRAM sourced graphics data might not be pixel doubled, or I can look into modifying the sequence to issue the command 2 scan lines prior if it doesn't fit the existing pipeline. Don't like that latter solution because one day I may wish to support arbitrary screen modes per scanline with a region list and I won't be setup for the new mode until the next line...
It's easily do-able. As you say, the whole font scanline is only ~64 cycles to read in, and it saves a random read from HUB which is potentially 16 cycles per character; the replacement lookup using altgb is only 4 cycles per character, so with an 80 character wide screen that's a total savings of 80*(16-4) - 64 = 896 cycles per scan line, which is pretty significant.
I thought about it late last night after also adding it all up as well and found those clock savings so very tempting. Now I'd sort of been preserving the upper LUT half for a potential TERC4 table to be put there some day but looking more at it found I could compress that 256 table down to 16 longs in COG RAM instead and free up the entire upper LUTRAM for a LOT more code! This table reduction ended up only costing me something about ~180 clocks to do this which I think is well worth it and still (just) remains within the budget. So I plan to change the layout over time which will free up more COG RAM. This should allow the font table to fit into 64 longs of COG RAM as you suggested. The bonus is this font buffer can be shared for other uses too.
The only thing I don't like the look of so far with respect to its overall execution time is this 64 bit text format with direct RGB24 selection. I think that could have some future limitations as it would probably take in the vicinity of over half an active line to compute everything in 80 column mode, 40 columns might be okay. With pure DVI you can use the entire active scanline plus a bit more so even that is probably still possible, maybe even with flashing and inverse/underline effects etc. If I can fit it in it would be quite good to be compatible with your ANSI driver. Perhaps a 16bpp driver can also be done in case that helps...or I could try to use this rgbsqz instruction. Needs some more thought...
I would guess the the TMDS encoder just captures the streamer output every 10 clocks. If that is the case, then just run the streamer at sysclock/20 for pixel doubling.
That's my hope too and I'd sort of guess that might be what it would do. Just don't know if it is true yet and how easily to switchover it's frequency on the precise clock edge needed if you do it on the fly. Might not be achievable unless it is buffered with the xcont instructions. I'm working on the software doubling approach in the meantime.
Cluso99 and TonyB_.
Yeah all the sneaky little inner loop savings add up and I should certainly use setq with a register.
I hope to work on this a bit more later tonight with some of these new ideas.
An untested revision of the original code that doesn't address pixel doubling but shaves space and a little time:
Interesting use of SKIPF. Yeah I was considering using some type of skip code in the nibble and two-bit modes pixel doubling, since the code is so similar, it should save code space, plus using SKIPF could reduce the additional execution overhead where the code differs. Am hoping SKIP/SKIPF fully works with REP and could even skip the rep instruction itself if such a sequence is ever desired, but have never tried using it yet.
A couple of the locals here in Oz have already had HyperRAM working with the P2, and ozpropdev has the new board working I believe. I have a board ready and am getting rather keen to try out soon myself with my DVI video driver. Ideally a COG could be developed that would function as an arbiter to this shared memory, giving a video COG's transfer requests priority and allowing it larger, streaming type of memory transfers, whilst also giving other COGs access to this shared RAM in the other times and also not starving refresh. It's an interesting scheduling type of problem to solve. I think a video COG could help it out by advertising when it expects its next access will likely be coming after this one (typically one scan line later, but could be more during vertical refresh time), and when it would like the results to be ready and the HyperRAM COG could use this information to allocate and divide any of the remaining free time up amongst its other COG requests while not starving them out for too long.
If you can provide larger burst sizes to the other non-video COGs you can get reasonable bandwidth, and if you break up the video transfer request into slightly smaller bursts (which helps refresh), you can reduce the latency for small accesses as well if you interleave them in. It's quite a tradeoff to solve... and a weighted deficit round robin or some other sort of scheduler algorithm may allow fairness for non-video COGs. This type of stuff reminds me of my old packet network switching days, LOL. Also in cases where there are only ever two COGs sharing the memory, which is probably going to be quite common, it becomes simpler to solve than if it is all 8 COGs vying for it, so if the HyperRAM driver COG adapts dynamically to its request load and can adjust its strategy accordingly that could be good too.
Allowing "reasonably" sized larger bursts to non-video COGs will make this shared memory work better. Just giving single 32 bit long size memory accesses will be far too slow for video writes to the whole screen for example. Better to allow transfers of something in the 64-128 byte order or higher which is on the 1us sort of timescale. A second non-video COG could then get given a handful (like 4-8) of these opportunites per scanline for example and hopefully achieve transfer rates in the vicinity of 16MB/s itself while still sharing it with a VGA driver resolution operating at the highest colour depths. Lower colour depths and resolutions would only increase the available bandwidth and access opportunities to other COGs further.
If all of the video (graphics + text) needs pixel doubling or more, e.g. due to resolution lower than HDMI minimum, simply set PixelRepetition (PR0:PR3) in the AVI InfoFrame.
As far as I can tell from my armchair, one could just use the center 320 pixels and set the InfoFrame to reflect that, which should cause the display to stretch it out for you. Although I wouldn't count on that being supported properly by every display.
If all of the video (graphics + text) needs pixel doubling or more, e.g. due to resolution lower than HDMI minimum, simply set Pixel Repetition (PR0:PR3) in the AVI InfoFrame.
I believe even then you still need to send the pixels twice TonyB_.
Roger, I could be completely wrong on this and I apologise if so. The HDMI spec v1.4 does give the strong impression that Pixel Repetition operates more like Pixel Decimation, where the sink skips pixels repeated by the source ... but ...
page 130 of 197 (actual page 145 of 425 in the PDF) shows that pixel doubling increases the number of audio channels. 640x480p/576p or 720x480p/576p have 800 or 858 pixels/line with same line frequency. Thus horizontal blanking is 800-640=160 or 858-720=138 pixels, hence 138 clocks at top of page. However, with pixel doubling, blanking period is 276 clocks, leaving only 800-276=524 or 858-276=582 for video.
I think the way it works is by transmitting twice as many pixels per line it increases the audio channel capacity because it also gives more horizontal blanking period for sending extra audio packets at a given pixel frequency. The linerate is effectively halved in this case, e.g. 480i or 288p can be sent with 2x pixel repetition, keeping the total pixel rate constant at 27Mpixels/sec. So you would send out 1440 pixels on the line (duplicating an original 720 pixels and indicating this in the info frame), but the line frequency would now become 27MHz/(858*2) = 17.734kHz instead of usual 31.468kHz for example. The TV would then only display half of these pixels on the screen because it sees this repetition information in the info frame. This way you can maintain the minimum TMDS link rate of 250MHz. That seems to be at least one of the purposes of it, but perhaps there is more to it...
Thanks for the explanation. Small correction, I think, line frequency is unchanged but pixel clock must be increased to get extra audio channels, from 27 to 54 MHz for 2x repetition in this example.
Monitors won't go below the arbitrary bottom limit of about 30 kHz line rate under any circumstances. It's what defines the minimum DVI clock rate. TVs do, but only for very specific modes. It's all a tad sucky really.
I worked it into the streamer. It takes the upper 3 bytes of the output long for R/G/B and the lower byte is used for command codes. In lieu of the normal 32 bits, it outputs {24'b0, r, !r, g, !g, b, !b, clk, !clk}. You have to pace your pixels at 1/10th the chip clock using 'SETXFRQ ##$0CCCCCCC'. To enable HDMI encoding: 'SETCMOD #$80'. It's maybe only good for 640x480 @60Hz, but gets us an HDMI output that will work with modern displays.
On P1 I don't think there is any way to reset the chroma phase. So I would think that there is no way to reset the TMDS phase on the streamer because there is no need for it. As long as the streamer runs at sysclock/10, they will be fine. It might add some delay as the streamer may output data to the TMDS encoder for several clocks until the encoder needs it.
For software pixel doubling, I think it would be best to duplicate the 1 bpp data from the font instead of the 4bpp data from the color encoder.
I worked it into the streamer. It takes the upper 3 bytes of the output long for R/G/B and the lower byte is used for command codes. In lieu of the normal 32 bits, it outputs {24'b0, r, !r, g, !g, b, !b, clk, !clk}. You have to pace your pixels at 1/10th the chip clock using 'SETXFRQ ##$0CCCCCCC'. To enable HDMI encoding: 'SETCMOD #$80'. It's maybe only good for 640x480 @60Hz, but gets us an HDMI output that will work with modern displays.
On P1 I don't think there is any way to reset the chroma phase. So I would think that there is no way to reset the TMDS phase on the streamer because there is no need for it. As long as the streamer runs at sysclock/10, they will be fine. It might add some delay as the streamer may output data to the TMDS encoder for several clocks until the encoder needs it.
For software pixel doubling, I think it would be best to duplicate the 1 bpp data from the font instead of the 4bpp data from the color encoder.
I worked it into the streamer. It takes the upper 3 bytes of the output long for R/G/B and the lower byte is used for command codes. In lieu of the normal 32 bits, it outputs {24'b0, r, !r, g, !g, b, !b, clk, !clk}. You have to pace your pixels at 1/10th the chip clock using 'SETXFRQ ##$0CCCCCCC'. To enable HDMI encoding: 'SETCMOD #$80'. It's maybe only good for 640x480 @60Hz, but gets us an HDMI output that will work with modern displays.
On P1 I don't think there is any way to reset the chroma phase. So I would think that there is no way to reset the TMDS phase on the streamer because there is no need for it. As long as the streamer runs at sysclock/10, they will be fine. It might add some delay as the streamer may output data to the TMDS encoder for several clocks until the encoder needs it.
For software pixel doubling, I think it would be best to duplicate the 1 bpp data from the font instead of the 4bpp data from the color encoder.
pixelbyte in: %00_01_11_10
dpix out: %xxxx_xxxx_0000_0011_xxxx_xxxx_1111_1100
The goal was to duplicate the bits in pixelbyte. Looks like it worked. You did ror instead of shr, but not a problem since it only affects the don't care bits.
Yes I have had some success with combining ideas from various smart people here including ozpropdev, AJL, ersmith and Saucy, etc.
I've managed to put together the following code block for 16 colour text that shares the same code loop for both 40 and 80 column modes and it fits the budget I wanted. Just tried it and it works nicely. 80 column mode falls through to some excess instructions at the end into the longer 40 column mode loop but I ensure only temporary variables are affected. A conditional jump at the beginning of the 40 column stuff fixes it but adds another long and 80 more clocks to the 40 column code.
setq #40-1 'get next 80 chars with colours
rdlong scratch, ptra 'read 40 longs from HUB into COGRAM
testb modedata, #0 wz 'flashing / high intensity bg test
if_nz mov test1, instr1
if_z mov test1, #0
testb modedata, #2 wz '40/80 column mode test
mov a, #0 wc
textloops
if_z rep #25, #40 '2000 clocks for 40 column mode
if_nz rep #16, #80 '2560 clocks for 80 column mode
altgw a, #scratch
getword first, 0-0, #0 'get next character and colours
getbyte fontaddr, first, #0 'determine font lookup address
altgb fontaddr, #font 'extract font offset
getbyte pixels, 0-0, #0 'get font information for character
test1 bitl first, #15 wcz 'test (and clear) flashing bit
if_c and pixels, flash 'make it all background if flashing
movbyts first, #%01010101 'becomes BF_BF_BF_BF
mov b, first 'grab a copy
rol b, #4 'b becomes FB_FB_FB_FB
setq fgbg_mask 'mask is #$F0FF000F
muxq first, b 'first becomes FF_FB_BF_BB
testb modedata, #2 wz 'repeat test again, z gets trashed
if_nz movbyts first, pixels 'select pixel colours
if_nz wrlut first, ptrb++ 'instrs. skipped if pixel doubled
add a, #1 'repeat or fall through for 40 cols
setword pixels, pixels, #1 'replicate words
mergew pixels 'double pixels
mov b, first 'save a copy before we lose colours
movbyts first, pixels 'compute 4 lower colours of char
if_z wrlut first, ptrb++ 'save it
ror pixels, #8 'get upper pixels
movbyts b, pixels 'compute 4 higher colours of char
if_z wrlut b, ptrb++ 'save it
endloop
....
scratch long 0[40]
font long 0[64]
Weufel_21, I knew something like that was going to be possible with those instructions. I think I can leverage it also for the nibble replication too. Been thinking about it all night... my original twit and nibble multipliers are my only ugly slow routines left until I fix it.
Update: There's another variant of the above that works with LUTRAM buffer in case I don't have enough COGRAM for holding the font and the characters. It only adds one clock cycle per loop iteration because RDLUT takes 3 clocks. You save on the counter and altgw, but need to do a rotate and xor after the RDLUT so overall it adds 1 clock which is still okay I suspect.
I worked it into the streamer. It takes the upper 3 bytes of the output long for R/G/B and the lower byte is used for command codes. In lieu of the normal 32 bits, it outputs {24'b0, r, !r, g, !g, b, !b, clk, !clk}. You have to pace your pixels at 1/10th the chip clock using 'SETXFRQ ##$0CCCCCCC'. To enable HDMI encoding: 'SETCMOD #$80'. It's maybe only good for 640x480 @60Hz, but gets us an HDMI output that will work with modern displays.
On P1 I don't think there is any way to reset the chroma phase. So I would think that there is no way to reset the TMDS phase on the streamer because there is no need for it. As long as the streamer runs at sysclock/10, they will be fine. It might add some delay as the streamer may output data to the TMDS encoder for several clocks until the encoder needs it.
For software pixel doubling, I think it would be best to duplicate the 1 bpp data from the font instead of the 4bpp data from the color encoder.
pixelbyte in: %00_01_11_10
dpix out: %xxxx_xxxx_0000_0011_xxxx_xxxx_1111_1100
The goal was to duplicate the bits in pixelbyte. Looks like it worked. You did ror instead of shr, but not a problem since it only affects the don't care bits.
Yes, I did ror instead of shr. My mistake.
I see, if you throw away bytes 3 and 1 it produces the result you are after, as the full result is actually
dpix out: %0000_0000_0000_0011_1111_0011_1111_1100
But, of course there's the mergew, mul sequence that is best :-)
We need to find a way to replicate nibbles from a starting word like 0000DCBA into this such pattern DDCCBBAA where "A"-"D" here represent the 4 bit nibbles in a 32 bit long. I still can't quite see it yet, but I'm almost 100% convinced it is doable with some clever use of mergeb/mergew/splitb/splitw/setbyts/ror/or/mul etc. My own way still takes 8 instructions to do this which is slow.
Yes Lachlan, that's sort of what I've been trying to do. I must be missing something, but I think his method is good for pixel doubling, not getting to 0D0C0B0A which is required first. You can then do the multiply trick as you suggest to get the nibbles doubled.
I've looked at what patterns can be reached using split/merge etc on the original data, and what can be used in the final step with split/merge/mul etc, but I can't see a fast way (yet) that they can be connected via intermediate instructions. I've convinced myself it must be doable somehow.
I can do it in 8 instructions, but ideally if it can be done in less than this, that would be great.
Comments
No, it's the same time and same code space.
The setq is actually 2 instructions for 4 clocks since it's loading a full long - its a combo AUGS and SETQ.
I plan to hook this driver into accessing HyperRAM memory as well, that's also on the cards now and makes sense for graphics modes specifically. Been thinking about that, the 320 pixels doubling process is problematic there as you may need to ask for the line data two scan line's prior if you want to pixel double it. My current processing model computes a line of pixels (including graphics pixel doubling on the fly), then writes it out to hub and gets it ready for being output on the next scan line when it begins the next scan line contents. A mouse cursor can also get rendered on it later once the whole line is complete. The problem here is there is may not be a line of data ready from HyperRAM in time to fully read in and process unless you can start to work on it no faster than it arrives in memory, but before notification of the transfer being complete. Worst case is HyperRAM sourced graphics data might not be pixel doubled, or I can look into modifying the sequence to issue the command 2 scan lines prior if it doesn't fit the existing pipeline. Don't like that latter solution because one day I may wish to support arbitrary screen modes per scanline with a region list and I won't be setup for the new mode until the next line...
I thought about it late last night after also adding it all up as well and found those clock savings so very tempting. Now I'd sort of been preserving the upper LUT half for a potential TERC4 table to be put there some day but looking more at it found I could compress that 256 table down to 16 longs in COG RAM instead and free up the entire upper LUTRAM for a LOT more code! This table reduction ended up only costing me something about ~180 clocks to do this which I think is well worth it and still (just) remains within the budget. So I plan to change the layout over time which will free up more COG RAM. This should allow the font table to fit into 64 longs of COG RAM as you suggested. The bonus is this font buffer can be shared for other uses too.
The only thing I don't like the look of so far with respect to its overall execution time is this 64 bit text format with direct RGB24 selection. I think that could have some future limitations as it would probably take in the vicinity of over half an active line to compute everything in 80 column mode, 40 columns might be okay. With pure DVI you can use the entire active scanline plus a bit more so even that is probably still possible, maybe even with flashing and inverse/underline effects etc. If I can fit it in it would be quite good to be compatible with your ANSI driver. Perhaps a 16bpp driver can also be done in case that helps...or I could try to use this rgbsqz instruction. Needs some more thought...
That's my hope too and I'd sort of guess that might be what it would do. Just don't know if it is true yet and how easily to switchover it's frequency on the precise clock edge needed if you do it on the fly. Might not be achievable unless it is buffered with the xcont instructions. I'm working on the software doubling approach in the meantime.
Cluso99 and TonyB_.
Yeah all the sneaky little inner loop savings add up and I should certainly use setq with a register.
I hope to work on this a bit more later tonight with some of these new ideas.
Interesting use of SKIPF. Yeah I was considering using some type of skip code in the nibble and two-bit modes pixel doubling, since the code is so similar, it should save code space, plus using SKIPF could reduce the additional execution overhead where the code differs. Am hoping SKIP/SKIPF fully works with REP and could even skip the rep instruction itself if such a sequence is ever desired, but have never tried using it yet.
If you can provide larger burst sizes to the other non-video COGs you can get reasonable bandwidth, and if you break up the video transfer request into slightly smaller bursts (which helps refresh), you can reduce the latency for small accesses as well if you interleave them in. It's quite a tradeoff to solve... and a weighted deficit round robin or some other sort of scheduler algorithm may allow fairness for non-video COGs. This type of stuff reminds me of my old packet network switching days, LOL. Also in cases where there are only ever two COGs sharing the memory, which is probably going to be quite common, it becomes simpler to solve than if it is all 8 COGs vying for it, so if the HyperRAM driver COG adapts dynamically to its request load and can adjust its strategy accordingly that could be good too.
Allowing "reasonably" sized larger bursts to non-video COGs will make this shared memory work better. Just giving single 32 bit long size memory accesses will be far too slow for video writes to the whole screen for example. Better to allow transfers of something in the 64-128 byte order or higher which is on the 1us sort of timescale. A second non-video COG could then get given a handful (like 4-8) of these opportunites per scanline for example and hopefully achieve transfer rates in the vicinity of 16MB/s itself while still sharing it with a VGA driver resolution operating at the highest colour depths. Lower colour depths and resolutions would only increase the available bandwidth and access opportunities to other COGs further.
@cgracey posted the documentation and instruction sheet links in the first post of the New P2 silicon thread
EDIT: Found it in the list of links that Saucy put up. Undocumented CMOD bit 8 enables it.
EDIT:
Wrong - see below.
Roger, I could be completely wrong on this and I apologise if so. The HDMI spec v1.4 does give the strong impression that Pixel Repetition operates more like Pixel Decimation, where the sink skips pixels repeated by the source
... but ...
page 130 of 197 (actual page 145 of 425 in the PDF) shows that pixel doubling increases the number of audio channels. 640x480p/576p or 720x480p/576p have 800 or 858 pixels/line with same line frequency. Thus horizontal blanking is 800-640=160 or 858-720=138 pixels, hence 138 clocks at top of page. However, with pixel doubling, blanking period is 276 clocks, leaving only 800-276=524 or 858-276=582 for video.
EDIT:
Wrong - see below.
LCD-based DVI/HDMI monitors still enforce this limit? (which was/is a physical constraint of VGA CRT monitors)
That sucks indeed.
Looks like the CMOD setting changed.
On P1 I don't think there is any way to reset the chroma phase. So I would think that there is no way to reset the TMDS phase on the streamer because there is no need for it. As long as the streamer runs at sysclock/10, they will be fine. It might add some delay as the streamer may output data to the TMDS encoder for several clocks until the encoder needs it.
For software pixel doubling, I think it would be best to duplicate the 1 bpp data from the font instead of the 4bpp data from the color encoder.
That movbyts make a clever little look-up-table.
I think you need to review how movbyts works; you seem to have misunderstood it's function (perhaps you've got the operands backwards).
Only the low 8 bits of the source register (or immediate value) are used as a series of 'twits' to control how the bytes of D are selected.
The format of the instruction is
Giving a pattern for pixelbyte to work through your first snippet as an example:
Or am I having a brain fart? (I don't have a P2, so this is ofc just theory from looking at the opcode sheet)
pixelbyte in: %00_01_11_10
dpix out: %xxxx_xxxx_0000_0011_xxxx_xxxx_1111_1100
The goal was to duplicate the bits in pixelbyte. Looks like it worked. You did ror instead of shr, but not a problem since it only affects the don't care bits.
Winner! Looks good.
I've managed to put together the following code block for 16 colour text that shares the same code loop for both 40 and 80 column modes and it fits the budget I wanted. Just tried it and it works nicely. 80 column mode falls through to some excess instructions at the end into the longer 40 column mode loop but I ensure only temporary variables are affected. A conditional jump at the beginning of the 40 column stuff fixes it but adds another long and 80 more clocks to the 40 column code.
Weufel_21, I knew something like that was going to be possible with those instructions. I think I can leverage it also for the nibble replication too. Been thinking about it all night... my original twit and nibble multipliers are my only ugly slow routines left until I fix it.
Update: There's another variant of the above that works with LUTRAM buffer in case I don't have enough COGRAM for holding the font and the characters. It only adds one clock cycle per loop iteration because RDLUT takes 3 clocks. You save on the counter and altgw, but need to do a rotate and xor after the RDLUT so overall it adds 1 clock which is still okay I suspect.
It wasn't a physical constraint of CRTs either. That was a marketing myth. It has always been arbitrary.
Yes, I did ror instead of shr. My mistake.
I see, if you throw away bytes 3 and 1 it produces the result you are after, as the full result is actually
dpix out: %0000_0000_0000_0011_1111_0011_1111_1100
But, of course there's the mergew, mul sequence that is best :-)
This works for up to 16 pixels provided high word is zero. If not:
Copied from:
http://forums.parallax.com/discussion/comment/1409553/#Comment_1409553
I've looked at what patterns can be reached using split/merge etc on the original data, and what can be used in the final step with split/merge/mul etc, but I can't see a fast way (yet) that they can be connected via intermediate instructions. I've convinced myself it must be doable somehow.
I can do it in 8 instructions, but ideally if it can be done in less than this, that would be great.
Five is better than 8 I had before. Will code this up.