1080p tile driver (P1 style)
Rayman
Posts: 14,662
Was thinking about how to do 1080p video and it occurred to me that the P1 tile driver might be good for that.
Tested the idea out today, it seems to work (see screenshot).
Started out with 2 cogs, one for the video output and one to build the video line data.
If your P2 clock is below 250 MHz, another cog would be needed to help build the video line data.
Update: New color scheme seems to work. Now able to replicate the P1 "VGA_demo" (see new screenshot below).
Driver is using the actual P1 ROM font do draw the characters and buttons.
Update2: Added "Graphics" from P1 and can run "Graphics_Demo"
Update3: Code now attached here.
Driver Details:
Using 3 cogs:
One is the VGA driver that outputs 1080p in 2bpp LUT mode from a 1080x16 pixel buffer (one row of tiles). The LUT offset (%bbbb) cycles through 0 and 7.
Second cog populates the 1080x16 pixel buffer based on the 120x68 tile pointer array and ATN signals from the VGA driver
Third cog dynamically updates the tile colors at LUT offsets (%bbbb) 0 through 7, guided by ATN signals from the VGA driver, based on the palette, which is reloaded every frame.
There are 5 arrays needed by this driver, requiring a total of 88.5 kB (out of 512 kB available in P2):
One is the pixel output buffer (7.5 kB):
Tested the idea out today, it seems to work (see screenshot).
Started out with 2 cogs, one for the video output and one to build the video line data.
If your P2 clock is below 250 MHz, another cog would be needed to help build the video line data.
Update: New color scheme seems to work. Now able to replicate the P1 "VGA_demo" (see new screenshot below).
Driver is using the actual P1 ROM font do draw the characters and buttons.
Update2: Added "Graphics" from P1 and can run "Graphics_Demo"
Update3: Code now attached here.
Driver Details:
Using 3 cogs:
One is the VGA driver that outputs 1080p in 2bpp LUT mode from a 1080x16 pixel buffer (one row of tiles). The LUT offset (%bbbb) cycles through 0 and 7.
Second cog populates the 1080x16 pixel buffer based on the 120x68 tile pointer array and ATN signals from the VGA driver
Third cog dynamically updates the tile colors at LUT offsets (%bbbb) 0 through 7, guided by ATN signals from the VGA driver, based on the palette, which is reloaded every frame.
There are 5 arrays needed by this driver, requiring a total of 88.5 kB (out of 512 kB available in P2):
One is the pixel output buffer (7.5 kB):
TileBuffer1 'pixel output buffer. 1920 of 2bpp tile data (16 lines) long $FFFFFF00[1920*16/16]Another is the tile pointer array (32 kB):
Tiles '1920x1080 in 16x16 tiles '20 bits for tile address, upper 12 bits free long $0[120*68]Another is the tile color pointer array (32 kB):
TileColors 'One long for each tile, picks four colors from 256 color palette. Note: bytes are read little endian long $01_00_01_00[120*68]Then, there's the 256 color, 24-bit palette (1 kB):
palette '256 element array of 24-bit colors (Note: Lower 8-bits are (must be?) zero). long 0[256]Finally, the P1 ROM Font (16 kB):
P1RomFont file "p1_font.dat"
Comments
The screen contents are described by an array of pointers to 16x16x2bpp tiles.
Each tile also has four colors assigned to it (2 bpp).
So, it can be text, if you point to the ROM_Font, for example. Or, it can also show 2 bpp graphics.
You can also combine these...
Each tile was defined by a word. The lower 10 bits defined the base address of the tile data, after being left shifted by 6. A left shift of 6 is corresponds to 64 bytes, the number of bytes in a 16x16x2bpp tile.
The upper 6 bits of the tile data word are an index to a set of 4 colors for the tile.
The P1 ROM font is 16x32, so is 1 tile wide and 2 tiles tall.
The font tile data combines two 1bpp characters in the same 2bpp tile.
You need to format the color set in a special way to select which character to display.
For example, the characters "B" and "C" are combined into the same tile data in the p1 rom font and so would have the same lower 10 bits of tile data.
The upper 6 bits of color data is used to pick which is displayed.
Each color set is one long with 1 byte for each of the 4 colors. The usual usage was 2 bits each for red, blue, green, in this order with the lower two bits being unused.
(Actually, when applied to the pins, the lower two bits are where H and V sync signals go).
So, to display character "B", you would pick a color set with the format:
Foreground, background, foreground, background
To pick "C", you use a color set with the format:
foreground, foreground, background, background
Some of the tiles in the ROM font are intended to be used as 2bpp data (instead of merged 1bpp characters), such as edges of buttons.
As Chip mentions below...
The P1 could do 1600x1200 VGA in the way described above, but used up 6 cogs and about half of RAM.
This was not very practical for some applications.
Here, the plan is to do 1920x1080 using 3 cogs and ~64 kB of RAM (out of 512 kB).
This leaves a lot of RAM and cogs for other things...
The driver currently uses one long for every tile. I think I'm going to use the lower 20 bits as pointer to tile address in hub and the upper 12 bits for the color.
There are actually a lot of different ways to do the color.
One way is to use the 12-bits as an index to up to 4096 sets of four, 24-bit colors.
That'd be 64kB of colors if you actually needed all 4096, but I think much less than that is needed in practice.
I always found this strange. I think it makes much more sense for the top bits to contain the tile address,so one can just mask off all the other stuff (And indeed, the text modes in JET Engine work like that, both with the tile address and the color index (there's only 16 colors, but tile deinterleaving is handled by the driver (with bit 0 selecting whether you want the odd or even tile))). In the end, there isn't any advantage to either method in terms of driver performance - on P1, that is. On P2, RDLONG doesn't mask off the high 16 bits (which would contain the color bits after shifting the address into place), so "my" way is probably a bit more efficient.
I vaguely remember a discussion about this...
But, don't see anything in documentation about it...
Actually, I guess this passage implies that only the lower 20-bits are used:
BTW: I agree that the P1 driver could use ANDs instead of shifts and maybe be clearer...
So, instead using a new array of one long per tile. Each byte in this long represents one of 256 24-bit color values stored in the COG RAM of the color cog.
This is a bit more limiting, but still pretty good and vastly superior to the P1 way.
Anyway, screen is now defined by two 32kB arrays, one for tile pointers and one for color info.
Now, it reloads the whole 256 color, 24-bit palette during vertical blanking.
The P1 could only load 15 out of 64, 8-bit colors...
So far, looks like this approach will work out nicely...
If you want to try it, just run the attached binary with A/V board connected to P0 header on eval board.
BTW: It's interesting that it looks steady with P2 clock set to 250 MHz and fpix = 148_500_000.0
I thought I would need P2 clock set to fpix*2 to get a steady image...
Yeah, with the 480-line modes one can get on P1 with a single cog, the font is just waaaay huge, even if you squeeze it down to 16 lines... An 8x8 font would have probably been more useful to have in ROM...
I love the P1 font. I was actually quite sad about not having it in ROM. I've been meaning to dump it to SD and experiment on the P2 but haven't got around to it. So many irons and so little fire these days.
But, at 1080P, it's just about right for something like a control panel.
There might be a case for making an 8x16 tile driver instead.
But, I'm going to see where this one takes me...
It's recognizable, but still needs work:
'Byte jump table moved into hubexec as long jump table as didn't see how to fix it...
'MINS-->FGES
'MAXS-->FLES
'MOVS-->SETS
':-->. (many places)
'SETD-->SETD2 'SETD is an instruction in P2
'cmps wc,wr --> subs wc
'command := cmd << 16 + argptr --> command := cmd << 24 + argptr
'Using QROTATE to fix missing sin table
'Used MUL to speed multiply
Maybe the main thing now is that it is using a sin table that doesn't actually exist...
Think I can change it to QROTATE though...
Here it is with no added delay in the loop.
Much faster than P1:
BTW: 1080p seems to work perfectly on any recent monitor, assuming it has VGA input.
A lot of the ones I have (like in this video) are old though with 4:3 layout and 1080p doesn't really work... It cuts off the part of the screen outside the 4:3 format.
Some others work better, but skip some of the 1920 vertical lines. This one does this too...
Seems fine, especially on newer monitors.
https://www.amazon.com/gp/product/B0823DDQ91
It actually works!
And now, it's clear that there is an issue with 250 MHz clock.
Although the image is stable, I can see that vertical lines appear missing, it's probably doubling some vertical lines and missing others.
At 297 MHz, it looks perfect.
I'll test some more, but I think we'll just have to run at 297 or 297/2....
I wonder how safe it is to run at 297... I guess it's fine at room temperature.
I currently do the visible line like this, with an xzero at the start:
I think that depends on what 'steady ' means.
If the P2 NCO generator resyncs every line, the timing columns in each line will line up within 1 sysclk, but they will vary across the line, as they will not be pixel-perfect, but will vary by half a sysclk from the ideal placement (due to NCO jitter)
That's expected, I think ? - as you will have sampling issues at non-standard clock multiples, so something has to give.
For someone using larger fonts, this may be tolerable.
Overclocking to 297MHz is likely to be optimistic for volume production, as we still do not have a handle on how much wafer lots will vary with P2.
The next chips will give a second sample, so may have some indication.
See here where it looks pretty close to original:
I added a delay to slow it down to somewhere close to the original.
This monitor is interesting... It's old and doesn't know 1080p. I think all newer monitors will recognize 1080p...
It only has 1680 horizontal pixels and yet looks good with 1920x1080 with P2 clock at 297 MHz.
BTW: I have to thank rogloh for showing me how to do 297 MHz. I didn't realize you could set xdiv to 20 and then xmul to 297...