Bufferless text video mode?

Rayman · 2017-02-07 14:53

Was just thinking about a text mode that was like P1, only with arbitrary colors.

Seems simple enough, just bring in say the Parallax font as 1 bit per pixel array of words.
Use a screen array with one byte per character position on screen for character #.
Use a two longs per character position for foreground and background color.

But, how can you change colors between characters?
The streamer would be in 1 bpp LUT mode.
How do you arrange to change the first two LUT entries between characters?

Is there a way?

Electrodude · 2017-02-07 16:12

What if you store two full-color scanlines in hubram: the one the streamer is currently outputting, and the one you're currently generating? That way, you can easily do things besides just text, but without the memory overhead of a full video buffer. I guess you'd unfortunately have to build the scanlines manually then, but it would make fonts with widths that aren't powers of 2, and even variable-width fonts, possible and easy.

I need to get an FPGA. I've been wanting to try scanline graphics on the P2 for a long time now.

Rayman · 2017-02-07 16:43

I guess that would work... One cog can make 24-bit per pixel scanlines in HUB RAM and another cog can output them.

That way could also be adapted to do arbitrary graphics without a buffer, like we talked about in a different thread.

What we'd need to do it the easy way is some kind of XLUTOFF command that lets the streamer work from some LUT offset instead of LUT address 0... If that were there, then LUT could hold 256 pairs of foreground and background colors... Would allow up to 256 columns of text with arbitrary colors...

Electrodude · 2017-02-07 17:03

So, you want a streamer LUT offset counter? Every configurable number of pixels, it increments by 2 for 1 bpp mode, by 4 for 2 bpp mode, by 16 for 4 bpp mode, and so on. When it reaches a configurable end value, it automatically resets to a configurable start value. Maybe XCONT and XINIT should also reset it to the start value.

Right?

jmg · 2017-02-07 20:02

Rayman wrote: »

..

But, how can you change colors between characters?
The streamer would be in 1 bpp LUT mode.
How do you arrange to change the first two LUT entries between characters?

Is there a way?

I think the problem here is information bandwidth.
1 bpp is lowest possible bit-rate per character, but if you want to change colours on a true random, FG/BG basis on any char boundary, now you need to move 24+24 bits every 8 pixel clocks.
ie, on average, you are now at 6(7) bpp

Of course, real systems do not need fully random colours, so you may be able to do some compression by using a table of used colours, per line group.
This becomes a form of scan-line buffering, only you do not fill-in at the pixel level, you fill-in at the palette index level.

ozpropdev · 2017-02-08 10:29

Back in the P2_Hot phase I made a text driver like you are suggesting.
http://forums.parallax.com/discussion/153913/p2-vga-text-driver

I know I had a bit of a battle building the scan lines in time for the video hardware.
I remember at the time that a pixel conversion instruction would have been nice.
Since P2_Hot had twice the performance of the current P2 it's now even trickier.
In the current P2 I have made text on the fly but only in 2 colors. (See Invaders 2.0 & the even earlier PST style text driver)

Here's the Verilog code for the pixel convert idea.


// PIXCONV D

// Convert 16 bit pixel/color data to 32 bit (8 * 4 bit color pattern).
// To be used with 4 bit RFLONG LUT streamer mode.

// d[7:0] = 8 bit pixel data
// d[15:12] = 4 bit background color index
// d[11:8] = 4 bit foreground color index

//e.g
// d = %1001_0110_01010111
// returns
// d = % 1001_0110_1001_0110_1001_0110_0110_0110


reg [15:0] d_in;
reg [31:0] d_out;

wire [3:0] bc = d_in[15:12];	//backfround color
wire [3:0] fc = d_in[11:8];	//foreground color


wire [31:0] pcd = {d_in[7] ? fc : bc,
		d_in[6] ?  fc : bc,
		d_in[5] ?  fc : bc,
		d_in[4] ?  fc : bc,
		d_in[3] ?  fc : bc,
		d_in[2] ?  fc : bc,
		d_in[1] ?  fc : bc,
		d_in[0] ?  fc : bc};

assign d_out = pcd;		//converted pixel color data

Anyhow it was just an idea based on my experiences so far using/creating text stuff.

Ariba · 2017-02-08 12:02

I have posted such a driver here (code a few posts later)
It's NTSC with 80 column and a color per character. For VGA timing, I got max. 40 columns. But with the higher clock freq of the real P2 80 chars per line should be possible.

It works with the Instruction set of Nov 2015, I have not updated my P2 FPGA since then.

Andy

Rayman · 2017-02-08 16:33

Thanks Andy. If I see it right, that is 1bpp streamer mode with font in cog RAM.

What I'd like to do is reproduce the P1 XGA (1024x768) text mode.
But, with arbitrary 24-bit foreground and background colors per 16x16 pixel cell.

Maybe use two cogs to alternate between each 16 pixel tall character row.
Within that row all the colors and characters are the same, so maybe can be fast.

Think two 16-pixel tall row buffers for XGA comes in ~100 kB, so not too bad.
Pixel clock is 65 MHz. With two cogs, that's 32.5 MHz each. We're at 40 MIPS per cog, so that's pretty tight. There is 30% blank time on horizontal line, so maybe can be done with tight rep loop? Anyway, I think 4 cogs might be able to get it done.

Still wonder how feasible it would be to change the LUT origin with the XINIT/XCONT or some new X instruction... Or, just have the LUT origin advance two places after every xcont...

jmg · 2017-02-08 20:10

Rayman wrote: »

What I'd like to do is reproduce the P1 XGA (1024x768) text mode.
But, with arbitrary 24-bit foreground and background colors per 16x16 pixel cell.

That means one FG and one BG value, per 16p ( or 8p), right ?

Rayman wrote: »

Maybe use two cogs to alternate between each 16 pixel tall character row.
Within that row all the colors and characters are the same, so maybe can be fast.
... Anyway, I think 4 cogs might be able to get it done.

Three COG is another possible solution point.

Rayman wrote: »

Still wonder how feasible it would be to change the LUT origin with the XINIT/XCONT or some new X instruction... Or, just have the LUT origin advance two places after every xcont...

I think auto-advance does not really help, as you need to refill the LUT every block of lines, with that X colour-repeat set. It may allow a slightly smaller LUT, or more pixels per LUT.

This problem is similar to the HyperRAM Streamer questions I posed.

The streamer can already manage blocks of data fast, but I've not seen details around the edges of this. ?
eg With HyperRAM, you need to be able to flip from Read to Write, on a clean burst boundary.
That means a write-mode streamer drives the pin, and immediately a read mode streamer starts, the pin dirn needs to reverse.
If those can be queued, so much the better, but HyperRAM is more tolerant of hand-over gaps than Video is.

With Multi COG video and Streamer, you need a tighter, gap-less hand-over between COGs.
ie COG1 primes with a LUT Start and Count (and can be 1,2,4,.. bppp I think ?)
Meanwhile COG 2 readies the next burst Start and Count, but the Streamer needs to queue that, and change over on the boundary, ideally without dropping clocks.
COG3 can give 3x the time to prepare the information, and so on....

In MCU land, some SPI ports can manage this change over better than others. Losing a clock is quite common, but some do manage clean packing.

More usual Streamer use, is to set for a whole line, and 'go', and the boundaries are less critical then.

Is there any info, or test results, on exactly how the Streamer behaves with small, closely packed bursts ?
(and commands coming from more than one COG, on a ping-pong basis ?)

Rayman · 2017-02-08 20:21

I was thinking the refill of LUT could be done during horizontal refresh.
Need to read in 128 longs, but this is only 1 clock each with SetQ2, right?

jmg · 2017-02-08 20:35

Rayman wrote: »

...Or, just have the LUT origin advance two places after every xcont...

Thinking some more about that approach, that would need a Upper-Adr-Bits Counter, plus some Pixel-Count modulus divider, which sets how many sysclks before the UAB advances.
Common might be 4,8,12,16... pixels (in this instance, you want 16)
Next, does that UAB alignment need to change, or does this work only on 1bpp streaming (==LSB) ?
Sounds like a reasonably significant amount of logic and config registers.

msrobots · 2017-02-09 03:15

I am not sure if I am complete of the topic here, but since we can share the lut with two cogs, isn't it possible to have one Cog streaming out the lut in a nice tight loop while another cog is writing the values to be displayed into the shared lut?

to poor to buy a FPGA, have to wait for silicon.

Enjoy!
Mike

ozpropdev · 2017-02-09 10:10

msrobots wrote: »

I am not sure if I am complete of the topic here, but since we can share the lut with two cogs, isn't it possible to have one Cog streaming out the lut in a nice tight loop while another cog is writing the values to be displayed into the shared lut?

I think that would cause glitches in the video output based on what the docs state.

These external writes from the other cog are implemented on the 2nd port of the lookup RAM, which port is shared by the
streamer in DDS/LUT modes. If an external write occurs on the same clock as a streamer read, the external write gets priority.
It is not intended that external writes would be enabled at the same time the streamer is in DDS/LUT mode.

cgracey · 2017-02-09 10:49

I think the way to do this is to eliminate one of the two hub RAM lookups. You need to lookup characters and then font scanlines. The characters are contiguous in memory, but the font scanlines are random, as they are dictated by the character data. So, use SETQ+RDLONG to get a row of characters into cog registers. It only takes one clock per long that way. Then, use RDLONG/RDWORD/RDBYTE instructions to lookup the font scanlines. Do it from cog-exec, not hub-exec, of course. I think you could make a fast character-based display that way.

I got all the instruction timings worked out in the Google Sheets file I've been working on:

https://docs.google.com/spreadsheets/d/1EM9LYoqcUgn0hAhzE38vLEi7-IABeD1CdLqDgICx3Hc/edit?usp=sharing

I just need to finish making descriptions for the math/logic instructions. Anything heavy will be explained in the Google Doc.

Rayman · 2017-02-09 16:26

I just looked at Andy's code more closely and see a trick I didn't know about... Just found it in the docs:

RFLONG LUT modes

A background RFLONG is executed initially, and then whenever more data is needed, in order to supply new 1/2/4/8-bit values on each NCO rollover, while shifting remaining RFLONG bits right. These 1/2/4/8-bit values are used as offset addresses in lookup RAM, with the %bbbb field of D/# furnishing bits 8..5 of the base address (%bbbb becomes %bbbb00000). The resultant 32 bits of data read from lookup RAM are output.

Looks like he was able to use some bits in D to set an offset in LUT for each character.
That's exactly what I think I need...

Well, this gives 16 possible sets of colors anyway. Guess that's enough to replicate P1 video somewhat...

potatohead · 2017-02-09 16:41

If character data is organized differently, Chips row fetch would happen early blank. Store them row sequential rather than char sequential.

In my old Potato text driver, I got 80 columns and did it with a single scanline RAM buffer. While this worked, racing ahead of the beam, I found I could not reliably composite (overlay) additional graphics, such as a mouse pointer reliably. It also had sharp lower clock speed bounds and would not fail gracefully.

(I did a 2 to 4 color lookup conversion, planning for the mouse to use an unused color to always be unique to the two color chars. 2 bits per pixel waitvid.)

If a double buffer is used for the scanline RAM, all of this gets much easier.

Fast fetch char row, do hub lookups for char pixel data, overlay pointer, or sprites, etc... and there is the whole scan line to do it in.

Use the vertical scan line counter to toggle display vs fetch buffers.

Doing this is likely one cog on p2. Took two on P1 due to how expensive color lookups are.

Rayman · 2017-02-09 17:38

That was NTSC, right? That signal is much slower than XGA, I think...

I'm thinking that maybe using the actual P1 ROM font here might make sense...
If you want the 16-bit wide font. Since a long read takes just as long as a word read...
Just use 2bpp mode. That might reduce the # of colors sets from 16 to 8 as you'd need a different color set to select between the merged pairs of characters, just like P1

potatohead · 2017-02-09 18:24

The stuff on P1 was NTSC.

Yeah, Parallax font would be good 2bpp.

I've got my FPGA updated, but am currently porting a bunch of old code. Stepped away a bit too long.

Might just rewrite.

The slowest sweeps are 640x480 and that same mode, interlaced us the slowest. Did that a while back and many displays de interlace for free. But 640x480 can do nice text.

I'll bet there is time on P2 with a full scanline buffer, using a quick row fetch like Chip suggested.

Cog indexing is fast now and there is the LUT. It's probably best to keep scanline buffers in COG. 2bpp is nice and small.

You only need 160 bytes per line that way, and you could also just convert one color fonts to two color via script or at runtime too.

That leaves a color for the pointer, should you drop one in.

If I were to attempt it right now, that is what I would try.

jmg · 2017-02-09 19:35

ozpropdev wrote: »

I think that would cause glitches in the video output based on what the docs state.

These external writes from the other cog are implemented on the 2nd port of the lookup RAM, which port is shared by the
streamer in DDS/LUT modes. If an external write occurs on the same clock as a streamer read, the external write gets priority.
It is not intended that external writes would be enabled at the same time the streamer is in DDS/LUT mode.

That sounds like speeds of SysCLK/2 or less could be ok, provided the external writes interleave with Streamer reads ? (eg 40MHz pixel clock at 80 MHz SysCLK ? )

The challenge could be getting that careful interleave and keeping it ?

potatohead · 2017-02-09 20:03

Oops, 320 bytes. Forgot colors.

I wonder about 4bpp and some combination of the nib and mux instructions too.

potatohead · 2017-02-09 20:14

Last things.

Drive it 1bpp, use an interrupt to directly modify color entries in the LUT on char boundaries. I did test this early on. When chip double buffered the streamer I did not test again. May not work now.

Could be two cog, shared LUT scenario too.

jmg · 2017-02-09 20:52

potatohead wrote: »

Drive it 1bpp, use an interrupt to directly modify color entries in the LUT on char boundaries. I did test this early on. When chip double buffered the streamer I did not test again. May not work now.

That's a good idea.

A buffer change should not break it, but it may shift the load alignment.

How many bpp did you test with ?

I think the write needs to be either
Atomic (ie 32b write can update 16 FG, 16BG in one clock edge),
or maybe 2 writes can be carefully ordered, based on the current-pixel content, so you avoid swap of an about to read pixel.

Along the lines of

Pixel Stream     Write New Order
xxxxxFF          BG32 then FG32
xxxxxBB          FG32 then BG32
xxxxxBF          BG32 then FG32
xxxxxFB          FG32 then BG32

Assumes a write on the same apparent clock as Streamer read, reads the old value

That is one test, for choice of one of two write orders, inside the interrupt ?

INT update of 2bpp I think gets harder, could work for an 8b palette ?

Rayman · 2017-02-09 21:07

Interrupt on boundary is interesting idea...

But, can't be two cogs, right? Each cog has streamer tied to it's own 4 I/O pins.

jmg · 2017-02-09 21:22

Rayman wrote: »

But, can't be two cogs, right? Each cog has streamer tied to it's own 4 I/O pins.

Good point, I think that limit is true for DAC Streamer output, but the docs seem to suggest digital outputs can map to any pins, in 32b + OUTx mask manner.
That means you could time-share 2-3-4+ COGS streamers onto pins, which I've assumed was possible.
The next question is, how seamless can that time-share be made ?

potatohead · 2017-02-09 22:24

Use one cog to stream and change colors, the other one fetches from hub, LUT is shared, either should be able to modify buffer data.

potatohead · 2017-02-09 22:26

Jmg, I I tried color changes in various modes.

I'm hoping chip just buffered commands. If so, this should still work.

As for writes, the LUT read for the pixel should happen, and once it does, changing it would affect a future pixel.

potatohead · 2017-02-09 22:30

On the two cog option

The signal cog does the streaming to display, and maybe char row fetch on blank too.

Either cog can modify shared LUT values.

The graphics cog does hub fetches for char values to load line buffer and can start this in the blank after char row arrives to be ahead of the signal cog.

BTW, char values are the same, so they can be buffered too. Not sure that gets us anything though.

Streamer on graphics cog goes unused. It's fetching char and attribute values from HUB to get pixel data addresses, and looking up colors from the attribute values, used as addresses. The interrupt will need to get those, maybe small ring buffer.

Hopefully, the signal cog can have the interrupts for color changes.

Put line buffers, char row in LUT. One buffer should work. Two would work at higher sweep frequencies. Should fit at 320 bytes per line.

This is basically P1 style, but with the shared LUT and pointers. It's gonna work much faster.

Rayman · 2017-02-10 22:44

Was thinking about this today, but decided it's too hard...

Instead, think I'll try a 2 bpp screen array. At 1024x768 that's ~200 kB.
I'll use the P1 ROM font and Andy's trick to set LUT offsets.
Just need a byte array of color choices for each 16x16 tile.

This sounds easy and can allow arbitrary graphics over full screen (with just a restriction on colors).

Or, maybe a long array of 4 colors for each tile so that tile colors can be arbitrary.
Don't need the LUT offset trick in this case.
RFAST should be able to get those four longs super fast...
I get ~50kB for color array.

Rayman · 2017-02-10 22:49

Actually, maybe I still need the LUT offset trick for arbitrary colors.
That gives me 8 slots for color sets that I can cycle through.
This avoids the buffered streamer command issue, I think...

potatohead · 2017-02-10 23:54

Cool. I'll take it as a project one day here soon. I like that kind of code anyway.

It's cool we have enough RAM now to make that kind of tradeoff.

One other thing you might consider, if you don't need a full screen bitmap would be to just partial buffer the bitmap.

Just make vertically small region, couple of rows and have a text COG render into them ahead of the raster.

1024x64 or something.

Rayman · 2017-02-12 00:55

Started a 2bpp driver.
But, looking like arbitrary tile colors isn't going to be possible with one cog.
Looks like going to be limited to 16 color options with the LUT offsets.

But, maybe arbitrary tile colors can be done with second cog with shared LUT.

Bufferless text video mode?

Comments