Creating a multi-cog VGA driver

escher · 2017-11-13 04:53

How's it spinning everyone...

I have finished an initial version of a 640x480 @ 60 Hz tile-based VGA driver, parameterized to support both 8x8 and 16x16 pixel tiles, as well as screen tile dimensions such that the horizontal and vertical screen tile dimensions are evenly divisible by the tile height and width with theoretical max screen dimensions of 40x30 w/ 16x16 tiles and 80x60 w/ 8x8 tiles. The code is located here. The graphics.spin object is the driver wrapper and vga.spin is the driver itself. The game and input objects are the game logic (which also defines the tile maps, tile palettes, and color palettes) and code to interface with 74HC165 shift registers to receive control inputs.

The problem is, I'm hitting actual dimension ceilings much lower for each because currently, the driver is pulling tile map, tile palette, and color palette data from hub RAM between each waitvid, so literally in real time in a losing race against the pixel clock. And that's before even attempting to incorporate a sprite engine into the driver.

The solution of course is a multi-cog driver, and I've found plenty of good information on this site concerning the implementation of such a system:

- http://forums.parallax.com/discussion/131278/wading-into-video
- http://forums.parallax.com/discussion/106036/multi-cog-tile-driver-tutorial/p1
- http://forums.parallax.com/discussion/123709/commented-graphics-demo-spin/p1

However, I still have some brain-picking to do from the people who have the most experience with this...

From what I've seen in Chip's demo multi-cog video drivers, it appears that he's purposed a single PASM routine to alternately render scanlines. So while one cog is building the next scanline(s) it will render, the other is rendering the scanline(s) it just built (extensible to however many cogs). His code is admittedly beyond my ability to fully understand at this point, but I believe I understand the methodology he's employing. I also understand that in this scenario, "synchronizing" the cogs is vital so that their independent video output is interleaved seamlessly.

The other paradigm I'm seeing is where you have a single "video" cog (which outputs the actual VGA signal) and several "render" cogs which build alternating scanlines and copy them to a buffer which the video cog can grab and display. In this scenario, cog synchronization isn't necessary (unless using a combination of Chip's method to have multiple "video" cogs as well as multiple "render" cogs) however the number of required cogs is increased.

I prefer Chip's method personally, due to the reduced resource footprint, however I'm concerned about its scalability i.e. when I add a sprite rendering component to the VGA driver, will there still be enough time to generate data for the video generator? Because with the video+render cogs method, the render cogs don't have to wait for the video generation to be completed before generating the next scanline(s) of data like Chip's method requires; once they've copied their generated data to the scanline buffer they can immediately start generating the next scanline(s), without having to wait for a vertical or horizontal blanking period to do so.

So out of all this, the questions here are:

1. Is this analysis of pros and cons of each accurate?
2. If not/so, which method (or other) is better for both constructing and pushing tile-mapped and sprite-augmented video at VGA speeds?
3. When I get to development of an NTSC driver, will one of these methods be preferential to the other?

Thanks for any help!

potatohead · 2017-11-13 11:55

Chips method results in higher resolutions, but it leaves little time for dynamic display methods, the primary one being sprites. It does use a tile map though. That can do a lot more than people think.

It also does not do any color any pixel. 4 colors per tile. Works very well with software sprites done in a pseudo-sprite COG that is timed to VBLANK.

The other method can do any pixel any color, and that works by running waitvid backwards. "WAITVID pixels, #%%3210"

The pixel values are fixed and reversed, each assigned one of the 4 palette entries. Palette entries become pixel colors, one byte per pixel.

This requires a lot more WAITVID instructions per line.

What happens is the max pixel clock gets constrained by only having 4 pixels, and the horizontal sweep and propeller clock determine the max number of pixels. At VGA sweeps, this is 256 pixels. At NTSC, one can get 320, maybe a bit more.

In addition, there is less racing the beam as the render COGS can work up to several scan lines ahead, doing tiles or a bitmap first, then masking in sprites. The more render COGS you have, the more sprites per line are possible.

The slower sweeps on TV = more graphics possible per line. This favors the render COG method, if full color freedom is needed, otherwise both work well, and Chips method does a lot more with fewer COGS required.

Optimal sprites are 4 x whatever number of pixels you want in the vertical direction. Run them in pairs or groups in the sprite position list to make larger objects.

This is all the classic speed / color depth tradeoff in the propeller software video system.

If you don't need super high resolution, nor all the colors in any pixel, Chips method is superior. What a lot of people miss about his method is each tile can be pointed to a region of HUB RAM. This can be used to partially buffer a screen and make effective use of higher resolutions where not enough RAM exists to represent the screen. Software sprites can be drawn to a small buffer, then displayed, repeat in vertical sections.

Chips method works at two or one bit per pixel, and one of 64 possible palettes per tile.

The other method, which several of us implemented, always works in units of 4 pixels, but any pixel is any color.

At any kind of resolution, say 160 pixels horizontally, or more, and one scanline per pixel, there isn't enough HUB RAM to represent the entire screen. Dynamic display methods, or tiles are needed at a minimum to draw all the pixels, in byte per pixel operation.

potatohead · 2017-11-13 12:45

There is one more great trick. Google WHOP. WAITVID hand off point.

First discovered by Linus Akesson on accident, the idea is WAITVID autorepeats. If one times a register load just right, sequential WAITVID are possible sans WAITVID instructions.

A user here worked the timing out. I can't remember their name... kurenko or similar.

Tor · 2017-11-13 12:57

Kuroneko. (Black Cat, so easy to remember (for me at least

)

potatohead · 2017-11-13 13:01

Thank you.

escher · 2017-11-13 19:33

potatohead wrote: »

Chips method results in higher resolutions, but it leaves little time for dynamic display methods, the primary one being sprites. It does use a tile map though. That can do a lot more than people think.

It also does not do any color any pixel. 4 colors per tile. Works very well with software sprites done in a pseudo-sprite COG that is timed to VBLANK.

The other method can do any pixel any color, and that works by running waitvid backwards. "WAITVID pixels, #%%3210"

The pixel values are fixed and reversed, each assigned one of the 4 palette entries. Palette entries become pixel colors, one byte per pixel.

This requires a lot more WAITVID instructions per line.

What happens is the max pixel clock gets constrained by only having 4 pixels, and the horizontal sweep and propeller clock determine the max number of pixels. At VGA sweeps, this is 256 pixels. At NTSC, one can get 320, maybe a bit more.

In addition, there is less racing the beam as the render COGS can work up to several scan lines ahead, doing tiles or a bitmap first, then masking in sprites. The more render COGS you have, the more sprites per line are possible.

The slower sweeps on TV = more graphics possible per line. This favors the render COG method, if full color freedom is needed, otherwise both work well, and Chips method does a lot more with fewer COGS required.

Optimal sprites are 4 x whatever number of pixels you want in the vertical direction. Run them in pairs or groups in the sprite position list to make larger objects.

This is all the classic speed / color depth tradeoff in the propeller software video system.

If you don't need super high resolution, nor all the colors in any pixel, Chips method is superior. What a lot of people miss about his method is each tile can be pointed to a region of HUB RAM. This can be used to partially buffer a screen and make effective use of higher resolutions where not enough RAM exists to represent the screen. Software sprites can be drawn to a small buffer, then displayed, repeat in vertical sections.

Chips method works at two or one bit per pixel, and one of 64 possible palettes per tile.

The other method, which several of us implemented, always works in units of 4 pixels, but any pixel is any color.

At any kind of resolution, say 160 pixels horizontally, or more, and one scanline per pixel, there isn't enough HUB RAM to represent the entire screen. Dynamic display methods, or tiles are needed at a minimum to draw all the pixels, in byte per pixel operation.

Thanks for the great reply potato!

The WHOP method I am aware of, and I've been talking with kuroneko about it on GitHub. Once I've got my head wrapped around it I'll investigate that path, but for now I'd like to focus on the "conventional" approaches as a starting point.

I was aware of the bitmapped and tile based approaches but not the reversed waitvid one; that's pretty snazzy!

But because my project is attempting to emulate retro arcade games, the tile engine I've implemented is my definite way forward. 4-color palettes are fine, and 6-bit RRGGBB color for 64 possible is sufficient for retro graphics.

So long story short, since I'm having my system capable of displaying VGA as an option alongside "CGA" (15 kHz) and possibly NTSC, I'm choosing speed over color depth. I'm printing either a 16-pixel or 8-pixel tile palette line per waitvid.

With this in mind, I'm thinking of the following setup...

For a 640x480 @ 60 Hz, tile and sprite based, 6-bit color video driver:
- One video cog which pulls the currently displaying scanline from main RAM
- 2 render cogs which generate their respective alternating scanlines and write them to main RAM
- One sprite cog which augments the scanline in main RAM after each render cog writes it but before the video cog grabs it

The methodology of the sprite system is still up in the air, so that's subject to change.

But based on my specs, is this paradigm valid and representative of "best practices" to an objective or subjective extent?

potatohead · 2017-11-13 20:54

What does augment mean? Add sprite data?

Tiles are quicker than sprites. You might want three buffered scan lines, 4 total.

All render COGS do tiles, then sprites.

Sprites are slower. Shift plus mask (read, mask, write sprite data) plus write back to line RAM is needed for two bits per pixel. More scan lines is more time.

Sprites are much quicker byte per pixel, BTW. No masks. If you start your video display one long in on the left and right, no clipping needed either. That's true for 2 and 4bpp too. Just don't display the clipped region.

escher · 2017-11-13 23:02

Yep that was what I meant by augment. I supposed I'll have to experiment with the architecture a bit, but I'll definitely start out having both tile AND sprite rendering in the same cog(s), and then restructure if that's not fast enough. I'm not sure what I would consider an "acceptable" number of sprites per scanline yet, so that will play a large role as well.

Thanks for the help!

potatohead · 2017-11-14 01:04

What I do is make it work. Then make it work faster.

potatohead · 2017-11-14 01:11

The best thing you can do, is to make it work however it works. Then write a little demo program showing all the Sprites and the tiles and exercise it some.

Once you have that working, crank up the numbers on the demo program till the driver fails. Then you know what kind of failure you've got. Then you can recode around it.

escher · 2017-11-14 01:31

Just noticed this comment:

potatohead wrote:

If you start your video display one long in on the left and right, no clipping needed either. That's true for 2 and 4bpp too. Just don't display the clipped region.

That's a nifty trick, thanks for that too!

potatohead · 2017-11-14 01:32

Thank Baggers.

pik33 · 2017-11-17 11:04

Several years ago I wrote a "nostalgic VGA driver" - it was 640x480 text driver (80x30 8x16px characters) with a border, signalling 800x600 to the monitor. It uses 2 cogs. One of them decodes a character from the font definitions to pixels, putting the results to the buffer in the main memory while the second cog displays pixels from this (circular) buffer. The cogs are synchronized via vblank/hblank signals. As the display cog's task is simple, there were some bytes left in the cog's memory so I used it for the color buffer.

Here is the topic: http://forums.parallax.com/discussion/139960/nostalgic-80x30-vga-text-driver-with-border-now-beta/p1

escher · 2017-11-17 19:58

That's pretty awesome! Very similar to the direction I'm going... I dig the Atari font too!

My final driver is going to be an almost direct copy of the NES graphics system, with 4-color tiles and sprites of user defined size but with support for a wide variety of resolutions in addition to VGA, RGBS (basically CGA), and probably NTSC.

Ambitious, but because I'm targeting those relatively low resolution retro graphics the modern Propeller is more than up to the task.

So far I've gotten VGA working with tiles, and I'm now implementing the multicog method in a refactor that will allow higher tile dimensions as well as a sprite system.

escher · 2017-11-22 20:05

potatohead wrote: »

At VGA sweeps, this is 256 pixels.

Sorry to revive an old post, but potato I was wondering whether this limit is for the entire scan line or just the visible video area (not taking into account horizontal sync area).

potatohead · 2017-11-22 21:30

Visible.

And it's 256ish pixels @ 80Mhz. You can push it a little at 96Mhz.

pik33 · 2017-11-23 20:33

The P1 can display 1920x1200x60Hz, 154 MHz pixel clock is in its PLL range which ends somewhere about 220 MHz (experimentally tested).

potatohead · 2017-11-23 21:21

Not in full color, which is what I was referring to.

Full color, one byte per pixel, requires 4 pixel waitvids, which does constrain things.

rogloh · 2017-11-24 03:22

escher wrote: »

Yep that was what I meant by augment. I supposed I'll have to experiment with the architecture a bit, but I'll definitely start out having both tile AND sprite rendering in the same cog(s), and then restructure if that's not fast enough. I'm not sure what I would consider an "acceptable" number of sprites per scanline yet, so that will play a large role as well.

Thanks for the help!

If you think about the types of games you want to implement that should help you work out how many sprites you want to support.

For example from memory something like space invaders has 11 sprites across and 5 rows deep of invaders and they were locked in separate regions of the screen so the number couldn't ever increase. You may want to add missiles as well and the player sprite. Portrait or landscape oriented displays would affect the final sprite per scanline count needed too.

Something like PacMan could be done with many fewer sprites since there are only 5 moving things most of the time and possibly you could even use dynamically generated tiles only, though transparent sprites can make the game logic much easier to do.

A game like a fighting game with two large characters side by side may need lots of sprites per scan line, as well as something like Gauntlet type of game where almost every cell position on a scanline can be occupied by a moving creature.

As a starting point I think about 16 sprites per scanline is a nice round number for 16x16 pixel size if you have scanlines of say 288 pixels wide as it almost fills the scanline with sprites over the tiles. That would allow many retro games. You also need to think about the total number of sprites per screen as well.

I've been able to get over 16, 16x16 pixel 16 color sprites (from 64 color palette) per scanline in my own customized VGA implementation with an 80MHz P1 dedicated *entirely* for video render and display at 288x224 resolution with line multiplication, based off of some of Marco Maccaferri's earlier work. My spreadsheet tells me that I could sustain up to 21, 16x16 sprites or 36, 8x8 transparent sprites per scanline drawn on top of the 8x8 tiles if I tweak the VGA timing and run at 50Hz, though I've not tested it that far. With some other optimizations I still hope I might be able to free up another COG for rendering and then boost it up to 29 (16x16) to 50 (8x8) one day just to see if it can be done.

It's also possible to support hi-color palettes too (15bpp palette entries) though that consumes an additional COG for display and some more scanline and palette memory. But with that you can get some very nice color gradients for doing sky backgrounds etc. I built up a 15bpp resistor DAC for that once and tried it out. Looks cool.

Finally, if you only use one single P1 for everything you won't have as many COGs available for rendering and would have to compromise on memory use as well. I think a dual Prop implementation is really neat for retro games, with one propeller just for video and the other one for graphics/code storage, running the game logic, reading user input and generating the sound fx/music using mixed outputs from all the different emulated vintage audio chips. It's a pretty powerful combination.

Cheers,
Roger.

macca · 2017-11-27 09:43

rogloh wrote: »

Finally, if you only use one single P1 for everything you won't have as many COGs available for rendering and would have to compromise on memory use as well. I think a dual Prop implementation is really neat for retro games, with one propeller just for video and the other one for graphics/code storage, running the game logic, reading user input and generating the sound fx/music using mixed outputs from all the different emulated vintage audio chips. It's a pretty powerful combination.

Like this for example?
http://forums.parallax.com/discussion/164043/the-p8x-game-system

rogloh · 2017-11-27 11:09

@macca
Precisely! Your board is a great example of how two Props can be put to good use for doing retro games Marco. I was particularly impressed when I saw it a while back as well as all the code you'd integrated together.

In one of my own board designs some time ago after being inspired by Uzebox project I used a propeller and an ATMega2651 AVR acting like your second prop for controlling game logic etc, but using two propellers alone is far nicer for all the extra audio support you can get, and nowadays the extra gcc support for Propeller makes things better there too for game development. There's also some potential scope to use SQI flash for holding much larger programs despite needing to use a slower caching driver for memory access compared to running native or LMM based code.

Cool stuff.
Roger

escher · 2017-11-27 23:04

macca wrote: »

rogloh wrote: »

Finally, if you only use one single P1 for everything you won't have as many COGs available for rendering and would have to compromise on memory use as well. I think a dual Prop implementation is really neat for retro games, with one propeller just for video and the other one for graphics/code storage, running the game logic, reading user input and generating the sound fx/music using mixed outputs from all the different emulated vintage audio chips. It's a pretty powerful combination.

Like this for example?
http://forums.parallax.com/discussion/164043/the-p8x-game-system

Hey I'm definitely going to be picking your brain soon on some video implementation details!

@potatohead, @ 80 MHz, 7 cycles per waitvid, 4 pixels per waitvid, and 640 pixels per visible line (640 x 480 @ 60Hz): That's 160 waitvids per line, 87.5 ns per waitvid, resulting in 14 microseconds to display a visible line. This is well less than the 25 microseconds required by the standard to display a line. Even adding a 4-cycle instruction for each waitvid such as a shift only brings it up to 22 microseconds.

Where is this 256 pixel visible resolution limit at "full color" coming from? I must be missing something obvious :P

potatohead · 2017-11-27 23:19

Those waitvids need data, and it's gotta come from somewhere, and it takes instructions to do that.

Add in the HUB transfers, index updates, etc... and it all limits things quick.

Baggers and I managed 320 pixels at 96mhz. If the timing works out using a WHOP (WAITVID hand off point) could yield more. That technique was known, but not exploited when I did this. Basically, you get rid of the WAITVID instructions entirely during active scan! Launch one, and it will just repeat, grabbing what it sees on D and S register busses at a specific time. You make sure an instruction holds the right values at that time, no WAITVID needed.

BTW, the slowest sweep still supported by most modern displays is 640x480 interlaced. Many displays will just deinterlace it today, but that is your max scanline time in VGA. We used that sweep frequency to get max pixels, again at 96. Decided to stick with 256 for those efforts as 80mhz setups are standard.

To get longer amounts of time, and more pixels per line, TV / CGA is needed. The old 320x200ish VGA modes ate no longer directly supported by most monitors, though many TV sets with VGA input will display them.

TV does 320 plus easy.

By all means, prove me wrong.

ericball · 2017-11-28 14:17

Yes, the biggest timing restriction is retrieving data from HUB RAM. Best case, you can fetch a long from HUB RAM every 16 cycles, or every 200ns @80MHz or 167ns @ 96MHz. At 1 byte per pixel that a maximum pixel clock of 50ns (20MHz) or 42ns (24MHz), both of which are less than a normal 640x480 VGA display (25.175MHz).

For my NTSC sprite driver I had each cog draw a single line to local cog RAM (although for color resolution rather than horizontal resolution). However, I'm not certain that even this would work as you'd need to have a 3 instruction WAITVID loop.

escher · 2017-11-28 15:33

OK how about this: with 4 cogs each rending every 4th line (2 cogs would be too few, and 3 wouldn't be divisible into all of the vertical resolutions I would want to support), the 3 lines before display would be 31.7*3=93.1 microseconds of time to execute 160 rdlongs, which would take 32 microseconds. Sure it takes 4 cogs plus 2 for actually generating every other line, but this isn't a deal breaker if there were a dedicated video Propeller.

ericball · 2017-11-28 17:13

Since each cog is rendering a single line, then each cog needs to produce pixels at the desired rate (e.g. 25.175MHz for 640x480 @ 60Hz), which is then limited by the RDLONG / WAITVID loop timing.

potatohead · 2017-11-28 17:31

Take a hard look at WAITVID hand off point. WHOP.

if you get the timing to work out, you can basically leave the WAITVID out of the pixel loop, just doing HUB fetches inline.

What you need is a faster pixel rate, not more time. The pixel rate, or clock, is the limiting factor here.

escher · 2017-11-28 18:56

You guys are assuming that the rdlongs are going to be done between each waitvid. What I'm proposing is interlacing lines, so that a cog can COPY the line from main RAM to cog RAM and then display it when its turn comes up. So 4 cogs, each displaying every 4th line, copying their respective lines from main RAM over the course of 3 blank lines in preparation for sequential waitvids to actually output video on the 4th.

potatohead · 2017-11-28 19:47

Will be interesting to see it play out. Maybe inline all the WAITVIDs to work from fixed COG addresses.

Might get 320 pixels at 80mhz that way.

rogloh · 2017-11-29 00:32

If you give up the 640 pixel width resolution, you can use WHOP and stream directly from hub memory at 20M pixels per second (4 per hub cycle) and achieve 512ish pixels per scanline using just a single COG at 80MHz. Of course this is a custom resolution, and may look quite a bit better on multi sync analog VGA monitors compared to LCDs depending on scaling (some are better than others and YMMV).

Critical inner display loop code then just looks like this:

rdlong fourcolors, addr  ' read the 4 pixels from hub memory
cmp fourcolors, #%%3210  ' this instruction cycle is when the WHOP happens
add addr, #4             ' advance the hub pointer
' these 3 lines are repeated 128 times for 512 pixels, doing 4 pixels per hub, consuming 384 longs of the 496 long COG instruction memory

Timing is a bit tricky and you have to carefully align the WHOP with the hub cycle at the right point once at initialization time to make such code work. But it's certainly doable and once correct it stays locked. In fact I find it is more repeatable than relying on the PLL locking in Chip's drivers which sometimes doesn't always sync up correctly after every reset and then gives you fuzzy text on a hires VGA monitor (an LCD might possibly re-clock things and hide the effect). Nice thing is this leaves 7 COGs free for everything else, including rendering the entire scanline from the text/sprite buffers in hub RAM and another benefit is that you don't have to deal with syncing up multiple display COGs which can also be a rather tricky thing.

Phil Pilgrim (PhiPi) · 2017-11-29 02:34

Because the video clock is derived from a PLL, it is asynchronous to the system clock. Does the WHOP method always work, or does it rely upon a constrained set of video clock frequencies that play nice with the system clock and make computing the rendezvous points more reliable?

-Phil

Creating a multi-cog VGA driver

Comments