Creating a multi-cog VGA driver
escher
Posts: 138
How's it spinning everyone...
I have finished an initial version of a 640x480 @ 60 Hz tile-based VGA driver, parameterized to support both 8x8 and 16x16 pixel tiles, as well as screen tile dimensions such that the horizontal and vertical screen tile dimensions are evenly divisible by the tile height and width with theoretical max screen dimensions of 40x30 w/ 16x16 tiles and 80x60 w/ 8x8 tiles. The code is located here. The graphics.spin object is the driver wrapper and vga.spin is the driver itself. The game and input objects are the game logic (which also defines the tile maps, tile palettes, and color palettes) and code to interface with 74HC165 shift registers to receive control inputs.
The problem is, I'm hitting actual dimension ceilings much lower for each because currently, the driver is pulling tile map, tile palette, and color palette data from hub RAM between each waitvid, so literally in real time in a losing race against the pixel clock. And that's before even attempting to incorporate a sprite engine into the driver.
The solution of course is a multi-cog driver, and I've found plenty of good information on this site concerning the implementation of such a system:
- http://forums.parallax.com/discussion/131278/wading-into-video
- http://forums.parallax.com/discussion/106036/multi-cog-tile-driver-tutorial/p1
- http://forums.parallax.com/discussion/123709/commented-graphics-demo-spin/p1
However, I still have some brain-picking to do from the people who have the most experience with this...
From what I've seen in Chip's demo multi-cog video drivers, it appears that he's purposed a single PASM routine to alternately render scanlines. So while one cog is building the next scanline(s) it will render, the other is rendering the scanline(s) it just built (extensible to however many cogs). His code is admittedly beyond my ability to fully understand at this point, but I believe I understand the methodology he's employing. I also understand that in this scenario, "synchronizing" the cogs is vital so that their independent video output is interleaved seamlessly.
The other paradigm I'm seeing is where you have a single "video" cog (which outputs the actual VGA signal) and several "render" cogs which build alternating scanlines and copy them to a buffer which the video cog can grab and display. In this scenario, cog synchronization isn't necessary (unless using a combination of Chip's method to have multiple "video" cogs as well as multiple "render" cogs) however the number of required cogs is increased.
I prefer Chip's method personally, due to the reduced resource footprint, however I'm concerned about its scalability i.e. when I add a sprite rendering component to the VGA driver, will there still be enough time to generate data for the video generator? Because with the video+render cogs method, the render cogs don't have to wait for the video generation to be completed before generating the next scanline(s) of data like Chip's method requires; once they've copied their generated data to the scanline buffer they can immediately start generating the next scanline(s), without having to wait for a vertical or horizontal blanking period to do so.
So out of all this, the questions here are:
1. Is this analysis of pros and cons of each accurate?
2. If not/so, which method (or other) is better for both constructing and pushing tile-mapped and sprite-augmented video at VGA speeds?
3. When I get to development of an NTSC driver, will one of these methods be preferential to the other?
Thanks for any help!
I have finished an initial version of a 640x480 @ 60 Hz tile-based VGA driver, parameterized to support both 8x8 and 16x16 pixel tiles, as well as screen tile dimensions such that the horizontal and vertical screen tile dimensions are evenly divisible by the tile height and width with theoretical max screen dimensions of 40x30 w/ 16x16 tiles and 80x60 w/ 8x8 tiles. The code is located here. The graphics.spin object is the driver wrapper and vga.spin is the driver itself. The game and input objects are the game logic (which also defines the tile maps, tile palettes, and color palettes) and code to interface with 74HC165 shift registers to receive control inputs.
The problem is, I'm hitting actual dimension ceilings much lower for each because currently, the driver is pulling tile map, tile palette, and color palette data from hub RAM between each waitvid, so literally in real time in a losing race against the pixel clock. And that's before even attempting to incorporate a sprite engine into the driver.
The solution of course is a multi-cog driver, and I've found plenty of good information on this site concerning the implementation of such a system:
- http://forums.parallax.com/discussion/131278/wading-into-video
- http://forums.parallax.com/discussion/106036/multi-cog-tile-driver-tutorial/p1
- http://forums.parallax.com/discussion/123709/commented-graphics-demo-spin/p1
However, I still have some brain-picking to do from the people who have the most experience with this...
From what I've seen in Chip's demo multi-cog video drivers, it appears that he's purposed a single PASM routine to alternately render scanlines. So while one cog is building the next scanline(s) it will render, the other is rendering the scanline(s) it just built (extensible to however many cogs). His code is admittedly beyond my ability to fully understand at this point, but I believe I understand the methodology he's employing. I also understand that in this scenario, "synchronizing" the cogs is vital so that their independent video output is interleaved seamlessly.
The other paradigm I'm seeing is where you have a single "video" cog (which outputs the actual VGA signal) and several "render" cogs which build alternating scanlines and copy them to a buffer which the video cog can grab and display. In this scenario, cog synchronization isn't necessary (unless using a combination of Chip's method to have multiple "video" cogs as well as multiple "render" cogs) however the number of required cogs is increased.
I prefer Chip's method personally, due to the reduced resource footprint, however I'm concerned about its scalability i.e. when I add a sprite rendering component to the VGA driver, will there still be enough time to generate data for the video generator? Because with the video+render cogs method, the render cogs don't have to wait for the video generation to be completed before generating the next scanline(s) of data like Chip's method requires; once they've copied their generated data to the scanline buffer they can immediately start generating the next scanline(s), without having to wait for a vertical or horizontal blanking period to do so.
So out of all this, the questions here are:
1. Is this analysis of pros and cons of each accurate?
2. If not/so, which method (or other) is better for both constructing and pushing tile-mapped and sprite-augmented video at VGA speeds?
3. When I get to development of an NTSC driver, will one of these methods be preferential to the other?
Thanks for any help!
Comments
It also does not do any color any pixel. 4 colors per tile. Works very well with software sprites done in a pseudo-sprite COG that is timed to VBLANK.
The other method can do any pixel any color, and that works by running waitvid backwards. "WAITVID pixels, #%%3210"
The pixel values are fixed and reversed, each assigned one of the 4 palette entries. Palette entries become pixel colors, one byte per pixel.
This requires a lot more WAITVID instructions per line.
What happens is the max pixel clock gets constrained by only having 4 pixels, and the horizontal sweep and propeller clock determine the max number of pixels. At VGA sweeps, this is 256 pixels. At NTSC, one can get 320, maybe a bit more.
In addition, there is less racing the beam as the render COGS can work up to several scan lines ahead, doing tiles or a bitmap first, then masking in sprites. The more render COGS you have, the more sprites per line are possible.
The slower sweeps on TV = more graphics possible per line. This favors the render COG method, if full color freedom is needed, otherwise both work well, and Chips method does a lot more with fewer COGS required.
Optimal sprites are 4 x whatever number of pixels you want in the vertical direction. Run them in pairs or groups in the sprite position list to make larger objects.
This is all the classic speed / color depth tradeoff in the propeller software video system.
If you don't need super high resolution, nor all the colors in any pixel, Chips method is superior. What a lot of people miss about his method is each tile can be pointed to a region of HUB RAM. This can be used to partially buffer a screen and make effective use of higher resolutions where not enough RAM exists to represent the screen. Software sprites can be drawn to a small buffer, then displayed, repeat in vertical sections.
Chips method works at two or one bit per pixel, and one of 64 possible palettes per tile.
The other method, which several of us implemented, always works in units of 4 pixels, but any pixel is any color.
At any kind of resolution, say 160 pixels horizontally, or more, and one scanline per pixel, there isn't enough HUB RAM to represent the entire screen. Dynamic display methods, or tiles are needed at a minimum to draw all the pixels, in byte per pixel operation.
First discovered by Linus Akesson on accident, the idea is WAITVID autorepeats. If one times a register load just right, sequential WAITVID are possible sans WAITVID instructions.
A user here worked the timing out. I can't remember their name... kurenko or similar.
Thanks for the great reply potato!
The WHOP method I am aware of, and I've been talking with kuroneko about it on GitHub. Once I've got my head wrapped around it I'll investigate that path, but for now I'd like to focus on the "conventional" approaches as a starting point.
I was aware of the bitmapped and tile based approaches but not the reversed waitvid one; that's pretty snazzy!
But because my project is attempting to emulate retro arcade games, the tile engine I've implemented is my definite way forward. 4-color palettes are fine, and 6-bit RRGGBB color for 64 possible is sufficient for retro graphics.
So long story short, since I'm having my system capable of displaying VGA as an option alongside "CGA" (15 kHz) and possibly NTSC, I'm choosing speed over color depth. I'm printing either a 16-pixel or 8-pixel tile palette line per waitvid.
With this in mind, I'm thinking of the following setup...
For a 640x480 @ 60 Hz, tile and sprite based, 6-bit color video driver:
- One video cog which pulls the currently displaying scanline from main RAM
- 2 render cogs which generate their respective alternating scanlines and write them to main RAM
- One sprite cog which augments the scanline in main RAM after each render cog writes it but before the video cog grabs it
The methodology of the sprite system is still up in the air, so that's subject to change.
But based on my specs, is this paradigm valid and representative of "best practices" to an objective or subjective extent?
Tiles are quicker than sprites. You might want three buffered scan lines, 4 total.
All render COGS do tiles, then sprites.
Sprites are slower. Shift plus mask (read, mask, write sprite data) plus write back to line RAM is needed for two bits per pixel. More scan lines is more time.
Sprites are much quicker byte per pixel, BTW. No masks. If you start your video display one long in on the left and right, no clipping needed either. That's true for 2 and 4bpp too. Just don't display the clipped region.
Thanks for the help!
Once you have that working, crank up the numbers on the demo program till the driver fails. Then you know what kind of failure you've got. Then you can recode around it.
That's a nifty trick, thanks for that too!
Here is the topic: http://forums.parallax.com/discussion/139960/nostalgic-80x30-vga-text-driver-with-border-now-beta/p1
My final driver is going to be an almost direct copy of the NES graphics system, with 4-color tiles and sprites of user defined size but with support for a wide variety of resolutions in addition to VGA, RGBS (basically CGA), and probably NTSC.
Ambitious, but because I'm targeting those relatively low resolution retro graphics the modern Propeller is more than up to the task.
So far I've gotten VGA working with tiles, and I'm now implementing the multicog method in a refactor that will allow higher tile dimensions as well as a sprite system.
Sorry to revive an old post, but potato I was wondering whether this limit is for the entire scan line or just the visible video area (not taking into account horizontal sync area).
And it's 256ish pixels @ 80Mhz. You can push it a little at 96Mhz.
Full color, one byte per pixel, requires 4 pixel waitvids, which does constrain things.
If you think about the types of games you want to implement that should help you work out how many sprites you want to support.
For example from memory something like space invaders has 11 sprites across and 5 rows deep of invaders and they were locked in separate regions of the screen so the number couldn't ever increase. You may want to add missiles as well and the player sprite. Portrait or landscape oriented displays would affect the final sprite per scanline count needed too.
Something like PacMan could be done with many fewer sprites since there are only 5 moving things most of the time and possibly you could even use dynamically generated tiles only, though transparent sprites can make the game logic much easier to do.
A game like a fighting game with two large characters side by side may need lots of sprites per scan line, as well as something like Gauntlet type of game where almost every cell position on a scanline can be occupied by a moving creature.
As a starting point I think about 16 sprites per scanline is a nice round number for 16x16 pixel size if you have scanlines of say 288 pixels wide as it almost fills the scanline with sprites over the tiles. That would allow many retro games. You also need to think about the total number of sprites per screen as well.
I've been able to get over 16, 16x16 pixel 16 color sprites (from 64 color palette) per scanline in my own customized VGA implementation with an 80MHz P1 dedicated *entirely* for video render and display at 288x224 resolution with line multiplication, based off of some of Marco Maccaferri's earlier work. My spreadsheet tells me that I could sustain up to 21, 16x16 sprites or 36, 8x8 transparent sprites per scanline drawn on top of the 8x8 tiles if I tweak the VGA timing and run at 50Hz, though I've not tested it that far. With some other optimizations I still hope I might be able to free up another COG for rendering and then boost it up to 29 (16x16) to 50 (8x8) one day just to see if it can be done.
It's also possible to support hi-color palettes too (15bpp palette entries) though that consumes an additional COG for display and some more scanline and palette memory. But with that you can get some very nice color gradients for doing sky backgrounds etc. I built up a 15bpp resistor DAC for that once and tried it out. Looks cool.
Finally, if you only use one single P1 for everything you won't have as many COGs available for rendering and would have to compromise on memory use as well. I think a dual Prop implementation is really neat for retro games, with one propeller just for video and the other one for graphics/code storage, running the game logic, reading user input and generating the sound fx/music using mixed outputs from all the different emulated vintage audio chips. It's a pretty powerful combination.
Cheers,
Roger.
Like this for example?
http://forums.parallax.com/discussion/164043/the-p8x-game-system
Precisely! Your board is a great example of how two Props can be put to good use for doing retro games Marco. I was particularly impressed when I saw it a while back as well as all the code you'd integrated together.
In one of my own board designs some time ago after being inspired by Uzebox project I used a propeller and an ATMega2651 AVR acting like your second prop for controlling game logic etc, but using two propellers alone is far nicer for all the extra audio support you can get, and nowadays the extra gcc support for Propeller makes things better there too for game development. There's also some potential scope to use SQI flash for holding much larger programs despite needing to use a slower caching driver for memory access compared to running native or LMM based code.
Cool stuff.
Roger
Hey I'm definitely going to be picking your brain soon on some video implementation details!
@potatohead, @ 80 MHz, 7 cycles per waitvid, 4 pixels per waitvid, and 640 pixels per visible line (640 x 480 @ 60Hz): That's 160 waitvids per line, 87.5 ns per waitvid, resulting in 14 microseconds to display a visible line. This is well less than the 25 microseconds required by the standard to display a line. Even adding a 4-cycle instruction for each waitvid such as a shift only brings it up to 22 microseconds.
Where is this 256 pixel visible resolution limit at "full color" coming from? I must be missing something obvious :P
Add in the HUB transfers, index updates, etc... and it all limits things quick.
Baggers and I managed 320 pixels at 96mhz. If the timing works out using a WHOP (WAITVID hand off point) could yield more. That technique was known, but not exploited when I did this. Basically, you get rid of the WAITVID instructions entirely during active scan! Launch one, and it will just repeat, grabbing what it sees on D and S register busses at a specific time. You make sure an instruction holds the right values at that time, no WAITVID needed.
BTW, the slowest sweep still supported by most modern displays is 640x480 interlaced. Many displays will just deinterlace it today, but that is your max scanline time in VGA. We used that sweep frequency to get max pixels, again at 96. Decided to stick with 256 for those efforts as 80mhz setups are standard.
To get longer amounts of time, and more pixels per line, TV / CGA is needed. The old 320x200ish VGA modes ate no longer directly supported by most monitors, though many TV sets with VGA input will display them.
TV does 320 plus easy.
By all means, prove me wrong.
For my NTSC sprite driver I had each cog draw a single line to local cog RAM (although for color resolution rather than horizontal resolution). However, I'm not certain that even this would work as you'd need to have a 3 instruction WAITVID loop.
if you get the timing to work out, you can basically leave the WAITVID out of the pixel loop, just doing HUB fetches inline.
What you need is a faster pixel rate, not more time. The pixel rate, or clock, is the limiting factor here.
Might get 320 pixels at 80mhz that way.
Critical inner display loop code then just looks like this:
Timing is a bit tricky and you have to carefully align the WHOP with the hub cycle at the right point once at initialization time to make such code work. But it's certainly doable and once correct it stays locked. In fact I find it is more repeatable than relying on the PLL locking in Chip's drivers which sometimes doesn't always sync up correctly after every reset and then gives you fuzzy text on a hires VGA monitor (an LCD might possibly re-clock things and hide the effect). Nice thing is this leaves 7 COGs free for everything else, including rendering the entire scanline from the text/sprite buffers in hub RAM and another benefit is that you don't have to deal with syncing up multiple display COGs which can also be a rather tricky thing.
-Phil