Waitvid instruction - short delay between it?

John A. Zoidberg · 2014-01-26 05:12

Hello there,

While my postgraduate studies nearing completion, I'm revisiting Propeller for my free time. So, I'm trying a bit of the assembler, however I'm curious about the waitvid instruction, as I'm working on a GUI terminal in my part time job.

The manual said that it takes 4 cycles, and can stream a bunch of 4 pixels with 4 4-bit color palette, or 2 8-bit color palette, if I'm reading the manual properly. (There is no mentions of palette anywhere, and I assume the first operand in waitvid is a long containing a four or two colors for pixels in the other operand - which looks like a palette to me).

I checked some examples in the VGA test source codes and I do find that the waitvid have to be done for a number of times if we want to have a number of pixels on the scanline. For example, if I want to stream 256 pixels on the horizontal scanline, I'd do:

(pseudocode in active H-Sync area - 25.6uS)

for(count = 0; count < 64; count++)
waitvid color_palette, pixels

In this, won't there's a 4-T (cycle) delay when waitvids are invoked? Wouldn't the screen be drawing that:

(one box means one pixel )

                               waitvid color_pallete, pixels ..............                            waitvid color_pallete, pixels .............. 
                               ^                                                                       ^ 
+--+--+--+--+--+--+--+--+--+--+-----------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+-----------+--+--+--+--+--+--+--+--+--+--+
|  |  |  |  |  |  |  |  |  |  |           |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |           |  |  |  |  |  |  |  |  |  |  |
+--+--+--+--+--+--+--+--+--+--+-----------+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+-----------+--+--+--+--+--+--+--+--+--+--+

I may assume that the line is uneven between the waitvids, but I could be wrong as I haven't tested it on the real Prop yet.

Would this happen or would the instruction/processor tries to make the pixel length symmetrical? Because for a 4-clock waitvid it's 0.05uS (80MHz) but the pixel clock might be shorter.

I hope you can understand what I'm asking - it's quite an awkward question that I should have asked years ago.

Mark_T · 2014-01-26 05:33

All the waitxxx instructions take 7 or more cycles (occasionally 6 or more) till its ready to run again - in other words consecutuve
waitvids take 7 cycles minimum in practice.

I'd just assume it takes 7+ cycles when calculating latencies. Its particular tricky to handle the case of using a hub-instruction
between waitvids, since the hub synchronisation and the waitvid synchronization are unrelated and may consume 30 cycles
between them (7.5 ordinary instructions worth)

kuroneko · 2014-01-26 06:21

Execution time for waitvid doesn't affect pixel timing (no gap). OTOH, if it's too short or not called often enough you affect what is actually displayed (invalid pixel data).

Ariba · 2014-01-26 10:07

WAITVID works differently than you think. WAITVID does not wait until all the pixels are shifted out.
There is a hardware videoshifter in every cog which shifts the pixels independently from the CPU (and with a different clock from counter A). WAITVID just fills the shifters with new values at exactly the point when the previous pixel data was shifted out.
You need to invoke WAITVID a bit before that happens, and WAITVID waits only until the old data has finished and it can write the new pixel and color longs to the shifter. Then you have time to fetch the new pixel and color data from hub or elsewhere, while the videoshifter does its work. The time the CPU needs for this fetching must be min 7 cycles shorter than the time the videoshifter needs to shift out the previously filled data.

One waitvid writes one color-long and one pixel-long, how many bits/pixels are really used by the videoshifter is defined with the VSCL and VCFG registers. These can be 1 to 32 pixels in two color mode and 1 to 16 pixels in 4 color mode, more do not fit in a long.
And yes the color-long contains a palette or CLUT with up to 4 colors, 8 bit each. The resistors in the VGA circuit defines how this 8 bits are mapped to the R-G-B components. Most drivers use only 6 bits: 2 R, 2 G and 2 B. So you have 64 different colors.

Andy

John A. Zoidberg · 2014-01-26 17:33

Thanks for the info. I'll try to experiment around this. Meanwhile, the Propeller reminds me of the NES PPU but in the Prop, there is no direct control of the sprite/tile and these have to be rendered per-line. Maybe it looks like Atart's Television Interface Adapter too.

potatohead · 2014-01-26 18:16

It's similar.

The TIA had a few basic objects. Like the TIA, a Prop races the beam to generate the display. Unlike the TIA, you get to do what ever you want and have time for during the display draw cycle.

A very simple Propeller display is a bitmap. Really, it's just a small loop! Move the pixel data from the HUB into a COG memory location, move the color info into another one, issue the waitvid and while it's drawing, increment your counters and pointers and go do another one.

And that's really the key. Once you turn waitvid on, it's always drawing something to the screen, including sync signals, etc... and your program is always fetching the next thing waitvid needs. The trick is to get done before waitvid does.

Sprites and tiles only exist if you write the code for them. As far as the Prop waitvid is concerned, neither exist. All a Propeller does is stream color and pixel data to the display. Sync signals and such are simply "colors" that you don't use in the active screen area. But to the Prop, it's all the same. The stream never stops, unless you shut it all down with VCFG; otherwise, it's always running.

You should go and look at the TV.spin program that comes with the Propeller Tool. In that one, Chip setup tiles and each tile can have a 4 color palette, which allows a person to get all the Propeller colors onto the display screen, 4 per tile. That driver is very flexible, and it's tile based. Once it's running, your higher level program need only supply and manipulate tile addresses, color palettes, etc... Otherwise, the TV COG looks like a graphics chip to any other COG, just doing it's thing.

For more advanced capability, most of us use scan line drivers. That means one or more scanlines are built in HUB memory prior to the display drawing them. The more scanlines you buffer, the longer you have to assemble all the graphics per scan line. Sprites and such get read from their lists, pixel data fetched, masked off and written to the scan line. The more COGS you have doing this, the more sprites you can display. Again, to a higher level program, all of it looks like a nice graphics chip with multi-sprite capability.

In the end, it's all software. Whatever you put into your video loop is whatever the Propeller video does. And that means a Propeller can emulate most any retro-computer display quite nicely!

Typically, it's best to take some of the existing drivers or templates and get one of those running on your setup. Usually these display color bars, or some basic bitmap, characters, whatever. Then you figure out what your display needs to be and modify the template to fit, making one change at a time so you can see the result. Writing the code for the signal part of the display, sync, etc... is the hardest. Use a template for that and then you can focus on your inner video display loops which deal with the active pixels and colors.

John A. Zoidberg · 2014-01-26 19:19

@Potatohead - Thanks for the info too.

Writing the tiles is a straightforward task, I assume, but I can't say the same for sprites. Looking back at the NES PPU (Picture Processing Unit), the circuit is really complicated as it involves driving directly 32 or 64 sprites and each sprite has its X and Y coordinate. I was hoping if Prop2 has some sort of "block copy" instruction (I remembered it's there when I last read the blog a few months ago) which attempts to stamp an array without much processor intervention.

Right now, all I could think about generating some sprites is to squeeze some instructions at the horizontal blanking, but the time for that event is really short. I'm not even sure if that Prop could calculate the locations of each sprite on that interval.

As I have not constructed the Prop hardware yet, I'll have to do it and experiment it one by one.

I chose the Prop because I prefer a better intergration without too much wires and extra components snaking around. I could do it with the NES clone PPU (UA6528) but they are not RGB and they need two supporting chips (one latch and an SRAM) and a lot of wires - so it's not pretty convenient.

potatohead · 2014-01-26 19:37

You can't really fit them in during HBLANK on a P1.

The best way to do it is to setup a few scan line buffers. One per sprite COG. Assign each sprite COG it's own buffer. Then, have the sprite COG quickly fetch the tiles into the scan line, then have it grab the sprite data, mask it, and drop it into the scan line as well. It's best to make sprites an even byte, word or long, so you aren't doing anything you don't have to.

The main video COG then manages all of those things. You launch it, and it launches as many sprite COGS as you want. It assigns buffer addresses, setup the video signal, draws the display, etc...

Your higher level program modifies values, updates it's sprite lists, etc...

If you go looking, Baggers and I put together a fine sprite + tiles driver... I started and he added sprites and we both got other issues sorted. http://forums.parallax.com/showthread.php/127123-Full-Color-Tile-Driver-Thread/page7

That driver can very easily drop a couple hundred 4 pixel x 8 pixel sprites onto a screen, 40-60 / scanline possible with 4-5 sprite COGS. The number on screen isn't the limiting factor. Number that occupy the same scan line is. Fewer sprite COGS = fewer sprites per line.

This thread will tell you a lot. Have fun!

potatohead · 2014-01-26 19:46

Oh, you mentioned the P2. Overall basic approach can be the same. P2 has a pixel engine that can make many sprite operations easier, and it's got a HUB write mode that can mean avoiding a masking operation. Just FYI, @60Mhz, a P2 can fetch an entire scanline worth of data in just a fraction of the scan line! At speed, you could easily do it in the over scan, leaving the rest of the scan line to do all sorts of things.

If you were to do sprites P1 style, a single sprite COG will be able to render quite a lot of sprites. Again, that thread will tell you a lot about how it's built. P1 or P2.

Nice thing about P2 is you can generally put an entire active video display into one waitvid. P1 has to work in little long sized chunks, meaning the whole display is sprinkled with waitvids and precise timing requirements. A P2 can work with a fraction of those, and it can operate from the in COG CLUT or AUX memory, which can be filled in advance by the video COG itself. Simple backgrounds, tiles and other kinds of things work in a single COG, even at high color depths. At the real chip speed, basic sprites will fit into one COG as well.

John A. Zoidberg · 2014-01-27 00:36

potatohead wrote: »

Oh, you mentioned the P2. Overall basic approach can be the same. P2 has a pixel engine that can make many sprite operations easier, and it's got a HUB write mode that can mean avoiding a masking operation. Just FYI, @60Mhz, a P2 can fetch an entire scanline worth of data in just a fraction of the scan line! At speed, you could easily do it in the over scan, leaving the rest of the scan line to do all sorts of things.

If you were to do sprites P1 style, a single sprite COG will be able to render quite a lot of sprites. Again, that thread will tell you a lot about how it's built. P1 or P2.

Nice thing about P2 is you can generally put an entire active video display into one waitvid. P1 has to work in little long sized chunks, meaning the whole display is sprinkled with waitvids and precise timing requirements. A P2 can work with a fraction of those, and it can operate from the in COG CLUT or AUX memory, which can be filled in advance by the video COG itself. Simple backgrounds, tiles and other kinds of things work in a single COG, even at high color depths. At the real chip speed, basic sprites will fit into one COG as well.

I'm waiting for that P2 patiently. However, I do wish they have a smaller version for it, like a DIP40 one. The only current graphics chip which comes close to Propeller is the FTDI Eve, but they are available at QFP48 package and there is no breakout board and only to small TFT screens for now. Or, use the NES Clone PPU (widely available in eBay or Chinese suppliers) but you have to use a latch, SRAM and wires.

I'm also reading Baggers' drivers and their treatment on sprite rendering. I learned a lot on that too.

potatohead · 2014-01-27 09:10

You haven't mentioned what graphics you want to do. P1 isn't going anywhere and it will continue to be hard to beat as a drop in graphics capable device.

ericball · 2014-01-27 11:44

Just some additional notes.

1. My notes say WAITVID is 6 cycles, and I seem to recall testing this successfully with NTSC. (5 cycles would glitch occasionally.)

2. Although 32 pixels @ 1bpp and 16 pixels @ 2bpp per WAITVID are typical, it's possible to display more or less pixels per WAITVID. More pixels simply duplicates the last pixel. The pixels can also be any number of FRQA clock cycles. (It also possible to switch between 1bpp and 2bpp modes on the fly, but care is required.)

3. Cycle count your HSYNC start & ends carefully and remember the PASM cycles before the WAITVID have to complete in the previous WAITVID interval. i.e. The time to fetch the first LONGs of active pixels has to fit in the "back porch" interval; similarly, any logic after the last LONGs of pixels has to get to a WAITVID before the pixels have been displayed.

Also see http://propeller.wikispaces.com/Video+Generator

John A. Zoidberg · 2014-01-27 19:10

potatohead wrote: »

You haven't mentioned what graphics you want to do. P1 isn't going anywhere and it will continue to be hard to beat as a drop in graphics capable device.

Actually, it's more like a GUI with a pointer/cursor in an experimental automobile system. There are those speedometer and battery life and other things being monitored in the car, so it must be printed out on the monitor/screen.

potatohead · 2014-01-27 22:04

That seems like a great P1 project. Will the monitor be VGA? Do you really only need a mouse pointer and perhaps some dial indicators?

How many colors?

Here's the thing about sprites. Color lookups on Propeller are expensive. We've got to fetch the data, compute a lookup, go and fetch the color, then shift, then mask, then write back to the display scan line buffer. This takes a long time and we've generally avoided it and that's generally been true because most people doing sprites were wanting to do games / emulation type projects.

This means either choosing a tile / color scheme, such as 4 colors per tile and sharing one of those colors for the mouse pointer, or it means writing your basic video engine in full color so that sprites can be any color, as can the tiles.

Advantage of the first method is less RAM required overall and higher resolutions generally, but you get "color clash" where the pointer crosses tiles that have different color definitions. See the HYDRA HEL driver discussion for how this all works. And BTW, if you don't need too many colors at once per display element, that HEL driver rocks. HEL is a modificaion of the Parallax TV.spin driver done by Andre' and team for the HYDRA project years ago. It never saw too much use, but like the Parallax driver Chip did, it can do way more than most people think.

The full color method has the advantage of being able to do any pixel any color. But it costs you both resolution and COGS to drive sprites. And it costs RAM too. We don't have an efficient 16 color display method, because the waitvid only does 1, 2 or 8 bits per pixel easily. The other advantage is sprites are a no brainer. You just have to fetch them, shift, mask and write to the buffer. So it's a full byte per pixel. Kind of expensive, unless your resolutions are modest, or you plan on a lean tile scheme, or lightly populated display, etc...

Anyway, if you just want a single mouse pointer sprite, you could opt to do it in software too. For this approach you develop or modify a tile only driver. That takes one COG. If you can pay the RAM cost, do it full color.

Treat the tile display like a tile display. Build up all your display elements, etc... Tiles are cool and you can re-use them and stack them in RAM saving ways.

For your sprite, do it during the VBLANK. There is lots of time there. What you do is keep a few spare tiles for a buffer. Four is probably all you need to cover the edge case where the mouse impacts 4 tiles at once.

During VBLANK it goes like this:

0. Restore screen tiles from the hidden buffer to erase the sprite. At this moment your screen is perfect, sans sprite.
1. Copy the target tiles to the hidden buffer, so you have them for next frame.
2. Calculate the sprite position relative to the tiles
3. Shift sprite into position.
4. Write it into the on-screen tiles! Now your screen has a sprite in it.

Repeat.

Software sprites are pretty nice this way. It's likely the video COG could do this too. Lots of time in the VBLANK.... If you are careful, you might get away with one COG.

ericball · 2014-01-28 05:37

John A. Zoidberg wrote: »

Actually, it's more like a GUI with a pointer/cursor in an experimental automobile system. There are those speedometer and battery life and other things being monitored in the car, so it must be printed out on the monitor/screen.

Sounds like a job for a standard 4 color tile driver. 1 color for background (which could change for alerts etc), 1 color for the pointer (again, with the ability to change color) and 2 colors per tile for general graphics. Keep the graphical elements in separate tiles and you can use any two colors for the graphical elements. As Doug (potatohead) says, the cursor is handled as a set of 4 tiles which are mapped to replace the "normal" tiles.

Waitvid instruction - short delay between it?

Comments