Deconstructing a video driver

idbruce · 2011-08-18 07:46

Oh boy has this been complicated.

Dr_Acula

I am sure it has been

Especially since you are tackling both hardware and software issues. Just the software associated with displaying graphics can be very difficult, not to mention hardware issues.

I wish I had more time to learn these kind of things.

Bruce

ericball · 2011-08-18 08:26

Just for kicks, let's cycle count the inner loop of the TV driver:

:xloop2                 rdlong  pixels,ptr                  ' 22
                        add     ptr,#4                      ' 4
                        xor     pixels,phaseflip            ' 4
                        waitvid pixels,#%%3210              ' 6
                        djnz    x,#:xloop2                  ' 4
                        jmp     #:done                      ' 40 cycles per 4 pixels

(Note: the jmp isn't required since everything between it and :done is commented out.)
40 cycles at 80MHz = 0.5usec, which is 28 PLLA cycles (NTSC). So as long as your tv_hx is more than 7, you're safe. The trickier bit is figuring out if you are exceeding any other VSCL timing windows. If VSCL.FrameClocks / 57.272MHz > CPU clocks between next two WAITVIDs / 80MHz then bad things will happen. For example:

                        waitvid pixels,#%%3210              ' 6
                        djnz    x,#:xloop2                  ' 4+4
                        jmp     #:done                      ' 4
:done
                        add     current_line,#1             ' 4  
                        wrlong  current_line,_nextline      ' 22  
                        
                        mov     vscl,hf                     ' 4
                        mov     tile,phaseflip              ' 4
                        xor     tile,bordercolour           ' 4
                        waitvid tile,#0                     ' 56 cycles between WAITVIDs

So, again you're safe if tv_hx > 10 ( 56 / 80 * 57.272 / 4 ). Hmm.. you have hx = 10 for NTSC so there may be cases where you won't meet the timing requirements which will extend the line length. CRTs probably won't mind, but LCDs probably will.

dr hydra · 2011-08-18 10:16

@ericball

I have posted a question yesterday (with no replies) about a problem with the dy hydra demo included with the hydra board...

After reading your post "CRTs probably won't mind, but LCDs probably will"...it looks like part of the problem is related to CRTs and LCDs...the program works great on the CRT...but bounces up/down on my LCD...can you look at my post and tell why the vertical lines are changes or bouncing

Thanks

Dr_Acula · 2011-08-24 07:38

So close - yet so far!

See the screenshot - this is pulling data off the ram in real time. The nominal display is 256 wide and 224 high. Rainbow colors and then a diagonal white and black line (just something random to test some pixels).

The max width I can go to is 156 before the timing goes haywire.

? optimise the code more (the readblock is in 4 parts because I wanted the flexibility to go to more than 256, but if it was a fixed 256 then it could read in one block instead of 4). Maybe higher xtal speed?

idbruce · 2011-08-24 07:44

Dr_Acula

Yep, it looks close! I am rootin' for ya.

Bruce

Dr_Acula · 2011-08-24 23:22

Thanks for the words of encouragement, idbruce. Your support has encouraged me to think about this further.

Ok, if it fails to work when reading the ram and when there about more than 170 pixels per line, then it never is going to be fast enough to read a full screen out of ram.

So I am working on a hybrid solution where half the screen is stored in hub and half is stored in external ram.

At the start of a frame, the video driver starts reading the first line and displaying it. At the same time, the ram cog starts reading a line into a single line buffer in hub. The video driver will get to the end of the line first, and will then start reading this single line buffer. The video buffer will be catching up to the ram buffer, but it won't quite catch up before ram cog has written that line. I think I can say that because 170 is more than half of 256, which means the ram cog is more than half the speed of the video cog.

Things like changing the ram block need to be fitted in, but I have a working solution that reads two ram blocks and is printing 64 rows with the full 256 wide pixels. So I am hoping that by changing the order of a few routines I can get more lines working.

potatohead · 2011-08-25 00:17

Hi Dr_A

If I were you, I would restructure your code just a little bit. Right now, if the fetch from RAM gets behind the video signal breaks down. Dynamically drawing and fetching things in real time is the leanest code, but not always the most robust. Failure modes are severe.

Try it this way.

There will be a signal cog, and a fetch cog. The signal cog will render to the TV, the fetch cog will get the data from the RAM. Right now, the signal cog only needs a few changes to read from a buffer, with that buffer address specified in the HUB, and it needs to write a zero to another location in the HUB, so that the render cog can control it.

Get your video signal cog rendering out of a buffer in the HUB. Put some test data in the buffer, such as a color rainbow, black and white lines, etc... Whatever works. This code should stand alone and just draw vertical bars to the screen.

Define two of these buffers.

The key from here is to have your fetch cog operating nearly all the time.

What I like to do is initialize the render cog first. It fetches a scan line worth of data, from the RAM in your case. It then writes that buffer address to the hub, so the signal cog knows where to get that data. Finally, it updates all it's pointers, and writes a one to a different hub address, looping until it sees that one change to a zero. When that happens, it goes to fill the second buffer, writing that address, updating pointers and such, repeating the process, buffer 0, buffer 1, buffer 0, buffer 1, etc... I find it's easiest to keep track of the frame, current scan line and such in the render cog, so it does not have to be communicated, and that also allows things like a interlaced signal, or doubling up of scan lines for coarser vertical resolution just by how those values are incremented. The signal cog is fixed, and all it does is draw scans, detailed below.

Then fire off the signal cog.

It does all the stuff it needs to, but most importantly, during the horizontal blank, it reads the buffer address given to it, then writes the 0 to what I'll call the latch address, signaling that it's got the buffer and is working on it. It repeats this every active scan line, always reading it's buffer address, always writing the 0.

This can also be done by simply using one address, with the signal cog reading the buffer address, writing a 0 there, and the fetch or render cog, reading that change to know it's time to go setup another buffer. Other schemes make sense here too, with the important thing being the states of the two cogs being latched, so they work in discrete steps, scan line by scan line.

The two together then will draw the screen. The render, or fetch cog gets nearly the entire scan line this way, and it's activity with the RAM isn't directly coupled to the pixels on screen.

The double buffer method buys you more time, and has the potential for doing other things to that scan line, like sprites. It also makes your hybrid approach possible, because you could have a render from hub cog, and a render from RAM cog, both doing half the line. Just have one of the render cogs be in charge of the state and the buffer address, leaving the signal cog latched, and other render cogs just tracking with the primary render cog.

My 80x50 driver in my blog uses this technique, as does the full color driver Baggers and I did, though it's got more than two buffers, the idea is the same. In that one, there is basically one buffer per allocated render cog, which extends the time you've got to fill one.

Those routines are fairly well commented in the 80x50, and somewhat commented in the full-color driver, FYI.

The other thing that occurs to me, given the RAM is half speed or so, is to simply start out with a lower vertical resolution. With a double buffer scheme, as I describe here, it's pretty easy to fill a buffer before the signal cog gets there, then just have the signal cog draw two scan lines, while the fetch or render COG fills the other buffer. For a very easy case, make each pixel 4 scan lines high, so that you get a stable bitmap with the data you've got, then you can test from a working case, trying out different buffer / ram fetch schemes, ideally getting it down to a single scan line!

When I am building out video drivers, I tend to operate at lower horizontal resolutions first. Start with, say 80 pixels and perfect the routine so that it all works. Then increase to 160, repeat, and refine, etc... You can get your timing metrics, while looking at a nice signal that way, and it helps to have a baseline, known case to compare to. Maybe start out with a 80x50 driver, assuming 200 scans per frame. 80x48, if it's 192, etc...

In this case, the lower vertical resolution capability would be helpful, and doing it in your render or fetch cog is easier than tweaking the signal cog code, which will be more complex, and all it really needs to do is draw scan lines anyway. Again, look at the render cog in the 80x50 code, maybe one of the earlier versions posted to see how simple that cog really is.

Looking damn cool so far! I'll bet what you've done will work at 256x96, or 256x100 pixels, given a double buffer to operate in. That's a very nice start, IMHO.

**My earlier comments on high color graphics reaching 512 pixels aren't viable. That thread was about abusing the waitvid instruction in ways that really aren't viable. Best case on TV is about 320, and that's pushing it, but possible at 80Mhz. The jump to 512 isn't quite there at 100, IMHO. Just following up on a earlier comment. Been quite busy as of late.

Edit: I did not have time to really examine your "get from ram" code. The thought I have is that the data in the ram does not need to be sequential. Consider addressing schemes that minimize fetch time. It doesn't matter too much how complicated the data structure in the external ram is for bitmaps. It only matters if the bitmap can be displayed. Graphics cogs, that deal with the RAM during blanking periods can use lookup tables to put the pixels in the right places quickly. Just a thought. Maybe it makes sense to store the data in some other way, maybe even a wasteful way, if it can be fetched quickly.

idbruce · 2011-08-25 05:43

Dr_Acula

Thanks for the words of encouragement, idbruce. Your support has encouraged me to think about this further.

That's good and I hope they continue to do so, because I would not even attempt to begin a project like this one, and you and I both know that someone needs to do the experimentation

I am more the mechanical problem solver type, but I like facing mechanical challenges and finding solutions. Any man that goes where no man has gone before, has my deepest respect.

As I am sure you have read in one of my other threads, I completely underestimated the strength of spring wire, as it applies to my cnc wire bender. Well the wire that I use is only 0.032 inches in diameter. I must have spent a little over a month designing and experimenting with three different cutting assemblies before I constructed one that would actually cut the wire and one that I was satisfied with. I guess what I am trying to say is that when you start exploring and experimenting in unknown territory, you never know what kind of problems you are going to encounter. However my experiences have proven to me that strong will and determination will usually overcome any problem that is capable of being solved. I am one of those stubborn men and some people say it is a bad thing. I say it all depends on the subject. If you are determined and if your goal is truly attainable, I am sure you will conquer it.

Keep experimenting!

It appears that potatohead had some input for you, so I will stop writing so you can think about what he wrote

Bruce

Dr_Acula · 2011-08-25 16:12

Thanks for all the sage advice. The render and signal cogs are working fairly close to as potatohead describes.

I tried a few dead ends but the code that seems to be most promising is the one where half the buffer is in hub ram and half in external ram. More experiments to try when I get home from work!

Looking damn cool so far! I'll bet what you've done will work at 256x96, or 256x100 pixels, given a double buffer to operate in. That's a very nice start, IMHO.

**My earlier comments on high color graphics reaching 512 pixels aren't viable. That thread was about abusing the waitvid instruction in ways that really aren't viable. Best case on TV is about 320, and that's pushing it, but possible at 80Mhz. The jump to 512 isn't quite there at 100, IMHO. Just following up on a earlier comment. Been quite busy as of late.

Yes I think around 330 is the max columns as color artifacts start appearing. I'm shooting for 224 rows.

I am one of those stubborn men...

Me too! And I like a good puzzle.

idbruce · 2011-08-25 16:16

Me too! And I like a good puzzle.

LOL

I never thought of my problems as puzzles, but I guess that is exactly what they are. I have a real difficult puzzle at this very moment

Dr_Acula · 2011-08-26 19:37

Thanks again to the forum, I went back and completely rewrote the code again.

It works!!

Attached is the code and a screenshot. It looks better in real life as the camera is picking up some wavy artifacts that are not there.

This is a 256 wide, 224 high picture. In the middle are two white pixels - one of these is coming from the hub buffer and one is coming from the external ram buffer.

On these little LCD screens I think this is getting to the limit - if you zoom right into that picture you can see the LCD display has 3 pixels wide for each pixel we are drawing, and the height is about the same.

As has been pointed out, and experiments have shown, adding more pixels in the x dimension starts to add color artifacts.

In the y dimension, we are out of hub ram.

In fact, there are quite a few things right on the limit. If you look at the picture on the left side, every 32 lines there is a notch. I think this is a timing issue with changing the latch every 32 lines.

Next challenge will be to get data into the frame buffer.

There isn't any room for an sd card driver, so I think this will have to be done with a serial driver. Run that in another cog, and buffer lines in the cog. Putting each alternate line into hub will be easy, but putting the other lines into ram will need to be done during the blank lines at the top and the bottom of the screen.

There are some other limits too.

For movies, a 160x120 display worked well. For 256x96 the frame rate had to go down to 7 per second as the sd card was not fast enough. At 256x224, the frame rate will be around 3 per second.

And for games, well maybe and maybe not. I'll need to see how quickly I can get a 16x16 tile in via a serial link. How fast can inter-prop comms go?

[at a standard 115k serial link, that is about 15kilobytes per second and with 256x224 bytes that will be about 4 seconds to refresh a screen]

I see two applications here, both using this as a standalone circuit and interfacing with another prop.

The first is for static displays, or displays that do not change very quickly.

And the second is for a GUI display (my project in C before I got distracted when someone on the forum said this can't be done *grin*)

If anyone wants to try to replicate this or improve it, all you need are a propeller, a 512k sram and a 74HC374 latch.

idbruce · 2011-08-26 19:42

Excellent!

I am glad you conquered your puzzle. Now you should have some free time to help me with mine

And the second is for a GUI display (my project in C before I got distracted when someone on the forum said this can't be done *grin*)

Don't you just love it when people underestimate the capabilities of a determined problem solver

Bruce

potatohead · 2011-08-26 22:47

Excellent!

Are there dual port RAMs available for a reasonable price and form factor?

Fill rate limits will be in play here, though for basic things like a GUI, they won't be too bad. You do have the blanking period.

Dr_Acula · 2011-08-28 07:05

I think we have something working here!

Pictures are being downloaded via RS232 so not quite fast enough for movies but pretty good for static displays and maybe good enough for your holiday snaps.

'tis nearly midnight, so I'll post code and a proper writeup in a few days.

idbruce · 2011-08-28 07:22

Very nice. I am looking forward to reading the writeup.

Bruce

Deconstructing a video driver

Comments