Does a 1-COG/40-Column/Full ROM font/4-Color/Low Memory/VGA Tile Map Driver Exist?

JRetSapDoog · 2013-09-29 13:13

Greetings! By way of background, I've got an 800x480 (WVGA) display which I've driven with an SSD1963 to give 50 characters per line using the Propeller's built-in font, and it looks quite good that way (as 50 chars x 16 pixels wide/char = 800 pixels). But the display also looks quite good being driven by a common VGA-to-RGB chip at 640x480 with 40 characters per line, even though the characters are automatically stretched to fill the screen (800-640=160 "strectch" pixels).

So, I'd like to drive this display with a character-based tile map VGA driver (640x480 resolution, 40 characters per line x 15 or so lines) with at least 4 total on-screen colors for text. I'd like to take advantage of the Prop's video serializer in doing this. However, I want to do so using only a single-cog driver, wherein the object with the driver has a relatively minimal hub footprint (in the range of 1 to 2 words per character). The driver should also use the Prop's nice built-in ROM character font at full resolution. Of course, I realize that we can't always have what we want, but....

I did a search of the new OBEX and also the old OBEX, but I haven't come up with anything that meets these criteria. Chip's VGA driver is nearly ideal in that it's a one-cog, small footprint driver that only takes one word per character and offers 8 pairs of foreground and background colors for the text (no line limitations or otherwise). However, it provides 32 characters per line as opposed to 40, unlike his equally nifty TV object. Of course, 32 characters is still a lot, but they get stretched a bit too much on the display I'm using (which is a very common display), and 8 more characters per line would be nice.

Kye's single cog tile map driver (VGA64_BMP) is very slick. It provides 40 characters per line with lots of colors. And it also allows placing characters at half-line/chacacter spacings, while also providing for various text boxes and a mouse cursor. So it's packed with functionality! However, using it would put me way over my memory budget for my current application as it reserves 1,200 longs and 1,200 words for its chroma and luma buffers, respectively. I did gut most of the mouse code (since I didn't need it for my purposes) and freed up around 700 longs, but I'm still out of memory (and can't cut memory demands elsewhere). Perhaps his driver can be further modified or restricted to free up more memory, but I haven't got into it that deeply yet (other than excising the mouse code).

So, I'm taking a breather to look around at what else is available. I seem to recall that someone (maybe Kye or Kuroneko) had a two-colors-per-line tile map driver, but I didn't manage to locate it. I don't recall for sure if it was actually placed in the OBEX, but, if it was, I either missed it while searching or it was subsequently taken down. I'd be okay with 2 colors per line (i.e., 1 background and 1 foreground color), but would prefer 3 or 4 or more colors per line, and at least 4 colors present at one time on the whole screen. Anyway, I had thought that there were several drivers that provided 40 characters per line, but perhaps I'm mistaken. Does anyone know of others?

Ultimately, all of the following features are desired: [1] uses only 1 cog, [2] provides 640x480 VGA with 40 columns x 15 rows, [3] consumes little memory (1-2 words/char), [4] provides multiple colors (at least 4 colors/screen and at least 2 or 3 colors/line, not just 2 over the entire screen), and (maybe I should write "AND" in that I'm so greedy) [5] displays characters using the full 16x32 built-in ROM font.

And now the second point of this post: if such a driver doesn't exist, is such a driver feasible??? I've not written any VGA drivers, but I know that there's only so many cycles available to do things before the video serializer needs to be stuffed by WAITVID. However, I haven't calculated whether such a driver is possible. Well, Kye's cool driver comes quite close but I'm looking for something that can do it at a memory cost of 1-2 words per character, and I haven't yet fully scrutinized the details of how his particular driver uses memory (though I'd guess the half-line/character spacings increases memory demands). Now, if it turns out that such a driver as I'm seeking is not possible running at the standard 80MHz, could anyone venture a guess (not provide a guarantee) as to whether overclocking to around 100MHz (give or take) would get the job done?

I'm considering trying to write such a driver (which will be a stretch given my limited PASM experience), but would like to have an idea of how much this would be pushing the Prop's limits before rolling up my sleeves (though one could argue that it's better not to know when something is supposedly impossible). I think such a driver would be quite useful for users (if such doesn't already exist) and welcome in industry. Having such a driver could be a plus for Parallax's sales efforts, too. I hope that such a driver is possible but worry that it would pushing things too much. For example, perhaps Chip's great driver provides "just" 32 columns (again, that's a lot; see paragraph 3 above) because either those were early coding days for the Propeller or that driver offers more flexibility (that we might not always need) that ends up imposing the 32-column limit. Similarly, I guessing Kye's driver ups the memory requirements due to the incredible functionality it provides.

Apologies in advance as this request no doubt overlaps with threads in the past, but I think that's okay as many of us could stand a reminder from time to time. Also, I haven't seen many new threads over the last several months dealing with video (which is one of the many exciting Prop topics), presumeably since so much has already been said and accomplished in the past. Another reason for the tail-off in video posts is due to folks anticipating the Prop 2 (myself included). However, the Prop 2 is still a ways out, and even if it were available now, the original Prop will continue to be useful and have its advantages, and drivers are a part of its utility. And like Ken says: design for the Prop 1 for now wherever it's suitable. So, any and all comments or opinions or guesses regarding the above are naturally most welcome. Thanks in advance, everyone! And great thanks to those who have provided drivers (and to Chip/Parallax for seeding such creativity). --Jim

P.S.: Okay, if such a driver is possible, perhaps it could have been written during the time it took to make this post.

localroger · 2013-09-29 18:28

I can save you some trouble. The feature mix you want isn't possible. The closest you will find to it is my own 3-cog 40 x 18 driver which is in the OBEX. The problem is that VGA timing is much faster (and also much less fault tolerant) than NTSC or PAL, and the Prop just doesn't have the CPU cycles to do all the line timing as well as extracting the tiles, finding the right line bitmaps in the corresponding ROM character, and shoveling them out to WAITVID with one cog. My driver uses two cogs to convert alternate lines from tiles to raw bitmaps as a third does the actual signal timing and shovels the bitmaps out to WAITVID. And it only barely works. It does manage to do 2 per-character colors per line, and per-character reverse video, so that's something.

JRetSapDoog · 2013-09-29 18:57

Thanks, localroger! That's exactly the type of relevant information that I'm seeking, even if I wish the conclusion were different (don't worry, we don't usually kill the messenger these days). Sounds like a "so close and yet so far away" situation.

Actually, I did take a look at your driver a couple days before posting, and managed to change the initialization parameters to get it to display at 640x480 (as the driver board I'm using would not sync up for the 800x600 version (just a black screen)). Here are the settings I used (which are probably not the only workable combination):

  hp = 640      'horizontal pixels                      cp: 640
  vp = 480      'vertical pixels                        cp: 576
  hf = 24       'horizontal front porch pixels          cp: 32
  hs = 95       'horizontal sync pixels                 cp: 102
  hb = 48       'horizontal back porch pixels           cp: 71
  vf = 11       'vertical front porch lines             cp: 5
  vs = 2        'vertical sync lines                    cp: 4
  vb = 32       'vertical back porch lines              cp: 27
  hn = 0        'horizontal normal sync state (0|1)     cp: 0
  vn = 0        'vertical normal sync state (0|1)       cp: 0
  pf = 402_800_000 'pc = 25,175,000 x 16 '???           cp: $1999_0000

Although I can't afford three cogs, I enjoyed playing with your driver as a learning exercise. Thanks for making it, as it could come in handy in other applications. Thanks for your response/insight!

Are there any other thoughts on this matter, even if they basically echo/reinforce what has been said?

kuroneko · 2013-09-30 02:03

It's Monday - so take this with a grain of salt - but I wouldn't give up hope just yet. At 25.175MHz 16px take about 50 clocks (@80MHz). The loop below uses waitvid, a hubop and 4 basic insns (7+8..23+16, worst case 46, < 50).

setup           mov     phsb, addr              ' character pixmap address
                rdlong  pix, phsb               ' read from address + [COLOR="#FFA500"]2frqb[/COLOR] ([COLOR="#FFA500"]row adjustment[/COLOR])
video           waitvid col, pix                ' 

                add     setup, #1               ' next address |
                add     video, dst1             ' next colour  | cog array

                djnz    ccnt, #setup            ' next 16px

HSYNC clocks in with 160px, IOW 508 clocks (@80MHz). That's 31 hub windows per row of which we have 32, i.e. plenty of time to fetch ASCII data, colour and do cog array setup (dual address/colour arrays, total 40*4 longs).

FWIW, there is also a [post=1063168]50x18 version[/post] of localroger's driver.

Mark_T · 2013-09-30 04:46

This is where the bizarre bit-interlaced font storage scheme bites - it necessitates more
code to unpick it:

            rdbyte code, screen
            test   code, #1  wz
            and    code, #$FE
            shl    code, #6
            add    code, offset
            add    screen, #1
            nop                              ' dead time due hub-aligned reads
            rdlong pattern, code
    if_nz   shr    pattern, #1

            WAITVID colours, pattern

A more rational storage scheme allows something like

            rdbyte code, screen
            shl    code, #5
            add    code, offset
            rdword pattern, code
            add    screen, #1
            WAITVID colours, pattern

JRetSapDoog · 2013-09-30 19:35

Wow! Replies with code? You're kidding me! I didn't (and shouldn't) expect that! Thanks, folks!!

Well, I'm not sure if we have a consensus yet, but this is interesting and I'm learning from it.

@kuroneko: Thanks for doing that math (even if the Monday morning initial thinking turns out to have overlooked something). I see how you derived your calculations, but I am curious about the number of cycles for the waitvid instruction. The manual I have says "5+" but you've used 7. So I'm wondering if, somewhere along the line, it was determined that 7 was "safer" or more typical. At any rate, thanks very much for the corresponding code example (or proof of concept)! I was wondering, given the interleaved complexity that Mark_T mentioned, if you still felt things might be doable (using your h-blanking scheme).

Now, if I understand you correctly, you're suggesting stuffing a pixel data buffer with font data during the horizontal blanking interval, i.e., pre-fetched font data, is that right? That is, the tight video loop that you've provided would just access that pre-fetched data...because there's no more cycle time left before the serializer needs data from waitvid in the loop you've provided, is that right? And if so, I wonder how going to a 32-pixel frame would change things (but if the data can be supplied during the horizontal period, it might not make much difference about the frame size (see colors below)).

Assuming that's what you're saying, providing for the blanking signal while doing hub ops and not getting stalled too early at a waitvid (if I'm saying that right) and maybe doing things predictably/deterministically to avoid video hiccups will take some thinking on my part, and I'm not opposed to receiving any clues (far from it). Also, there's 3 different phases to the non-video portion of horizontal blanking (sync and 2 porches), though maybe the sync portion would be enough. But the first thing to know, of course, is whether there's indeed time to pre-buffer pixel data for the 40 characters across a line.

I see where you get the 508 clocks ((1/25.175M)*160/(1/80M)). But are you saying there's 31 hub windows per horizontal line of pixels because it's assumed the code will be in synchronization with the hub (i.e., 508/16 = ~31)? I guess so, but what do you mean by "of which we have 32" part? Also, could you expand on the meaning of "dual address/colour arrays, total 40*4 longs," including the calculation? Obviously, we're talking about pixel data for 40 characters across a line (though just for 1 of the 32 lines composing each character at a time, of course). But what about the 4 longs part? So, by "dual," do you mean 2 arrays for each (address and colour), for a total of 4? And by "address" in "dual address/colour arrays," do you mean the buffer space for the pre-fetched pixel data (as opposed to strict addresses)? I'm pretty fuzzy after the 31 hub windows part.

@Mark_T: Thanks for the code comparison! Yes, I do see what you mean: separating that interleaved font data does complicate things and takes two/thirds *more* time, time that may not be available. Do you have an opinion on whether time is sufficient or not? Or maybe you were saying that the interleaved font data creates an insurmountable barrier (I'm not sure how you meant it). If one wanted to fetch pixel data in the main waitvid loop, that'd be a deal breaker (at least with a 16-pixel frame), but what about with a pre-fetching scheme during horizontal blanking? Anyway, if my proverbial ship comes in, I'll put in a request to Parallax for a new IC mask set ($$$) with a ROM font that is not interleaved and that also has basic graphics block patterns in the upper half. I'm kidding for the most part, but if they ever did do a refresh, an alternative version would be nice (hmm...maybe they could charge more money for a revised version).

@Anyone: Okay, I'm taking this video stuff real slow, having no direct experience with it. For now, I'm using localroger's 40x18 driver as a base (modified with timing for a 640x480 display). I like how his loop and subroutine structure seems straightforward and there's not a lot of extra stuff in there to confuse a video neophyte like me (though the simplicity would go down a lot if prefetching data during blanking). At this point, I've stripped out the hub ops and helper cog stuff and am just experimenting with putting solid colors and various color patterns on the screen. I realize that the crux of the problem is being able to deliver the font data fast enough (the code that I've removed), but we have to learn to crawl before we can walk, let alone run. I need to be comfortable with the basics first, and I don't really know if I'll make it much further, but we'll see.

Regarding colors, I'd be fine with a single set of 4 fixed colors for the whole screen, so maybe that's one thing less to worry about (though Kuroneko's comments give hope that getting color data wouldn't be a problem). And a set of 4 fixed colors would allow more than 2 colors per line of text (i.e., any amount from 1-4 colors), which is preferable to a limit of 2. But I'd still be interested in just 2 colors per line of text if for some reason 4 (or more) doesn't work out.

However, about the colors thing, I was wondering if it would be better or worse to go with a 32-pixel frame (I think it's called) as opposed to 16? I mean, no more than 2 colors are needed for a single text character in my case (there won't be any overlapping stuff or boxes around the characters). So, would using a 32-pixel frame double the available working time between adjacent characters to about 100 clocks? And if so, would there be the possibility of adjusting the foreground/background color pair on the fly, such that characters in various colors could be shown on the same line (more than 2 and even more than 4 if desired)? But maybe there still won't be time to modify on the fly, I don't know yet (which is partly why I mentioned going to a 32-pixel frame). Anyway, if font character pixel data (and optionally color data) can be pre-fetched during the horizontal blanking period, then I suppose it's not necessary to go to a 32-pixel frame (though doing so might provide even more flexibility...maybe).

Anyway, again, I think I'll be a "happy camper" with just 4 colors for the whole screen (especially if I can get more than 2 per line), so if that makes coding simpler (and it would seem to mean never having to load anything but pixel data to index into the color set), then that's totally fine. I just haven't got a handle on which way is better to go: a 16-pixel frame or a 32-pixel one. I'm definitely open to suggestions. It's possible that there's something that I'm missing which dictates the 16-pixel (4-color capable) frame, or vice-versa. Apologies for the long post. Thanks again for that great feedback, everyone!

kuroneko · 2013-09-30 19:52

JRetSapDoog wrote: »

Thanks for doing that math (even if the Monday morning initial thinking turns out to have overlooked something). I see how you derived your calculations, but I am curious about the number of cycles for the waitvid instruction. The manual I have says "5+" but you've used 7. So I'm wondering if, somewhere along the line, it was determined that 7 was "safer" or more typical. At any rate, thanks very much for the corresponding code example (or proof of concept)! I was wondering, given the interleaved complexity that Mark_T mentioned, if you still felt things might be doable (using your h-blanking scheme).

You better get an updated datasheet. Minimum cycle time is actually 4 but anything below 7 (3 setup, hand-off, 3 exit) doesn't work. As for interleaving, that's all included.

JRetSapDoog wrote: »

Now, if I understand you correctly, you're suggesting stuffing a pixel data buffer with font data during the horizontal blanking interval, i.e., pre-fetched font data, is that right? That is, the tight video loop that you've provided would just access that pre-fetched data...because there's no more cycle time left before the serializer needs data from waitvid in the loop you've provided, is that right?

What I (intend to) do during HSYNC is fetch ASCII and colours. The former is then transformed and stored into a cog address array holding the ROM addresses, the latter goes with some even/odd transform directly into the colour array (2 colours/per character, i.e. fg/bg). The emitter loop cycles through 40 ROM addresses and fetches the actual bit pattern which is then sent to the video h/w. The palette is setup so that even/odd characters are displayed correctly (BBAA, BABA).

JRetSapDoog wrote: »

I see where you get the 508 clocks ((1/25.175M)*160/(1/80M)). But are you saying there's 31 hub windows per horizontal line of pixels because it's assumed the code will be in synchronization with the hub (i.e., 508/16 = ~31)? I guess so, but what do you mean by "of which we have 32" part? Also, could you expand on the meaning of "dual address/colour arrays, total 40*4 longs," including the calculation? Obviously, we're talking about pixel data for 40 characters across a line (though just for 1 of the 32 lines composing each character at a time, of course). But what about the 4 longs part? So, by "dual," do you mean 2 arrays for each (address and colour), for a total of 4? And by "address" in "dual address/colour arrays," do you mean the buffer space for the pre-fetched pixel data (as opposed to strict addresses)? I'm pretty fuzzy after the 31 hub windows part.

You asked for 16x32 pixel characters. Which gives us the 32 scanlines per row. As for the 31 hub windows, imagine HSYNC can be issued as one waitvid, then the whole time can be used for processing, 31 is just the upper limit. The ASCII code stays constant as do the colours and ROM start addresses (the row offset is provided by ctrb). So while row N is displayed (using arrays addr0/colour0) row N+1 is fetched (arrays addr1/colour1). Then we swap over. But in order to achieve that we need 4 arrays of 40 longs each. Since the line emitter is a loop we can afford the addr/colour arrays.

JRetSapDoog · 2013-09-30 22:44

Thanks for those clarifications, kuroneko. [1] Yes, I was referring to version 1.1 of the Propeller manual, but now I see there's new and improved info for the waitvid instruction in version 1.2 of the manual. [2] Regarding your "emitter loop" to look up pixel data, that would be located in the hsync code, right, not the primary waitvid loop for visible data, even though the latter could be thought of as emitting the actual pixel data to the serializer. I guess that's obvious to most people (about being in the hsync code section) since there's no--or not much--time for it in the primary waitvid loop, but I'm just making sure. Also, thanks for the comments about the even/odd transformations. [3A] Yes, definitely looking for 16x32 characters, the full font info for each character, as the Parallax font, though not proportionally spaced, looks quite nice. [3B] The row offset is provided by ctrb, huh? Hmm, I did NOT see that coming! [3C] Regarding dual arrays (for addresses and colours), if there were just single arrays, I guess it would be too risky, timing-wise, to try to repopulate the same array one was reading from at the same time.

JRetSapDoog · 2013-09-30 22:51

Oh, by the way, kuroneko, I received your mail message and sent a response, but let me know if you didn't receive it (as I don't see it in my sent messages). In the nutshell, it's great that you have some interest in such a driver, as it might be useful for lots of folks (it would be for me). And, as you can tell, I'm just scratching the surface in getting a handle on what's involved.

kuroneko · 2013-09-30 23:15

JRetSapDoog wrote: »

Regarding dual arrays (for addresses and colours), if there were just single arrays, I guess it would be too risky, timing-wise, to try to repopulate the same array one was reading from at the same time.

Yes, I've been there before and it doesn't look nice. Anyway, I just verified that the emitter loop works. Now I just have to include the fetch code and we are done. I also received your PM (there may be an option re: copying to Sent).

kuroneko · 2013-10-01 07:21

[thread=150593]POC is now available[/thread].

User Name · 2013-10-01 09:16

You know the old conundrum... "What happens when an immovable object meets an irresistible force?" Well, we now have a partial answer: If the irresistible force is kuroneko, the force wins.

JRetSapDoog · 2013-10-01 16:48

Wow! Absolutely fantastic! You did it! Great job! And so quick, too! Thanks so much for working on this!

I'll make further comments on the actual thread for the driver, not that I've wrapped my head around it by any means.

Maybe any further comments on this thread can be about timing considerations and any related matters.

Comments regarding this amazing new driver from kuroneko should probably be posted on the new thread.