TFT Driver - Optimizing for speed
Vega256
Posts: 197
in Propeller 1
Hey guys,
I'm writing a TFT driver for a 320x240xRGB QVGA display module; my code is attached. I'm trying to get the Prop to send RGB pixel data fast enough to get a solid 60fps, but to no avail. Could someone possibly give me some hints, tips, and/or insight on my methods? I would post the code right here, but the editor won't take my code in its entirety.
My basic theory of operation goes as follows. There is one 'line buffer' that holds the pixel data for one horizontal line. This buffer is 80 longs in size (16bpp x (320 pixels / 2) / 32) and this resides in main RAM. There is a single cog that reads the data from this buffer and sends it to the display via 16 GPIO pins.
I'm writing a TFT driver for a 320x240xRGB QVGA display module; my code is attached. I'm trying to get the Prop to send RGB pixel data fast enough to get a solid 60fps, but to no avail. Could someone possibly give me some hints, tips, and/or insight on my methods? I would post the code right here, but the editor won't take my code in its entirety.
My basic theory of operation goes as follows. There is one 'line buffer' that holds the pixel data for one horizontal line. This buffer is 80 longs in size (16bpp x (320 pixels / 2) / 32) and this resides in main RAM. There is a single cog that reads the data from this buffer and sends it to the display via 16 GPIO pins.
Comments
in user_graphics_lines
you can save one instruction: is the same as let the compiler do it.
all them one line subs like 'cs_active'
don't do that. Think about the overhead. you execute 3 instructions instead of one.
do not
do
And you will gain speed and also some cog memory.
Basically the same with those 3-liners. But there you just gain two ins. cycles but loose two longs on memory for each call.
start with a list of the less often occurring calls in source and compare to a mental list to the most often called while running list.
inline the calls until you run out of cog memory...
Enjoy!
Mike
forums.parallax.com/discussion/115518/new-4-3-touchscreen-lcd-for-propeller-used-screens-almost-free-w-purchase/p1
How will you fit 320x240 pixels with 16bits per pixel into the RAM of the Propeller?
Andy
60 frames a second requires one frame every 16.67 mS
240 lines per frame is one line every 69.44 usec
320 pixels per line is one pixel every 217.01 nsec
This is without subtracting the time taken for vertical and horizontal sync which will make the time per pixel even less.
A read from hub ram takes 8 to 23 clock cycles, and most others take 4 cycles. At 100MHz each cycle takes 10nS. A cog can access the Hub once every 16 System Clock cycles so best case there is only time for one hub access (8 x 10nS = 80nS) and two 4 cycle instructions (2 x 4 x 10 = 80nS). Any more instructions would require missing the next hub access window.
If I read the source correctly, the display has it's own controller, you are not driving it directly, right ?
If so, take a look at the attached source code. I wrote a driver for a display using the SSD1289 controller, it is capable of 60fps (well, nearly, 59.9fps) using 4 scanline buffers and some hardware tricks: invert the R/S line so you can load 16 bits of data directly from hub to OUTA (otherwise you have to set the R/S bit each time), this also has the side effect of driving the WR line low so you have to just drive it back HIGH to complete the write cycle. It doesn't use the CS line, the display is always selected. A bit of loop unrolling is also necessary because of the hub access window.
Hope this helps.
Good catch on the subroutines. Every call is basically 8 cycles since every call is also followed by a ret. Every nanosecond counts.
Thanks for the list of resources. Although the display does indeed have a resolution of 320x240, I'm not doing bitmapped graphics here (maybe should have stated that before...)
Also, every two pixels on the display are being treated as one, making the effective resolution 160x120. I only need roughly 2.5KB to do 300 8x8 tiles.
Thanks for mentioning the on-chip video generator; you bring up a good point. I have a follow-up question: I assume the video generator and the cogs are driven by the same clock (external osc. + PLL). How is it that the generator can do this job any faster than the cog? Is it because the generator is dedicated whereas cog is doing much more than outputting the video stream?
Thanks a bunch for the code; I hope I can adapt it to my display driver. Loading 16 bits from hub to outa? That's pretty impressive.
Thanks for the resources. Just as a clarification, if I go the non-video generator route, it's mainly waiting for hub access that slows down pixel transfer? If this is the case, maybe the Prop isn't the best tool for this; I really need the 16-bit color depth. On average, how does the bandwidth of PICs compare to the Prop (obviously, each PIC model is different)
One is SSD1928 and another is FT800.
I can't believe what I'm seeing...I'm not questioning it either.
By unrolling the pixel transfer loop and replacing all of my calls with inlines, I managed to shave off enough cog time to cleanly do 40fps. There's gotta be something else I can do...
I'm already overclocked at 96MHz. I wonder if that extra 4MHz is enough to get it over the hump; I don't have a 6.25MHz crystal on hand, though.
I'm so close to 60, but if all else fails, I'll settle for 50 since this is PAL update rate.
I suppose the other question is can you organize your screen data differently to help improve transfer?