Propeller C - Driving 8 bit parallel screen slowly?

Wallaby · 2018-07-17 12:46

Hello,

I managed to get a ILI9341 320x240 screen working. I used an example here to drive SPI and found it too slow. There wasn't a 8-bit parallel example, so I wrote my own in C. Running at 80mhz, I had absolutely no doubt it could drive that easily at 60hz. But, it turns out it's nowhere near 60hz. I expected to update every pixel 60 times a second with little issue.

So, while I'm proud of my achievement updating pixels on this display (first hardware project) - I'm a little disappointed PASM SPI was faster.

I'm not accessing memory or anything, simply setting the pixel color with a constant. It can't be any faster, as far as I know. Was I barking up the wrong tree with C for this?

Thanks!

twm47099 · 2018-07-17 13:30

Are you using the Simpletools library and CMM (compact memory model)? Each can be slow. If you have enough memory available try compiling with LMM. It is faster. If that is still not fast enough or you run out of memory, try using the functions and aliases in the propeller.h library. They are a little more complex to use unless you are use to Spin, but much faster than the functions in the Simpletools library.

If that is still not fast enough, I wrote a C library of SPI functions that uses the PASM SPI driver from the PropTool examples. The thread below details how that was developed and has the library.

Tom

https://forums.parallax.com/discussion/157441/can-spi-in-simple-libraries-be-speeded-up

DavidZemon · 2018-07-17 14:29

Unfortunately, there aren't many SPI drivers in C/C++. libpropeller has at least one because I know it uses it for the SD card, but it appears to be built into the SD card driver so I'm not sure it's very suitable for use elsewhere. I wrote one in PropWare, and it's very fast, but it will take some work to PropWare-ify your code if you want to use it: https://github.com/parallaxinc/PropWare/blob/develop/PropWare/serial/spi/spi.h

Wallaby · 2018-07-17 23:52

Thank you for the replies. I am using the SimpleTools library, but manually set the output state with OUTA from the simpletools.h file. The screen is updating faster, but it's still magnitudes too slow.

Here is the code I'm using to write to the display:

void output_byte(unsigned int pattern)
{  
  pattern = pattern << 8; //IO pins start at 8.
 
  pattern |= rdMask;
  pattern |= resetMask;   
  pattern |= cdMask; //Data = 1, Command = 0  
   
  OUTA = pattern;
  
  OUTA |= wrMask; //Write data
  
  OUTA |= csMask; //CS inactive
}

It should be able to write 320x240 easily, shouldn't it? If each command took ~4 clock cycles, that's 4x7x2x320x240 = 8.6Mhz. Not even 10% of the cog's power?

twm47099 · 2018-07-18 00:26

Are you using CMM OR LMM. I tested how long it took for a pin to transition from high to low using different languages and methods.

Using outa was the fastest in straight C. In CMM it took 145 clocks and in LMM it took 17 clocks. For comparison it took 6 clocks in PASM.

Tom

jmg · 2018-07-18 01:23

Wallaby wrote: »

..
I managed to get a ILI9341 320x240 screen working. I used an example here to drive SPI and found it too slow. There wasn't a 8-bit parallel example, so I wrote my own in C. Running at 80mhz, I had absolutely no doubt it could drive that easily at 60hz. But, it turns out it's nowhere near 60hz. I expected to update every pixel 60 times a second with little issue.

Depends what 'update' means here ?

Wallaby wrote: »

It should be able to write 320x240 easily, shouldn't it?
If each command took ~4 clock cycles, that's 4x7x2x320x240 = 8.6Mhz. Not even 10% of the cog's power?

There seems to be an extra x2 in there, but this does show your problem.
80M/(4*7*320*240) = 37.20238 Hz, (assumes each source line is 1 PASM line of 4 sysclks). That rate is a screen-fill speed, not write of useful information.

However, your code does not loop, so you need other call overheads to add,
If your call overhead is also (say) 7 lines of PASM equiv, you have an update rate of 18.6Hz, (say) 14 lines of PASM equiv gets you down to 12.4Hz refresh speed

DavidZemon · 2018-07-18 02:10

Wallaby wrote: »
Thank you for the replies. I am using the SimpleTools library, but manually set the output state with OUTA from the simpletools.h file. The screen is updating faster, but it's still magnitudes too slow.

Here is the code I'm using to write to the display:
void output_byte(unsigned int pattern)
{  
  pattern = pattern << 8; //IO pins start at 8.
 
  pattern |= rdMask;
  pattern |= resetMask;   
  pattern |= cdMask; //Data = 1, Command = 0  
   
  OUTA = pattern;
  
  OUTA |= wrMask; //Write data
  
  OUTA |= csMask; //CS inactive
}
It should be able to write 320x240 easily, shouldn't it? If each command took ~4 clock cycles, that's 4x7x2x320x240 = 8.6Mhz. Not even 10% of the cog's power?

I don't understand your math at all, and I think you're missing some key parts. 80 MHz is the clock rate of the propeller. It takes 4 clocks to execute a single instruction, which means it can only executed 20M instructions per second, or 20 MIPS. Each line of C in your above function is multiple op codes, but when compiled with LMM it actually takes four PASM instructions for each op code because it has to fetch the op code from HUB RAM. That's why you're not seeing anything remotely close to 8.6 MHz with your above function.

To get the best speed, you're going to need to use FCache. Take a look at how I implemented an SPI block write function with FCache and inline assembly here: https://github.com/parallaxinc/PropWare/blob/develop/PropWare/serial/spi/spi.h#L270. Note that there are three preprocessor definitions used by that function which are defined here: https://github.com/parallaxinc/PropWare/blob/develop/PropWare/PropWare.h#L69

Wallaby · 2018-07-18 03:55

Using outa was the fastest in straight C. In CMM it took 145 clocks and in LMM it took 17 clocks. For comparison it took 6 clocks in PASM.

Ah, I see. I thought that C was just compiled down to PASM.

I found the memory management in SimpleIDE but it won't resize properly so it was hard for me to navigate. I was able to change it from CMM to LMM and it was a noticeable improvement - still to slow - but getting there! COG Ram won't compile with simpletools.h so maybe I will write my own routines. Its only setting pins right now so it should fit into COG ram without simpletools.

There seems to be an extra x2 in there, but this does show your problem.
80M/(4*7*320*240) = 37.20238 Hz, (assumes each source line is 1 PASM line of 4 sysclks). That rate is a screen-fill speed, not write of useful information.

However, your code does not loop, so you need other call overheads to add,
If your call overhead is also (say) 7 lines of PASM equiv, you have an update rate of 18.6Hz, (say) 14 lines of PASM equiv gets you down to 12.4Hz refresh speed

Yes, my math was probably wrong. I was just trying to get an estimate. I feel that the Prop should be more than fast enough to drive this display but I'm very new to hardware. Maybe it can't.

The extra x2 was because its a 16-bit display and requires two writes to fill one pixel. I'd use 8-bit colors if I could, but the display doesn't support it.

Is it possible to divide up the work between multiple cogs? Let's say in a perfect world one COG could write 12.4Hz, could I interlace the screen x6? It seems to me that the output pins would get in the way of each other?

To get the best speed, you're going to need to use FCache.

I'll try it. Thanks!

jmg · 2018-07-18 06:11

Wallaby wrote: »

The extra x2 was because its a 16-bit display and requires two writes to fill one pixel. I'd use 8-bit colors if I could, but the display doesn't support it.

Is it possible to divide up the work between multiple cogs? Let's say in a perfect world one COG could write 12.4Hz, could I interlace the screen x6? It seems to me that the output pins would get in the way of each other?

This thread may help - some measurements there
https://forums.parallax.com/discussion/154703/read-bmp-image-from-sd-to-display-ili9341-done-in-spin-but-very-slow

Your inner most loops will need to be assembler, but you can run different languages in different COGS

There is also a cog-mode in PropC, which is for small-but-fast stuff.
See this thread & generated PASM code
https://forums.parallax.com/discussion/comment/1325462/#Comment_1325462
and same thread compares PropC and PropBASIC code
https://forums.parallax.com/discussion/comment/1325549/#Comment_1325549

Wallaby · 2018-07-18 07:31

I put the draw code into a loop and toggled the write as fast as possible. There is a full screen update in a little less than half a second, but I don't think any more amount of optimization is going to put it where I was hoping.

Is there a way to use the built-in VGA generation on the COG to drive this display? I felt the VGA implementation was too restrictive but maybe if it could generate at 320x240 screen instead of a 640x480 screen it might be worth testing. Still, I think with an 8 bit data bus, it's never going to be fast enough.

Even toggling the write as fast as possible isn't enough.

The display has a RGB mode where it can use VSYNC and HSYNC but it's not broken out on the board. Even with that, I'd be limited to 2 or 4 color modes.

I'll see what else I can do with the Prop.

Peter Jakacki · 2018-07-18 08:29

I really don't see how you can ever get a fast "update" since you need to write 153.6kB. However "update" is the wrong term, isn't it? I mean if you had 512kB of RAM like the P2 you could render to a frame buffer and then "update" the display's internal memory.

I have interfaced to a 320x240 24-bit TFT that has iirc an SSD1963 SSD2119 internal display memory via SPI and I don't have a problem with writing to it. Of course I'm using Tachyon Forth where I have very fast SPI instructions so both that and the execution speed certainly help. BTW, I would draw variable sized fonts and lines and rectangles etc. The slowest operation will always be clearing or filling that display memory but drawing should not be such a problem.

Rayman · 2018-07-18 13:48

I think you need an assembly driver and 8 or 16 bit interface to update the screen at anything like video speed.
If you search for PSM, you might find my old 320x240 driver for 8-bit interface with some kind of ILI chip.
It is Spin and PASM, but you can probably modify it for your display...

jmg · 2018-07-19 02:59

Wallaby wrote: »

... I felt the VGA implementation was too restrictive but maybe if it could generate at 320x240 screen instead of a 640x480 screen it might be worth testing. Still, I think with an 8 bit data bus, it's never going to be fast enough...

You could also look at this thread :
https://forums.parallax.com/discussion/168553/newhaven-matrix-orbital-display-with-on-board-ftdi-ft81x-embedded-video-engine/p1

The problem is not just the bus, it's also the shuffling of pixels needed, as that 320x240 is going to push the Prop RAM - eg a just 4bpp image plane needs 38.4kBytes.
You can improve the bandwidth with external help.

Working up the price curve, looking for x16 memory, (to use as palette LUT) finds
( You need to pre-program that LUT, either externally, or using something like PCA6416A 16b i2c io, 62c/1k or the N76E003AT20 can do init loading.)

? Flash : SST39LF402C-55-4C-EKE 88c/25+ - quite cheap at 4Mb, these seem to now be the lowest price part of the curve.
- you only use a small portion of that, and with a palette pair you can then send just pixels from the P1
? SDRAM : IS42S16100H-7TLI-TR $1.64/1 or W9816G6JH-6 $1.09/1 - I think those can do 1024 x 16 x 2 LUT, using column address only. As above, P1 selects LUT and then sends Pixels to select FG/BG
? SRAM : IS62WV12816EBLL-45TLI $1.32/100+ 2Mb, this part can either be simple LUT, or at 128k pixels, it is large enough to swallow the whole 76800 pixel display
? SRAM : IS61WV25616EDBLL-10TLI $2.35/100, 4Mb, 10ns

Wallaby · 2018-07-19 07:48

Wow! Very helpful forum!

Working up the price curve, looking for x16 memory, (to use as palette LUT) finds
( You need to pre-program that LUT, either externally, or using something like PCA6416A 16b i2c io, 62c/1k or the N76E003AT20 can do init loading.)

Pallet look up table is a good idea. I understand that would greatly lower the memory cost to store a frame in memory. And I could use that external ram to store a frame buffer. How do I get the pixel data to the display fast enough though? I want to target 60hz refresh. I could try a 16 bit parallel bus or a lower resolution screen I guess. Something like a 128x128 possibly. That's 1989 GameBoy resolution though and I'm not sure if it's enough.

I figured I could use the Prop for a low cost portable game console. I'm a game developer by trade, but love the idea of hardware and only just learning. My initial idea was to see how many pixels I could drive with the Prop and design a portable around the screen. Because I'm new to hardware, I thought I'd have tons of horsepower to drive the screen and could design the rest of the system with the extra cores.

Peter Jakacki · 2018-07-19 08:07

60Hz is only possible if you are actually just "updating" part of memory, as in the case of some sprites. The P1 does not have enough RAM to hold the frame buffer so the only time you would ever write to the full 154kB of display memory is when you are clearing or filling the screen.

However we are expecting the P2 to sample later this year which has 512kB of RAM (just for starters) and is a whole lot faster and can stream data efficiently to the display. Perhaps you would like to hold off and be one of the first to use this powerful new Propeller chip for a portable game console?

See my sig for links.

macca · 2018-07-19 09:17

Wallaby wrote: »

I figured I could use the Prop for a low cost portable game console. I'm a game developer by trade, but love the idea of hardware and only just learning. My initial idea was to see how many pixels I could drive with the Prop and design a portable around the screen. Because I'm new to hardware, I thought I'd have tons of horsepower to drive the screen and could design the rest of the system with the extra cores.

Take a look at this:
https://dev.maccasoft.com/propgame/

There is a portable version prototype that uses those 320x240 LCD display with SD card and touch panel with SSD1289 or ILI9341 drivers. The PASM code can update the screen at 60Hz using a 16 bit data bus and a small hardware trick (inverting the data/command line logic) to minimize pin toggles.

Source code is here:
https://dev.maccasoft.com/propgame/browser/trunk/libraries/lcd/scanline_driver.s

Wallaby · 2018-07-19 09:30

I like the sound of the P2.

Updating only part of the screen is a fair compromise, but I'm worried the logic behind it would take a lot of work. For example, with something like a 3D object.

I'll try simulating a 2D sprite and see how the refresh is.

Wallaby · 2018-07-19 09:44

There is a portable version prototype that uses those 320x240 LCD display with SD card and touch panel with SSD1289 or ILI9341 drivers. The PASM code can update the screen at 60Hz using a 16 bit data bus and a small hardware trick (inverting the data/command line logic) to minimize pin toggles.

Interesting. So it can actually get there with a 16 bit data bus and PASM.

How do you invert command / data bit? After the signal leaves the Prop? I could see that inverting that bit makes a lot of sense because you're often sending many more data commands than commands and not having to toggle it to 1 every write would save some time.

Any idea what the maximum resolution is for 60hz?

macca · 2018-07-19 09:59

Wallaby wrote: »

How do you invert command / data bit? After the signal leaves the Prop? I could see that inverting that bit makes a lot of sense because you're often sending many more data commands than commands and not having to toggle it to 1 every write would save some time.

There is a 7404 inverter between the Propeller and the LCD. The full schematic is here:
https://dev.maccasoft.com/propgame/wiki/Doc/PortableConsoleSchematic

Any idea what the maximum resolution is for 60hz?

320x240 is the maximum resolution, there isn't much space to improve things, and as you can see I had to unroll most of the loop to keep the hub reads synchronized and achive the maximum speed. The code has some delays used to make it a like a CRT screen that can be removed, the refresh may be increaed to 65 Hz or so (don't remember exactly what are the limits) but that doesn't leave much for additional resolution. Lowering the refresh to 50Hz may have some gains, depends on what you want to do.

Wallaby · 2018-07-19 11:15

Yes, I saw how you unrolled the loop. The code is very clean! I'm surprised you have enough time to read from the hub ram and output it.

The cartridge ROM / RAM is generous too.

Is it feasible to use a bluetooth controller instead of building your own?

macca · 2018-07-19 13:00

Wallaby wrote: »

Yes, I saw how you unrolled the loop. The code is very clean! I'm surprised you have enough time to read from the hub ram and output it.

Hub reads are synchronized to minimize the wasted clock cycles, rdword reads directly to OUTA saving some clocks (that's why I need to invert the data / command bit, rdword clear bits 16-31 so the lines must be 0 in data mode) and the other two instructions are using the hub wait time so the next rdword is synchronized.

The cartridge ROM / RAM is generous too.

Is it feasible to use a bluetooth controller instead of building your own?

I think you are now referring to the "classic" console. Bluetooth usually needs a USB receiver, there is a USB host stack for the Propeller 1 with bluetooth support here:
https://github.com/SaucySoliton/propeller-usb-host

But never used for that, maybe it works.

Propeller C - Driving 8 bit parallel screen slowly?

Comments