Radio Shack LCD: C vs Spin vs +PASM grudge match
localroger
Posts: 3,452
In the discussion of the Radio Shack touchscreen LCD module Jazzed very helpfully ported the Arduino code to C for the Propeller ASC as seen here:
http://forums.parallax.com/showthread.php/157141-2.8-quot-TFT-Touch-Screen?p=1290310&viewfull=1#post1290310
This is actually an interesting demo to use to compare development languages. It's small enough not to be hindered by the Prop's memory size, and doesn't especially benefit from any of the Prop's special features; it's basically shoveling bits out of I/O pins as fast as possible.
Working mostly line for line I translated the application to pure Spin:
SS_TFT.spin
That pretty much worked the first time. I then added pretty much the simplest possible PASM helpers. You wouldn't think PASM would buy you much in this application because the TFT display has a parallel interface; there are no SPI or IIC loops serializing the data, and it takes 12 I/O lines. Yet you would be wrong.
SS_TFT_PASM.spin
The C version is compiled to LMM because as Jazzed warns its performance is terrible in bytecode. In LMM it takes 2 seconds to clear the screen and about 1.5 seconds to draw three progressively larger lines of "Happy!"
It also takes 8772 bytes of Hub RAM. And the distribution necessary to make sure I get all the files needed to build it is a 2.6 megabyte folder of 299 files.
The pure Spin version is very slow; it takes 20 seconds to clear the screen, but draws the text in a very competitive 2 seconds. But it's also a single self-contained file and only takes 2,500 bytes of Hub RAM.
But look at the PASM-helped version! The amount of PASM is really modest and only increases the Hub RAM usage to 2640 bytes while making minimal changes to the Spin logic. It clears the screen in 2.5 seconds and draws the text in about a quarter of a second.
Edit: Adding just a few more longs of PASM blows out the jams. The whole demo in under 0.2 sec: SS_TFT_PASMX.spin
So suppose we resort to C bytecode? The Arduino code compiled to CMM takes 9 seconds to clear the screen and 6 seconds to draw the three iterations of "Happy!" struggling very noticeably on the third and largest line. And it still takes 5644 bytes of Hub RAM.
So what the hell is going on here?
A lot of C code is informed by the idea that calls are expensive, which is why you get things like this hidden in the .h file:
Those are macros. It's bothersome enough that you have to go snorkeling in the .h file to find out what the hell WR_HIGH does when you see it in the main source file, but what's even more bothersome is that every time you use it it generates a little string of byte codes -- or worse LMM instructions. This technique very effectively hides from you just how much RAM you're burning with a sequence like
The macro actually isn't a bad idea in a 32-bit or 64-bit system where the cost of a CALL can be 5 or 9 bytes, but in embedded byte code it kills you.
There's also this massive eater of RAM...
In byte code each of those command-data pairs eats at least 10 bytes: Two call tokens with 16-bit arguments, and two push tokens with an 8 and a 16 bit argument. And if you're wondering why they didn't build a bloody DAT table like I did in the conversion for the startup magic number spray, just look at the syntastic gymnastics necessary to encode the font that way.
There is also the matter of the impressively execrable (I can't think of a more descriptive word that's OK in PG-13 land here) performance of the routine that draws text characters. I can't even figure out exactly why it's so bad but it's really, really bad, even in LMM. I suspect there is a lot of hidden stack frame overhead or something in those for statements. In any case Spin does it a whole lot better, with nearly the same syntax.
Anyway, that's C vs. Spin vs. Spin with a teeny bit of helper PASM on the Propeller. My urge to get up to speed on the C side of things has wilted considerably.
http://forums.parallax.com/showthread.php/157141-2.8-quot-TFT-Touch-Screen?p=1290310&viewfull=1#post1290310
This is actually an interesting demo to use to compare development languages. It's small enough not to be hindered by the Prop's memory size, and doesn't especially benefit from any of the Prop's special features; it's basically shoveling bits out of I/O pins as fast as possible.
Working mostly line for line I translated the application to pure Spin:
SS_TFT.spin
That pretty much worked the first time. I then added pretty much the simplest possible PASM helpers. You wouldn't think PASM would buy you much in this application because the TFT display has a parallel interface; there are no SPI or IIC loops serializing the data, and it takes 12 I/O lines. Yet you would be wrong.
SS_TFT_PASM.spin
The C version is compiled to LMM because as Jazzed warns its performance is terrible in bytecode. In LMM it takes 2 seconds to clear the screen and about 1.5 seconds to draw three progressively larger lines of "Happy!"
It also takes 8772 bytes of Hub RAM. And the distribution necessary to make sure I get all the files needed to build it is a 2.6 megabyte folder of 299 files.
The pure Spin version is very slow; it takes 20 seconds to clear the screen, but draws the text in a very competitive 2 seconds. But it's also a single self-contained file and only takes 2,500 bytes of Hub RAM.
But look at the PASM-helped version! The amount of PASM is really modest and only increases the Hub RAM usage to 2640 bytes while making minimal changes to the Spin logic. It clears the screen in 2.5 seconds and draws the text in about a quarter of a second.
Edit: Adding just a few more longs of PASM blows out the jams. The whole demo in under 0.2 sec: SS_TFT_PASMX.spin
So suppose we resort to C bytecode? The Arduino code compiled to CMM takes 9 seconds to clear the screen and 6 seconds to draw the three iterations of "Happy!" struggling very noticeably on the third and largest line. And it still takes 5644 bytes of Hub RAM.
So what the hell is going on here?
A lot of C code is informed by the idea that calls are expensive, which is why you get things like this hidden in the .h file:
#define WR_HIGH {PORT_WR|=WR_BIT;} #define WR_LOW {PORT_WR&=~WR_BIT;}
Those are macros. It's bothersome enough that you have to go snorkeling in the .h file to find out what the hell WR_HIGH does when you see it in the main source file, but what's even more bothersome is that every time you use it it generates a little string of byte codes -- or worse LMM instructions. This technique very effectively hides from you just how much RAM you're burning with a sequence like
CS_LOW; RS_HIGH; RD_HIGH; WR_LOW;
The macro actually isn't a bad idea in a 32-bit or 64-bit system where the cost of a CALL can be 5 or 9 bytes, but in embedded byte code it kills you.
There's also this massive eater of RAM...
sendCommand(0x0001); sendData(0x0100); sendCommand(0x0002); sendData(0x0700); //...repeat and repeat and repeat
In byte code each of those command-data pairs eats at least 10 bytes: Two call tokens with 16-bit arguments, and two push tokens with an 8 and a 16 bit argument. And if you're wondering why they didn't build a bloody DAT table like I did in the conversion for the startup magic number spray, just look at the syntastic gymnastics necessary to encode the font that way.
There is also the matter of the impressively execrable (I can't think of a more descriptive word that's OK in PG-13 land here) performance of the routine that draws text characters. I can't even figure out exactly why it's so bad but it's really, really bad, even in LMM. I suspect there is a lot of hidden stack frame overhead or something in those for statements. In any case Spin does it a whole lot better, with nearly the same syntax.
Anyway, that's C vs. Spin vs. Spin with a teeny bit of helper PASM on the Propeller. My urge to get up to speed on the C side of things has wilted considerably.
Comments
In C you don't generally launch PASM helper cogs; you have inline PASM which is actually very cool but still LMM. If it's possible at all to launch helper cogs it doesn't seem to be the simple and baked-in thing it is with the PropTool. If someone who knows the C dev system wants to try it I'm all ears. (If you don't have a RS TFT board I'll try it on mine.) It seems that would just be adding another PASM image on top of the LMM or CMM interpreter and not correcting the fundamental resource usage problems that make the C code so inefficient at this scale.
In summary you have: I guess there are no surprises there.
Spin is a marvel of design. As well as being a simple and elegant language design it gets compiled down to really small byte code programs, maximizing the amount of functionality that can be squeezed into such a small space as the Propeller. The byte code design itself is a gem, the required interpreter fits into the 4096 instructions of a COG. Amazing! Try that with a Java run time or even the old Pascal p-code system. The cost for all this capability is speed. Spin is slow.
But wait. Spin makes it really easy to add assembler code to your program where needed for speed. And the Propeller architecture and PASM syntax are the simplest ways of working in assembler I have ever seen. It's not much harder to work in assembler on the Prop than it is to work in a high level language. It's even easier to write in PASM than Forth:)
So with the integration of all these elements, the Prop architecture, the Spin language, the interpreter, PASM and it's seamless integration into Spin we get the best of both worlds. Small code for the bulk of an application and fast code where needed. Brilliant!
Then there is C.
Years ago, before there was a C compiler fro the Propeller, we used to discuss the possibility of having one.
Some of us argued it was a totally pointless exercise because:
1) C is normally compiled to native machine instructions. That makes no sense on the Prop because we can only run 496 instructions of native code. What use is that?
2) LMM gets you bigger code but at the cost of the huge size of those 32 bit instructions and the terrible slow down of fetching them into COG.
Could it be that you have just discovered that the nay sayers were right?
On the other hand:
My Fast Fourier Transform exists in C, Spin and PASM. Amazingly it turns out that the C version is not much slower than the PASM version. That FCACHE mechanism really works well there. Sorry I don't have the performance figures to hand.
I have also written a C version of Full Duplex Serial that fits in the COG and manages 115200 baud.
So what's up with that LCD driver code?
Well why not?
Certainly you can launch C code into COGs. See my C version of FDS. Certainly that helper code can be written in assembler instead of C.
I did not really get your point about macros like: Whenever you use WR_HIGH it will insert some code. After all it has to read the port, read WR_BIT, OR them together and write the PORT. You can't expect the work to be done with out some code being generated. How well that get's optimized is another story.
The intent here is that whatever code the macro inserts, in line, into your program is smaller and faster than making a call to do the same thing. What with the overheads of passing parameters, calling and returning etc.
If that is not the case using a macro is perhaps a poor choice.
In Spin we have things like INA, OUTA and DIRA which are not actually variables. They are features baked into the language which no doubt makes the resulting code very tight. This is not really going to happen in C. As a cross platform high level language C cannot have hardware dependent things like that built in.
I'm also pretty sure the Spin version is so slow to clear the screen because of all the subroutine calling overhead for those routines that were macros in the sketch.
But it's not really obvious what is happening when you use a macro, and even this demo warns in the C comments that one of the common Arduinos has the pins arranged awkwardly so that it runs slow. You also lose the ability to combine several of these statements into a single AND or OR if ordering is unimportant. I'm seriously tempted to do a search-and-replace manual macro substitution on the Spin version to see what that does to the performance and file size.
Up and running in about an hour. Very little pain for a C programmer.
Get it working first, then optimize as resources permit.
Actually doing such comparisons is very helpful. ;-)
Maybe the original author thought it was easier or clearer - both of which are often requirements for demos.
Fonts are often created with separate software packages.
How does the actual Arduino performance compare to what has been presented here?
That's a pretty good question. Maybe someone who has an actual Arduino can tell us.
Edit: Also, if someone hasn't already tested it by then i can time it on my Arduino
There is something wonky about the PASM clear screen code, it should be faster than LMM.
The PASM version is still using Spin to iterate across all the pixels. When I tried coding that loop in PASM I couldn't get it to work. All of the methods in the example are amenable to full PASM conversion, but that involves a lot more work, and it's frustrating to debug because the source is totally undocumented and doesn't work at all if any tiny thing is wrong.
Edit: And I mean "hack" in the good sense! :-)
I found some possible opportunities for optimization that I will try once I get mine.
If anyone else wants to give this a try in the meantime, I'm anxious to hear how it compares. I set the sendCommand and SendData methods as fcache so that might help a lot? maybe?
and jazzed said
I agree. The key is to break it up into manageable bits and gradually transfer things from C or Spin into pasm.
Your should be able to improve on 20 seconds for a screen refresh. Using two external ram chips and a tight pasm loop I think we got it down to 30 milliseconds on this thread http://forums.parallax.com/showthread.php/137266-Propeller-GUI-touchscreen-and-full-color-display/page9?highlight=touchscreen post #168
The catch there is it uses too many propeller pins and too many external TTL chips. I'm working on fpga solution - there are some great solutions with a hybrid propeller/fpga.
SS_TFT_PASMR.spin
But that's just for the speed.
As to the grudge match?
How simple it was to add Simple_Numbers.spin, FullDuplexSerial.spin, a few lines of code,
and my little project is up and running.
edit:
At least it looks simple.
Having trouble with SimpleNumbers.
But I can probably deal with that.
The 299 files inthe C demo - no way.
edit again:
Interesting...
Writing text that extends past the edge of the display seems to crash the display.
That was my Simple_Numbers problem.
I suppose the C is there for people that want to use it as a starting point; but Spin and PASM will continue to be around to speed things up or to reduce bloat when things don't work out..
And of course, Forth could do this nicely.. but will likely never have floating point.
Oh, and another comparison you might make is to compare the performance of C using -mcog mode with Spin. There you will find that C performs almost as well as PASM without having to resort to writing in assembly language. Spin can't touch that performance.
FillCircle also seems to crash it. Weird. It should be signal for signal the same as the slow Spin version, which works. And uncommenting the wait loops to make sure SendCommand and SendData are complete doesn't help.
That's true for LMM; my test shows that it's definitely not true for C byte code, which performs very noticeably more poorly than Spin in the character generator. And even giving it credit for the PASM image of the bytecode interpreter (can C reclaim this RAM after the interpreter starts?), the C byte code is still nearly twice the size of the Spin byte code.
And of course using PASM discards the single biggest advantage of C which is its platform agnosticism -- the fact that you can load this Arduino sketch on a Prop ASC and run it. Once you PASM it up (or even use inline assembly) you can't take the resulting project and run it on a regular Arduino any more. It's then just as Propcentric as a Spin project, and probably bigger, slower, and requiring a lot more files to archive.
Edit: Probably worth mentioning, one point where C excels and there isn't really any other similarly good solution is for large business logic that needs to be in XMM because it won't fit in Hub RAM, but doesn't need to be fast. I believe gcc is currently the best solution for this sort of thing and the miserable performance of the byte code interpreter doesn't matter when you're accepting miserable performance to load from a serialized XMM solution anyway. But you still need the option of speed for those functions that need it within a project.
This is not entirely correct. The idea of the C preprocessor makes lots of sense in this case. As you mentioned, the demo code is written to run on arduino or propeller. The only reason that is possible is because of the __PROPELLER__ defined symbol. For example, in TFT.h we have this.
There is nothing stopping us from using the same convention for in-line ASM in sendCommand() and sendData() ... except for desire, time, and ability.
I tried a few things to recover for the display crash, but nothing I did helped.
It was late. I'll play with it some more later.
But it's easy enough for now to avoid writing off the edge.
It's usable at this point..
Even if no one takes it any further, it's usable now.
There are things I'd like to see happen - SendRegister, for instance.
And if bounds checking could be done to prevent the display from crashing, that would be nice.
But it used carefully, at this point, it's working well enough to release.
And get a driver working for the ADC chip on the ASC board.
But it works right now as a color graphics display - in SPIN.
Probably need to set it up to run in another cog at some point.
An init call, stack, whatever, some way to return a Finished Flag to sync with?.
My test last night was a simple 1 second down-count timer display.
The paintscreenblack routine is so fast I used as a CLS function between prints.
With the following added it looks like about 806 longs of code,
22 longs for variables.
Most Excellent Work!
As for the language wars?
This project has validated the cross-platform capability of C.
With a bit of work from an experienced coder the Arduino sketch could be made to run on a Propeller.
But it has also shown the down side as well.
The monster complexity of the Sketch, and lack of speed made it quite unattractive to anyone NOT an expert C coder.
Unusable might be said id the display speed is critical (and when isn't it?)
So therein lies the rub.
But not the end of the language wars, I'm sure...
Using no assembly, I got about the same results as Steve with his code (no surprise) - that is ~3 seconds to run the init function (I couldn't tell if you guys were timing the entire init function or just paintScreenBlack) and ~1.5 seconds to print all three lines of text.
However... when I copied over the PASM routines...
Now, I only bothered implementing sendCommand and sendData - not the multiSend or sendCmdSeq - but it dropped time to .811 and .325 seconds!
And personally, what I think is the best part - here's all the (C++) code that it took:
I will also admit, I had to remove "const" from the end of each method declaration in PropWare::SeeedTFT which actually increased the runtime of that class from 3/1.56 to 3/1.68. This could be remmedied by using a mailbox that was in the global scope instead of a member variable though.