Speed of PropForth vs Spin
Don Pomplun
Posts: 116
Just getting back into Forth and had a purely hardware project to play with -- driving an RGB LED strip that uses LPD8806 chips. I got a 5M (260 LEDs) strip and draped it across the ceiling. For starters, I just light it R, G & B, interspersed with Y, C & M, with a half second pause before changing colors. If I do it in PropForth it takes about 1.4 seconds for a color to propagate down the string. I'd like it to go much faster. Today I wrote the same thing in Spin and find it goes about a second faster.
Changing the Forth from 1 lshift to 2* makes a slightly visible difference, but what can I do to make it Really fast? Here's the PropForth:
I do: io cycle to get it going
Here's the Spin to do the same thing:
Is Propeller assembler the only answer?
TIA
Don
Changing the Forth from 1 lshift to 2* makes a slightly visible difference, but what can I do to make it Really fast? Here's the PropForth:
fl 0 constant data 1 constant clock 264 constant numleds : io data pinout led pinout clock pinout ; : sendMSBfirst \ put byte on stack first 8 0 do dup h80 and if 1 data px then 1 clock px 0 clock px \ toggle clock 0 data px \ 1 lshift 2* loop drop \ the depleted old shifted word ; : prime \ sends a zero byte 0 sendMSBfirst 50 delms ; : magenta numleds 0 do hFF sendMSBfirst \ blue hFF sendMSBfirst \ red h80 sendMSBfirst \ green loop ; : red numleds 0 do h80 sendMSBfirst \ blue hFF sendMSBfirst \ red h80 sendMSBfirst \ green loop ; : yellow numleds 0 do h80 sendMSBfirst \ blue hFF sendMSBfirst \ red hFF sendMSBfirst \ green loop ; : green numleds 0 do h80 sendMSBfirst \ blue h80 sendMSBfirst \ red hFF sendMSBfirst \ green loop ; : cyan numleds 0 do hFF sendMSBfirst \ blue h80 sendMSBfirst \ red hFF sendMSBfirst \ green loop ; : blue numleds 0 do hFF sendMSBfirst \ blue h80 sendMSBfirst \ red h80 sendMSBfirst \ green loop ; : cycle begin prime magenta 500 delms prime red 500 delms prime yellow 500 delms prime green 500 delms prime cyan 500 delms prime blue 500 delms 0 until ;
I do: io cycle to get it going
Here's the Spin to do the same thing:
{ ******** Propeller I/O assignments *********** P0 RGB data P1 RGB clock } CON _clkmode = xtal1 + pll16x 'Standard clock mode * crystal frequency = 80 MHz _xinfreq = 5_000_000 numLEDs = 264 LEDdata = 0 LEDclock = 1 PUB startup dira[LEDdata]~~ ' LED strip data line outa[LEDdata]~ dira[LEDclock]~~ ' LED strip clock line outa[LEDclock]~ colorTest PRI colorTest | i streamByte(0) repeat i from 1 to 3*numLEDs ' turn off all LEDs streamByte( $80 ) 'waitcnt( clkfreq/60 + cnt ) repeat ' forever streamByte(0) ' primes LED strip to receive new data repeat i from 1 to numLEDs ' magenta streamByte( $FF ) ' Blue . . . LED addressing order is Blue Red Green (not RGB) streamByte( $FF ) ' Red streamByte( $80 ) ' Green waitcnt( clkfreq/2 + cnt ) ' 1/2 second between color changes streamByte(0) repeat i from 1 to numLEDs ' red streamByte( $80 ) streamByte( $FF ) streamByte( $80 ) waitcnt( clkfreq/2 + cnt ) streamByte(0) repeat i from 1 to numLEDs ' yellow streamByte( $80 ) streamByte( $FF ) streamByte( $FF ) waitcnt( clkfreq/2 + cnt ) streamByte(0) repeat i from 1 to numLEDs ' green streamByte( $80 ) streamByte( $80 ) streamByte( $FF ) waitcnt( clkfreq/2 + cnt ) streamByte(0) repeat i from 1 to numLEDs ' cyan streamByte( $FF ) streamByte( $80) streamByte( $FF ) waitcnt( clkfreq/2 + cnt ) streamByte(0) repeat i from 1 to numLEDs ' blue streamByte( $FF ) streamByte( $80 ) streamByte( $80 ) ' keep green on waitcnt( clkfreq/2 + cnt ) PRI streamByte(d) d ><= 8 ' reverse bit order to shift out MSB first repeat 8 if (d & 1) <> 0 outa[LEDdata]~~ outa[LEDclock]~~ ' high outa[LEDclock]~ ' low outa[LEDdata]~ d >>= 1
Is Propeller assembler the only answer?
TIA
Don
Comments
You might want to try Tachyon Forth which is supposed to be faster than Propforth, but I don't know how it compares to Spin. If speed is critical, I don't think there's a substitute for PASM which can interface with Forth as easily as Spin. You could just rewrite streamByte in PASM as it looks the most time critical.
That's why I wrote Tachyon Forth but even though this same code could be changed slightly as is but for TF and run about 10 times faster than Spin you will find that using the SPIWR instructions in TF the whole loop for 264 LEDs should take around 3ms (0.003 seconds). 32-bits take 10us * 264 + overhead.
I used Spin to manipulate the pixels on the WS2801 strips on my hexacopter but the low level driver was written in PASM by Jon McPhalen (aka JonnyMac). There are links to the WS2801 driver in the video description.
While spin is interpreted at execution, forth is interpreted at compile time. Compile time is when you hit enter while typing in a colon definition. At execution time, forth is executioning assembler. Becuase of the dictionary structure, its mostly jumps down the dictionary until it reach an "atom" word like + or C@ ,etc.
The rule of thumb is "Get the function written quickly using FORTH, the get function execution quickly using assembler". After we get the function to run, if we need to optimize it, we can use the asm: and asm; word from the propofth.htm manual to factor out forth dictionalry overhead, and whittle it down to just the opcodes we want. I have not had need of assembler optimization. You might ask Caskaz, he is the master of propforth assembler.
I have not used those parts. If Peter says 260 LEDs should take 0.003 seconds, there is probably a way to factor the high level forth so the whole string appear to change at the same instant.
Spin is compiled into bytecode which is "interpreted" at runtime, not much different from how Tachyon does it, rather than PropForth which interprets compiled Forth addresses.
However, when I said that TF would take 3ms of course I am not taking into account the 50 ms overhead for "priming" so here is the code and without the priming all 264 LEDs takes 2.66ms to update!
Here's an optimized version that uses a table to lookup the next color.
Or you could write directly in C and avoid the spin2cpp conversion. This would provide access to multi-dimensional arrays, data structures, function pointers and other features that C provides.
But as you may have read above, there are a lot of people looking at different ways to outperform SPIN, and you don't have to just go directly to PASM.
I needed short timing delays and I found that I could easily get a nice accurate Forth work in pfth down to about 25 micro seconds. Faster than that, I guess it would be PASM. I don't think Tachyon could do significantly better. I found PropForth a bit too complex for my own use.
Are we about to start another Forth on Propeller debate... all three do have specific interesting features that make each unique and useful in different ways.
In both cases when it comes to speed they give up and use PASM. Correct me if I am wrong.
In the modern world the kids are doing similar things with JavaScript or Python on their micro-controllers. Again when the going gets tough it's back to dipping down into C or assembler.
The Propeller is a bit of a special case. It has multiple independent processors so you can run your time critical PASM in one of them whilst loafing around in your user friendly Spin or Forth or whatever in another. This is not so easy, or even possible, on single processor machines.
I just typed a long reply and got an expired token, don't you hate that.
Tachyon Forth is much faster than most of you think it seems as 25us is enough time for TF to transmit 80 bits serially or perform 55 FOR NEXT loops etc. In fact PASM itself couldn't do too much better in this task of updating LEDs and I should know, I couldn't do much better unless I handcrafted special PASM code using the counters.
Shall I throw down the gauntlet in this regard and challenge another HLL language to produce code that can run faster then what I can in Tachyon Forth? I'm very confident that that would not be an easy thing at all. Let those who think they can then present their case on the field of combat with their HLL implementation, and then let all HeLL break loose. Remember, this is not some mere benchmark, the TF implementation takes 2.66 ms to update 264 RGB LEDs serially not counting the 50ms priming that the LEDs need.
Challenge accepted.
Now, what are the rules? I presume all entrants must be written purley in the HLL of choice. No assembler sneaked in there.
Anyway, in the propgcc sources you will find my version of FullDuplexSerial entirely written in C that can transmit and receive at the same time at 115200 baud.
If that does not take the prize I also enter my 1024 point fixed point FFT, again entirely written in C. I forget how it performed now, was it 20 or 30 transforms per second. Faster if compiled with the OMP options and allowed to spread itself over two or four COGS.
I do know C and in fact I started coding in Forth around the same time as I did C. But C is all about the libraries and I'd rather have my own "libraries" and interactive program development and O/S on the Prop.
@Heater: I'm talking about this LED demo, a typical task that we would put the Propeller to use on. How well and how efficiently can code be written in HLL to write to these LEDs. PropForth came in at 1.4 seconds, Spin at 0.4 seconds, Tachyon at 2.66ms + the enforced 50ms priming.
OK, sorry I did not read the small print, seems you have an actual driver in mind that drives 264 RGB LEDs in some unspecified way.
I guess you win the challenge by default then. I don't have 264 RGB LEDS to drive nor the will to write such code.
Your figures indicate that TF is 150 times faster than Spin. Sort of what we could expect for a compiled language over a bytecode interpreter.
What is this "enforced 50ms priming" thing?
This is the key point. The purpose of forth is to code interactively. IF you like coding interactively, you probably find you can CODE FASTER in forth. Execution speed is a separate issue, addressed AFTER the code is working.
The cost of this rapid development environment is that forth has to pop down the link list through the forth dictionary, until it gets to the assembly code "atom" words.
As previously stated, if the code is working, and you want it faster, we have several methods of opimizing.
words like pinhi and pinlo are inherently slow, they do a whole bunch of stuff everytime they are called. If you do a several slow words, it has little significance until you do these over and over in a loop. Then start nesting loops, this compund the impact.
There are other words where we just set the mask, and twiddle the bits. The words that do "just the bit banging" are fast. Refactor using those words to get improvement.
Sal says that usually when he starts optimizing the high level forth, it often ends up TOO fast, and the stream can overrun a given device. At which point he backs off or adds flowcontrol or protocol etc, to balance it out. So it could be possible to sufficently optimize the LED routine with just high level forth.
Regarding a challenge for fastest code: The speed of the prop is a the limiting factor, and any language will approach that limit. At what point are we sufficently close? When it works. Sal would not produce code that's any faster than Peter's. Sal would write the code fast enough to be sufficent for the application, then move on.
Dave suggests I take a look at C to appreciate that it can be fast. Done. I already appreciate C is fast, and runs as a program in a workstation, and been optimized by experts over years of development. Nowadays C is seems bloated and not much fun, and there are experts that know their way around the bloat. I don't care to be one of those. That's just me.
I don't actually have LED hardware and I don't need to, I just took the original code that Don did in PropForth and wrote the equivalent in TF and it's all there in the thread. I just went and looked up the datasheet for the
Now TF is a bytecode interpreter, just like Spin, but it's the implementation that's different plus my bytecodes directly index into cog code. So a bytecode of $41 points directly to the code at address $41 for the + operation. The inner "interpreter" loop is no more than 3 PASM instructions so it's very fast. So taking away the 50 ms that is added for "priming" the chip then we have 2.66ms for 264 LEDs for TF vs 350ms (400-50) for Spin, so that makes TF 130 times faster than Spin in this particular operation.
@Prof: I don't quite buy the "too fast" bit at all.
I tend to agree with Dave, it helps to become more versatile and have several options. Just the ability to read C is very useful in studying programing examples that are widely available. A C is often used in collaborative programing, where you work as part of a team to be more productive and to share knowledge. Forth can be a bit difficult to collaborate with others as the dictionary very quickly can take on a personalized style.
I don't understand how or why GCC is supposed to be faster than SPIN, but Jazzed feels confident that it can be. This may only be true in modes that use byte code, but I have a lot more to learn about GCC.
===
I do have to admit my own dislike for languages that have huge documentation. It doesn't matter whether it is Java, C++, or whatever --- when the books get near to 1000 pages or more, I just feel that the text is not entry level for new users. Somebody really needs to realize that new users would like something less than 300 pages that is informative and provides overview, rather than the whole language in one tome. And printing the complete reference in 3 or 4 volumes, rather than one huge one is likely to be more warmly accepted.
Yes, since you did say "TOO fast" then what is the external device(s) that can't keep up?
There has been a lot of software that misbehaves when things happen with a timing it does not expect. That's generally regarded as bad design.
I am waiting for one of you to refer to me as the 'external device'.
The computer, like the automobile, the airplane, and the rocket ship, is yet another love affair with speed. It just has the added twist of trying to get smaller and smaller.
Do we all agree that there are lots of choices faster than SPIN?
Err...well...pretty much anything is faster than Spin.
However, Spin is till the master of cramming the most functionality into the Propeller in the simplest way.
I ran the following code under pfth, and it took 101,952 cycles.
I'm offended that you think I might be using IE on this Win 7 machine:)
That was Chrome showing the demented greek.
This Firefox displays something a bit better. But it's Forth so I might as well have stayed with the demented greek thing for all the sense that it makes:)
It has something to do with zooming. I use Comodo Dragon (Chrome Clone). With zoom 150% and zoom 175% it is bad. Below and above it is good.
Wow, yes you are right. Small, OK. Big, OK. Normal readable size, gibberish. God I hate Windows, or Chrome, or both.
Thanks.