Speed of PropForth vs Spin

Don Pomplun · 2014-07-05 15:44

Just getting back into Forth and had a purely hardware project to play with -- driving an RGB LED strip that uses LPD8806 chips. I got a 5M (260 LEDs) strip and draped it across the ceiling. For starters, I just light it R, G & B, interspersed with Y, C & M, with a half second pause before changing colors. If I do it in PropForth it takes about 1.4 seconds for a color to propagate down the string. I'd like it to go much faster. Today I wrote the same thing in Spin and find it goes about a second faster.
Changing the Forth from 1 lshift to 2* makes a slightly visible difference, but what can I do to make it Really fast? Here's the PropForth:

fl
0 constant data
1 constant clock
264 constant numleds

: io   
data pinout
led pinout
clock pinout ;

: sendMSBfirst \ put byte on stack first
8
0
do
 dup
 h80
 and
 if 1 data px then
 1 clock px 0 clock px \ toggle clock
 0 data px
\ 1 lshift
 2*
loop
drop \ the depleted old shifted word
;

: prime \ sends a zero byte
0 sendMSBfirst
50 delms
;


: magenta
numleds
0
do
 hFF sendMSBfirst \ blue
 hFF sendMSBfirst \ red
 h80 sendMSBfirst \ green
loop
;

: red
numleds
0
do
 h80 sendMSBfirst \ blue
 hFF sendMSBfirst \ red
 h80 sendMSBfirst \ green
loop
;

: yellow
numleds
0
do
 h80 sendMSBfirst \ blue
 hFF sendMSBfirst \ red
 hFF sendMSBfirst \ green
loop
;

: green
numleds
0
do
 h80 sendMSBfirst \ blue
 h80 sendMSBfirst \ red
 hFF sendMSBfirst \ green
loop
;

: cyan
numleds
0
do
 hFF sendMSBfirst \ blue
 h80 sendMSBfirst \ red
 hFF sendMSBfirst \ green
loop
;

: blue
numleds
0
do
 hFF sendMSBfirst \ blue
 h80 sendMSBfirst \ red
 h80 sendMSBfirst \ green
loop
;

: cycle
begin
 prime
 magenta
 500 delms
 prime
 red
 500 delms
 prime
 yellow
 500 delms
 prime
 green
 500 delms
 prime
 cyan
 500 delms
 prime
 blue
 500 delms
 0
until
;

I do: io cycle to get it going

Here's the Spin to do the same thing:

{   

********  Propeller I/O assignments   ***********

P0   RGB data
P1   RGB clock


}

CON
  _clkmode = xtal1  + pll16x    'Standard clock mode * crystal frequency = 80 MHz
  _xinfreq = 5_000_000

  numLEDs  =  264
  LEDdata = 0
  LEDclock = 1

PUB startup 

dira[LEDdata]~~  '  LED strip data line
outa[LEDdata]~  
dira[LEDclock]~~  '  LED strip clock line
outa[LEDclock]~
colorTest



PRI colorTest | i

streamByte(0)
repeat i from 1 to 3*numLEDs   ' turn off all LEDs  
  streamByte( $80 )
  'waitcnt( clkfreq/60 + cnt )  

  
repeat ' forever

    streamByte(0) '  primes LED strip to receive new data
  
    repeat i from 1 to numLEDs   ' magenta
      streamByte( $FF ) ' Blue . . . LED addressing order is Blue Red Green (not RGB)
      streamByte( $FF ) ' Red
      streamByte( $80 ) ' Green
      
    waitcnt( clkfreq/2 + cnt ) '  1/2 second between color changes 
    streamByte(0)
    
    repeat i from 1 to numLEDs   '  red
      streamByte( $80 ) 
      streamByte( $FF )  
      streamByte( $80 )
      
    waitcnt( clkfreq/2 + cnt )  
    streamByte(0)

    repeat i from 1 to numLEDs   '  yellow
      streamByte( $80 ) 
      streamByte( $FF )
      streamByte( $FF )
      
    waitcnt( clkfreq/2 + cnt )  
    streamByte(0)
   
    repeat i from 1 to numLEDs  '  green
      streamByte( $80 ) 
      streamByte( $80 )  
      streamByte( $FF )
      
    waitcnt( clkfreq/2 + cnt )  
    streamByte(0)

    repeat i from 1 to numLEDs  '  cyan
      streamByte( $FF ) 
      streamByte( $80) 
      streamByte( $FF )
      
    waitcnt( clkfreq/2 + cnt )  
    streamByte(0)

    repeat i from 1 to numLEDs  '  blue
      streamByte( $FF ) 
      streamByte( $80 )  
      streamByte( $80 ) ' keep green on
      
    waitcnt( clkfreq/2 + cnt )  

    
  

PRI streamByte(d)

d ><= 8  '  reverse bit order to shift out MSB first

repeat 8
  if (d & 1) <> 0 
    outa[LEDdata]~~
   
  outa[LEDclock]~~   ' high
  outa[LEDclock]~    ' low
  
  outa[LEDdata]~

  d >>= 1

Is Propeller assembler the only answer?
TIA
Don

Martin_H · 2014-07-05 16:04

I've also noticed that Spin is faster than Propforth. I think it’s architectural and not your code. The problem is that all languages other than PASM, cog model C, and LMM C are running on a VM that is itself running in PASM. Spin is compiled VM bytecode, while Propforth is a VM running an interpreter. Honestly an interpreter running at 1/3 compiled code is actually pretty fast as most are 1/10 as fast. Although best in class Forth's used to run at around 80 percent the speed of compiled code back in the 80's.

You might want to try Tachyon Forth which is supposed to be faster than Propforth, but I don't know how it compares to Spin. If speed is critical, I don't think there's a substitute for PASM which can interface with Forth as easily as Spin. You could just rewrite streamByte in PASM as it looks the most time critical.

Peter Jakacki · 2014-07-05 16:07

Don Pomplun wrote: »

Just getting back into Forth and had a purely hardware project to play with -- driving an RGB LED strip that uses LPD8806 chips. I got a 5M (260 LEDs) strip and draped it across the ceiling. For starters, I just light it R, G & B, interspersed with Y, C & M, with a half second pause before changing colors. If I do it in PropForth it takes about 1.4 seconds for a color to propagate down the string. I'd like it to go much faster. Today I wrote the same thing in Spin and find it goes about a second faster.
Changing the Forth from 1 lshift to 2* makes a slightly visible difference, but what can I do to make it Really fast? Here's the PropForth:

Is Propeller assembler the only answer?
TIA
Don

That's why I wrote Tachyon Forth but even though this same code could be changed slightly as is but for TF and run about 10 times faster than Spin you will find that using the SPIWR instructions in TF the whole loop for 264 LEDs should take around 3ms (0.003 seconds). 32-bits take 10us * 264 + overhead.

Duane Degn · 2014-07-05 16:18

You might want to check out the WS2801 objects and see if they can be modified to drive the LPD6608 (I understand the protocols have a lot of similarities).

I used Spin to manipulate the pixels on the WS2801 strips on my hexacopter but the low level driver was written in PASM by Jon McPhalen (aka JonnyMac). There are links to the WS2801 driver in the video description.

prof_braino · 2014-07-05 18:26

Some stuff will be faster in spin, some stuff will be faster in forth. Less than optimal code will be slower using either.

While spin is interpreted at execution, forth is interpreted at compile time. Compile time is when you hit enter while typing in a colon definition. At execution time, forth is executioning assembler. Becuase of the dictionary structure, its mostly jumps down the dictionary until it reach an "atom" word like + or C@ ,etc.

The rule of thumb is "Get the function written quickly using FORTH, the get function execution quickly using assembler". After we get the function to run, if we need to optimize it, we can use the asm: and asm; word from the propofth.htm manual to factor out forth dictionalry overhead, and whittle it down to just the opcodes we want. I have not had need of assembler optimization. You might ask Caskaz, he is the master of propforth assembler.

I have not used those parts. If Peter says 260 LEDs should take 0.003 seconds, there is probably a way to factor the high level forth so the whole string appear to change at the same instant.

Peter Jakacki · 2014-07-05 19:01

prof_braino wrote: »

Some stuff will be faster in spin, some stuff will be faster in forth. Less than optimal code will be slower using either.

While spin is interpreted at execution, forth is interpreted at compile time. Compile time is when you hit enter while typing in a colon definition. At execution time, forth is executioning assembler. Becuase of the dictionary structure, its mostly jumps down the dictionary until it reach an "atom" word like + or C@ ,etc.

The rule of thumb is "Get the function written quickly using FORTH, the get function execution quickly using assembler". After we get the function to run, if we need to optimize it, we can use the asm: and asm; word from the propofth.htm manual to factor out forth dictionalry overhead, and whittle it down to just the opcodes we want. I have not had need of assembler optimization. You might ask Caskaz, he is the master of propforth assembler.

I have not used those parts. If Peter says 260 LEDs should take 0.003 seconds, there is probably a way to factor the high level forth so the whole string appear to change at the same instant.

Spin is compiled into bytecode which is "interpreted" at runtime, not much different from how Tachyon does it, rather than PropForth which interprets compiled Forth addresses.

However, when I said that TF would take 3ms of course I am not taking into account the 50 ms overhead for "priming" so here is the code and without the priming all 264 LEDs takes 2.66ms to update!

[FONT=courier new]
DECIMAL
264    == numleds

: LEDS ( rgb -- )    
     SPIWR 50 ms                \ Prime it using the MSB = 00 first off and wait 50 ms
     numleds FOR                 \ do all LEDs
       SPIWR SPIWR SPIWR 8 ROL   \ shift out 3 lots of 8 bytes while preserving the rgb info
     NEXT 
     DROP                        \ discard the rgb info
     ;

: MAGENTA      $FFFF80 LEDS ;
: RED          $80FF80 LEDS ;
: YELLOW       $80FFFF LEDS ;
: GREEN        $8080FF LEDS ;
: CYAN         $FF80FF LEDS ;
: BLUE         $FF8080 LEDS ;

: DEMO
   00:01 SPIPINS    \ setup SPIWR instruction pins miso:mosi:clk  - also sets pins low
   BEGIN
     MAGENTA 500 ms
     RED 500 ms
     YELLOW 500 ms
     GREEN 500 ms
     CYAN 500 ms
     BLUE 500 ms
   AGAIN
   ;


{ Tests - which includes the 50ms priming
LAP MAGENTA LAP .LAP  52.66ms ok  
}
[/FONT]

Here's an optimized version that uses a table to lookup the next color.

DECIMAL
: LEDS ( rgb -- )        
         SPIWR 50 ms                           \ Prime it using the MSB = 00 first off and wait 50 ms
         264 FOR SPIWR SPIWR SPIWR 8 ROL NEXT  \ shift out 3 lots of 8 bytes while preserving the rgb info
         DROP                               \ discard the rgb info
         ;
TABLE colors   $FFFF80 , $80FF80 , $F80FFFF , $8080FF , $FF80FF , $FF8080 , 0 ,
: DEMO   00:01 SPIPINS colors BEGIN DUP @ ?DUP IF LEDS 500 ms 1+ ELSE DROP colors THEN ESC? UNTIL ;

Dave Hein · 2014-07-06 06:04

In the past Spin was only compiled into bytecodes., However, with the spin2cpp utility it can now be compiled into PASM code that runs directly from cog memory, or into LMM PASM that runs from hub memory, or even XMM PASM that resides in external memory. Spin that is compiled into PASM will probably be as fast, or even faster than any other code written for the Prop, except for hand-coded assembly.

Or you could write directly in C and avoid the spin2cpp conversion. This would provide access to multi-dimensional arrays, data structures, function pointers and other features that C provides.

LoopyByteloose · 2014-07-06 06:11

I found both Tachyon and pfth seem to run faster than SPIN.

But as you may have read above, there are a lot of people looking at different ways to outperform SPIN, and you don't have to just go directly to PASM.

I needed short timing delays and I found that I could easily get a nice accurate Forth work in pfth down to about 25 micro seconds. Faster than that, I guess it would be PASM. I don't think Tachyon could do significantly better. I found PropForth a bit too complex for my own use.

Are we about to start another Forth on Propeller debate... all three do have specific interesting features that make each unique and useful in different ways.

Heater. · 2014-07-06 06:30

As far as I can tell the speed of Forth or Spin is somewhat irrelevant. They are used because one likes a simple programming language and environment. Performance is not everything.

In both cases when it comes to speed they give up and use PASM. Correct me if I am wrong.

In the modern world the kids are doing similar things with JavaScript or Python on their micro-controllers. Again when the going gets tough it's back to dipping down into C or assembler.

The Propeller is a bit of a special case. It has multiple independent processors so you can run your time critical PASM in one of them whilst loafing around in your user friendly Spin or Forth or whatever in another. This is not so easy, or even possible, on single processor machines.

Peter Jakacki · 2014-07-06 06:36

Loopy Byteloose wrote: »

I found both Tachyon and pfth seem to run faster than SPIN.

But as you may have read above, there are a lot of people looking at different ways to outperform SPIN, and you don't have to just go directly to PASM.

I needed short timing delays and I found that I could easily get a nice accurate Forth work in pfth down to about 25 micro seconds. Faster than that, I guess it would be PASM. I don't think Tachyon could do significantly better. I found PropForth a bit too complex for my own use.

Are we about to start another Forth on Propeller debate... all three do have specific interesting features that make each unique and useful in different ways.

I just typed a long reply and got an expired token, don't you hate that.

Tachyon Forth is much faster than most of you think it seems as 25us is enough time for TF to transmit 80 bits serially or perform 55 FOR NEXT loops etc. In fact PASM itself couldn't do too much better in this task of updating LEDs and I should know, I couldn't do much better unless I handcrafted special PASM code using the counters.

Shall I throw down the gauntlet in this regard and challenge another HLL language to produce code that can run faster then what I can in Tachyon Forth? I'm very confident that that would not be an easy thing at all. Let those who think they can then present their case on the field of combat with their HLL implementation, and then let all HeLL break loose. Remember, this is not some mere benchmark, the TF implementation takes 2.66 ms to update 264 RGB LEDs serially not counting the 50ms priming that the LEDs need.

Heater. · 2014-07-06 06:44

Peter,

Shall I throw down the gauntlet in this regard and challenge another HLL language to produce code that can run faster then what I can in Tachyon Forth?

Challenge accepted.

Now, what are the rules? I presume all entrants must be written purley in the HLL of choice. No assembler sneaked in there.

Anyway, in the propgcc sources you will find my version of FullDuplexSerial entirely written in C that can transmit and receive at the same time at 115200 baud.

If that does not take the prize I also enter my 1024 point fixed point FFT, again entirely written in C. I forget how it performed now, was it 20 or 30 transforms per second. Faster if compiled with the OMP options and allowed to spread itself over two or four COGS.

Dave Hein · 2014-07-06 06:49

Peter, I suggest that you try C, and you'll realize that is possible to write code that is just as fast, or faster than any Forth implementation on the Prop. Personally, I spent several months coding in Forth, and writing a few Forth interpreters so I could understand the pros and cons of Forth. I've suggested to Doug that he do the same with C. Maybe you can give C a try so you have a better understanding of the tradeoffs in programming in C on the Prop.

Peter Jakacki · 2014-07-06 06:56

Dave Hein wrote: »

Peter, I suggest that you try C, and you'll realize that is possible to write code that is just as fast, or faster than any Forth implementation on the Prop. Personally, I spent several months coding in Forth, and writing a few Forth interpreters so I could understand the pros and cons of Forth. I've suggested to Doug that he do the same with C. Maybe you can give C a try so you have a better understanding of the tradeoffs in programming in C on the Prop.

I do know C and in fact I started coding in Forth around the same time as I did C. But C is all about the libraries and I'd rather have my own "libraries" and interactive program development and O/S on the Prop.

@Heater: I'm talking about this LED demo, a typical task that we would put the Propeller to use on. How well and how efficiently can code be written in HLL to write to these LEDs. PropForth came in at 1.4 seconds, Spin at 0.4 seconds, Tachyon at 2.66ms + the enforced 50ms priming.

Heater. · 2014-07-06 08:19

Peter,

OK, sorry I did not read the small print, seems you have an actual driver in mind that drives 264 RGB LEDs in some unspecified way.

I guess you win the challenge by default then. I don't have 264 RGB LEDS to drive nor the will to write such code.

Your figures indicate that TF is 150 times faster than Spin. Sort of what we could expect for a compiled language over a bytecode interpreter.

What is this "enforced 50ms priming" thing?

prof_braino · 2014-07-06 08:23

Heater. wrote: »

As far as I can tell the speed of Forth or Spin is somewhat irrelevant.

This is the key point. The purpose of forth is to code interactively. IF you like coding interactively, you probably find you can CODE FASTER in forth. Execution speed is a separate issue, addressed AFTER the code is working.

The cost of this rapid development environment is that forth has to pop down the link list through the forth dictionary, until it gets to the assembly code "atom" words.

As previously stated, if the code is working, and you want it faster, we have several methods of opimizing.

words like pinhi and pinlo are inherently slow, they do a whole bunch of stuff everytime they are called. If you do a several slow words, it has little significance until you do these over and over in a loop. Then start nesting loops, this compund the impact.

There are other words where we just set the mask, and twiddle the bits. The words that do "just the bit banging" are fast. Refactor using those words to get improvement.

Sal says that usually when he starts optimizing the high level forth, it often ends up TOO fast, and the stream can overrun a given device. At which point he backs off or adds flowcontrol or protocol etc, to balance it out. So it could be possible to sufficently optimize the LED routine with just high level forth.

Regarding a challenge for fastest code: The speed of the prop is a the limiting factor, and any language will approach that limit. At what point are we sufficently close? When it works. Sal would not produce code that's any faster than Peter's. Sal would write the code fast enough to be sufficent for the application, then move on.

Dave suggests I take a look at C to appreciate that it can be fast. Done. I already appreciate C is fast, and runs as a program in a workstation, and been optimized by experts over years of development. Nowadays C is seems bloated and not much fun, and there are experts that know their way around the bloat. I don't care to be one of those. That's just me.

Peter Jakacki · 2014-07-06 08:37

Heater. wrote: »

Peter,

OK, sorry I did not read the small print, seems you have an actual driver in mind that drives 264 RGB LEDs in some unspecified way.

I guess you win the challenge by default then. I don't have 264 RGB LEDS to drive nor the will to write such code.

Your figures indicate that TF is 150 times faster than Spin. Sort of what we could expect for a compiled language over a bytecode interpreter.

What is this "enforced 50ms priming" thing?

I don't actually have LED hardware and I don't need to, I just took the original code that Don did in PropForth and wrote the equivalent in TF and it's all there in the thread. I just went and looked up the datasheet for the

LPD8806 that Don is using and I'm not sure where the 50ms figure comes from, I just blast out the bits correctly. Here is the link for the Arduino C code. but that uses hardware SPI by the look of it.

Now TF is a bytecode interpreter, just like Spin, but it's the implementation that's different plus my bytecodes directly index into cog code. So a bytecode of $41 points directly to the code at address $41 for the + operation. The inner "interpreter" loop is no more than 3 PASM instructions so it's very fast. So taking away the 50 ms that is added for "priming" the chip then we have 2.66ms for 264 LEDs for TF vs 350ms (400-50) for Spin, so that makes TF 130 times faster than Spin in this particular operation.

@Prof: I don't quite buy the "too fast" bit at all.

Heater. · 2014-07-06 08:48

Peter,

...bytecodes directly index into cog code. So a bytecode of $41 points directly to the code at address $41 for the + operation.

Interesting. Where do the byte codes live, I presume in HUB RAM?

prof_braino · 2014-07-06 09:04

"Too fast" means "faster than the external device can handle". Is there a question?

Dave Hein · 2014-07-06 20:02

prof_braino wrote: »

Dave suggests I take a look at C to appreciate that it can be fast. Done. I already appreciate C is fast, and runs as a program in a workstation, and been optimized by experts over years of development. Nowadays C is seems bloated and not much fun, and there are experts that know their way around the bloat. I don't care to be one of those. That's just me.

Doug, if you really took a serious look at C you would realize that the code doesn't need to be bloated. Try to open up your horizons and program a few of your favorite Forth programs in C. Ask the forum for assistance. It's hard to take anybody serious when they always suggest that Forth is the best approach to a problem. And I'm not just singling you or Peter out -- I feel the same about someone who would say the C is the only way to go. Clearly there are some things that Forth does best, and other things that C does best.

LoopyByteloose · 2014-07-06 21:42

While I tend to like Forth and actually prefer it for most things, learning C has offered a lot of new insights to me. .. and more than a few pleasant surprises.

I tend to agree with Dave, it helps to become more versatile and have several options. Just the ability to read C is very useful in studying programing examples that are widely available. A C is often used in collaborative programing, where you work as part of a team to be more productive and to share knowledge. Forth can be a bit difficult to collaborate with others as the dictionary very quickly can take on a personalized style.

I don't understand how or why GCC is supposed to be faster than SPIN, but Jazzed feels confident that it can be. This may only be true in modes that use byte code, but I have a lot more to learn about GCC.

===
I do have to admit my own dislike for languages that have huge documentation. It doesn't matter whether it is Java, C++, or whatever --- when the books get near to 1000 pages or more, I just feel that the text is not entry level for new users. Somebody really needs to realize that new users would like something less than 300 pages that is informative and provides overview, rather than the whole language in one tome. And printing the complete reference in 3 or 4 volumes, rather than one huge one is likely to be more warmly accepted.

Peter Jakacki · 2014-07-07 06:49

prof_braino wrote: »

"Too fast" means "faster than the external device can handle". Is there a question?

Yes, since you did say "TOO fast" then what is the external device(s) that can't keep up?

Heater. · 2014-07-07 07:02

Never in the history of computing has there been a computer that was "too fast". I think that would violate some fundamental physical law!

There has been a lot of software that misbehaves when things happen with a timing it does not expect. That's generally regarded as bad design.

LoopyByteloose · 2014-07-07 10:46

Peter Jakacki wrote: »

Yes, since you did say "TOO fast" then what is the external device(s) that can't keep up?

I am waiting for one of you to refer to me as the 'external device'.

The computer, like the automobile, the airplane, and the rocket ship, is yet another love affair with speed. It just has the added twist of trying to get smaller and smaller.

Do we all agree that there are lots of choices faster than SPIN?

Heater. · 2014-07-07 11:05

Loopy,

Do we all agree that there are lots of choices faster than SPIN?

Err...well...pretty much anything is faster than Spin.

However, Spin is till the master of cramming the most functionality into the Propeller in the simplest way.

Dave Hein · 2014-07-07 11:09

The following program takes 141,008 cycles in Spin, and 2,032 cycles in C. It would be interesting to see how long it takes in some of the other languages on the Prop. I'll try it pfth when I have a chance.

#include <stdio.h>
#include <propeller.h>

int sum(int n)
{
    int i;
    int j = 0;
    for (i = 0; i <= n; i++) j += i;
    return j;
}

int main()
{
    int i, cycles;

    cycles = CNT;
    i = sum(100);
    cycles = CNT - cycles;

    printf("sum(100) = %d, %d cycles\n", i, cycles);
    return 0;
}

Heater. · 2014-07-07 11:15

Why do code snippets look like demented greek here on this Windows 7 laptop?

Dave Hein · 2014-07-07 11:31

Code snippets look fine on my Windows 7 system. It must be something in your configuration. Try a different browser. Whenever I get mangled stuff in IE I use FireFox instead, and it usually works OK.

I ran the following code under pfth, and it took 101,952 cycles.

: sum ( n -- result ) 1+ 0 swap over do i + loop ;
: main cnt@ 100 sum cnt@ rot - swap ." sum(100) = " . ." , " . ." cycles" ;

Heater. · 2014-07-07 11:40

Dave,

I'm offended that you think I might be using IE on this Win 7 machine:)

That was Chrome showing the demented greek.

This Firefox displays something a bit better. But it's Forth so I might as well have stayed with the demented greek thing for all the sense that it makes:)

dnalor · 2014-07-07 11:58

Heater. wrote: »

Why do code snippets look like demented greek here on this Windows 7 laptop?

It has something to do with zooming. I use Comodo Dragon (Chrome Clone). With zoom 150% and zoom 175% it is bad. Below and above it is good.

Dave Hein · 2014-07-07 12:00

I ran the same Forth code using the Fast interpreter, and I got 26,928 cycles. This interpreter uses an inner loop that is similar to Tachyon's, so I would expect similar results from Tachyon. I'm looking forward to seeing the numbers from Peter and Doug.

Heater. · 2014-07-07 12:04

dnalor,

Wow, yes you are right. Small, OK. Big, OK. Normal readable size, gibberish. God I hate Windows, or Chrome, or both.

Thanks.

Speed of PropForth vs Spin

Comments