PropellerForth

Cliff L. Biffle · 2006-11-09 17:20

Okay, cooked up a more interesting example of multi-Cog Forth.

It's not protein folding, but it's a little more flexible than what I posted before. This example will spawn Forth tasks on all eight Cogs (counting the console task on Cog 0) and show it by blinking LEDs from each task. It allocates 7 USER task areas starting from $6000, for a total memory footprint of slightly over a kilobyte.

First, some toolkit words for setting up and starting tasks:

\ Given the base address of a task's USER area, does initial setup.
\ This doesn't set the BASE or other fields you need to do console
\ I/O, and sets the stack sizes very small.
: init-user ( user-addr -- )
  [noparse][[/noparse] hex ]
  dup 90 +  over 10 +  !
  70 +  over 14 +  ! ;

\ Handy holder for the size of a task's USER area, for this example.
: task-user-size  0D0 ;

\ Given a USER address and an XT, sets the first-word field.
: set-first-word ( user-addr xt -- )
  swap 18 + ! ;

\ Given a prepared USER address, spawns a new Cog running Forth. 
: start-forth-task ( user-addr -- )
  [noparse][[/noparse] decimal ]
  16 lshift   'kernel 2 lshift or   8 or
  2 hubop ;

\ Kills every Cog but the console Cog (0).
: killall
  8 1 do  i cogstop  loop ;

And some demo program code:

\ Blinks an LED, 0-7, depending on which Cog we're on.
: cog-blink
  1 cogid lshift  ledemit   \ Put a 1 in our cog ID's position
  500 milliseconds wait
  0 ledemit
  500 milliseconds wait ;

\ Blinks forever.  This is separate from the above so that
\ cog-blink can be tested.
: cog-blink-forever
  begin
    cog-blink
  again ;

\ Given the addr of a task's USER area, sets the task up
\ to run cog-blink-forever when it starts.
: make-blink-task ( user-addr -- )
  dup init-user
  [noparse][[/noparse]'] cog-blink-forever set-first-word ;

\ Spawns tasks on Cogs 1-7, offset by a quarter second.
\ Net effect: chasing LEDs.
: many-tasks
  [noparse][[/noparse] hex ]
  7 0 do
    \ Compute task address
    task-user-size i *  6000 + 
    \ Set up the task
    dup init-user
    dup make-blink-task
    \ Spawn
    start-forth-task
    100 milliseconds wait
  loop ;

To use:

many-tasks

To stop:

killall

Peter Jakacki · 2006-11-09 22:32

Well my initial test code isn't pretty but here it is. I have to rush off and start the day but I will get back later. This is fun.

*Peter*

hex

: .hex ( char -- )
 30 + dup
 39 > if 7 + then
 emit ;

: .byte ( byte -- )
 FF and 10 /mod .hex .hex ;

: .word
 FFFF and 100 /mod .byte .byte ;

: .chars ( adrh adrl -- )
 do i c@
   dup 7F > if 7F and then
   dup 20 < if drop 2E then
   emit
 loop ;

\ dump hex bytes + ascii from main memory
: dump ( adr cnt -- )
 over + swap
 do i 0F and 0=
   if cr i .word ." : " then
   i c@ .byte space
   i 1+ 0F and 0=
   if space space i 1+ i 0F - .chars then
 loop ;

\ dump words from cog memory
: ldump ( adr cnt -- )
 over + swap
 do i 1F and 0= if cr i .word ." : " then
 i l@ dup 10 rshift .word .word space
 4 +loop ;

: ftype ( addr cnt -- )
 over + swap do i c@ dup 0= if 20 + then emit loop ;

\ list dictionary words
: WORDS
 0 latest @
 begin
 8 -
 \ 4 words/line
 over 3 > if swap drop 0 swap cr then
 dup .word ." : "
 dup 1+ 7 ftype
 space space swap 1+ swap
 nfa>lfa h@ dup
 while
 repeat
 2drop ;

Cliff L. Biffle · 2006-11-09 23:53

Peter,

Dead-on and reasonably well-factored for "quick-and-dirty" code. My only tip would be to use cfa>nfa instead of 8 -; the dictionary layout is going to change soon.

So, with a cfa>nfa nfa>lfa pair, you can get from the XT to the link field.

The next release will include .R, which will simplify your life some (by eliminating the need for .byte). I'll also go ahead and include more of my programming words (DUMP, FORGET, and possibly SEE if I can get it working right).

Peter Jakacki · 2006-11-10 02:33

Thanks Cliff, ok now I'm back in front of the propeller I can cleanup that bit of code and correct some mistakes too. The WORDS formats each word so that it displays the CFA,code field,attributes,count, and of course the name.

19AE=[noparse][[/noparse]001D] ...04 COLD       1956=[noparse][[/noparse]001D] ...04 FIND
1934=[noparse][[/noparse]001D] ...02 .S         1908=[noparse][[/noparse]001D] ...05 THROW
18DA=[noparse][[/noparse]001D] ...05 CATCH      18C4=[noparse][[/noparse]001D] I..01 \
18AE=[noparse][[/noparse]001D] I..01 (          188A=[noparse][[/noparse]001D] I..02 S"
185E=[noparse][[/noparse]001D] ...04 (S")       1842=[noparse][[/noparse]001D] I..02 ."

Next bit of code I will try to do something useful....

*Peter*

\ PBJ'S q&d CLB propeller forth extensions
hex
\ convert start and cnt to form suitable for "DO"
: bounds ( start cnt -- end start )
 over + swap ;

: spaces ( cnt -- )
 0 do space loop ;

: .hex ( nibble -- )
 30 + dup
 39 > if 7 + then
 emit ;

\ Print as 2 hex digits regardless of current base
: .byte ( byte -- )
 FF and 10 /mod .hex .hex ;

: .word ( 16bits -- )
 FFFF and 100 /mod .byte .byte ;

: .long
 dup 10 rshift .word .word ;

: .chars ( adrh adrl -- )
 do i c@ 7F and
   dup 20 < if drop 2E then emit
 loop ;

\ dump hex bytes + ascii from main memory
: dump ( adr cnt -- )
 bounds
 do i 10 bounds do
   i 0F and 0= if cr i .word ." : " then
   i c@ .byte space
   loop
   2 spaces i 10 + i .chars
 10 +loop ;

\ dump words from cog memory
: ldump ( wadr wcnt -- )
 bounds
 do i 7 and 0= if cr i .word ." : " then
 i l@ .long space
 loop ;

: ftype ( addr cnt -- )
 bounds do i c@ dup 0= if 20 + then emit loop ;

: .head ( cfa -- )
 .word ." =[noparse][[/noparse]" \ print cfa
 dup h@ .word ." ] "
 cfa>nfa
 dup c@ 80 and if ." I" else ." ." then
 dup c@ 40 and if ." C" else ." ." then
 dup c@ 20 and if ." S" else ." ." then
 dup c@ 1F and .byte \ print count+atr
 space dup 1+ 7 ftype  \ print name
 4 spaces ;


\ list dictionary words
: WORDS
 0 latest @
 begin
 \ 2 words/line
 over 1 and 0= if cr then
 dup .head
 swap 1+ swap
 nfa>lfa h@ dup
 while
 repeat
 2drop ;

cgracey · 2006-11-10 03:17

Cliff,

It looks like your Forth is really working well. I don't know much about Forth, but it looks like the kind of thing that I'd really like -- terse and RPN (cuts to the chase, doesn't it?). Are compiler optimizations even possible in such a direct language? It looks very intriguing. I think Chuck Moore had the right idea here.

I was reading earlier today about a new chip called SEAForth24. It has 24 processors which execute·1 billion native Forth instructions each, per second. I think I was reading that it comes in a 240-pin BGA and costs under $20 at 1K units. It seems aimed at very high-volume apps. If I remember from a while back, Chuck Moore designed this chip using his own IC layout tools written in...· Color Forth! It was interesting how he used·his hierarchical language to make a hierarchical chip. He made GDSII (Gerber for chips) tiles which were like macros that he could arrange hierarchically. Outputting the database was no thought for his tools. It seems Chuck Moore is all about identifying redundancies and eliminating them.

Anyway, I'm anxious to see how your Forth goes.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Lawson · 2006-11-10 05:03

huh, that SEAForth24 is an interesting chip.· One aspect I find especially interesting is that it uses asynchronis digital logic. (or more accuratly local/self clocking logic)· The posted power consumption is pretty dang low for a chip with·the equivelent of a ~1GHz clock!·· I wonder how well this type of clocking would adapt to propeller style shared memory interconnect?· I wonder how they get the chip to work in applications that need·precise pacing?· (didn't see anything like the Prop's CNT register)·

Anywho...

Marty

Cliff L. Biffle · 2006-11-10 07:07

People are rediscovering Forth chips all over the place. I read three papers last night alone on the subject, released in the past year.

10 years of silence and they come back...what changed? FPGAs. One of the papers included the complete Verilog for a Forth core inline in the paper, in true Literate Programming style. It was very impressive and makes me want to dust off my Verilog. (FPGAs are so much more capable these days; cores will be the next source to open.)

Chuck's design tools are neat, but the next wave of chips will come from free tools, I promise you that.

Chip, as to Forth:

Forth tends to be terse, but in the hands of a capable programmer it's also one of the most readable languages out there -- because you can conform the entire language to your problem domain. Want infix, object-orientation, auto-vectorization? No sweat.

Forth optimizations are a well-explored field, and are quite straightforward. They tend to fall into a few discrete areas:
- Finding commonly repeated word sequences (swap + over, for example) and synthesizing them into a native-code primitive automatically (so-called superinstructions).
- Inlining short words to save on procedure-call overhead.
- Compiling "hot" words to native code, often on the fly.
- Stack analysis to convert the code to a register-to-register form, on architectures where that makes sense.

If you combine these three, you can compile the entire Forth system to native code, with performance rivaling a good C compiler. All three of these have been available in production systems since the 1980s, and I've used all three of these in other (non-Forth) VMs in the past. (Smalltalk, specifically, which is also a language that allows the programmer to change any aspect on-the-fly, making traditional optimizations difficult.)

On the Propeller, superinstructions are doable -- I've got nearly 200 longs of Cog RAM free in the current kernel. Native compilation is harder, since we can't execute directly from shared or external RAM. I am already working on inlining.

(I've worked with code to page short native-code sections in and out of Cog RAM, but Forth-style threaded interpreters are already very fast. The copy operation is invariably slower than executing an interpreted form -- and threaded code takes up vastly less storage space. I've discussed some potential optimizations for PropellerForth on my tech blog, and there'll be more to come.)

Cliff L. Biffle · 2006-11-10 07:22

So, I didn't read my own docs -- the definition of UNTIL I provided is wrong, and will make your system do annoying things.

Here's one that works.

\ Skips back to a matching BEGIN unless 'flag' is true
: UNTIL  ( flag -- )
  [noparse][[/noparse]'] 0branch compile,
  <resolve ; immediate

This will be included in the next PropellerForth release, but for now, that should do 'ya.

Bill Henning · 2006-11-10 08:28

Cliff, you wrote:

"On the Propeller, superinstructions are doable -- I've got nearly 200 longs of Cog RAM free in the current kernel. Native compilation is harder, since we can't execute directly from shared or external RAM. I am already working on inlining."

Actually... time for me to fess up.

Check the new thread I'm posting in a minute or two "Large memory model for Propeller Assembly language programs"

Cliff L. Biffle · 2006-11-12 02:01

Folks, PropellerForth now supports multitasking "both ways" -- running Forth words on multiple Cogs, and cooperative multitasking on a single Cog.

Each task requires 40 bytes of shared RAM to hold some state, plus stacks (plus another 80 for tasks that talk to the terminal). Only 4 of these bytes are specific to the task switcher.

I demonstrated running Forth code on multiple Cogs earlier, and will post the cooperative task code shortly. In summary form, the system is quite similar to most RTOSes:
- Tasks are linked together into task cycles (mostly transparently to the user), which are scheduled round-robin.
- Control is passed from task to task using the traditional Forth word PAUSE. (If the standard library is made task-aware it'll also be passed at any blocking operation, like some I/O primitives.)
- The code is entirely in Forth, so task switches take several microseconds (see numbers below). This could be optimized or moved to native code, but it's well within the parameters for my application (a full-duplex serial driver).

Using a minimal test environment (total overhead of 60 bytes per task), I've successfully run 50 tasks on a single Cog. All tasks were sharing code and some global data, but had thread-local copies of necessary variables.

The test routine:

: blink-forever
  begin
    u0 ledemit \ writes the address of this task to the debug LEDs
    pause
  again ;

For a ring of 42 tasks (chosen because it's an amusing number and also fits in a buffer I'd allocated), 1,000 context switches from this routine took 87,424,528 cycles. That's 1.09s, or 1.09ms per iteration -- so 26.01us per task activation. With one task (but with task switching), it took 26ms (the same 26us per activation).

With task switching disabled, it took 16.4ms for the same 1,000 iterations, suggesting a task-switch overhead of 10us.

This code will be included in the next release of PropellerForth (which, it should be noted, will include sources!).

Cliff L. Biffle · 2006-11-12 07:11

I've set up a web page for PropellerForth.

www.cliff.biffle.org/software/propeller/forth/

There's a newer release there than what I've posted in this thread, though it's a build from earlier today and lacks the in-Cog tasker.

I've also set up code hosting and issue tracking on Google Code Hosting, so folks can feel free to file bugs against me there. (The project page is linked from the homepage above.)

Future updates will be posted on the website; I find forum threads to be really bad repositories for information, so I'm going to quit updating this one unless anyone has specific questions or issues they'd like to discuss.

Bill Henning · 2006-11-12 07:18

Way to go!!!

Very kewl.

Cliff L. Biffle said...
Folks, PropellerForth now supports multitasking "both ways" -- running Forth words on multiple Cogs, and cooperative multitasking on a single Cog.

Each task requires 40 bytes of shared RAM to hold some state, plus stacks (plus another 80 for tasks that talk to the terminal). Only 4 of these bytes are specific to the task switcher.

I demonstrated running Forth code on multiple Cogs earlier, and will post the cooperative task code shortly. In summary form, the system is quite similar to most RTOSes:
- Tasks are linked together into task cycles (mostly transparently to the user), which are scheduled round-robin.
- Control is passed from task to task using the traditional Forth word PAUSE. (If the standard library is made task-aware it'll also be passed at any blocking operation, like some I/O primitives.)
- The code is entirely in Forth, so task switches take several microseconds (see numbers below). This could be optimized or moved to native code, but it's well within the parameters for my application (a full-duplex serial driver).

Using a minimal test environment (total overhead of 60 bytes per task), I've successfully run 50 tasks on a single Cog. All tasks were sharing code and some global data, but had thread-local copies of necessary variables.

The test routine:
: blink-forever
  begin
    u0 ledemit \ writes the address of this task to the debug LEDs
    pause
  again ;
For a ring of 42 tasks (chosen because it's an amusing number and also fits in a buffer I'd allocated), 1,000 context switches from this routine took 87,424,528 cycles. That's 1.09s, or 1.09ms per iteration -- so 26.01us per task activation. With one task (but with task switching), it took 26ms (the same 26us per activation).

With task switching disabled, it took 16.4ms for the same 1,000 iterations, suggesting a task-switch overhead of 10us.

This code will be included in the next release of PropellerForth (which, it should be noted, will include sources!).

Cliff L. Biffle · 2006-11-12 08:28

Okay, I lied. Here's an update.

I've managed to run 256 threads on a single core. All were running a really simple cycle-burner so I could measure their performance.

For the folks who'd rather not read through a page of Forth code, here are the results. Sure wish mCode supported tables.

Benchmark Times (all in cycles)

Single Task:
Tasker disabled, one task: 1,536,528
Tasker enabled, one task: 2,304,528

Having the tasker enabled imposes roughly a 768 cycle penalty for each call to PAUSE, even with only one task. That's what I get for writing the engine in an afternoon; I'll optimize it later. (For those playing along at home, that's 9.6us.)

Multiple Tasks:
2: 4,560,528
4: 9,072,528
8: 18,096,528
16: 36,144,528
32: 72,240,528
64: 144,432,528
128: 288,816,528
256: 577,584,528

Performance scales almost perfectly linearly with the number of tasks -- that is, doubling the number of tasks halves your speed. The scheduler is deterministic O(n) for the number of tasks, by design -- though this may change when I get the thread prioritization code working.

The system has no hard limit to the number of tasks; it's bounded only by available RAM, and at 256 tasks, I had -4 bytes free. (I allowed one task to reuse some of my interpreter's thread-locals. It's cheating, but it worked.) I am deliberately overallocating, giving each task 96 (decimal) bytes.

Astute readers have noticed that every time ends in 528. I think it's weird too. You can try to replicate it with the sources below.

The Code

This will be built into the next PropellerForth release, but interested parties can try it now. Paste the code into your terminal emulator, or put it in a text file and send it as ASCII.

Note that this is not exactly the code I ran for my benchmarks above. It's slightly better factored, and runs a bit slower. Readers can match (or beat) my numbers by inlining the task-id display code.

First, the multitasking engine:

hex   \ the number base of champions

\ Given a pointer to the base of a task's USER area,
\ sets up a pretty generic, spacious task.
\ Note: you cannot PAUSE directly to this task, you
\ must first activate it with start-task
: create-task ( user-addr -- user-addr )
    10 over !     \ Set the number base for I/O...
    dup 7C +  over 10 + !   \ ...the stack pointer
    dup 17C +  over 14 + !   \ ...the return pointer
    next-task @  over 24 + !  \ ...the next task in the cycle
    dup next-task ! ;  \ ...and set it to our next task

\ Given a pointer to the base of a task's USER area,
\ sets up a dinky task suitable only for trivial things.
\ Note: you cannot PAUSE directly to this task, you
\ must first activate it with start-task
: create-small-task ( user-addr -- user-addr )
    10 over !
    dup 28 +  over 10 + !
    dup 48 +  over 14 + !
    next-task @  over 24 + !
    dup next-task ! ;

\ Suspends the current task, passing control to the
\ next (which may be the same task)
: pause
  rsp@   \ stash our return-stack pointer on the param stack.
  sp@  saved-sp !  \ store our stack pointer in a USER thread-local
  next-task @  usp!  \ activate the next task
  saved-sp @ sp!  \ restore our stack pointer (we're in the next task)
  rsp! ;  \ restore our return stack pointer and proceed

\ Given a pointer to a USER area (prepared with
\ create-task or create-small-task) and the XT of
\ a word to run, starts a new task, immediately
\ giving it control.
: start-task  ( user-addr xt -- )
  rsp@ 
  rot rot  2>r
  sp@ saved-sp !
  r> r>
  dup next-task !
  usp!
  sp0 @ !
  sp0 @ sp!
  rsp0 @ rsp!
  execute    \ Start the task's initial word
  bye ;   \ If we ever return, halt -- something is wrong

\ Creates a small task using memory from the
\ dictionary.
: allot-small-task  ( -- user-addr )
  align   \ make sure we're at a word boundary
  here   60 allot   \ make room
  create-small-task ;

Now, some code to multitask:

\ Shows a semi-unique value for the
\ current task, to show that we've switched
: show-task-id
  u0  \ address of user area
  5 rshift  \ these bits are not interesting, drop them
  ledemit ;

\ Simply keeps showing our task ID.
\ This is used for the non-interactive tasks.
: show-task-id-forever
  begin
    show-task-id
    pause
  again ;

\ Show the task ID and pause a specified
\ number of times.  Used for the interactive
\ task (so we get control back).
: show-task-id-repeatedly  ( count -- )
  0 do
    show-task-id
    pause
  loop ;

: spawn-task
  allot-small-task
  [noparse][[/noparse]'] show-task-id-forever start-task ;

: tasks ( count -- )
  0 do  spawn-task  loop ;

To use, enter something like the following:

32 tasks
1000 show-task-id-repeatedly

If you'd like to see how long it takes, enter this word:

: benchmark
  '
  cnt l@ >r
  execute
  cnt l@ r> - 
  . ." cycles" cr ;

It's used like this:

1000  benchmark show-task-id-repeatedly

Happy hacking!

Post Edited (Cliff L. Biffle) : 11/12/2006 8:36:46 AM GMT

Phil Pilgrim (PhiPi) · 2006-11-12 09:15

Cliff,

mCode does support tables, apparently. Chip posts them occasionally (example here). I can't figure out how he does it, though, and I've been meaning to ask him.

-Phil

cgracey · 2006-11-12 10:48

I didn't know one way or another, I just copied from Word and pasted it in.

Phil Pilgrim (PhiPi) said...
Cliff,

mCode does support tables, apparently. Chip posts them occasionally (example here). I can't figure out how he does it, though, and I've been meaning to ask him.

-Phil

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Cliff L. Biffle · 2006-11-12 17:01

I reckon that Chip, as a forum moderator, can use HTML.

simonl · 2007-04-02 12:27

Hey Cliff, how PropellerFORTH comin' on?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,

Simon

BTW: I type as I'm thinking, so please don't take any offense at my writing style

www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)

PropellerForth

Comments