Shop OBEX P1 Docs P2 Docs Learn Events
PropellerForth - Page 2 — Parallax Forums

PropellerForth

2»

Comments

  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-09 17:20
    Okay, cooked up a more interesting example of multi-Cog Forth.

    It's not protein folding, but it's a little more flexible than what I posted before. This example will spawn Forth tasks on all eight Cogs (counting the console task on Cog 0) and show it by blinking LEDs from each task. It allocates 7 USER task areas starting from $6000, for a total memory footprint of slightly over a kilobyte.

    First, some toolkit words for setting up and starting tasks:
    \ Given the base address of a task's USER area, does initial setup.
    \ This doesn't set the BASE or other fields you need to do console
    \ I/O, and sets the stack sizes very small.
    : init-user ( user-addr -- )
      [noparse][[/noparse] hex ]
      dup 90 +  over 10 +  !
      70 +  over 14 +  ! ;
    
    \ Handy holder for the size of a task's USER area, for this example.
    : task-user-size  0D0 ;
    
    \ Given a USER address and an XT, sets the first-word field.
    : set-first-word ( user-addr xt -- )
      swap 18 + ! ;
    
    \ Given a prepared USER address, spawns a new Cog running Forth. 
    : start-forth-task ( user-addr -- )
      [noparse][[/noparse] decimal ]
      16 lshift   'kernel 2 lshift or   8 or
      2 hubop ;
    
    \ Kills every Cog but the console Cog (0).
    : killall
      8 1 do  i cogstop  loop ;
    
    
    



    And some demo program code:
    \ Blinks an LED, 0-7, depending on which Cog we're on.
    : cog-blink
      1 cogid lshift  ledemit   \ Put a 1 in our cog ID's position
      500 milliseconds wait
      0 ledemit
      500 milliseconds wait ;
    
    \ Blinks forever.  This is separate from the above so that
    \ cog-blink can be tested.
    : cog-blink-forever
      begin
        cog-blink
      again ;
    
    \ Given the addr of a task's USER area, sets the task up
    \ to run cog-blink-forever when it starts.
    : make-blink-task ( user-addr -- )
      dup init-user
      [noparse][[/noparse]'] cog-blink-forever set-first-word ;
    
    \ Spawns tasks on Cogs 1-7, offset by a quarter second.
    \ Net effect: chasing LEDs.
    : many-tasks
      [noparse][[/noparse] hex ]
      7 0 do
        \ Compute task address
        task-user-size i *  6000 + 
        \ Set up the task
        dup init-user
        dup make-blink-task
        \ Spawn
        start-forth-task
        100 milliseconds wait
      loop ;
    
    



    To use:
    many-tasks
    
    



    To stop:
    killall
    
    
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2006-11-09 22:32
    Well my initial test code isn't pretty but here it is. I have to rush off and start the day but I will get back later. This is fun.

    *Peter*

    hex
    
    : .hex ( char -- )
     30 + dup
     39 > if 7 + then
     emit ;
    
    : .byte ( byte -- )
     FF and 10 /mod .hex .hex ;
    
    : .word
     FFFF and 100 /mod .byte .byte ;
    
    : .chars ( adrh adrl -- )
     do i c@
       dup 7F > if 7F and then
       dup 20 < if drop 2E then
       emit
     loop ;
    
    \ dump hex bytes + ascii from main memory
    : dump ( adr cnt -- )
     over + swap
     do i 0F and 0=
       if cr i .word ." : " then
       i c@ .byte space
       i 1+ 0F and 0=
       if space space i 1+ i 0F - .chars then
     loop ;
    
    \ dump words from cog memory
    : ldump ( adr cnt -- )
     over + swap
     do i 1F and 0= if cr i .word ." : " then
     i l@ dup 10 rshift .word .word space
     4 +loop ;
    
    : ftype ( addr cnt -- )
     over + swap do i c@ dup 0= if 20 + then emit loop ;
    
    \ list dictionary words
    : WORDS
     0 latest @
     begin
     8 -
     \ 4 words/line
     over 3 > if swap drop 0 swap cr then
     dup .word ." : "
     dup 1+ 7 ftype
     space space swap 1+ swap
     nfa>lfa h@ dup
     while
     repeat
     2drop ;
    
    
    
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-09 23:53
    Peter,

    Dead-on and reasonably well-factored for "quick-and-dirty" code. My only tip would be to use cfa>nfa instead of 8 -; the dictionary layout is going to change soon.

    So, with a cfa>nfa nfa>lfa pair, you can get from the XT to the link field.

    The next release will include .R, which will simplify your life some (by eliminating the need for .byte). I'll also go ahead and include more of my programming words (DUMP, FORGET, and possibly SEE if I can get it working right).
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2006-11-10 02:33
    Thanks Cliff, ok now I'm back in front of the propeller I can cleanup that bit of code and correct some mistakes too. The WORDS formats each word so that it displays the CFA,code field,attributes,count, and of course the name.
    19AE=[noparse][[/noparse]001D] ...04 COLD       1956=[noparse][[/noparse]001D] ...04 FIND
    1934=[noparse][[/noparse]001D] ...02 .S         1908=[noparse][[/noparse]001D] ...05 THROW
    18DA=[noparse][[/noparse]001D] ...05 CATCH      18C4=[noparse][[/noparse]001D] I..01 \
    18AE=[noparse][[/noparse]001D] I..01 (          188A=[noparse][[/noparse]001D] I..02 S"
    185E=[noparse][[/noparse]001D] ...04 (S")       1842=[noparse][[/noparse]001D] I..02 ."
    
    



    Next bit of code I will try to do something useful....

    *Peter*

    \ PBJ'S q&d CLB propeller forth extensions
    hex
    \ convert start and cnt to form suitable for "DO"
    : bounds ( start cnt -- end start )
     over + swap ;
    
    : spaces ( cnt -- )
     0 do space loop ;
    
    : .hex ( nibble -- )
     30 + dup
     39 > if 7 + then
     emit ;
    
    \ Print as 2 hex digits regardless of current base
    : .byte ( byte -- )
     FF and 10 /mod .hex .hex ;
    
    : .word ( 16bits -- )
     FFFF and 100 /mod .byte .byte ;
    
    : .long
     dup 10 rshift .word .word ;
    
    : .chars ( adrh adrl -- )
     do i c@ 7F and
       dup 20 < if drop 2E then emit
     loop ;
    
    \ dump hex bytes + ascii from main memory
    : dump ( adr cnt -- )
     bounds
     do i 10 bounds do
       i 0F and 0= if cr i .word ." : " then
       i c@ .byte space
       loop
       2 spaces i 10 + i .chars
     10 +loop ;
    
    \ dump words from cog memory
    : ldump ( wadr wcnt -- )
     bounds
     do i 7 and 0= if cr i .word ." : " then
     i l@ .long space
     loop ;
    
    : ftype ( addr cnt -- )
     bounds do i c@ dup 0= if 20 + then emit loop ;
    
    : .head ( cfa -- )
     .word ." =[noparse][[/noparse]" \ print cfa
     dup h@ .word ." ] "
     cfa>nfa
     dup c@ 80 and if ." I" else ." ." then
     dup c@ 40 and if ." C" else ." ." then
     dup c@ 20 and if ." S" else ." ." then
     dup c@ 1F and .byte \ print count+atr
     space dup 1+ 7 ftype  \ print name
     4 spaces ;
    
    
    \ list dictionary words
    : WORDS
     0 latest @
     begin
     \ 2 words/line
     over 1 and 0= if cr then
     dup .head
     swap 1+ swap
     nfa>lfa h@ dup
     while
     repeat
     2drop ;
    
    
    
  • cgraceycgracey Posts: 14,133
    edited 2006-11-10 03:17
    Cliff,

    It looks like your Forth is really working well. I don't know much about Forth, but it looks like the kind of thing that I'd really like -- terse and RPN (cuts to the chase, doesn't it?). Are compiler optimizations even possible in such a direct language? It looks very intriguing. I think Chuck Moore had the right idea here.

    I was reading earlier today about a new chip called SEAForth24. It has 24 processors which execute·1 billion native Forth instructions each, per second. I think I was reading that it comes in a 240-pin BGA and costs under $20 at 1K units. It seems aimed at very high-volume apps. If I remember from a while back, Chuck Moore designed this chip using his own IC layout tools written in...· Color Forth! It was interesting how he used·his hierarchical language to make a hierarchical chip. He made GDSII (Gerber for chips) tiles which were like macros that he could arrange hierarchically. Outputting the database was no thought for his tools. It seems Chuck Moore is all about identifying redundancies and eliminating them.

    Anyway, I'm anxious to see how your Forth goes.



    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • LawsonLawson Posts: 870
    edited 2006-11-10 05:03
    huh, that SEAForth24 is an interesting chip.· One aspect I find especially interesting is that it uses asynchronis digital logic. (or more accuratly local/self clocking logic)· The posted power consumption is pretty dang low for a chip with·the equivelent of a ~1GHz clock!·· I wonder how well this type of clocking would adapt to propeller style shared memory interconnect?· I wonder how they get the chip to work in applications that need·precise pacing?· (didn't see anything like the Prop's CNT register)·

    Anywho...

    Marty
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-10 07:07
    People are rediscovering Forth chips all over the place. I read three papers last night alone on the subject, released in the past year.

    10 years of silence and they come back...what changed? FPGAs. One of the papers included the complete Verilog for a Forth core inline in the paper, in true Literate Programming style. It was very impressive and makes me want to dust off my Verilog. (FPGAs are so much more capable these days; cores will be the next source to open.)

    Chuck's design tools are neat, but the next wave of chips will come from free tools, I promise you that.


    Chip, as to Forth:

    Forth tends to be terse, but in the hands of a capable programmer it's also one of the most readable languages out there -- because you can conform the entire language to your problem domain. Want infix, object-orientation, auto-vectorization? No sweat.

    Forth optimizations are a well-explored field, and are quite straightforward. They tend to fall into a few discrete areas:
    - Finding commonly repeated word sequences (swap + over, for example) and synthesizing them into a native-code primitive automatically (so-called superinstructions).
    - Inlining short words to save on procedure-call overhead.
    - Compiling "hot" words to native code, often on the fly.
    - Stack analysis to convert the code to a register-to-register form, on architectures where that makes sense.

    If you combine these three, you can compile the entire Forth system to native code, with performance rivaling a good C compiler. All three of these have been available in production systems since the 1980s, and I've used all three of these in other (non-Forth) VMs in the past. (Smalltalk, specifically, which is also a language that allows the programmer to change any aspect on-the-fly, making traditional optimizations difficult.)

    On the Propeller, superinstructions are doable -- I've got nearly 200 longs of Cog RAM free in the current kernel. Native compilation is harder, since we can't execute directly from shared or external RAM. I am already working on inlining.

    (I've worked with code to page short native-code sections in and out of Cog RAM, but Forth-style threaded interpreters are already very fast. The copy operation is invariably slower than executing an interpreted form -- and threaded code takes up vastly less storage space. I've discussed some potential optimizations for PropellerForth on my tech blog, and there'll be more to come.)
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-10 07:22
    So, I didn't read my own docs -- the definition of UNTIL I provided is wrong, and will make your system do annoying things.

    Here's one that works.

    \ Skips back to a matching BEGIN unless 'flag' is true
    : UNTIL  ( flag -- )
      [noparse][[/noparse]'] 0branch compile,
      <resolve ; immediate
    
    



    This will be included in the next PropellerForth release, but for now, that should do 'ya.
  • Bill HenningBill Henning Posts: 6,445
    edited 2006-11-10 08:28
    Cliff, you wrote:

    "On the Propeller, superinstructions are doable -- I've got nearly 200 longs of Cog RAM free in the current kernel. Native compilation is harder, since we can't execute directly from shared or external RAM. I am already working on inlining."

    Actually... time for me to fess up.

    Check the new thread I'm posting in a minute or two "Large memory model for Propeller Assembly language programs"
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-12 02:01
    Folks, PropellerForth now supports multitasking "both ways" -- running Forth words on multiple Cogs, and cooperative multitasking on a single Cog.

    Each task requires 40 bytes of shared RAM to hold some state, plus stacks (plus another 80 for tasks that talk to the terminal). Only 4 of these bytes are specific to the task switcher.

    I demonstrated running Forth code on multiple Cogs earlier, and will post the cooperative task code shortly. In summary form, the system is quite similar to most RTOSes:
    - Tasks are linked together into task cycles (mostly transparently to the user), which are scheduled round-robin.
    - Control is passed from task to task using the traditional Forth word PAUSE. (If the standard library is made task-aware it'll also be passed at any blocking operation, like some I/O primitives.)
    - The code is entirely in Forth, so task switches take several microseconds (see numbers below). This could be optimized or moved to native code, but it's well within the parameters for my application (a full-duplex serial driver).

    Using a minimal test environment (total overhead of 60 bytes per task), I've successfully run 50 tasks on a single Cog. All tasks were sharing code and some global data, but had thread-local copies of necessary variables.

    The test routine:
    : blink-forever
      begin
        u0 ledemit \ writes the address of this task to the debug LEDs
        pause
      again ;
    
    



    For a ring of 42 tasks (chosen because it's an amusing number and also fits in a buffer I'd allocated), 1,000 context switches from this routine took 87,424,528 cycles. That's 1.09s, or 1.09ms per iteration -- so 26.01us per task activation. With one task (but with task switching), it took 26ms (the same 26us per activation).

    With task switching disabled, it took 16.4ms for the same 1,000 iterations, suggesting a task-switch overhead of 10us.

    This code will be included in the next release of PropellerForth (which, it should be noted, will include sources!).
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-12 07:11
    I've set up a web page for PropellerForth.

    www.cliff.biffle.org/software/propeller/forth/

    There's a newer release there than what I've posted in this thread, though it's a build from earlier today and lacks the in-Cog tasker.

    I've also set up code hosting and issue tracking on Google Code Hosting, so folks can feel free to file bugs against me there. (The project page is linked from the homepage above.)

    Future updates will be posted on the website; I find forum threads to be really bad repositories for information, so I'm going to quit updating this one unless anyone has specific questions or issues they'd like to discuss.
  • Bill HenningBill Henning Posts: 6,445
    edited 2006-11-12 07:18
    Way to go!!!

    Very kewl.
    Cliff L. Biffle said...
    Folks, PropellerForth now supports multitasking "both ways" -- running Forth words on multiple Cogs, and cooperative multitasking on a single Cog.

    Each task requires 40 bytes of shared RAM to hold some state, plus stacks (plus another 80 for tasks that talk to the terminal). Only 4 of these bytes are specific to the task switcher.

    I demonstrated running Forth code on multiple Cogs earlier, and will post the cooperative task code shortly. In summary form, the system is quite similar to most RTOSes:
    - Tasks are linked together into task cycles (mostly transparently to the user), which are scheduled round-robin.
    - Control is passed from task to task using the traditional Forth word PAUSE. (If the standard library is made task-aware it'll also be passed at any blocking operation, like some I/O primitives.)
    - The code is entirely in Forth, so task switches take several microseconds (see numbers below). This could be optimized or moved to native code, but it's well within the parameters for my application (a full-duplex serial driver).

    Using a minimal test environment (total overhead of 60 bytes per task), I've successfully run 50 tasks on a single Cog. All tasks were sharing code and some global data, but had thread-local copies of necessary variables.

    The test routine:
    : blink-forever
      begin
        u0 ledemit \ writes the address of this task to the debug LEDs
        pause
      again ;
    
    



    For a ring of 42 tasks (chosen because it's an amusing number and also fits in a buffer I'd allocated), 1,000 context switches from this routine took 87,424,528 cycles. That's 1.09s, or 1.09ms per iteration -- so 26.01us per task activation. With one task (but with task switching), it took 26ms (the same 26us per activation).

    With task switching disabled, it took 16.4ms for the same 1,000 iterations, suggesting a task-switch overhead of 10us.

    This code will be included in the next release of PropellerForth (which, it should be noted, will include sources!).
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-12 08:28
    Okay, I lied. Here's an update.

    I've managed to run 256 threads on a single core. All were running a really simple cycle-burner so I could measure their performance.

    For the folks who'd rather not read through a page of Forth code, here are the results. Sure wish mCode supported tables.

    Benchmark Times (all in cycles)

    Single Task:
    Tasker disabled, one task: 1,536,528
    Tasker enabled, one task: 2,304,528

    Having the tasker enabled imposes roughly a 768 cycle penalty for each call to PAUSE, even with only one task. That's what I get for writing the engine in an afternoon; I'll optimize it later. (For those playing along at home, that's 9.6us.)

    Multiple Tasks:
    2: 4,560,528
    4: 9,072,528
    8: 18,096,528
    16: 36,144,528
    32: 72,240,528
    64: 144,432,528
    128: 288,816,528
    256: 577,584,528

    Performance scales almost perfectly linearly with the number of tasks -- that is, doubling the number of tasks halves your speed. The scheduler is deterministic O(n) for the number of tasks, by design -- though this may change when I get the thread prioritization code working.

    The system has no hard limit to the number of tasks; it's bounded only by available RAM, and at 256 tasks, I had -4 bytes free. (I allowed one task to reuse some of my interpreter's thread-locals. It's cheating, but it worked.) I am deliberately overallocating, giving each task 96 (decimal) bytes.

    Astute readers have noticed that every time ends in 528. I think it's weird too. You can try to replicate it with the sources below.

    The Code

    This will be built into the next PropellerForth release, but interested parties can try it now. Paste the code into your terminal emulator, or put it in a text file and send it as ASCII.

    Note that this is not exactly the code I ran for my benchmarks above. It's slightly better factored, and runs a bit slower. Readers can match (or beat) my numbers by inlining the task-id display code.

    First, the multitasking engine:

    hex   \ the number base of champions
    
    \ Given a pointer to the base of a task's USER area,
    \ sets up a pretty generic, spacious task.
    \ Note: you cannot PAUSE directly to this task, you
    \ must first activate it with start-task
    : create-task ( user-addr -- user-addr )
        10 over !     \ Set the number base for I/O...
        dup 7C +  over 10 + !   \ ...the stack pointer
        dup 17C +  over 14 + !   \ ...the return pointer
        next-task @  over 24 + !  \ ...the next task in the cycle
        dup next-task ! ;  \ ...and set it to our next task
    
    \ Given a pointer to the base of a task's USER area,
    \ sets up a dinky task suitable only for trivial things.
    \ Note: you cannot PAUSE directly to this task, you
    \ must first activate it with start-task
    : create-small-task ( user-addr -- user-addr )
        10 over !
        dup 28 +  over 10 + !
        dup 48 +  over 14 + !
        next-task @  over 24 + !
        dup next-task ! ;
    
    \ Suspends the current task, passing control to the
    \ next (which may be the same task)
    : pause
      rsp@   \ stash our return-stack pointer on the param stack.
      sp@  saved-sp !  \ store our stack pointer in a USER thread-local
      next-task @  usp!  \ activate the next task
      saved-sp @ sp!  \ restore our stack pointer (we're in the next task)
      rsp! ;  \ restore our return stack pointer and proceed
    
    \ Given a pointer to a USER area (prepared with
    \ create-task or create-small-task) and the XT of
    \ a word to run, starts a new task, immediately
    \ giving it control.
    : start-task  ( user-addr xt -- )
      rsp@ 
      rot rot  2>r
      sp@ saved-sp !
      r> r>
      dup next-task !
      usp!
      sp0 @ !
      sp0 @ sp!
      rsp0 @ rsp!
      execute    \ Start the task's initial word
      bye ;   \ If we ever return, halt -- something is wrong
    
    \ Creates a small task using memory from the
    \ dictionary.
    : allot-small-task  ( -- user-addr )
      align   \ make sure we're at a word boundary
      here   60 allot   \ make room
      create-small-task ;
    
    



    Now, some code to multitask:
    \ Shows a semi-unique value for the
    \ current task, to show that we've switched
    : show-task-id
      u0  \ address of user area
      5 rshift  \ these bits are not interesting, drop them
      ledemit ;
    
    \ Simply keeps showing our task ID.
    \ This is used for the non-interactive tasks.
    : show-task-id-forever
      begin
        show-task-id
        pause
      again ;
    
    \ Show the task ID and pause a specified
    \ number of times.  Used for the interactive
    \ task (so we get control back).
    : show-task-id-repeatedly  ( count -- )
      0 do
        show-task-id
        pause
      loop ;
    
    : spawn-task
      allot-small-task
      [noparse][[/noparse]'] show-task-id-forever start-task ;
    
    : tasks ( count -- )
      0 do  spawn-task  loop ;
    
    




    To use, enter something like the following:
    32 tasks
    1000 show-task-id-repeatedly
    
    



    If you'd like to see how long it takes, enter this word:
    : benchmark
      '
      cnt l@ >r
      execute
      cnt l@ r> - 
      . ." cycles" cr ;
    
    



    It's used like this:
    1000  benchmark show-task-id-repeatedly
    
    



    Happy hacking!

    Post Edited (Cliff L. Biffle) : 11/12/2006 8:36:46 AM GMT
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2006-11-12 09:15
    Cliff,

    mCode does support tables, apparently. Chip posts them occasionally (example here). I can't figure out how he does it, though, and I've been meaning to ask him.

    -Phil
  • cgraceycgracey Posts: 14,133
    edited 2006-11-12 10:48
    I didn't know one way or another, I just copied from Word and pasted it in.
    Phil Pilgrim (PhiPi) said...
    Cliff,

    mCode does support tables, apparently. Chip posts them occasionally (example here). I can't figure out how he does it, though, and I've been meaning to ask him.

    -Phil
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    Chip Gracey
    Parallax, Inc.
  • Cliff L. BiffleCliff L. Biffle Posts: 206
    edited 2006-11-12 17:01
    I reckon that Chip, as a forum moderator, can use HTML.
  • simonlsimonl Posts: 866
    edited 2007-04-02 12:27
    Hey Cliff, how PropellerFORTH comin' on?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Cheers,

    Simon

    BTW: I type as I'm thinking, so please don't take any offense at my writing style smile.gif

    www.norfolkhelicopterclub.co.uk
    You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
Sign In or Register to comment.