Shop OBEX P1 Docs P2 Docs Learn Events
TACHYON O/S V3.0 JUNO - Furiously Fast Forth, FAT32+LAN+VGA+RS485+OBEX ROMS+FP+LMM+++ — Parallax Forums

TACHYON O/S V3.0 JUNO - Furiously Fast Forth, FAT32+LAN+VGA+RS485+OBEX ROMS+FP+LMM+++

Peter JakackiPeter Jakacki Posts: 10,193
edited 2016-07-22 00:14 in Propeller 1
Updated links 150826
Propeller Hardware Explorer with VGA
Tachyon Dropbox files and binaries (latest)
Introduction to TACHYON Forth
Tachyon Forth Resource Links
Tachyon Web Server
FTP: ftp://tachyonforth.com
Telnet: tachyonforth.com 10001

Watch Easynet in operation




Note: these early posts are mostly historical only, please read the latest posts or click the links in my sig.
Enhanced bitmap graphics demo + serial


*ORIGINAL POST*

I've been hooked onto Forth again after a long break away. Thanks to Sal's PropForth and recently the Bluetooth modules I have rediscovered the advantages and fun of programming and testing in a Forth environment. Now I mentioned I have been away from Forth for awhile and that's got to do with the Propeller chip since I like using it but Forth does not lend itself to this architecture very easily. Several years ago (time flies) I had a look at writing a Forth call CogForth for the Prop but I just felt it was too much hard work, which it was, not just because of the architecture but also because of the limitations of the tools (Spin tool etc). Even so the Forth would have been slow for what I need and there were memory limitations.

However, spurred on by the efforts of Sal Sanci and Prop Braino I have taken another look at my old CogForth since I needed amongst other things more runtime speed without having to resort to assembler. So over the last couple of days CogForth has been completely revamped and I think I'm on a winner with this implementation. It is both fast and very small thanks to the byte codes for each Forth VM operation. Like a tachyon, it is fast and very small (as a hypothetical particle anyway) with emphasis on fast I/O operations and maximizing the Propeller's memory. What would some byte code look like? Have a look at this function which prints a hex character:
0534(001C)             | PRTHEX  ' ( n -- ) print n (0..$0F) as a hex character
0534(001C) 2D          |         byte  CLIT/2,$30,PLUS/2
0535(001C) 30          | 
0536(001C) 0C          | 
0537(001C) 05          |         byte  DUP/2,CLIT/2,$39,GT/2,_IF/2,3
0538(001D) 2D          | 
0539(001D) 39          | 
053A(001D) 20          | 
053B(001D) 3E          | 
053C(001E) 03          | 
053D(001E) 2D          |         byte  CLIT/2,12,PLUS/2                      'Adjust for A..F
053E(001E) 0C          | 
053F(001E) 0C          | 
0540(001F) 49          | PRTCH   byte  EMIT/2,EXIT/2
0541(001F) 00          | 
EDIT: Byte codes mush be shifted one bit right to compress 9-bits, the lsb is always zero as all byte code functions are on double-long boundaries.

So you see this function takes 14 bytes and compare this to Spin which also uses byte codes:
88                        char+=$30
Addr : 05B0:          38 30  : Constant 1 Bytes - 30 - $00000030 48
Addr : 05B2:          66 4C  : Variable Operation Local Offset - 1 Assign WordMathop +
89                        if char > $39
Addr : 05B4:             64  : Variable Operation Local Offset - 1 Read
Addr : 05B5:          38 39  : Constant 1 Bytes - 39 - $00000039 57
Addr : 05B7:             FA  : Math Op >     
Addr : 05B8: JZ Label0002
Addr : 05B8:          0A 04  : jz Address = 05BE 4
90                          char+=12
Addr : 05BA:          38 0C  : Constant 1 Bytes - 0C - $0000000C 12
Addr : 05BC:          66 4C  : Variable Operation Local Offset - 1 Assign WordMathop +
Addr : 05BE: Label0002
Addr : 05BE: Label0003
91                        coms.tx(char)
Addr : 05BE:             01  : Drop Anchor   
Addr : 05BF:             64  : Variable Operation Local Offset - 1 Read
Addr : 05C0:       06 03 0B  : Call Obj.Sub 3 11
Addr : 05C3:             32  : Return

<removed proposed dictionary description>

The runtime speed is mainly because many of the primitives are written in assembly and stacks are implemented that are more suited for the Prop's architecture permitting direct addressing, just like a register. So many of the primitives get the job done with very very few instructions and even the runtime interpreter is lean and mean. A byte code is read from hub RAM and shifted up to 9-bits which the Prop jumps to in COG memory, so it's very direct. The runtime interpreter looks like this:
doNEXT                  rdbyte  token,IP                'read byte code instruction
                        add     IP,#1                   'advance IP to next byte token
                        shl     token,#1                'expand to 9-bits - all byte codes point to code on double-long boundary
                        cmp    token,#$180 wc          'tokens $C0..$FF are calls to kernel byte code via kbctbl
               if_c     jmp     token                   'directly execute PASM byte codes without further ado

EDIT: Fixed a bug when testing for PASM for extended byte code functions.

There's a test for byte codes from $C0..$FF which doesn't really impact the speedy operation of the assembly primitves which are indexed by codes $00..$BF. The reason I reserve some codes is that there is no way you could use all the 256 codes for assembly primitives so I used some to form a very compact way of accessing up to 64 more words (functions) which instead of being assembly code are instead interpreted byte codes. All byte code functions other than these special 64 are referenced with 2 or 3 bytes one of which is the byte code and the other 1 or 2 bytes are a relative address poining back to the word function. The one byte CALL gets straight into 1 of 64 higher-level functions which are themselves comprised of byte codes which eventually execute assembly code via the first 192 byte codes $00..$BF.

Anyway, I'm developing and testing and the beta will be ready very soon but I thought I would present some details of the workings of this Forth implementation as I am also looking for feedback. Perhaps also someone could suggest an easier way around the Spin/BST compiler limitations especially with DAT sections and references which the compiler insists must be on long boundaries. Anyway, I want the references to be absolute in hub RAM rather than as if it were PASM running in a COG. Also, I am making it far easier to interface to various chips by having low-level code for serial operations and making all the byte code operations fast, especially serial operations. I'm even thinking of making it as easy to use as the Basic Stamp. For instance, there's something in being able to send and receive serial data on any pin at any time (without starting up a cog). So too all those pin high and pin low and clocking operations etc. I want to be able to hook-up an I2C or SPI'ish chip and bit-bash to it at least in the 100kHz range if not more (without resorting to PASM in a COG).

This is my header file and some code snippets for the moment.


TACHYON

A very fast and very small Forth byte code interpreter for the Propeller chip.
2012 Peter Jakacki

Features:
- Low level words are written in PASM and accessed by the
Forth run-time interpreter as single byte codes.
Byte codes are read from hub RAM and executed in PASM
Byte codes $00..$BF are PASM primitives expaned to 9-bits to directly address COG code
Byte codes $C0..$FF are calls to kernel byte code defs via table in hub RAM

- Support for LMM operations
- Interpreted byte code definitions are referenced either as:
- 1 byte - codes $C0..$FF index their definitions via a table - used as part of compiled kernel
- 2 bytes - RCALL opcode + relative byte (always referenced backwards) (extra 4 bits in opcode = -4096 range)
There are 16 entires in the COG for the RCALL byte code + extra address bits
- 3 bytes - WCALL byte code + 16-bit relative address
- All literals and strings are byte aligned
- Fast I/O bit-bashing support
- Flexible SPI PASM code support words in kernel
Constuct fast serial drivers with minimal code


- Holds Forth headers in EEPROM or SD storage
Searches the dictionary using rapid index key searching by first character
No hub RAM is used by headers
Even 32K EEPROMs can be used if the area is in RAM is normally rewritten (i.e. video memory)
Option to hold additional information per defintion such as stack usage and description

- Kernel compiled in standard manner via Spin tools so other Spin objects can be combined

- Three stacks in COG RAM: Data, Return, and Loop
Access loop indices outside of definitions
Avoids manipulation and corruption of return stack
Static stack arrays for direct addressing of stack items
Intrinsically safe stack overflow and underflow

Some early unoptimized observations:
- Empty loops can execute in 500ns to 825ns (absolute worst case)
Two to one stack operations ( + * AND etc) inc opcode fetch take 900ns to 1.087us (absolute worse case)
' Fetch the next byte code instruction pointed to by the instruction pointer IP in hub RAM
'
doNEXT                  rdbyte  token,IP                'read byte code instruction
                        add     IP,#1                   'advance IP to next byte token
                        shl     token,#1                'expand to 9-bits - all byte codes point to code on double-long boundary
                        cmp    token,#$180 wc          'tokens $C0..$FF are calls to kernel byte code via kbctbl
               if_c     jmp     token                   'directly execute PASM byte codes without further ado
                                                        ' byte codes $C0..$FF point to further byte code definitions
                            ' which are larger fragments of byte code in hub RAM
                        call    #SAVEIP                 'save current IP in prep for a call
                        add     X,kbcptr                'kbcptr points to the kernel byte code table (less $180)
                        rdword  IP,X                    'read 16-bit address from hub kbc table into IP
                        jmp     #doNEXT                 'Execute the code
  
  
' Example of PASM code entries for Byte Code indexing on double-long boundaries
'   
DROP2                   call    #POPX
                        jmp     #DROP
DUP                     mov     X,tos                   ' Read directly from the top of the data stack
                        jmp     #PUSHX                  ' Push X onto the data stack and doNEXT              '
OVER                    mov     X,tos+1                 'read second data item and push
                        jmp     #PUSHX
NIP                     mov     tos+1,tos               'replace second item with top and drop
                        jmp     #DROP

LIT0                    mov     X,#0
                        jmp     #PUSHX
LIT1                    mov     X,#1
                        jmp     #PUSHX

'****************** BOOLEAN ******************
_AND                    movi    _POPEX,#1000_001    ' AND ( n1 n2 -- n3 )
                        jmp     #POPEX                  'discard top of stack and execute modified PASM
_OR                     movi    _POPEX,#1010_001
                        jmp     #POPEX
_XOR                    movi    _POPEX,#1011_001
                        jmp     #POPEX

'***************** MEMORY *******************
CFETCH                  rdbyte  tos,tos        ' read byte pointed to by tos into tos
                        jmp     #doNEXT    
CPLUSST                 rdbyte  X,tos           ' read in byte from adress
                        add     tos+1,X         ' add second item to contents of address 
CSTORE                  wrbyte  tos+1,tos       ' write the second item using address on the tos
                        jmp     #DROP2


' Example of interpreted byte codes in hub RAM
' References to other byte code defintions is relative which is also necessary because of the Spin compiler's limitations with DAT sections
'
0530(001B) 06          | _BOUNDS byte  OVER/2,PLUS/2,SWAP/2,EXIT/2
0531(001B) 0C          | 
0532(001B) 08          | 
0533(001B) 00          | 
0534(001C)             | PRTHEX  ' ( n -- ) print n (0..$0F) as a hex character
0534(001C) 2D          |         byte  CLIT/2,$30,PLUS/2
0535(001C) 30          | 
0536(001C) 0C          | 
0537(001C) 05          |         byte  DUP/2,CLIT/2,$39,GT/2,_IF/2,3
0538(001D) 2D          | 
0539(001D) 39          | 
053A(001D) 20          | 
053B(001D) 3E          | 
053C(001E) 03          | 
053D(001E) 2D          |         byte  CLIT/2,12,PLUS/2                      'Adjust for A..F
053E(001E) 0C          | 
053F(001E) 0C          | 
0540(001F) 49          | PRTCH   byte  EMIT/2,EXIT/2
0541(001F) 00          | 
0542(001F)             | PRTBYTE
0542(001F) 05          |         byte  DUP/2,CLIT/2,4,_SHR/2
0543(001F) 2D          | 
0544(0020) 04          | 
0545(0020) 1A          | 
0546(0020) 3B          |         byte  RCALL/2,20 '--&gt;PRTHEX                  'Due to limitations of Spin tool &amp; BST this needs to be calculated by hand
0547(0020) 14          | 
0548(0021) 3B          |         byte  RCALL/2,22
0549(0021) 16          | 
054A(0021) 00          |         byte  EXIT/2
EDIT: Fixed byte code references which are encoded as 8-bits using cogaddress/2
«134567109

Comments

  • ericballericball Posts: 774
    edited 2012-07-03 11:25
    Hi Peter,

    Sounds cool. Did you read localroger's Windmill blogs? It's a Forth-like bytecode interpreter designed to run out of SPI EEPROM. I'd recommend checking out http://forums.parallax.com/entry.php?39-Windmill-Byte-Code-Interpreter where he talks about mapping bytecodes to PASM instructions.
  • mindrobotsmindrobots Posts: 6,506
    edited 2012-07-03 11:34
    Peter, Excellent!!

    Tachyon is a perfect name for it since Forth is often considered a hypothetical language that may or may not exist.

    I look forward to taking this out for a cruise!
  • richaj45richaj45 Posts: 179
    edited 2012-07-03 17:21
    Hello:

    I am curious as to what the advantage of developing another Forth when PropForth looks very complete?

    That is besides the need to learn by doing, which i understand completely.

    cheers,
    rich
  • kuronekokuroneko Posts: 3,623
    edited 2012-07-03 17:33
    doNEXT                  rdbyte  token,IP                'read byte code instruction
                            add     IP,#1                   'advance IP to next byte token
                            shl     token,#1                'expand to 9-bits - all byte codes point to code on double-long boundary
                           [COLOR="orange"] test    token,#$180 wz[/COLOR]          'tokens $C0..$FF are calls to kernel byte code via kbctbl
                   [COLOR="orange"]if_nz    jmp     token[/COLOR]                   'directly execute PASM byte codes without further ado
    
    If you want to separate between $00-BF and $C0-$FF shouldn't the test/jmp pair be a cmp/jmp (less thanA)? Currently anything above $3F produces !Z. Or are your tokens inverted somewhere? Do I need more coffee?

    A or greater equal depending on when you do the jump
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-03 17:58
    kuroneko wrote: »
    If you want to separate between $00-BF and $C0-$FF shouldn't the test/jmp pair be a cmp/jmp (less than)? Currently anything above $3F produces !Z. Or are your tokens inverted somewhere? Do I need more coffee?

    We all need more coffee and we don't need an excuse. The code was all mish-mashed by some major upheavals when I posted this about 3:30AM in the morning (or at night from my pont of view) and I spotted it this morning and changed it to if_z instead but still wrong ;) Oh look! time for a coffee! The original just tested the msb so the reason for "test" instead of "cmp". Anyway I thought I would leave the post without corrections as it serves it's purpose and I will find out if anyone is analyzing it (which you did 30 minutes later).

    I had changed the point at which byte tokens are either used as a direct 9-bit address into the cog's PASM code or as an index into a jump table to byte code. Somehow it looked right when I did it. Sometimes I aim for "close enough is good enough" as I know I will come back and if it's still there in that I haven't changed it all again then I will make it right (or just crumple it up and toss it into the trash).
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-07-04 18:47
    WOW. You rock! A new forth on the prop. Next thing you know Cliffe Biffle will release his source. Are there now more versions of forth than C for the prop? We make catch up to basic yet!

    I look forward to (attempting) assimilating all your best ideas! Especially those that yield more run time speed without assembler. You got some neat stuff going on. Can you borrow from localroger's work?

    When your design starts to get stable, consider using the test automation so the workstation runs a regression test suite after every development change is "done". Sal says there is no reason the automation would not work with any language, I would like to find evidence one way or the other. I'd like to help set it up.

    @richaj45: The biggest advantage is getting the perspective of a different approach. Sal's way is deemed the best by Sal to do what Sal wants to do, and does not necessarily lend itself to what Peter wants to do or the way Peter wants to do it. "Right tool for the right job". Its almost guaranteed that Peter will find a better way to do what Peter wants to do, since he does not have the same constraints. Often, we will find a new perspective in the way one does something that improves the other. This effect might not be limited to forth kernel development . :) I heard of some guy Darwin that made some notes about exploiting niches, but he has not posted anything in the OBEX. :)
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-07-04 19:42
    Nice work Peter. I don't have time to analyse what you've done and I have never looked at Forth.

    As Prof says, another version is great, and ideas can be shared, making each version better than before. Good ideas will find their way into other code too.

    Prof & Sal: I am sure the test automation suite will be great once things settle. Hopefully it should even find its way into normal code too.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-04 21:44
    While I really like PropForth and the work that's been put into it the problem is that it runs too slow at the Forth level since very little is actually coded in directly executed PASM. I don't know if my Forth will measure up to the work that Sal and Prof Braino have put into theirs but it definitely will suit my needs. Plus it should be a lot easier and faster to recompile a new kernel with other objects. I'm a simple guy and I like to keep things simple. If I can't knock this TACHYON Forth over in a week or two then it's probably too complicated for me :)

    One of my aims is to also have enough resources left over that I can hook-up a monitor, keyboard and SD card and run the whole system stand-alone if I have to. But mainly I find that I interface to a great variety of chips but I need more speed and I would like to stay within the normal Forth environment when doing that. Forth was after all designed to get at the bare metal in a transparent and interactive manner having first been employed on radio telescopes in the 70's. At the very most if necessary having to patch in a Spin file, recompile the kernel and have it up and running just as quick. Well, at least, that's my aim. Having efficient byte code means I can pack a large application program in and still have memory left over for video etc.

    I hadn't looked at localroger's Windmill before but that's the basic idea I had before for running larger programs in that I would use those small 4M byte serial Flash chips I have on some boards or else SD but it's still a lot more cumbersome than running from hub RAM. Running interpreted code from serial memory is an old idea, I remember the TSS400 for one. It's interesting to see that he is thinking of a scheme to encode PASM. But what has happened to Windmill since? Has he charged at it a bit too quixotically? :)
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-07-05 11:29
    .. PropForth ... runs too slow at the Forth level ... PASM. ... lot easier and faster to recompile a new kernel with other objects.
    ... hook-up a monitor, keyboard and SD card and run the whole system stand-alone .
    ... patch in a Spin file, recompile the kernel and have it up and running just as quick.
    ... running larger programs ... 4M byte serial Flash chips or SD

    I'm looking forward to your experimenting along these lines, your perspective and results will be interesting. We do have a way to "lock" the forth prompt so only the user level application words are available, and to eliminate the development extensions from the final application to save space, but we haven't worked on trimming down the kernel further, there is a lot of unexplored custom kernel development.

    In the meantime, maybe look at the JupiterACE code from v3.6; this runs on the Prop Demo Board and is a stand alone forth with VGA and keyboard, 80 column text in hires, 40 column text in low res. This might be towards what you are looking for, it will be brought up to 5.0 kernel when the test automation is complete. You might be able to build on this, but Sal's version is still a couple weeks out. The JupiteACE was actually my goal for joining the project. V3.6 was the teaser, v5.3 may be the final result. Running VGA takes most of the resources of a prop chip, which lead us to the idea of just adding more props, which lead us to MCS and Go-channels. So the propforth development has been getting "bigger", if you find ways to get it "smaller" again, that will be really helpful.

    Sal plans to simplify the process for optimizing in assembler, he will add new words that start and end the assembly process, and the assembler code can be compiled right into the dictionary. This could help in creating tachyon, but it may not ready until 5.4.

    We looked at linking in arbitrary SPIN files, but that seemed to require the SPIN be written to support a "standard" interface, and we couldn't find anything to use as the standard; (every spin program seems to be too different) so we stopped that investigation. Maybe you can provide some insight or example of a specific spin file you want linked in, we can use that as a starting point for a "standard".

    Sal's model supports adding more hub and cog memory in the form of more units of prop + SD, rather than adding more dedicated memory parts etc. For Sal's purposes, its easier and cheaper to just grab a couple more props out of the bin. In the case of the JupiteACE, it allows the full resource of oa prop to run the VGA, and permits a a full prop or more to be available to an application. But we have kept in the 32K memory mind set, it will be interesting to see what we gain will large external memory configurations.

    I have a pile of HIVE boards (hive-project.de) that accept 1 meg x 8 bit SRAM. I was thinking of circuit bending these toward a propforth rig, but that is way down the road. In fact, you might want to check out the "m" language Ingo is working on over there. It appears to be a version of forth for the Hive hardware, and works with the hive OS running on the other chips (which might get you the "link in spin programs" function you seek). I don't know the details, but a bunch of it seems similar to your goals. Google translate does a fair job with the German, and Ingo and the Borg drones can do English.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-08 06:58
    Alright, after a few delays away from the project here and there due do the dear wife and her holiday projects I have reached a milestone with TACHYON. I am now able to run precompiled byte code successfully and have run some tests on various byte code routines. Afterwards I wrote an SPI output routine that manipulated the SPI bits individually and this ran at a speed of 82kHz with a clock high time of 1us. I compared this to the equivalent Spin routine which ran at 26.88kHz with a clock high time of 7.8us. Great, that's 3 times faster already for this kind of code. Next step was to code a simple primitive (6 PASM instructions) that clocked a single bit at a time and place that within a DO..LOOP to keep the code flexible and ran the tests. This time it ran at 625kHz with a clock high time of 50ns so that the CLKBIT primitive and the looping amounted to 1.6us per bit and 11 bytes for the SPI routine, that's pretty good I think. So the average execution speed of most byte codes is around 1us which is mainly limited by the RDBYTE access times.

    I have a little tidying up to do (if I am not seconded to the garden project in the meantime) and I will build in the dictionary and high-level words that form the text interpreter (vs the byte code interpreter) so that I can work with this interactively in a terminal. At this stage I may release the source for the alpha which should be in the next week of so.

    Another change I made was to the serial I/O and include the serial transmit code into the Forth cog and leave the serial receive to a dedicated cog. The transmit code is very small and doesn't have to worry about multi-tasking with any receive code etc. This way the receive timing can be very precise and run at 1,382,400 baud for the maximum speed of my Bluetooth modules. Also since the transmit speed is very high there is no need to buffer or waste time writing to hub RAM as each character completes transmission in 7.23us.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-07-08 18:39
    Nice job Peter. May it rain a little more so you don't need to garden until finished ;)
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-09 07:28
    Just a progress report now as I streamline the code. I have decided to revert back to using all 256 byte-code values for direct long access to the first 256 longs in the cog as this also simplifies the precompiled byte-code in that I don't need to specify a "/2" after each word. Plus not all byte-code values are used though as I try to cram the code for each into 2 longs if possible rather than just wasting code by jumping to the actual code. Some code primitives are 6 longs long but are still in the first 256 longs of the cog because it's a balance between wasting byte-code values and wasting cog memory (extra jmps). Having some byte-codes that use external tables for vectoring to more byte-code is too awkward to maintain and at this point seems redundant so it's been removed. The byte-code interpreter is reduced to:
    doNEXT
                            rdbyte  token,IP                'read byte code instruction
                            add     IP,#1                   'advance IP to next byte token
                            jmp     token
    

    The SPI primitives now include a flexible transmit routine which clocks data out at 2.85MHz (without doing anything fancy that is) so this handles a lot of SPI and I2C style protocols very efficiently. To slow it down just requires accessing the CLKBIT primitive at your leisure.

    I have also found that it is far more efficient to store my inline literals and constants in big endian form to facilitate shifting and accumulating. So there is only one routine that reads in bytes to form these numbers and depending upon the entry point is what decides how many bytes are read. Constants are also coded as a standard definition with an exit (return from call) as there is not much advantage in having a special operation just for this. So the structure of say a 24-bit constant is [PUSH3] [$A5] [$00] [$C1] [EXIT] where PUSH3 is the byte-code for reading in 3 bytes and pushing the result onto the data stack after which the definition EXITs. Too easy.

    This is the simple PASM code that effects pushing 1 to 4 inline bytes onto the datastack.
    PUSH4                   call    #ACCBYTE                ' read the next byte @IP++ and shift accumulate
    PUSH3                   call    #ACCBYTE
    PUSH2                   call    #ACCBYTE
    PUSH1                   call    #ACCBYTE
                            call    #_PUSHACC               ' Push the accumulator onto the stack then zero it
                            jmp     #doNEXT
    
    ' This code here is located in the second half of the cog RAM to leave the first half free for byte-code access.
    
    ACCBYTE                 call    #GETBYTE
                            shl     ACC,#8
                            or      ACC,X
    ACCBYTE_ret             ret
    
    GETBYTE                 rdbyte  X,IP
                            add     IP,#1
    GETBYTE_ret             ret
    

    And this is how a separate constant is coded (similar to inline literals without the EXIT):
    06DD(0023) 74          | MYCON   byte  PUSH4,$A5,$00,$FC,$01,EXIT
    06DE(0023) A5          | 
    06DF(0023) 00          | 
    06E0(0024) FC          | 
    06E1(0024) 01          | 
    06E2(0024) 03          |
    
    Because all values are non-aligned there is no wasted space aligning them to word or long boundaries. Also relative addressing gets around any offsets and allows for relocating code easily.

    So my test routine which prints a start-up message, sends out 32-bits via SPI, and does a hex dump of hub RAM looks like this in byte-code listing form (courtesy BST):
    06F8(002A) 77          | TXSPI   byte  PUSH1,32,CLKBITS,EXIT                  ' Send the 32-bits
    06F9(002A) 20          | 
    06FA(002A) 8D          | 
    06FB(002A) 03          | 
    06FC(002B) 74          | MYCON   byte  PUSH4,$A5,$00,$FC,$01,EXIT
    06FD(002B) A5          | 
    06FE(002B) 00          | 
    06FF(002B) FC          | 
    0700(002C) 01          | 
    0701(002C) 03          | 
    0702(002C)             | DEMO
    0702(002C) B9          |         byte  PRTSTR,$0D,$0A,"TACHYON Forth V1.0 ",0
    0703(002C) 0D          | 
    0704(002D) 0A          | 
    0705(002D) 54          | 
    0706(002D) 41          | 
    0707(002D) 43          | 
    0708(002E) 48          | 
    0709(002E) 59          | 
    070A(002E) 4F          | 
    070B(002E) 4E          | 
    070C(002F) 20          | 
    070D(002F) 46          | 
    070E(002F) 6F          | 
    070F(002F) 72          | 
    0710(0030) 74          | 
    0711(0030) 68          | 
    0712(0030) 20          | 
    0713(0030) 56          | 
    0714(0031) 31          | 
    0715(0031) 2E          | 
    0716(0031) 30          | 
    0717(0031) 20          | 
    0718(0032) 00          | 
    0719(0032) 77          |         byte  PUSH1,15,MASK,_8,MASK
    071A(0032) 0F          | 
    071B(0032) 4F          | 
    071C(0033) 35          | 
    071D(0033) 4F          | 
    071E(0033) 96          |         byte  RCALL,@DO7-@MYCON
    071F(0033) 24          | 
    0720(0034) 09          | DO7     byte  DUP,RCALL,@DO8-@PRTLONG
    0721(0034) 96          | 
    0722(0034) 53          | 
    0723(0034) 77          | DO8     byte  PUSH1,32,_REV
    0724(0035) 20          | 
    0725(0035) 4D          | 
    0726(0035) 96          |         byte  RCALL,@DO5-@TXSPI
    0727(0035) 30          | 
    0728(0036) 2B          | DO5     byte  _0,PUSH2,$02,00,RCALL,@DO9-@DUMP
    0729(0036) 76          | 
    072A(0036) 02          | 
    072B(0036) 00          | 
    072C(0037) 96          | 
    072D(0037) 55          | 
    072E(0037) A0          | DO9     byte  _AGAIN,@DO6-@DEMO
    072F(0037) 2E          | 
    0730(0038)             | DO6
    

    Please note too that Forth is only coded this way in the Spin compiler to form the kernel after which the Forth itself would handle normal text input for compiling which would look like this:
    : TXSPI       32 CLKBITS ;
    : MYCON       $A500FC01 ;       \ Could also be coded as: $A500FC01 CONSTANT MYCON
    : DEMO
            begin
              cr ." TACHYON Forth V1.0 "            \
              15 MASK 8 MASK MYCON                  \ Assemble clockmask outmask and data for SPI
              DUP PRTLONG                           \ Let's see what is going to be sent
              32 REV                                \ Make it MSB first
              TXSPI                                 \ Off it goes
              0 $200 DUMP                            \ Hex dump of first 512 bytes of RAM
            again ;                                 \ Continue forever
    

    So I'll try not to bore you any further with any details suffice to say that there are a lot of very neat things going on and planned. With the dictionary (names of functions and pointers etc) in external memory such as EEPROM and SD there will be a lot of program code that will be able to be squeezed into just a few k of RAM, count the bytes that are used in the demo! Some code will be available very soon now. (Hey Cluso, I hear it's going to rain all week :) )
  • mindrobotsmindrobots Posts: 6,506
    edited 2012-07-09 07:58
    Excellent work Peter! This is exciting!

    I'm doing a rain dance on this side of the world because we need the rain.....I'll add some "remote location" rain dancing for purely selfish reasons!
  • Dave HeinDave Hein Posts: 6,347
    edited 2012-07-09 08:02
    Peter,

    Do you think your interpreter could be applied to Spin to improve the speed? One approach would be to compile Spin to your bytecodes. Another approach may be to use your instruction-decoding technique to decode the existing Spin bytescodes.

    Also, maybe it would be possible to compile C to your bytecodes. Can you describe your VM in more detail? It appears to be stack-based, but also includes an accumulator. Are there any other registers in the VM?

    Dave
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-09 09:04
    Dave Hein wrote: »
    Peter,

    Do you think your interpreter could be applied to Spin to improve the speed? One approach would be to compile Spin to your bytecodes. Another approach may be to use your instruction-decoding technique to decode the existing Spin bytescodes.

    Also, maybe it would be possible to compile C to your bytecodes. Can you describe your VM in more detail? It appears to be stack-based, but also includes an accumulator. Are there any other registers in the VM?

    Dave

    I just had a look at the Spin bytecodes and it ain't fun, there's no way you could code all that and still cram it into just the cog. Tachyon Forth bytecode operations are fairly simple, just like PASM, but they are very flexible and implement a simple virtual stack based processor. The reference to an accumulator is really nominal as this is just a location to shift and accumulate literal values. The so named "accumulator" is cleared for next use after every push onto the datastack. There are other temporary registers also named for convenience such as R,X,Y,Z,R0..R3 as well as the IP which is equivalent to the PC in a real processor. The "X" register is used a lot for passing a value without upsetting the tos (top of stack) value as you can see in the GETBYTE and ACCBYTE routines. Creating virtual registers is not a problem though.

    Stack manipulation can be very easy and transparent on some processors but the Prop isn't one of them, there are no auto increment/decrement indexed instructions. I push and pop my stacks by physically moving values which sounds kind of brute force'ish but I worked out that this is still far more efficient as the PASM routines can access all stack items (not just tos) directly without any extra overhead so rotating and swapping etc is very fast and compact. The push and pop operations only take a tiny bit longer than a conventional stack implemented with a pointer anyway. Upon detecting non-zero values "falling" off the bottom of the stack I jump to an error processing routine but who cares if zero values "fall off" as I also pump zero values back into the bottom of the stack when it's popped.

    As for compiling from C to these bytecodes there shouldn't be any problem at all as it would only be the PASM code that fits in the VM cog that would be required to run them. It's a bit like compiling from C to Java bytecodes and using the JVM but a lot simpler of course and without all the overhead that would normally be involved. At present the Tachyon bytecodes are not fixed in value as the VM is in a state of flux but even so I don't think that there would be any requirement for portable bytecode normally. The symbol address of the bytecode function is the same as the bytecode which is why I can just reference them directly with the "byte" directive in a DAT section.

    Hope that sort of answers your questions and when i release some source soon you will be able to have a good look yourself to see if Tachyon bytecode is suitable for your task.
  • Dave HeinDave Hein Posts: 6,347
    edited 2012-07-09 09:23
    Can you publish a complete list of the Tachyon bytecodes? It seems that Spin could be compiled to Tachyon bytecodes with about the same code size, but faster speed. Most of the Spin bytecodes are used to support various addressing modes, but this takes up a lot of space in the Spin interpreter. It may be more efficient to use a simpler VM, and do some operations with multiple bytecodes. This way the interpreter could be streamlined to a smaller set of intructions, and could run faster. It's sort of a RISC approach. Maybe a common streamlined bytecode VM could be developed that could be used by Spin, C, Forth and any other high level languages.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2012-07-09 09:50
    Peter,

    I like your stack philosophy. There's no reason to have a huge stack, except for deep recursion. One annoyance of Forth, of course, is stack maintenance. It would be handy to add two more kernel instructions, pushmark and poptomark. These allows one to punctuate the stack in such a way that post-operative cleanup requires only a poptomark without having to know how much garbage remains. The "mark" deosn't really have to exist in the stack itself, but can be tracked either via a rotating bitmask that's synchronous with the stack, or a separate mark stack whose top element keeps track of the relative position of the next mark (i.e. increments on a push, decrements on a pop, and gets popped off when it goes negative).

    -Phil
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-10 21:30
    This bytecode is proving to be very compact as I have tediously coded in the high-level words that are needed to handle the text input, parsing, interpreting/compiling and number conversion etc. How compact? The bytecode for all this with the bells and whistles is under 1k.

    Some changes:
    1) Added a jump table for kernel bytecodes as I originally planned. This simplifies calling them from the Spin compiler and allows these functions to be anywhere.
    The table allows for up to 256 vectors which so that a call looks like:..... BCALL,xPARSE ..... where PARSE is referred to by the byte label xPARSE. Now I don't have to do a relative call with all that awkward setup for the Spin tool as in:...... RCALL,@L1-@PARSE .....where I also have to create a label such as L1 that follows everytime. The DATA section has it's advantages as I just set the ORG to 0 before the table so that each entry has consecutive values from 0 up to the maximum reference of 255.

    2) Added a local register bank which is great for storing temporary values and settings etc. This makes the kernel bytecode a lot easier too as it doesn't have to create special variables for number bases and interpreter flags etc. The registers used by the kernel are referred to in the CON section where they are created with a simple:
    ' REGISTER byte OFFSETS - 64 bytes
            #32,NTIB,IN,NUMFLG,FLAGS,BASE
    
    While the reference in the Spin tool to the register is in the form:
    COMMENT byte  REG,NTIB,CFETCH,REG,IN,CSTORE,EXIT
    
    and compiled:
    091E(002A) DF          | COMMENT byte  REG,NTIB,CFETCH,REG,IN,CSTORE,EXIT   ' ignore the rest of the text line
    091F(002A) 20          | 
    0920(002B) 62          | 
    0921(002B) DF          | 
    0922(002B) 21          | 
    0923(002B) 6A          | 
    0924(002C) 05          |
    

    3) Enhanced number processing so that sensible prefixes and suffixes can be used:
    Prefixs:
        -    negative number
        $    hex number ($0F00)
        %    binary number 
        '    ASCII character ('H') used in the form..... '*' EMIT
        ^    ASCII control character (^G) used in the form..... KEY? ^C = IF .....
     Suffixes:
        h    hex number (0F00h)
        d    decimal number (1000d)
        b    binary number (11000101b)
    
     Others:
         .    decimal point - decimal place stored in DPL
        _  standard Spin digit separator (ignored)
    


    4) 5) 6) lots more stuff

    I will probably allow direct addressing of stack items so that .... 3 STACK ... will return the address of the third stack item which can be manipulated just like any other variable.

    @Phil:
    I have tinkered before with shadow stacks that hold information about the stack items, a bit like a type identifier. But although I understand what you are saying and ways to implement it I'm not sure how useful this would be, at least to a Forth programmer. You see, although the stack can be a nuisance especially when it is misused and layers deep, part of coding any Forth is the art of factoring things into small, clear, and manageable chunks of code. My Tachyon Forth is being driven by need, by actual embedded control use so I tweak it to optimize what I really need. But could you please give me an example of where your suggestions would really excel? Thanks.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2012-07-10 22:01
    Peter,

    My suggestion arises from comparative experience programming in both Forth and Postscript. The latter has mark and cleartomark words that make stack maintenance a breeze compared to Forth. One simple example of their utility can be seen when a subroutine needs to abort, returning the stack to a known state upon exit. Depending upon how much stuff got pushed on before the fault condition was encountered, this could be a real chore without a way simply to wipe out the garbage down to a known point in the stack.

    -Phil
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-10 22:14
    Phil,
    Although Postscript may be based on Forth it is also very non-Forth'ish because of it's huge complexity. I understand the Postscript requirements but how useful would that really be in the simpler embedded Forth model with lots of "small" words? But I will see if this can be implemented without too much fuss, especially since I am looking at being able to create local variables straight from the stack description which is normally only a comment such as: ( pin channel -- flg ) ..... and referring directly to pin and channel etc. However EXIT or it's cousin will then have to clean-up the stack and place whatever results there are onto the stack.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-12 22:04
    I'm not really ready yet to release the source for this but in a way I am doing so. I've been experimenting with formatting the source code in Google Docs (<-- click here) and just copying and pasting the formatted text into BST to make sure that it runs. Well, guess what?, it does!. So this is more an exercise into posting code in fancy format as a shared document just to see how it will work. This is also a good opportunity for those who want to get their hands on to the source for Tachyon just to play with until I manage to get the text interpreter all sorted out. BTW, I copied the higher-level source from an old Forth I wrote for the ARM chip several years ago and although it worked there it is ugly and I am in the process of re-doing it all. So just a little while longer but in the meantime have a good look.

    The Google Docs format makes it easy to view and even download in various formats. Of course if you want to try out the code then just select all and copy&paste into your Spin IDE. The settings are so that anyone with the link can view and comment and if anyone wants to be able to edit this document as part of a collaborative effort then please just email me. There's also the chat box you can use when you have the document opened.
  • msrobotsmsrobots Posts: 3,709
    edited 2012-07-12 23:10
    @Peter.

    now I need to start thinking backwards.This is nice. Even I might understand FORTH now - Will study your pasm ...

    Thank you

    Mike
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-12 23:15
    Yes, the fancy formatting makes it a lot easier to read, even for me. It's certainly a lot clearer to see where the flow goes and I can insert images etc without upsetting the text itself. So you can just copy and paste or download as text to compile.

    Another great feature is that this "source" is live, you may even see things change before your eyes as I am formatting it or making code changes !!! Maybe this way should be the way we format source for the gold standard?

    Fancy formatted source code in Google Docs
  • HumanoidoHumanoido Posts: 5,770
    edited 2012-07-14 04:07
    Peter, congrats on this spectacular project! You mentioned CogForth and how you reworked it to create Tachyon Forth. I'm interested in a copy of the original CogForth, in any condition, for its great historical value.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-07-14 13:54
    That looks pretty cool. How do you do the formatting? Is there a link to how-to instructions?
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-14 17:25
    Hi prof, this is a Google cloud document so all the normal editing tools are there online when you are signed in and you have permission to edit. For some reason some may not be able to access this document although I had no problems on many different systems as guest and only one person has complained so far that their browser didn't render it properly..grumble grumble. I've tried accessing this on several completely clean systems with different browsers etc and it all seems to work fine. What I like about this type of formatted document is that it is a very good way to document as you go and it makes important parts of the source "pop out" while subduing others. Images are a bonus too. I've probably gone a little wild with some of the colors on this document but I am experimenting at present to see what does work. It's actually faster to edit in this document normally and when it comes time to compile it takes less than 3 seconds to "select all - copy& paste into BST and F10.

    The other advantage is that this is a live document, what you see is what I see and what I am changing.
  • prof_brainoprof_braino Posts: 4,313
    edited 2012-07-14 17:36
    I think I want to try this for propforth. So how do you get the context highlighting for forth? In the PSPad text editor, we can set up the set of forth control words, and it automagically does the context highlights. Do we have to set up up something to define the set of forth words, or do you have to do the color for every word as you type it in?
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-14 20:34
    Even with automagic converters you still have to have it formatted correctly plus there are some things that you don't want it to do automagically anyway. Well that's my argument against it :) No but the Google document is manually formatted just like any word processor and if you are formatting as you are entering text (vs pasting an existing text file) then you make it stand out just as you would like etc. My suggestion is to give it a try, it can't hurt and you can play with it to see how well it works plus the source code is always intact.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-07-18 08:59
    Well I'm starting to get somewhere with this beyond the VM part of it that is. I am staring off simple again after scrubbing out some ugly code I imported and tried to use from another Forth project. Now I have a simple text interpreter running that executes immediately on each word for the moment. The idea is that even in interpretive mode text will be compiled into a temporary location word by word (as it is entered) and then that complied code executed on an CR. The dictionary is running out of RAM at present but I will be copying this over the 2K image of the Tachyon's VM kernel that lies unused after boot rather than worrying about accessing it in EEPROM. Once the dictionary is copied over the VM image there is still about 29K of RAM left at present. The structure of the dictionary is settling down now as I have this running nicely with non-counted string names terminated by an attribute byte and 2 bytecodes so this structure is very compact and easy to scan plus there are no link fields.

    I have also tested the serial transmit and receive to at least 2M baud at present and I will do some further tests at higher speeds later on. I'm also overlaying the serial receive cog's image as well with the receive buffer so there is no wasted memory.

    So far...so good.
  • richaj45richaj45 Posts: 179
    edited 2012-07-18 09:36
    Hello:

    So how is your coding doing 2M baud?
    Why is it so much faster than the standard Full Duplex 4 port object?
    Is the serial driver full duplex?
    When do you think the whole of the Tachyon code will be posted?

    Thanks in advance for your patience with all these questions.

    cheers,
    rich
Sign In or Register to comment.