Simple Assembler for P1 Tachyon Forth 5.7 - Or: Can you beat Tachyon at fibonacci?
perhaps this is useful for somebody.
This is a simple assembler for Tachyon Forth 5.7. It is based on the "work in progress interactive assembler" which Peter had begun. It is not very pretty, but it is at least able to run 46 fibonacci.
Update: With version B, there is a way to use "real" COG PASM too, see below.
What is LMM?
As Propeller can only run assembler code from cog memory and as this memory is already mostly used, this assembler uses a feature, that is builtin as an alternative to the standard inner interpreter "doNEXT" of Forth. LMM stands for LARGE MEMORY MODEL and enables P1 to execute assembler code from hub memory.
A tiny loop fetches code from hub, patches the code into the loop at "instr", increments the instruction pointer, executes the instruction (instead of NOP) and repeats itself. So instead of PC, IP is used which is important to know, because jump instructions have to be done differently.
79D8(007C) 23 FC BC 08 | LMM rdlong instr,IP 79DC(007D) 04 46 FC 80 | add IP,#4 79E0(007E) 00 00 00 00 | instr nop 79E4(007F) 7C 00 7C 5C | jmp #LMM
So to start the assembler routine, we have to start this LMM-loop. This is done with startLMM.
To end it, we can do a real JMP to the EXIT code, which ends this word and switches back to word code execution. This is done with endLMM.
P1 has restricted memory, so it would be nice, if you could forget the ASSEMBLER, when it has done it's job. It is not needed during execution.
This can be achieved, with the following method:
1. Use a normal colon-definition to create the new executable word. Fill it with dummy-contents, I use literals. Tachyon can compile 15bit literals into word codes. Each literal reserves 2 bytes.
2. Load the Assembler
3. Get the Code field address of the new word. Patch the assembler code into the new word. Each instruction needs 4 bytes. Starting and ending each need 4 bytes too including alignment.
4. Forget the Assembler
5. Use the new word
Syntax and example
This is a Forth Assembler, so it uses Forth syntax.
R1 1 wz imm sub
R1 - destination register
1 - source, in this case an immediate number
wz - write the zero flag
imm - source is immediate
sub - this word puts together the instruction with the other information of the line and patches the resulting 32 bit code into the word.
IP 20 imm if_nz sub \ Jump relative -5 * 4
As said before, the IP-register serves as program counter. To jump back 5 instructions, we subtract 5 * 4 = 20 from IP. No labels, the distance has to be calculated manually. Though you could use the stack and patchadr somehow....
You can at least use the registers R0, R1, R2 and you can access tos, top of stack, and its next stack contents par2, par3, par4.
You can have a look at the code fields contents of the word before and after patching using HELP.
\ Fibonacci Benchmark \ https://rosettacode.org/wiki/Fibonacci_sequence see 8080 code \ Create the dummy executable word and reserve space in its code field : fiboasm ( n -- f ) \ LMM fibonacci 0 \ startLMM needs 1 or 2 words, each literal is 2 bytes 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 2 99 2 \ endLMM needs 1 long ; \ version using registers R0 R1 R2 \ Patch the PASM assembly code into the dummy word ' fiboasm patchadr ! \ get the destination for patching startLMM \ Start Patching R1 tos mov \ FIBNCI: MOV C, A ; C will store the counter R1 1 imm sub \ DCR C ; decrement, because we know f(1) already tos 1 imm mov \ MVI A, 1 R0 0 imm mov \ MVI B, 0 R2 tos mov \ LOOP: MOV D, A tos R0 add \ ADD B ; A := A + B R0 R2 mov \ MOV B, D R1 1 wz imm sub \ DCR C IP 20 imm if_nz sub \ Jump relative -5*4 JNZ LOOP ; jump if not zero \ RET ; return from subroutine endLMM
COG PASM Using LOADMOD
Like LMM, the LOADMOD feature is already builtin into Tachyon.
Without much alteration of the assembler it is possible to create short -max. 38 instructions- PASM routines which can be loaded as a block into cog ram. In Tachyon 5.7 the startadress in cog ram is always $1D9. This is done with LOADMOD ( hub-adr cog-adr length -- ) .
After loading, the code can be started with the word RUNMOD, which acts as a placeholder for whatever the loaded routine will do. The stack effect of RUNMOD will depend on the code, that is now executed.
As we are now in real assembler, we must now use real jmp instructions.
0 $1D9 4 + nr imm if_nz jmp
0 - no destination field used
$1D9 4 + - we want to jump to the 5th instruction (still no labels)
nr - no write of result
imm - immediate flag set
if_nz - only if no zero flag set
jmp - jump
The code has to finish with a jump to the inner interpreter wordcode loop:
0 $5D nr imm jmp \ jmp to doNEXT
(If appropriate, it could jump to DROP)
[fibomod] is a word, that will load the code into the cog ram.
46 [fibomod] RUNMOD will load the module and then execute the code, which will use 46 from the stack and give back the fibonacci number.
\ version for RUNMOD ======================================================== 4 15 * CARRAY _fibomod \ Create an array in code space to hold the code in hub memory 0 _fibomod longalign patchadr ! \ Start of code ist at the first aligned byte of the array startMOD \ start patching R1 tos mov \ FIBNCI: MOV C, A ; C will store the counter R1 1 imm sub \ DCR C ; decrement, because we know f(1) already tos 1 imm mov \ MVI A, 1 R0 0 imm mov \ MVI B, 0 R2 tos mov \ LOOP: MOV D, A tos R0 add \ ADD B ; A := A + B R0 R2 mov \ MOV B, D R1 1 wz imm sub \ DCR C 0 $1D9 4 + nr imm if_nz jmp \ Jump to LOOP JNZ LOOP ; jump if not zero 0 $5D nr imm jmp \ jmp to doNEXT \ endMOD no action : [fibomod] \ load the new code 0 _fibomod longalign \ get start in hub ram $01D9 \ destination adr in cog ram always the same place OVER patchadr @ SWAP - 4 / \ length longs LOADMOD ;
Fast with BOUNDS: 1836311903 5,008 cycles = 62.600us
Standard Forth words: 1836311903 13,072 cycles = 163.400us
LMM PASM: 1836311903 7,488 cycles = 93.600us
cog PASM incl LOADMOD: 1836311903 5,568 cycles = 69.600us
cog PASM excl LOADMOD: 1836311903 944 cycles = 11.800us ok
The fibonacci benchmark code in extend.fth uses a special word BOUNDS which is coded in cog assembler. This seems to be a little bit like cheating. We can't beat this.
pub fibo ( n -- f ) 0 1 ROT FOR BOUNDS NEXT DROP ; \ from extend.fth
Using normal forth words we get 163 µs.
And yes, we can beat THIS with LMM assembler 93µs.
The pure cog PASM does overtake, if there are many iterations or if the module can be loaded and used several times. :-)
I am still looking for a way to include LMM-code in between Forth code. So that you can use structure words like IF....THEN.
Ideas, comments welcome!