Simple Assembler for P1 Tachyon Forth 5.7 - Or: Can you beat Tachyon at fibonacci?
Hi,
perhaps this is useful for somebody.
This is a simple assembler for Tachyon Forth 5.7. It is based on the "work in progress interactive assembler" which Peter had begun. It is not very pretty, but it is at least able to run 46 fibonacci.
Update: With version B, there is a way to use "real" COG PASM too, see below.
What is LMM?
As Propeller can only run assembler code from cog memory and as this memory is already mostly used, this assembler uses a feature, that is builtin as an alternative to the standard inner interpreter "doNEXT" of Forth. LMM stands for LARGE MEMORY MODEL and enables P1 to execute assembler code from hub memory.
A tiny loop fetches code from hub, patches the code into the loop at "instr", increments the instruction pointer, executes the instruction (instead of NOP) and repeats itself. So instead of PC, IP is used which is important to know, because jump instructions have to be done differently.
79D8(007C) 23 FC BC 08 | LMM rdlong instr,IP 79DC(007D) 04 46 FC 80 | add IP,#4 79E0(007E) 00 00 00 00 | instr nop 79E4(007F) 7C 00 7C 5C | jmp #LMM
So to start the assembler routine, we have to start this LMM-loop. This is done with startLMM.
To end it, we can do a real JMP to the EXIT code, which ends this word and switches back to word code execution. This is done with endLMM.
Patching
P1 has restricted memory, so it would be nice, if you could forget the ASSEMBLER, when it has done it's job. It is not needed during execution.
This can be achieved, with the following method:
1. Use a normal colon-definition to create the new executable word. Fill it with dummy-contents, I use literals. Tachyon can compile 15bit literals into word codes. Each literal reserves 2 bytes.
2. Load the Assembler
3. Get the Code field address of the new word. Patch the assembler code into the new word. Each instruction needs 4 bytes. Starting and ending each need 4 bytes too including alignment.
4. Forget the Assembler
5. Use the new word
Syntax and example
This is a Forth Assembler, so it uses Forth syntax.
R1 1 wz imm sub
R1 - destination register
1 - source, in this case an immediate number
wz - write the zero flag
imm - source is immediate
sub - this word puts together the instruction with the other information of the line and patches the resulting 32 bit code into the word.
IP 20 imm if_nz sub \ Jump relative -5 * 4
As said before, the IP-register serves as program counter. To jump back 5 instructions, we subtract 5 * 4 = 20 from IP. No labels, the distance has to be calculated manually. Though you could use the stack and patchadr somehow....
You can at least use the registers R0, R1, R2 and you can access tos, top of stack, and its next stack contents par2, par3, par4.
You can have a look at the code fields contents of the word before and after patching using HELP.
\ Fibonacci Benchmark \ https://rosettacode.org/wiki/Fibonacci_sequence see 8080 code \ Create the dummy executable word and reserve space in its code field : fiboasm ( n -- f ) \ LMM fibonacci 0 \ startLMM needs 1 or 2 words, each literal is 2 bytes 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 2 99 2 \ endLMM needs 1 long ; \ version using registers R0 R1 R2 \ Patch the PASM assembly code into the dummy word ' fiboasm patchadr ! \ get the destination for patching startLMM \ Start Patching R1 tos mov \ FIBNCI: MOV C, A ; C will store the counter R1 1 imm sub \ DCR C ; decrement, because we know f(1) already tos 1 imm mov \ MVI A, 1 R0 0 imm mov \ MVI B, 0 R2 tos mov \ LOOP: MOV D, A tos R0 add \ ADD B ; A := A + B R0 R2 mov \ MOV B, D R1 1 wz imm sub \ DCR C IP 20 imm if_nz sub \ Jump relative -5*4 JNZ LOOP ; jump if not zero \ RET ; return from subroutine endLMM
COG PASM Using LOADMOD
Like LMM, the LOADMOD feature is already builtin into Tachyon.
Without much alteration of the assembler it is possible to create short -max. 38 instructions- PASM routines which can be loaded as a block into cog ram. In Tachyon 5.7 the startadress in cog ram is always $1D9. This is done with LOADMOD ( hub-adr cog-adr length -- ) .
After loading, the code can be started with the word RUNMOD, which acts as a placeholder for whatever the loaded routine will do. The stack effect of RUNMOD will depend on the code, that is now executed.
As we are now in real assembler, we must now use real jmp instructions.
0 $1D9 4 + nr imm if_nz jmp
0 - no destination field used
$1D9 4 + - we want to jump to the 5th instruction (still no labels)
nr - no write of result
imm - immediate flag set
if_nz - only if no zero flag set
jmp - jump
The code has to finish with a jump to the inner interpreter wordcode loop:
0 $5D nr imm jmp \ jmp to doNEXT
(If appropriate, it could jump to DROP)
[fibomod] is a word, that will load the code into the cog ram.
46 [fibomod] RUNMOD will load the module and then execute the code, which will use 46 from the stack and give back the fibonacci number.
\ version for RUNMOD ======================================================== 4 15 * CARRAY _fibomod \ Create an array in code space to hold the code in hub memory 0 _fibomod longalign patchadr ! \ Start of code ist at the first aligned byte of the array startMOD \ start patching R1 tos mov \ FIBNCI: MOV C, A ; C will store the counter R1 1 imm sub \ DCR C ; decrement, because we know f(1) already tos 1 imm mov \ MVI A, 1 R0 0 imm mov \ MVI B, 0 R2 tos mov \ LOOP: MOV D, A tos R0 add \ ADD B ; A := A + B R0 R2 mov \ MOV B, D R1 1 wz imm sub \ DCR C 0 $1D9 4 + nr imm if_nz jmp \ Jump to LOOP JNZ LOOP ; jump if not zero 0 $5D nr imm jmp \ jmp to doNEXT \ endMOD no action : [fibomod] \ load the new code 0 _fibomod longalign \ get start in hub ram $01D9 \ destination adr in cog ram always the same place OVER patchadr @ SWAP - 4 / \ length longs LOADMOD ;
Speed Results
@ 80MHz:
Fast with BOUNDS: 1836311903 5,008 cycles = 62.600us
Standard Forth words: 1836311903 13,072 cycles = 163.400us
LMM PASM: 1836311903 7,488 cycles = 93.600us
cog PASM incl LOADMOD: 1836311903 5,568 cycles = 69.600us
cog PASM excl LOADMOD: 1836311903 944 cycles = 11.800us ok
The fibonacci benchmark code in extend.fth uses a special word BOUNDS which is coded in cog assembler. This seems to be a little bit like cheating. We can't beat this.
pub fibo ( n -- f ) 0 1 ROT FOR BOUNDS NEXT DROP ; \ from extend.fth
Using normal forth words we get 163 µs.
And yes, we can beat THIS with LMM assembler 93µs.
The pure cog PASM does overtake, if there are many iterations or if the module can be loaded and used several times. :-)
I am still looking for a way to include LMM-code in between Forth code. So that you can use structure words like IF....THEN.
Christof
Have fun!
Ideas, comments welcome!
Comments
Very neat trick. And very educational, too.
Thanks for posting.
However, regarding this statement:
I have two questions to better understand the intent of that statement:
1. Is the word BOUNDS the only Tachyon word that is coded in cog assembler ?
2. If it is not, then why do you get the impression that using that word "..seems to be a little bit like cheating" ?
I see nothing wrong with BOUNDS from the user perspective.
Well, this is a very special kernel (!!) word, coded in assembler. Sitting in precious cog memory and waiting there to be used. It is useful -as far as I can see- for nothing else than calculating fibonacci numbers. It is never used in extend.fth again. So at this point you have to be aware, that here the benchmark is about cog assembler and not about Forth.
There seems to be a possibility to load some short code into cog memory as a block and execute it there, perhaps I will find out how this is done....
FWIW LMM stands for LargeMemoryModel and was developed by Bill Henning about 10 years ago.
I have a version of the LMM execution loop that hides and executes from the shadow cog registers. Search for my Zero Footprint Debugger in the P1 forum.
@"Christof Eb." , I get it now. Thank you for the clarification.
I still think the little space occupied by the BOUND word is well worth it's existence, especially that it is a public word. There are many words in Tachyon one might never use or need but If I still have a room for my application with all these words I wouldn't get rid of any of them just for the sake of future convenience. But that's just me. Having more doesn't hurt until ... it starts to hurt and then it's time to look for ways to overcome this limitation.
To me, benchmarking just for the sake of benchmarking makes no real sense. Well, mostly. It gives you a general idea, that is true, but tells little of the particular implementation one is currently trying to optimize.
What I really like about your pursuit is that you try a different approach than that of Peters' and that might prove beneficial.
We might have different views or needs but that doesn't mean we shall not benefit from these differences. Quite the opposite.
Please, keep posting. This is valuable stuff.
Thanks, Cluso, I have corrected the wording for LMM. This assembler does only use features, that are already built in into the Tachyon 5.7 kernel. The LMM loop is already there. New is only, that you can assemble new PASM code words.
Version B shows now, how to assemble and use short real COG PASM routines, which can be loaded as a block into COG ram before execution. Once the code is there, you drive with full speed. :-)
First post updated.