TIA - An interactive inline assembler for TAQOZ

I have been putting off writing an assembler for ages, mainly because I didn't have a clue about many of the new P2 instructions. One of the other reasons I put it off was because I wanted the assembler to work with the same syntax as PASM does. However, I decided that any assembler is better than no assembler, and so I started the back-end of the assembler in Forth fashion. I still have a bit to go but I also have done the bulk of the back-end already and in non-surprising TAQOZ fashion, the code only takes around 1kB with another 2kB for the lexicon. So now I can start coding assembler in interactively with TAQOZ from the P2. Later on I will add a front-end that processes files in PASM fashion but ends up passing it to the back-end.
Here's a quick sample of a fast Fibonacci and how it looks with the listing when it is typed/pasted in, and then I test it interactively and compare results and timings with the pure Forth version "fibo". Then I have a listing produce by p2asm to show how it produces the same code.
Fast fibo in TIA:
Pasting that in produces this output:
Testing this in the terminal along with the standard fibo version:
and finally the listing that p2asm produces with the equivalent code:
P.S. Yes, I am getting P2D2 boards together and tested. This is my break away from it all.
Here's a quick sample of a fast Fibonacci and how it looks with the listing when it is typed/pasted in, and then I test it interactively and compare results and timings with the pure Forth version "fibo". Then I have a listing produce by p2asm to show how it produces the same code.
Fast fibo in TIA:
code ffibo ( n -- f )
x # 0 mov,
y # 1 mov,
a a FOR,
y x add,
z x mov,
x y mov,
y z mov,
a NEXT,
_ret_ a x mov,
end,
Pasting that in produces this output:
TAQOZ# code ffibo ( <condition> <dest> <#> <src> <effects> instr, ) ( n -- f )
062AC: F604_1600 x # 0 mov,
062B0: F604_1801 y # 1 mov,
a a FOR,
062B4: F100_180B y x add,
062B8: F600_1A0B z x mov,
062BC: F600_160C x y mov,
062C0: F600_180D y z mov,
062C4: FB6C_45FB a NEXT,
062C8: 0600_440B _ret_ a x mov,
end, ok
Testing this in the terminal along with the standard fibo version:
TAQOZ# 46 ffibo . --- 1836311903 ok
TAQOZ# 46 fibo . --- 1836311903 ok
TAQOZ# 46 LAP ffibo LAP .LAP --- 1,184 cycles= 4,736ns @250MHz ok
TAQOZ# 46 LAP fibo LAP .LAP --- 3,296 cycles= 13,184ns @250MHz ok
and finally the listing that p2asm produces with the equivalent code:
01418 ffibo '( n -- f ) fast fibo - a = tos '
01418 f6041600 mov x,#0
0141c f6041801 mov y,#1
01420 f100180b .l0 add y,x
01424 f6001a0b mov z,x
01428 f600160c mov x,y
0142c f600180d mov y,z
01430 fb6c45fb djnz a,#.l0
01434 0600440b _ret_ mov a,x
P.S. Yes, I am getting P2D2 boards together and tested. This is my break away from it all.
Comments
@ErNa - I agree that I'd rather not assemble these boards myself, but I lucked out with my friend who hasn't been available, but nonetheless I am close to finishing and testing the boards.
Like many others I like to program Forth in any language ;-)
TAQOZ# code ffibo 06DB2: F604_1600 mov x,#0 06DB6: F604_1801 mov y,#1 06DBA: F100_180B .l0 add y,x 06DBE: F600_1A0B mov z,x 06DC2: F600_160C mov x,y 06DC6: F600_180D mov y,z 06DCA: FB6C_45FB djnz a,#l0 06DCE: F600_440B _ret_ mov a,x end --- ok
Testing:
TAQOZ# 46 ffibo . --- 1836311903 ok
Here is an even faster fibo that avoids the register swap each iteration and alternates between registers.
TAQOZ# code ffibo2 06DD0: F604_1600 mov x,#0 06DD4: F604_1801 mov y,#1 06DD8: F100_180B .l0 add y,x 06DDC: FB6C_45FF djnz a,#l2 06DE0: F600_440C _ret_ mov a,y 06DE4: F100_160C .l2 add x,y 06DE8: FB6C_45FB djnz a,#l0 06DEC: F600_440B _ret_ mov a,x end --- ok
(EDIT: actually I just found a bug that crept in - with the previous post also - see if you can spot it)Checking that the forward reference was resolved:
TAQOZ# $6DDC @ .l --- $FB6C_4401 ok
and of course testing it:
TAQOZ# 46 ffibo2 . --- 1836311903 ok TAQOZ# 46 LAP ffibo2 LAP .LAP --- 1,000 cycles= 4,000ns @250MHz ok
The dots are missing from the jump targets: #l0 instead of #.l0
Hint: Check instruction encoding. If you are used to reading P2 machine code, then it should be glaringly obvious without having to know the exact code
Correct! ( the first 4-bit filed is the conditional execution field and since "never" is a fairly useless condition, this was instead leveraged to specify a return after the current instruction. )
The bug was due to new line evaluation routine that wasn't converting case, and in this "case" the _ret_ wasn't being found in the dictionary because it was a _RET_, and my current default during testing was to skip that evaluation failure. All that the new version really does is defer execution of the instructions until the end of the line, and also filter and substitute certain characters so that it can be evaluated as a standard Forth expression.
This btw, is the correct output (I removed the need for the dot for local labels)
TAQOZ# code ffibo2 06E52: F604_1600 mov x,#0 06E56: F604_1801 mov y,#1 06E5A: F100_180B l0 add y,x 06E5E: FB6C_45FF djnz a,#l2 06E62: 0600_440C _ret_ mov a,y 06E66: F100_160C l2 add x,y 06E6A: FB6C_45FB djnz a,#l0 06E6E: 0600_440B _ret_ mov a,x end --- ok
Do we really need to use a ## or is that some limitation of PNut?
Here's a quick check:
TAQOZ# code augment 06EA2: F604_4400 mov a,#0 06EA6: F604_4464 mov a,#100 06EAA: F604_45FF mov a,#511 06EAE: FF00_0001 mov a,#512 06EB2: F604_4400 06EB6: FF1D_CD65 mov a,#1_000_000_000 06EBA: F604_4400 06EBE: FF7F_FFFF mov a,#-1 06EC2: F604_45FF 06EC6: FCDC_0408 rep #2,#8 06ECA: 0000_0000 nop 06ECE: 0000_0000 nop 06ED2: FF1D_CD65 rep #2,#1_000_000_000 06ED6: FCDC_0400 06EDA: 0000_0000 nop 06EDE: 0000_0000 nop 06EE2: FD64_002D ret end --- ok TAQOZ# code wrhub 06EE8: FF00_0001 mov a,#1000 06EEC: F604_45E8 06EF0: FFFF_FFFF wrlong #-1,#$80 06EF4: FC6F_FE80 06EF8: FF00_0080 wrlong #-1,#$1_0000 06EFC: FFFF_FFFF 06F00: FC6F_FE00 06F04: FD64_002D ret end --- ok
BTW, this code here is not meant to be a working example, simply checking the assembler.
I would think that once we have an immediate operand, then we already know what size it is, although I guess with symbols that might mean it doesn't need an augs and thus reduces the code size. But I think the fact that PNut does not produce a line by line listing is a far greater disadvantage than the forced ##. I agree that having it is good when you want to force it, but not having to use it is also good.
So I wrote a check routine for combined augs and augd in the one operation. It starts to look ugly when you want to write a 32-bit immediate value to a 20-bit hub address. Onlyt takes 16 bytes to do it!
I know this is an extreme example, but actually it's good to know that you can do it.
TAQOZ# code wrhub 06EA2: FF00_0080 wrlong #$12345678,#$1_0000 06EA6: FF89_1A2B 06EAA: FC6C_F000 06EAE: FD64_002D ret end --- ok TAQOZ# wrhub --- ok TAQOZ# $10000 QD --- 0001_0000: 78 56 34 12 00 00 00 00 00 00 00 00 00 00 00 00 'xV4.............' 0001_0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 '................' ok
sos check your pm
So I punch in this piece of code into TIA
code LEN ( string -- len ) rdfast #0,a mov a,#0 .l0 rfbyte x wcz ' read a byte {c|z} ret ' terminate if zero or > $7F' ijnz a,#l0 end
and get it to assemble correctly:
TAQOZ# code LEN ( string -- len ) 07006 FC78_0022 rdfast #0,a 0700A F604_4400 mov a,#0 0700E FD78_1610 .l0 rfbyte x wcz ' read a byte 07012 ED64_002D {c|z} ret ' terminate if zero or > $7F' 07016 FB8C_45FD ijnz a,#l0 end --- ok
So it saves an instruction and would never not branch in reality, plus the wcz in the rfbyte is handy for testing the msb as a terminator as well. I have the normal conditionals such as if_c_or_z but I also have shorter less verbose variants like {c|z}.
But of course this won't work in hubexec mode since it uses rdfast, so I construct and assemble a loader that copies the code to a spare area in the TAQOZ kernel cog called COGMOD and run it. (Like the old RUNMOD in Tachyon).
TAQOZ# AT COGMOD := _COGMOD --- ok TAQOZ# --- ok --- ok-- Define a new routine that can load COGMOD memory from hub TAQOZ# code LOADMOD ( src len -- ) 07022 FD60_4428 setq a 07026 FB03_C223 rdlong _COGMOD,b 'read longs into cog' 0702A FD64_002D ret end --- ok TAQOZ# --- ok --- ok-- load code into free COGMOD memory TAQOZ# AT LEN 2+ 5 LOADMOD --- ok
Then test the LEN code that has been loaded into the COGMOD area:
TAQOZ# " HELLO WORLD!" LAP COGMOD LAP . SPACE .LAP --- 12 129 cycles= 645ns @200MHz ok
Then test the speed of LOADMOD itself: TAQOZ# AT LEN 2+ 5 LAP LOADMOD LAP .LAP --- 113 cycles= 565ns @200MHz ok
So while I am getting these P2D2s out I am having fun with testing whatifs with assembly code. I even tested a fast SD virtual memory address translator that checks if the 4GB virtual address is buffered and returns with the hub address in less than 1us if it is, so that it can be read/written etc. Of course, if it is not buffered it needs to flush and read in the relevant sector, before returning with the address.
rdfast #0, a mov a,#0 loop rfbyte x wcz if_nz_and_nc ijnz a, #loop ret
Good idea, so I tried it:
TAQOZ# code LEN ( string -- len ) 0703A FC78_0022 rdfast #0, a 0703E F604_4400 mov a,#0 07042 FD78_1610 .l0 rfbyte x wcz 07046 1B8C_45FE {nc&nz} ijnz a, #l0 0704A FD64_002D ret end --- ok TAQOZ# --- ok TAQOZ# AT LEN 2+ 5 LOADMOD --- ok TAQOZ# --- ok TAQOZ# " HELLO WORLD!" COGMOD . --- 12 ok TAQOZ# " HELLO WORLD!" LAP COGMOD LAP .LAP --- 105 cycles= 525ns @200MHz ok
The optimized code that took 129 cycles before now takes 105!
TAQOZ# BUFFERS $400 'A' FILL --- ok TAQOZ# 0 BUFFERS $200 + C! --- ok TAQOZ# BUFFERS LEN$ . --- 512 ok TAQOZ# BUFFERS LAP LEN$ LAP .LAP --- 3,105 cycles= 15,525ns @200MHz ok TAQOZ# 15525 512 / . --- 30 ok
30ns or 6 clocks per character aint bad.rdfast #0, a getct b rep #2, #0 ' loop 2 next instructions forever until we branch out rfbyte x wcz if_z_or_c jmp #exit ' leave loop exit getct a sub a, b sub a, #10 ' also account for overhead clocks shr a, #2 ' a now contains the valid string length before char==0 or > 127 ret
loop: check P2D2 available ijnz ease jmp loop ease:
I was trying yesterday to figure out how to make it go faster, and I got to thinking about REP and a conditional JMP out, but it never occurred to me to use the system counter. I couldn't do better than 6 clocks. Yours is a pretty surprising idea. You'd have to STALLI before and ALLOWI after if interrupts were occurring.
rdfast #0, a rep #2, #0 ' loop 2 next instructions forever until we branch out rfbyte x wcz if_z_or_c jmp #exit ' leave loop exit getptr b sub b, a sub b, #1 ' don't count trailing null byte ret
Yes, that's the best way, so far.
rdfast #0, a rep #5, #0 ' loop 5 next instructions forever until we branch out rfbyte x wcz if_nc_and_nz rfbyte x wcz if_nc_and_nz rfbyte x wcz if_nc_and_nz rfbyte x wcz if_z_or_c jmp #exit ' leave loop exit getptr b sub b, a sub b, #1 ' don't count trailing null byte ret
Update:
Using 4 bytes per loop is faster or equivalent than without unrolling for all string lengths > 1 byte in size. Shortening to 3 bytes per loop is faster or equivalent for all string lengths that are not zero, and isn't too much of a performance hit (2.66 clocks per byte vs 2.5 clocks per byte on average). You could unroll further to approach the best 2 clocks per byte scanned, but it becomes diminishing returns and the short string performance gets punished more for each byte you add to the loop (plus it burns more COGRAM).
Therefore an extra RFBYTE has been executed and the GETPTR returns +1 more than it would otherwise. ie: Using a JMP #\exit will be a different outcome for GETPTR.
PS: And the CZ flags, as well as "x", will reflect the subsequent FIFO byte read too.