TIA - An interactive inline assembler for TAQOZ
Peter Jakacki
Posts: 10,193
I have been putting off writing an assembler for ages, mainly because I didn't have a clue about many of the new P2 instructions. One of the other reasons I put it off was because I wanted the assembler to work with the same syntax as PASM does. However, I decided that any assembler is better than no assembler, and so I started the back-end of the assembler in Forth fashion. I still have a bit to go but I also have done the bulk of the back-end already and in non-surprising TAQOZ fashion, the code only takes around 1kB with another 2kB for the lexicon. So now I can start coding assembler in interactively with TAQOZ from the P2. Later on I will add a front-end that processes files in PASM fashion but ends up passing it to the back-end.
Here's a quick sample of a fast Fibonacci and how it looks with the listing when it is typed/pasted in, and then I test it interactively and compare results and timings with the pure Forth version "fibo". Then I have a listing produce by p2asm to show how it produces the same code.
Fast fibo in TIA:
Pasting that in produces this output:
Testing this in the terminal along with the standard fibo version:
and finally the listing that p2asm produces with the equivalent code:
P.S. Yes, I am getting P2D2 boards together and tested. This is my break away from it all.
Here's a quick sample of a fast Fibonacci and how it looks with the listing when it is typed/pasted in, and then I test it interactively and compare results and timings with the pure Forth version "fibo". Then I have a listing produce by p2asm to show how it produces the same code.
Fast fibo in TIA:
code ffibo ( n -- f ) x # 0 mov, y # 1 mov, a a FOR, y x add, z x mov, x y mov, y z mov, a NEXT, _ret_ a x mov, end,
Pasting that in produces this output:
TAQOZ# code ffibo ( <condition> <dest> <#> <src> <effects> instr, ) ( n -- f ) 062AC: F604_1600 x # 0 mov, 062B0: F604_1801 y # 1 mov, a a FOR, 062B4: F100_180B y x add, 062B8: F600_1A0B z x mov, 062BC: F600_160C x y mov, 062C0: F600_180D y z mov, 062C4: FB6C_45FB a NEXT, 062C8: 0600_440B _ret_ a x mov, end, ok
Testing this in the terminal along with the standard fibo version:
TAQOZ# 46 ffibo . --- 1836311903 ok TAQOZ# 46 fibo . --- 1836311903 ok TAQOZ# 46 LAP ffibo LAP .LAP --- 1,184 cycles= 4,736ns @250MHz ok TAQOZ# 46 LAP fibo LAP .LAP --- 3,296 cycles= 13,184ns @250MHz ok
and finally the listing that p2asm produces with the equivalent code:
01418 ffibo '( n -- f ) fast fibo - a = tos ' 01418 f6041600 mov x,#0 0141c f6041801 mov y,#1 01420 f100180b .l0 add y,x 01424 f6001a0b mov z,x 01428 f600160c mov x,y 0142c f600180d mov y,z 01430 fb6c45fb djnz a,#.l0 01434 0600440b _ret_ mov a,x
P.S. Yes, I am getting P2D2 boards together and tested. This is my break away from it all.
Comments
@ErNa - I agree that I'd rather not assemble these boards myself, but I lucked out with my friend who hasn't been available, but nonetheless I am close to finishing and testing the boards.
Like many others I like to program Forth in any language ;-)
Testing:
Here is an even faster fibo that avoids the register swap each iteration and alternates between registers. (EDIT: actually I just found a bug that crept in - with the previous post also - see if you can spot it)
Checking that the forward reference was resolved:
and of course testing it:
The dots are missing from the jump targets: #l0 instead of #.l0
Hint: Check instruction encoding. If you are used to reading P2 machine code, then it should be glaringly obvious without having to know the exact code
Correct! ( the first 4-bit filed is the conditional execution field and since "never" is a fairly useless condition, this was instead leveraged to specify a return after the current instruction. )
The bug was due to new line evaluation routine that wasn't converting case, and in this "case" the _ret_ wasn't being found in the dictionary because it was a _RET_, and my current default during testing was to skip that evaluation failure. All that the new version really does is defer execution of the instructions until the end of the line, and also filter and substitute certain characters so that it can be evaluated as a standard Forth expression.
This btw, is the correct output (I removed the need for the dot for local labels)
Do we really need to use a ## or is that some limitation of PNut?
Here's a quick check:
BTW, this code here is not meant to be a working example, simply checking the assembler.
I would think that once we have an immediate operand, then we already know what size it is, although I guess with symbols that might mean it doesn't need an augs and thus reduces the code size. But I think the fact that PNut does not produce a line by line listing is a far greater disadvantage than the forced ##. I agree that having it is good when you want to force it, but not having to use it is also good.
So I wrote a check routine for combined augs and augd in the one operation. It starts to look ugly when you want to write a 32-bit immediate value to a 20-bit hub address. Onlyt takes 16 bytes to do it!
I know this is an extreme example, but actually it's good to know that you can do it.
So I punch in this piece of code into TIA
and get it to assemble correctly:
So it saves an instruction and would never not branch in reality, plus the wcz in the rfbyte is handy for testing the msb as a terminator as well. I have the normal conditionals such as if_c_or_z but I also have shorter less verbose variants like {c|z}.
But of course this won't work in hubexec mode since it uses rdfast, so I construct and assemble a loader that copies the code to a spare area in the TAQOZ kernel cog called COGMOD and run it. (Like the old RUNMOD in Tachyon).
Then test the LEN code that has been loaded into the COGMOD area:
So while I am getting these P2D2s out I am having fun with testing whatifs with assembly code. I even tested a fast SD virtual memory address translator that checks if the 4GB virtual address is buffered and returns with the hub address in less than 1us if it is, so that it can be read/written etc. Of course, if it is not buffered it needs to flush and read in the relevant sector, before returning with the address.
Good idea, so I tried it:
The optimized code that took 129 cycles before now takes 105!
I was trying yesterday to figure out how to make it go faster, and I got to thinking about REP and a conditional JMP out, but it never occurred to me to use the system counter. I couldn't do better than 6 clocks. Yours is a pretty surprising idea. You'd have to STALLI before and ALLOWI after if interrupts were occurring.
Yes, that's the best way, so far.
Update:
Using 4 bytes per loop is faster or equivalent than without unrolling for all string lengths > 1 byte in size. Shortening to 3 bytes per loop is faster or equivalent for all string lengths that are not zero, and isn't too much of a performance hit (2.66 clocks per byte vs 2.5 clocks per byte on average). You could unroll further to approach the best 2 clocks per byte scanned, but it becomes diminishing returns and the short string performance gets punished more for each byte you add to the loop (plus it burns more COGRAM).
Therefore an extra RFBYTE has been executed and the GETPTR returns +1 more than it would otherwise. ie: Using a JMP #\exit will be a different outcome for GETPTR.
PS: And the CZ flags, as well as "x", will reflect the subsequent FIFO byte read too.