TIA - An interactive inline assembler for TAQOZ

Peter Jakacki · 2020-02-25 05:00

I have been putting off writing an assembler for ages, mainly because I didn't have a clue about many of the new P2 instructions. One of the other reasons I put it off was because I wanted the assembler to work with the same syntax as PASM does. However, I decided that any assembler is better than no assembler, and so I started the back-end of the assembler in Forth fashion. I still have a bit to go but I also have done the bulk of the back-end already and in non-surprising TAQOZ fashion, the code only takes around 1kB with another 2kB for the lexicon. So now I can start coding assembler in interactively with TAQOZ from the P2. Later on I will add a front-end that processes files in PASM fashion but ends up passing it to the back-end.

Here's a quick sample of a fast Fibonacci and how it looks with the listing when it is typed/pasted in, and then I test it interactively and compare results and timings with the pure Forth version "fibo". Then I have a listing produce by p2asm to show how it produces the same code.

Fast fibo in TIA:

code ffibo ( n -- f )
	x # 0 mov,
	y # 1 mov,
	a a FOR,
	  y x add,
	  z x mov,
	  x y mov,
	  y z mov,
	a NEXT,
_ret_ 	a x mov,
	end,

Pasting that in produces this output:

TAQOZ# code ffibo ( <condition> <dest> <#> <src> <effects>  instr, )                             ( n -- f )
062AC: F604_1600                 x # 0 mov,
062B0: F604_1801                 y # 1 mov,
                                 a a FOR,
062B4: F100_180B                   y x add,
062B8: F600_1A0B                   z x mov,
062BC: F600_160C                   x y mov,
062C0: F600_180D                   y z mov,
062C4: FB6C_45FB                 a NEXT,
062C8: 0600_440B            _ret_        a x mov,
                                 end, ok

Testing this in the terminal along with the standard fibo version:

TAQOZ# 46 ffibo . --- 1836311903  ok
TAQOZ# 46 fibo . --- 1836311903  ok
TAQOZ# 46 LAP ffibo LAP .LAP --- 1,184 cycles= 4,736ns @250MHz ok
TAQOZ# 46 LAP fibo LAP .LAP --- 3,296 cycles= 13,184ns @250MHz ok

and finally the listing that p2asm produces with the equivalent code:

01418              ffibo '( n -- f ) fast fibo - a = tos '
01418     f6041600 	mov	x,#0
0141c     f6041801 	mov	y,#1
01420     f100180b .l0	add	y,x
01424     f6001a0b 	mov	z,x
01428     f600160c 	mov	x,y
0142c     f600180d 	mov	y,z
01430     fb6c45fb 	djnz	a,#.l0
01434     0600440b _ret_	mov	a,x

P.S. Yes, I am getting P2D2 boards together and tested. This is my break away from it all.

cgracey · 2020-02-25 06:09

Nice, Peter!

ErNa · 2020-02-25 06:17

Peter, please give us a chance to follow you by bringing out the P2D2! I'll try my best to free you from production of additional hardware, if only we have your deep thought of how it has to be!

Peter Jakacki · 2020-02-25 07:55

One of the advantages of this assembler is that it is inherently extensible and powerful macros can be created with a single line of Forth code. The FOR, NEXT, were very simple "macros" indeed.

@ErNa - I agree that I'd rather not assemble these boards myself, but I lucked out with my friend who hasn't been available, but nonetheless I am close to finishing and testing the boards.

ErNa · 2020-02-25 13:06

It's just that I outsourced a project the P1 can't do, but P2 will and as long as I have no throw in hardware module I can not control the design process and the architecture, as the way is not so important but the outcome. But I would very much like to have the P2 do the job due to many facts, definitely I don't rely on C++ ;-)
Like many others I like to program Forth in any language ;-)

Peter Jakacki · 2020-02-28 05:45

I filled in practically all of the opcodes although there are a few I need to refine but I added in a simple front-end to process regular assembler syntax. I'm using predefined labels at present but I can easily convert symbols that start in column 1 as labels and use these. This is the ffibo from earlier on.

TAQOZ# code ffibo
06DB2: F604_1600                mov     x,#0
06DB6: F604_1801                mov     y,#1
06DBA: F100_180B        .l0     add     y,x
06DBE: F600_1A0B                mov     z,x
06DC2: F600_160C                mov     x,y
06DC6: F600_180D                mov     y,z
06DCA: FB6C_45FB                djnz    a,#l0
06DCE: F600_440B         _ret_  mov     a,x
                                end ---  ok

Testing:

TAQOZ# 46 ffibo . --- 1836311903  ok

Peter Jakacki · 2020-02-28 07:28

I was thinking about single pass forward references with conventional labels for interactive use and I came up with a simple scheme. When a label has not yet been defined it will simply save the current PC and use that for the instruction also. Once the label is defined, it resolves the reference depending upon the type. Of course in an interactive session in the listing it can't go back and relist the forward references but I might be able to add something to indicate that it has resolved these. Nonetheless it works well.

Here is an even faster fibo that avoids the register swap each iteration and alternates between registers.

TAQOZ# code ffibo2
06DD0: F604_1600                mov     x,#0
06DD4: F604_1801                mov     y,#1
06DD8: F100_180B        .l0     add     y,x
06DDC: FB6C_45FF                djnz    a,#l2
06DE0: F600_440C         _ret_  mov     a,y
06DE4: F100_160C        .l2     add     x,y
06DE8: FB6C_45FB                djnz    a,#l0
06DEC: F600_440B         _ret_  mov     a,x
                                end ---  ok

(EDIT: actually I just found a bug that crept in - with the previous post also - see if you can spot it)

Checking that the forward reference was resolved:

TAQOZ# $6DDC @ .l --- $FB6C_4401 ok

and of course testing it:

TAQOZ# 46 ffibo2 . --- 1836311903  ok
TAQOZ# 46 LAP ffibo2 LAP .LAP --- 1,000 cycles= 4,000ns @250MHz ok

AJL · 2020-02-28 10:46

Peter Jakacki wrote: »
I was thinking about single pass forward references with conventional labels for interactive use and I came up with a simple scheme. When a label has not yet been defined it will simply save the current PC and use that for the instruction also. Once the label is defined, it resolves the reference depending upon the type. Of course in an interactive session in the listing it can't go back and relist the forward references but I might be able to add something to indicate that it has resolved these. Nonetheless it works well.

Here is an even faster fibo that avoids the register swap each iteration and alternates between registers.
TAQOZ# code ffibo2
06DD0: F604_1600                mov     x,#0
06DD4: F604_1801                mov     y,#1
06DD8: F100_180B        .l0     add     y,x
06DDC: FB6C_45FF                djnz    a,#l2
06DE0: F600_440C         _ret_  mov     a,y
06DE4: F100_160C        .l2     add     x,y
06DE8: FB6C_45FB                djnz    a,#l0
06DEC: F600_440B         _ret_  mov     a,x
                                end ---  ok
(EDIT: actually I just found a bug that crept in - with the previous post also - see if you can spot it)

Checking that the forward reference was resolved:
TAQOZ# $6DDC @ .l --- $FB6C_4401 ok
and of course testing it:
TAQOZ# 46 ffibo2 . --- 1836311903  ok
TAQOZ# 46 LAP ffibo2 LAP .LAP --- 1,000 cycles= 4,000ns @250MHz ok

The dots are missing from the jump targets: #l0 instead of #.l0

Peter Jakacki · 2020-02-28 13:37

@AJL - True, but not really a "bug", more of a temporary implementation and I will add some extras to handle labels smoothly.
Hint: Check instruction encoding. If you are used to reading P2 machine code, then it should be glaringly obvious without having to know the exact code

TonyB_ · 2020-02-28 14:21

_ret_ instructions should be 0xxx_xxxx.

Peter Jakacki · 2020-02-28 14:46

TonyB_ wrote: »

_ret_ instructions should be 0xxx_xxxx.

Correct! ( the first 4-bit filed is the conditional execution field and since "never" is a fairly useless condition, this was instead leveraged to specify a return after the current instruction. )

The bug was due to new line evaluation routine that wasn't converting case, and in this "case" the _ret_ wasn't being found in the dictionary because it was a _RET_, and my current default during testing was to skip that evaluation failure. All that the new version really does is defer execution of the instructions until the end of the line, and also filter and substitute certain characters so that it can be evaluated as a standard Forth expression.

This btw, is the correct output (I removed the need for the dot for local labels)

TAQOZ# code ffibo2
06E52: F604_1600                mov     x,#0
06E56: F604_1801                mov     y,#1
06E5A: F100_180B        l0      add     y,x
06E5E: FB6C_45FF                djnz    a,#l2
06E62: 0600_440C         _ret_  mov     a,y
06E66: F100_160C        l2      add     x,y
06E6A: FB6C_45FB                djnz    a,#l0
06E6E: 0600_440B         _ret_  mov     a,x
                                end ---  ok

Peter Jakacki · 2020-03-01 06:42

Taking a break on a Sunday afternoon and while testing TIA I decided to create an automatic AUGS and/or AUGD instead of forcing it with the ##. If the source and/or destination is out of range it will augment it automatically. I think TIA would make a great tool for learning P2 assembly code since you can interact with it and debug it from the TAQOZ shell very easily, as seen in the previous examples.

Do we really need to use a ## or is that some limitation of PNut?

Here's a quick check:

TAQOZ# code augment
06EA2: F604_4400                mov     a,#0
06EA6: F604_4464                mov     a,#100
06EAA: F604_45FF                mov     a,#511
06EAE: FF00_0001                mov     a,#512
06EB2: F604_4400
06EB6: FF1D_CD65                mov     a,#1_000_000_000
06EBA: F604_4400
06EBE: FF7F_FFFF                mov     a,#-1
06EC2: F604_45FF
06EC6: FCDC_0408                rep     #2,#8
06ECA: 0000_0000                nop
06ECE: 0000_0000                nop
06ED2: FF1D_CD65                rep     #2,#1_000_000_000
06ED6: FCDC_0400
06EDA: 0000_0000                nop
06EDE: 0000_0000                nop
06EE2: FD64_002D                ret
                        end ---  ok
TAQOZ# code wrhub
06EE8: FF00_0001                mov     a,#1000
06EEC: F604_45E8
06EF0: FFFF_FFFF                wrlong  #-1,#$80
06EF4: FC6F_FE80
06EF8: FF00_0080                wrlong  #-1,#$1_0000
06EFC: FFFF_FFFF
06F00: FC6F_FE00
06F04: FD64_002D                ret
                        end ---  ok

BTW, this code here is not meant to be a working example, simply checking the assembler.

cgracey · 2020-03-01 07:13

We don't need it in PNut, but I like to have it so I know for sure how long my code is and how long it takes to execute.

Peter Jakacki · 2020-03-01 09:08

cgracey wrote: »

We don't need it in PNut, but I like to have it so I know for sure how long my code is and how long it takes to execute.

I would think that once we have an immediate operand, then we already know what size it is, although I guess with symbols that might mean it doesn't need an augs and thus reduces the code size. But I think the fact that PNut does not produce a line by line listing is a far greater disadvantage than the forced ##. I agree that having it is good when you want to force it, but not having to use it is also good.

Cluso99 · 2020-03-01 09:38

My preference is to leave ## and report an error otherwise. It’s just like reporting a warning for the missing # in jump instructions.

ErNa · 2020-03-01 11:16

@ Peter: ... --- ... -.-. .... . -.-. -.- -.-- --- ..- .-. .--. --

Peter Jakacki · 2020-03-01 12:10

My preference is to always have the option, this and that is always than this and not that.

So I wrote a check routine for combined augs and augd in the one operation. It starts to look ugly when you want to write a 32-bit immediate value to a 20-bit hub address. Onlyt takes 16 bytes to do it!

I know this is an extreme example, but actually it's good to know that you can do it.

TAQOZ# code wrhub
06EA2: FF00_0080                wrlong  #$12345678,#$1_0000
06EA6: FF89_1A2B
06EAA: FC6C_F000
06EAE: FD64_002D                ret
                        end ---  ok
TAQOZ# wrhub ---  ok
TAQOZ# $10000 QD --- 
0001_0000: 78 56 34 12  00 00 00 00  00 00 00 00  00 00 00 00     'xV4.............'
0001_0010: 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00     '................' ok

dMajo · 2020-03-01 17:25

ErNa wrote: »

@ Peter: ... --- ... -.-. .... . -.-. -.- -.-- --- ..- .-. .--. --

For others who don't read morse code:

sos check your pm

Peter Jakacki · 2020-03-10 04:10

While I wait to pickup assembled P2D2 modules tomorrow, I knew that the IJNZ instruction could be useful, but I have only just used it now in optimizing a routine that finds the length of an ASCII string that is terminated by a null or >$7F.
So I punch in this piece of code into TIA

code LEN ( string -- len )
	rdfast  #0,a
        mov     a,#0
.l0	rfbyte  x wcz			' read a byte
 {c|z}	ret				' terminate if zero or > $7F'
 	ijnz	a,#l0
	end

and get it to assemble correctly:

TAQOZ# code LEN ( string -- len )
07006 FC78_0022         rdfast  #0,a
0700A F604_4400         mov     a,#0
0700E FD78_1610 .l0     rfbyte  x wcz                   ' read a byte
07012 ED64_002D  {c|z}  ret                             ' terminate if zero or > $7F'
07016 FB8C_45FD         ijnz    a,#l0
                        end ---  ok

So it saves an instruction and would never not branch in reality, plus the wcz in the rfbyte is handy for testing the msb as a terminator as well. I have the normal conditionals such as if_c_or_z but I also have shorter less verbose variants like {c|z}.

But of course this won't work in hubexec mode since it uses rdfast, so I construct and assemble a loader that copies the code to a spare area in the TAQOZ kernel cog called COGMOD and run it. (Like the old RUNMOD in Tachyon).

TAQOZ# AT COGMOD         := _COGMOD ---  ok
TAQOZ#  ---  ok
 ---  ok-- Define a new routine that can load COGMOD memory from hub
TAQOZ# code LOADMOD ( src len -- )
07022 FD60_4428         setq    a
07026 FB03_C223         rdlong  _COGMOD,b               'read longs into cog'
0702A FD64_002D         ret
                        end ---  ok
TAQOZ#  ---  ok
 ---  ok-- load code into free COGMOD memory
TAQOZ# AT LEN 2+ 5 LOADMOD ---  ok

Then test the LEN code that has been loaded into the COGMOD area:

TAQOZ# " HELLO WORLD!" LAP COGMOD LAP . SPACE .LAP --- 12  129 cycles= 645ns @200MHz ok

Then test the speed of LOADMOD itself:
TAQOZ# AT LEN 2+ 5 LAP LOADMOD LAP .LAP --- 113 cycles= 565ns @200MHz ok

So while I am getting these P2D2s out I am having fun with testing whatifs with assembly code. I even tested a fast SD virtual memory address translator that checks if the 4GB virtual address is buffered and returns with the hub address in less than 1us if it is, so that it can be read/written etc. Of course, if it is not buffered it needs to flush and read in the relevant sector, before returning with the address.

rogloh · 2020-03-10 04:22

Interesting use of the fifo when checking the string length, it should be nice and fast. I wonder if you could make the ijnz use if_nz_and_nc to tighten your loop further, like this:

                  rdfast   #0, a
                  mov      a,#0
loop              rfbyte   x wcz
   if_nz_and_nc   ijnz     a, #loop
                  ret

Peter Jakacki · 2020-03-10 04:33

rogloh wrote: »
Interesting use of the fifo when checking the string length, it should be nice and fast. I wonder if you could make the ijnz use if_nz_and_nc to tighten your loop further, like this:
                  rdfast   #0, a
                  mov      a,#0
loop              rfbyte   x wcz
   if_nz_and_nc   ijnz     a, #loop
                  ret

Good idea, so I tried it:

TAQOZ# code LEN ( string -- len )
0703A FC78_0022         rdfast   #0, a
0703E F604_4400         mov      a,#0
07042 FD78_1610 .l0     rfbyte   x wcz
07046 1B8C_45FE {nc&nz} ijnz     a, #l0
0704A FD64_002D         ret
                        end ---  ok
TAQOZ#  ---  ok
TAQOZ# AT LEN 2+ 5 LOADMOD ---  ok
TAQOZ#  ---  ok
TAQOZ# " HELLO WORLD!" COGMOD . --- 12  ok
TAQOZ# " HELLO WORLD!" LAP COGMOD LAP .LAP --- 105 cycles= 525ns @200MHz ok

The optimized code that took 129 cycles before now takes 105!

rogloh · 2020-03-10 04:34

Nice.

Peter Jakacki · 2020-03-10 05:17

So I updated the kernel LEN$ word with this routine and tried it out by filling a buffer area:

TAQOZ# BUFFERS $400 'A' FILL ---  ok
TAQOZ# 0 BUFFERS $200 + C! ---  ok
TAQOZ# BUFFERS LEN$ . --- 512  ok
TAQOZ# BUFFERS LAP LEN$ LAP .LAP --- 3,105 cycles= 15,525ns @200MHz ok
TAQOZ# 15525 512 / . --- 30  ok

30ns or 6 clocks per character aint bad.

rogloh · 2020-03-10 06:34

Another even faster way to speed it up to just 4 clocks per character scanned is if you use the execution timing to determine the string length with something like this perhaps, if I have the offsets & scaling correct:

                  rdfast   #0, a
                  getct    b
                  rep      #2, #0 ' loop 2 next instructions forever until we branch out
                  rfbyte   x wcz
   if_z_or_c      jmp      #exit  ' leave loop
exit              getct    a
                  sub      a, b
                  sub      a, #10   ' also account for overhead clocks
                  shr      a, #2   ' a now contains the valid string length before char==0 or > 127                    
                  ret

ErNa · 2020-03-10 09:59

Peter Jakacki wrote: »

While I wait to pickup assembled P2D2 modules tomorrow, I knew that the IJNZ instruction could be useful, but I have only just used it now in optimizing a routine that finds the length of an ASCII string that is terminated by a null or >$7F.

loop:
   check P2D2 available
   ijnz ease
   jmp loop
ease:

cgracey · 2020-03-10 20:50

rogloh wrote: »

Another even faster way to speed it up to just 4 clocks per character scanned is if you use the execution timing to determine the string length with something like this perhaps, if I have the offsets & scaling correct:

                  rdfast   #0, a
                  getct    b
                  rep      #2, #0 ' loop 2 next instructions forever until we branch out
                  rfbyte   x wcz
   if_z_or_c      jmp      #exit  ' leave loop
exit              getct    a
                  sub      a, b
                  sub      a, #10   ' also account for overhead clocks
                  shr      a, #2   ' a now contains the valid string length before char==0 or > 127                    
                  ret

I was trying yesterday to figure out how to make it go faster, and I got to thinking about REP and a conditional JMP out, but it never occurred to me to use the system counter. I couldn't do better than 6 clocks. Yours is a pretty surprising idea. You'd have to STALLI before and ALLOWI after if interrupts were occurring.

Electrodude · 2020-03-12 03:56

Using GETPTR instead of GETCT should shave off two instructions, and it doesn't require STALLI and ALLOWI:

                  rdfast   #0, a
                  rep      #2, #0 ' loop 2 next instructions forever until we branch out
                  rfbyte   x wcz
   if_z_or_c      jmp      #exit  ' leave loop
exit              getptr   b
                  sub      b, a
                  sub      b, #1 ' don't count trailing null byte
                  ret

rogloh · 2020-03-12 04:49

Wow, I like that even more. I'd forgotten about being able to reading back the fifo position.

cgracey · 2020-03-12 04:51

rogloh wrote: »

Wow, I like that even more. I'd forgotten about being able to reading back the fifo position.

Yes, that's the best way, so far.

rogloh · 2020-03-12 04:58

Maybe some unrolling could speed it up if you know you are mainly working with strings above a reasonable minimum size. This scans 4 bytes in 10 clocks, but also delays the output slightly for 3 of the 4 cases. Another way to think about it is that execution time is quantized to the next long boundary.

                  rdfast   #0, a
                  rep      #5, #0 ' loop 5 next instructions forever until we branch out
                  rfbyte   x wcz
   if_nc_and_nz   rfbyte   x wcz
   if_nc_and_nz   rfbyte   x wcz
   if_nc_and_nz   rfbyte   x wcz
   if_z_or_c      jmp      #exit  ' leave loop
exit              getptr   b
                  sub      b, a
                  sub      b, #1 ' don't count trailing null byte
                  ret

Update:
Using 4 bytes per loop is faster or equivalent than without unrolling for all string lengths > 1 byte in size. Shortening to 3 bytes per loop is faster or equivalent for all string lengths that are not zero, and isn't too much of a performance hit (2.66 clocks per byte vs 2.5 clocks per byte on average). You could unroll further to approach the best 2 clocks per byte scanned, but it becomes diminishing returns and the short string performance gets punished more for each byte you add to the loop (plus it burns more COGRAM).

evanh · 2020-03-12 05:29

Be aware that a relative branch as the last instruction of a REP block has an undocumented branch offset applied. In this case, because the destination is the following instruction, the offset will create one extra pass through the REP block.

Therefore an extra RFBYTE has been executed and the GETPTR returns +1 more than it would otherwise. ie: Using a JMP #\exit will be a different outcome for GETPTR.

PS: And the CZ flags, as well as "x", will reflect the subsequent FIFO byte read too.

TIA - An interactive inline assembler for TAQOZ

Comments