Best cross-toolchain way to do inline assembly?

rogloh · 2021-04-01 10:16

I have a SPIN2 function to plot lines using my video and memory driver that is currently being developed and I was wondering what is the best way to convert it into inline/callable PASM that would work with both FlexProp and PropTool/PNut toolchains in order to speed things up more.

PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen
    dx :=  abs(x1-x0)
    sx := (x0<x1) ? 1 : -1
    dy := -abs(y1-y0)
    sy := (y0<y1) ? 1 : -1
    xlen := 0
    ylen := 0
    err := dx+dy
    if dx == 0 ' vertical line optimization
        mem.gfxFill(buf+x0<<(bpp>>4)+ (y0<#y1)*width, width, 1, 1-dy, colour, 0, bpp>>3) 
        return
    if dy == 0 ' horizontal line optimization
        mem.fill(buf+((x0<#x1)<<(bpp>>4))+ y0*width, colour, 1+dx, 0, bpp>>3) 
        return
    repeat  ' continue until final co-ordinates are reached
        if (x0 == x1 && y0 == y1) 
            if ylen
                mem.gfxFill(buf+x0<<(bpp>>4)+(y0-(sy*ylen))*width, width, 1, ylen, colour, 0, bpp>>3) 
            if xlen
                mem.fill(buf+(x0-sx*xlen)<<(bpp>>4)+y0*width, colour, xlen, 0, bpp>>3)
            return
        e2 := err<<1
        if (e2 >= dy)
            if ylen
                mem.gfxFill(buf+x0<<(bpp>>4)+(y0-(sy*ylen))*width, width, 1, ylen, colour, 0, bpp>>3) 
                ylen:=0
            err += dy
            x0 += sx
            xlen++
        if (e2 <= dx)
            if xlen
                mem.fill(buf+(x0-sx*xlen)<<(bpp>>4)+y0*width, colour, xlen, 0, bpp>>3)
                xlen:=0
            err += dx
            y0 += sy
            ylen++

If I convert this code to PASM2, I know I can change the existing method calls to both mem.fill(...) and mem.gfxFill(...) to instead write directly to my memory mailbox, so plotLine won't have to call any deeper down SPIN2 code, meaning this entire method could be completely written in PASM2 for the fastest drawing speed. I imagine it would probably fit into the $12F longs allowed by PropTool for full COG exec.

But how can/should the code be written to work in both toolchains?

It looks like PropTool SPIN2 is fairly well documented as to what is allowed, but can FlexProp also support COGRAM execution like PNut or is it only HUB-exec only? Can I simply start to write this type of code shown below...and can the symbols names be retained from the locals and parameters? Or do I need to do it another way using some PR0-PR7 names etc? What type of inline PASM2 coding actually works with both tools as a sort of lowest common denominator? Has anyone figured this out already? @ersmith basically what is the state of play with inline PASM2 in comparison with what PropTool does? It this fully documented anywhere as to how it all works and what P2 resources I can use while executing inline PASM2? Am I just worrying unnecessarily here and should I just code it up fully and try it out?

PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen
    org
        mov     dx, x1
        sub     dx, x0
        abs     dx
        cmp     x0, x1 wc
  if_c  mov     sx, #1
  if_nc neg     sx, #1
        mov     dy, y1
        sub     dy, y0
        abs     dy
        neg     dy
        cmp     y0, y1
   if_c mov     sy, #1
  if_nc neg     sy, #1
        mov     xlen, #0
        mov     ylen, #0
        mov     err, dx
        add     err, dy
        test    dx wz
   if_z jmp     #xx ' some code to do gfxFill etc
        test    dy wz
   if_z jmp     #zz ' some code to do fill etc
 ...
    end

ersmith · 2021-04-01 11:13

Roger, the flexspin inline assembly documentation is in general.md (since it applies to all 3 languages). There's some discussion of the difference between flexspin and PNut in the spin.md file as well. ORG/END blocks are automatically copied to LUT before execution. ORG with an address is not supported by flexspin, but plain ORG should generally work fine. The code you wrote above looks like it should work.

As I understand it, in PNut the inline assembly is resident in the COG (so the total amount of inline assembly must not exceed the area left for it by the interpreter) and the variables are copied from the stack into internal memory before running the assembly, and then copied back afterwards. In flexspin it's the reverse: the variables are already in COG memory, but the code has to be copied into LUT before execution. This means that flexspin lets you have an arbitrary number of inline assembly blocks, but there's a bit more latency before starting the asm (at least when there is more code than data). It is possible to "fix" a function in LUT or COG memory in flexspin by using {++cog} or {++lut} in its declaration, but that's best used extremely sparingly.

rogloh · 2021-04-01 11:45

Sounds good Eric, thanks for the explanation.

If it executes from LUTRAM is the size of an ORG block in FlexSpin then limited to 512 longs, or is is something less than this? I think my code is hopefully going to fit. I am up to 114 total longs, but am almost coded. I don't imagine it will reach more than about 200 longs even if I add some clipping stuff to it.

Can I use PA, and PB, PTRA, PTRB, registers as temporary registers or are they already used by FlexSpin? If they are can I just push them? Also can I do a CALL with the HW stack (one level deep)? How much of the HW stack can I use?

ersmith · 2021-04-01 12:36

If it executes from LUTRAM is the size of an ORG block in FlexSpin then limited to 512 longs, or is is something less than this?

256 longs, actually. This is slightly less than PNut's limit of $130 longs, but on the other hand the PNut limit is the total for all ORG blocks in the program and its objects, whereas the FlexSpin limit is per inline assembly block. (Please do keep in mind that a lot of objects use inline assembly, so it would be a bad idea to use up too much of PNut's space in any one object.) (EDIT: this may not actually be right, I think I misread the docs -- PNut, like FlexSpin, loads the assembly at run time so the limit is not global.)

Can I use PA, and PB, PTRA, PTRB, registers as temporary registers or are they already used by FlexSpin? If they are can I just push them?

As mentioned in the docs, PTRA must be saved/restored. It'd probably be a good idea to save/restore PTRB too just in case someone else is using it.

Also can I do a CALL with the HW stack (one level deep)? How much of the HW stack can I use?

No, CALL will not work. RET has to be handled specially in ORG/END blocks because of the difference between PNut and FlexSpin, so don't try to use it yourself. The HW stack itself is OK to use, there will only be one level active when your code is entered.

Surac · 2021-04-01 12:40

Just out of curiosity. Is there a memory map what is stored in the other 256 longs of the LUT?

ersmith · 2021-04-01 12:51

@Surac said:
Just out of curiosity. Is there a memory map what is stored in the other 256 longs of the LUT?

There's a memory map in the general.md file in the documentation. In fact nothing is stored in the other 256 LUT longs, they're left available for the user (e.g. to use as a LUT, or to put specific functions in by declaring them as {++lut}).

rogloh · 2021-04-01 13:00

@ersmith said:

show previous quotes

If it executes from LUTRAM is the size of an ORG block in FlexSpin then limited to 512 longs, or is is something less than this?

256 longs, actually. This is slightly less than PNut's limit of $130 longs, but on the other hand the PNut limit is the total for all ORG blocks in the program and its objects, whereas the FlexSpin limit is per inline assembly block. (Please do keep in mind that a lot of objects use inline assembly, so it would be a bad idea to use up too much of PNut's space in any one object.)

show previous quotes

Can I use PA, and PB, PTRA, PTRB, registers as temporary registers or are they already used by FlexSpin? If they are can I just push them?

As mentioned in the docs, PTRA must be saved/restored. It'd probably be a good idea to save/restore PTRB too just in case someone else is using it.

show previous quotes

Also can I do a CALL with the HW stack (one level deep)? How much of the HW stack can I use?

No, CALL will not work. RET has to be handled specially in ORG/END blocks because of the difference between PNut and FlexSpin, so don't try to use it yourself. The HW stack itself is OK to use, there will only be one level active when your code is entered.

Thanks again for the further info Eric. LOL, I just found out the hard way about CALL not working...looks like I need to unroll my two copies. This is not ideal if it is going to get so large as to consume the bulk of the resource. I thought PNut would load in the code like your FCACHE method, didn't realize it was a fixed area for ALL inline PASM2. This is going to be troublesome to get it to fit if there is other inline code, plus I use just over 16 locals+params which I think PNUT can't support.

I also found your inline PASM doesn't like this syntax either:

   wrlong  buf, ptrb[4]

PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen, tmp1, tmp2, mailbox, list
    mailbox:=mem.getMailboxAddr(bus,cogid()) 
    list:=@listbuf
    listbuf[2]:=1 ' preallocate fixed data (eventually should do this outside the function in a top level somewhere)
    listbuf[3]:=$80000000
    listbuf[5]:=width
    listbuf[6]:=0
    listbuf[7]:=0
    org
        shr     bpp, #4
        mov     dx,x1
        sub     dx, x0
        abs     dx
        cmp     x0, x1 wc
  if_c  mov     sx,#1
  if_nc neg     sx,#1
        mov     dy, y1
        sub     dy, y0
        abs     dy
        neg     dy
        cmp     y0, y1 wc
   if_c mov     sy, #1
  if_nc neg     sy, #1
        mov     xlen, #0
        mov     ylen, #0
        mov     err, dx
        add     err, dy
testvertical
        test    dx wz
  if_nz jmp     #testhorizontal
        ' code to do gfxFill etc
        shl     x0,bpp
        add     buf, x0
        fles    y0, y1
        mul     y0, width
        add     buf, y0
        subr    dy, #1
        or      bpp, #$c
        setnib  buf, bpp, #7
        rdlong  tmp2, mailbox
        tjs     tmp2, #$-1 ' wait for mailbox to be empty
        wrlong  buf, list  ' or use streamer here?
        add     list, #4
        wrlong  colour, list
        add     list, #8
        wrlong  dy, list
        sub     list, #12
        mov     tmp2, mailbox
        add     tmp2, #4
        wrlong  tmp1, tmp2
        wrlong  ##-1, mailbox
        jmp     #done
testhorizontal
        test    dy wz
  if_nz jmp     #loop
        ' code to do fill etc
        fles    x0, x1
        shl     x0, bpp
        add     buf, x0
        mul     y0, width
        add     buf, y0
        add     dx, #1
        or      bpp, #$C
        setnib  buf, bpp, #7
        rdlong  tmp2, mailbox
        tjs     tmp2, #$-1 ' wait for mailbox to be empty
        mov     tmp2, mailbox
        add     tmp2, #8
        wrlong  dx, tmp2
        sub     tmp2, #4
        wrlong  colour, tmp2
        wrlong  buf, mailbox ' trigger request
        jmp     #done

loop
        cmp     x1, x0 wz
   if_z cmp     y1, y0 wz
  if_nz jmp     #skipend
        test    ylen wz
  if_nz call    #dovert
        'compute gfxfill params
  if_nz jmp     #done
skip2   test    xlen wz
  if_nz call    #dohoriz
  if_nz jmp     #done   
        'compute fill params
skipend
        mov     e2, err
        shl     e2, #1 
        cmps    e2, dy wc
  if_c  jmp     #skip3
        test    ylen wz
  if_nz call    #dovert
  if_nz mov     ylen, #0
        add     err, dy
        add     x0, sx
        add     xlen, #1

skip3   cmps    e2, dx wcz
 if_nc_and_nz   jmp #loop
        test    xlen wz
  if_nz call    #dohoriz
  if_nz mov     xlen, #0
        add     err, dx
        add     y0, sy
        add     ylen, #1
        jmp     #loop

dohoriz
        mov     tmp1, sx
        mul     tmp1, xlen
        subr    tmp1, x0
        shl     tmp1, bpp
        add     tmp1, buf
        mov     tmp2, y0
        mul     tmp2, width
        add     tmp1, tmp2
        mov     tmp2, bpp
        or      tmp2, #$C
        setnib  tmp1, tmp2, #7
        rdlong  tmp2, mailbox
        tjs     tmp2, #$-1 ' wait for mailbox to be empty
        mov     tmp2, mailbox
        add     tmp2, #8
        wrlong  colour, tmp2
        sub     tmp2, #4
        wrlong  xlen, tmp2
        wrlong  tmp1, mailbox
        ret

dovert
        mov     tmp1, sy
        mul     tmp1, ylen
        subr    tmp1, y0
        mul     tmp1, width
        mov     tmp2, x0
        shl     tmp2, bpp
        add     tmp1, buf
        add     tmp1, tmp2
        mov     tmp2, bpp
        or      tmp2, #$c
        setnib  tmp1, tmp2, #7
        rdlong  tmp2, mailbox
        tjs     tmp2, #$-1
        wrlong  buf, list  ' or use streamer here?
        add     list, #4
        wrlong  colour, list
        add     list, #8
        wrlong  dy, list
        sub     list, #12
        mov     tmp2, mailbox
        add     tmp2, #4
        wrlong  list, tmp2
        wrlong  ##-1, mailbox
        ret

done 
        rdlong  tmp1, mailbox ' wait for mailbox to be complete
        tjs     tmp1, #$-1

        end

rogloh · 2021-04-01 14:05

Just got it working under FlexSpin.

Result when drawing a bunch of different angled lines in both the original SPIN2 and my new PASM2 version:

spin2 ticks = 290342186
pasm2 ticks = 110185432

The inline PASM2 is almost 3x faster, nice. I imagine under PNut it would be even more of a difference.

Here is the method now (still not perfect but somewhat functional at least):

PUB plotLine3(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen, tmp1, tmp2, mailbox, list
    mailbox:=mem.getMailboxAddr(bus,cogid()) 
    list:=@listbuf
    listbuf[2]:=1 ' preallocate
    listbuf[3]:=$80000000
    listbuf[5]:=width
    listbuf[6]:=0
    listbuf[7]:=0
    org
        shr     bpp, #4
        mov     dx, x1
        sub     dx, x0
        abs     dx
        cmp     x0, x1 wc
  if_c  mov     sx,#1
  if_nc neg     sx,#1
        mov     dy, y1
        sub     dy, y0
        abs     dy
        neg     dy
        cmp     y0, y1 wc
   if_c mov     sy, #1
  if_nc neg     sy, #1
        mov     xlen, #0
        mov     ylen, #0
        mov     err, dx
        add     err, dy
testvertical
        test    dx wz
  if_nz jmp     #testhorizontal
        ' code to do gfxFill etc
        shl     x0,bpp
        add     buf, x0
        fles    y0, y1
        mul     y0, width
        add     buf, y0
        subr    dy, #1
        or      bpp, #$c
        setnib  buf, bpp, #7
        rdlong  tmp2, mailbox
        tjs     tmp2, #$-1 ' wait for mailbox to be empty
        wrlong  buf, list
        add     list, #4
        wrlong  colour, list
        add     list, #12
        wrlong  dy, list
        sub     list, #16
        mov     tmp2, mailbox
        add     tmp2, #4
        wrlong  list, tmp2
        wrlong  ##-1, mailbox
        jmp     #done
testhorizontal
        test    dy wz
  if_nz jmp     #loop
        ' code to do fill etc
        fles    x0, x1
        shl     x0, bpp
        add     buf, x0
        mul     y0, width
        add     buf, y0
        add     dx, #1
        or      bpp, #$C
        setnib  buf, bpp, #7
        rdlong  tmp2, mailbox
        tjs     tmp2, #$-1 ' wait for mailbox to be empty
        mov     tmp2, mailbox
        add     tmp2, #8
        wrlong  dx, tmp2
        sub     tmp2, #4
        wrlong  colour, tmp2
        wrlong  buf, mailbox ' trigger request
        jmp     #done

loop
        modc    0 wc  ' c is used to determine where to "return" to in dovert/dohoriz
        cmp     x1, x0 wz
   if_z cmp     y1, y0 wz
  if_nz jmp     #skipend

        test    ylen wz
  if_nz jmp     #dovert

skip2   test    xlen wz
  if_nz jmp     #dohoriz

skipend
        mov     e2, err
        shl     e2, #1 
        cmps    e2, dy wc
  if_c  jmp     #skip3
        test    ylen wz
        modc    15 wc
  if_nz jmp     #dovert
ret1
  if_nz mov     ylen, #0
        add     err, dy
        add     x0, sx
        add     xlen, #1

skip3   cmps    e2, dx wcz
 if_nc_and_nz   jmp #loop
        test    xlen wz
        modc    15 wc
  if_nz jmp     #dohoriz
ret2
  if_nz mov     xlen, #0
        add     err, dx
        add     y0, sy
        add     ylen, #1
        jmp     #loop

dohoriz
        mov     tmp1, sx
        mul     tmp1, xlen
        subr    tmp1, x0
        shl     tmp1, bpp
        add     tmp1, buf
        mov     tmp2, y0
        mul     tmp2, width
        add     tmp1, tmp2
        mov     tmp2, bpp
        or      tmp2, #$C
        setnib  tmp1, tmp2, #7
        rdlong  tmp2, mailbox
        tjs     tmp2, #$-1 ' wait for mailbox to be empty
        mov     tmp2, mailbox
        add     tmp2, #8
        wrlong  xlen, tmp2
        sub     tmp2, #4
        wrlong  colour, tmp2
        wrlong  tmp1, mailbox
   if_c jmp     #ret2
        jmp     #done

dovert
        mov     tmp1, sy
        mul     tmp1, ylen
        subr    tmp1, y0
        mul     tmp1, width
        mov     tmp2, x0
        shl     tmp2, bpp
        add     tmp1, buf
        add     tmp1, tmp2
        mov     tmp2, bpp
        or      tmp2, #$c
        setnib  tmp1, tmp2, #7
        rdlong  tmp2, mailbox
        tjs     tmp2, #$-1
        wrlong  tmp1, list
        add     list, #4
        wrlong  colour, list
        add     list, #12
        wrlong  ylen, list
        sub     list, #16
        mov     tmp2, mailbox
        add     tmp2, #4
        wrlong  list, tmp2
        wrlong  ##-1, mailbox
   if_c jmp     #ret1

done 
        rdlong  tmp1, mailbox ' wait for mailbox to be complete
        tjs     tmp1, #$-1

        end

dnalor · 2021-04-01 14:29

@ersmith said:
As I understand it, in PNut the inline assembly is resident in the COG (so the total amount of inline assembly must not exceed the area left for it by the interpreter) and the variables are copied from the stack into internal memory before running the assembly, and then copied back afterwards. In flexspin it's the reverse: the variables are already in COG memory, but the code has to be copied into LUT before execution. This means that flexspin lets you have an arbitrary number of inline assembly blocks, but there's a bit more latency before starting the asm (at least when there is more code than data). It is possible to "fix" a function in LUT or COG memory in flexspin by using {++cog} or {++lut} in its declaration, but that's best used extremely sparingly.

Are you sure? That's from Spin-Documentation:

Here's the internal Spin2 procedure for executing in-line PASM code:

Save the current streamer address for restoration after the PASM code executes.
Copy the method's first 16 long variables, including any parameters, return values, and local variables, from hub RAM to cog registers $1E0..$1EF.
Copy the in-line PASM-code longs from hub RAM into cog registers, starting at the ORG address (default is $000).
CALL the PASM code.
Restore the 16 longs in cog registers $1E0..$1EF back to hub RAM, in order to update any modified method variables.
Restore the streamer address and resume Spin2 bytecode execution.

As I understand, you could have persistent pasm code if you copy it at the end of the "inlinecode space", where it would not be overwritten.

ersmith · 2021-04-01 16:41

@dnalor : It appears I misunderstood what the address in PNut's ORG meant. It does seem that PNut, like FlexSpin, loads the inline assembly code into local memory at run time, so different inline assembly does not have to share. Thank you for the correction.

Cluso99 · 2021-04-01 20:39

@rogloh
In my LCD graphics driver (a Quick Bytes article) I have draw and fill routines in spin. It would be nice if possible to use the same order calling parameters.

rogloh · 2021-04-02 00:13

@Cluso99 Right now this API is just for my testing and internal use. Any final API would be different and hopefully simpler. It also only works for 8, 16, 32(24) bpp pixel modes, and likely still has bugs as I found a problem in some octants using this algorithm. Drawing in 1/2/4 bpp needs another approach too and gets quite a bit messier with read-modify-write instead of simple fill operations.

I'm also still considering including line drawing as a possible hub-exec extension/accelerator in my memory driver and want to compare the performance there as well. It would eliminate a new memory request to be made for every vertical or horizontal segment that makes up a line and could then offload the requesting COG quite a lot. It probably would not support the 1/2/4bpp modes though, at this stage.

ersmith · 2021-04-02 00:23

@rogloh : BTW, you seriously need to update your flexspin compiler :

Propeller Spin/PASM Compiler 'FastSpin' (c) 2011-2020 Total Spectrum Software Inc.
Version 4.3.1 Compiled on: Sep 1 2020

The current version is 5.3.1 of March 2021.

I also found your inline PASM doesn't like this syntax either:
wrlong buf, ptrb[4]

That's fixed in github now, and will be in the upcoming 5.3.2 release.

rogloh · 2021-04-02 01:05

LOL, well I have updated it, main problem is there are just about 5 different versions of it installed on my Mac at this time and I need to rationalize where I keep them all. I just need to map my alias/path to invoke the "proper" one.

ersmith · 2021-04-02 13:09

A big compatibility flag for inline-assembly: FlexSpin runs the code from LUT rather than COG memory, so no variables may be declared in the inline assembly. That is, instead of writing:

pub inc1000(x) : r
  org
    add x, thousand
    ret
thousand
    long 1000
  end
  return x

you should instead write:

pub inc1000(x) : r | thousand
  thousand := 1000
  org
    add x, thousand
  end
  return x

Best cross-toolchain way to do inline assembly?

Comments