Shop OBEX P1 Docs P2 Docs Learn Events
Best cross-toolchain way to do inline assembly? — Parallax Forums

Best cross-toolchain way to do inline assembly?

roglohrogloh Posts: 5,786
edited 2021-04-01 10:28 in Propeller 2

I have a SPIN2 function to plot lines using my video and memory driver that is currently being developed and I was wondering what is the best way to convert it into inline/callable PASM that would work with both FlexProp and PropTool/PNut toolchains in order to speed things up more.

PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen
    dx :=  abs(x1-x0)
    sx := (x0<x1) ? 1 : -1
    dy := -abs(y1-y0)
    sy := (y0<y1) ? 1 : -1
    xlen := 0
    ylen := 0
    err := dx+dy
    if dx == 0 ' vertical line optimization
        mem.gfxFill(buf+x0<<(bpp>>4)+ (y0<#y1)*width, width, 1, 1-dy, colour, 0, bpp>>3) 
        return
    if dy == 0 ' horizontal line optimization
        mem.fill(buf+((x0<#x1)<<(bpp>>4))+ y0*width, colour, 1+dx, 0, bpp>>3) 
        return
    repeat  ' continue until final co-ordinates are reached
        if (x0 == x1 && y0 == y1) 
            if ylen
                mem.gfxFill(buf+x0<<(bpp>>4)+(y0-(sy*ylen))*width, width, 1, ylen, colour, 0, bpp>>3) 
            if xlen
                mem.fill(buf+(x0-sx*xlen)<<(bpp>>4)+y0*width, colour, xlen, 0, bpp>>3)
            return
        e2 := err<<1
        if (e2 >= dy)
            if ylen
                mem.gfxFill(buf+x0<<(bpp>>4)+(y0-(sy*ylen))*width, width, 1, ylen, colour, 0, bpp>>3) 
                ylen:=0
            err += dy
            x0 += sx
            xlen++
        if (e2 <= dx)
            if xlen
                mem.fill(buf+(x0-sx*xlen)<<(bpp>>4)+y0*width, colour, xlen, 0, bpp>>3)
                xlen:=0
            err += dx
            y0 += sy
            ylen++

If I convert this code to PASM2, I know I can change the existing method calls to both mem.fill(...) and mem.gfxFill(...) to instead write directly to my memory mailbox, so plotLine won't have to call any deeper down SPIN2 code, meaning this entire method could be completely written in PASM2 for the fastest drawing speed. I imagine it would probably fit into the $12F longs allowed by PropTool for full COG exec.

But how can/should the code be written to work in both toolchains?

It looks like PropTool SPIN2 is fairly well documented as to what is allowed, but can FlexProp also support COGRAM execution like PNut or is it only HUB-exec only? Can I simply start to write this type of code shown below...and can the symbols names be retained from the locals and parameters? Or do I need to do it another way using some PR0-PR7 names etc? What type of inline PASM2 coding actually works with both tools as a sort of lowest common denominator? Has anyone figured this out already? @ersmith basically what is the state of play with inline PASM2 in comparison with what PropTool does? It this fully documented anywhere as to how it all works and what P2 resources I can use while executing inline PASM2? Am I just worrying unnecessarily here and should I just code it up fully and try it out? :smile:

PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen
    org
        mov     dx, x1
        sub     dx, x0
        abs     dx
        cmp     x0, x1 wc
  if_c  mov     sx, #1
  if_nc neg     sx, #1
        mov     dy, y1
        sub     dy, y0
        abs     dy
        neg     dy
        cmp     y0, y1
   if_c mov     sy, #1
  if_nc neg     sy, #1
        mov     xlen, #0
        mov     ylen, #0
        mov     err, dx
        add     err, dy
        test    dx wz
   if_z jmp     #xx ' some code to do gfxFill etc
        test    dy wz
   if_z jmp     #zz ' some code to do fill etc
 ...
    end

Comments

  • Roger, the flexspin inline assembly documentation is in general.md (since it applies to all 3 languages). There's some discussion of the difference between flexspin and PNut in the spin.md file as well. ORG/END blocks are automatically copied to LUT before execution. ORG with an address is not supported by flexspin, but plain ORG should generally work fine. The code you wrote above looks like it should work.

    As I understand it, in PNut the inline assembly is resident in the COG (so the total amount of inline assembly must not exceed the area left for it by the interpreter) and the variables are copied from the stack into internal memory before running the assembly, and then copied back afterwards. In flexspin it's the reverse: the variables are already in COG memory, but the code has to be copied into LUT before execution. This means that flexspin lets you have an arbitrary number of inline assembly blocks, but there's a bit more latency before starting the asm (at least when there is more code than data). It is possible to "fix" a function in LUT or COG memory in flexspin by using {++cog} or {++lut} in its declaration, but that's best used extremely sparingly.

  • Sounds good Eric, thanks for the explanation.

    If it executes from LUTRAM is the size of an ORG block in FlexSpin then limited to 512 longs, or is is something less than this? I think my code is hopefully going to fit. I am up to 114 total longs, but am almost coded. I don't imagine it will reach more than about 200 longs even if I add some clipping stuff to it.

    Can I use PA, and PB, PTRA, PTRB, registers as temporary registers or are they already used by FlexSpin? If they are can I just push them? Also can I do a CALL with the HW stack (one level deep)? How much of the HW stack can I use?

  • ersmithersmith Posts: 6,052
    edited 2021-04-01 16:39

    If it executes from LUTRAM is the size of an ORG block in FlexSpin then limited to 512 longs, or is is something less than this?

    256 longs, actually. This is slightly less than PNut's limit of $130 longs, but on the other hand the PNut limit is the total for all ORG blocks in the program and its objects, whereas the FlexSpin limit is per inline assembly block. (Please do keep in mind that a lot of objects use inline assembly, so it would be a bad idea to use up too much of PNut's space in any one object.) (EDIT: this may not actually be right, I think I misread the docs -- PNut, like FlexSpin, loads the assembly at run time so the limit is not global.)

    Can I use PA, and PB, PTRA, PTRB, registers as temporary registers or are they already used by FlexSpin? If they are can I just push them?

    As mentioned in the docs, PTRA must be saved/restored. It'd probably be a good idea to save/restore PTRB too just in case someone else is using it.

    Also can I do a CALL with the HW stack (one level deep)? How much of the HW stack can I use?

    No, CALL will not work. RET has to be handled specially in ORG/END blocks because of the difference between PNut and FlexSpin, so don't try to use it yourself. The HW stack itself is OK to use, there will only be one level active when your code is entered.

  • SuracSurac Posts: 176
    edited 2021-04-01 12:41

    Just out of curiosity. Is there a memory map what is stored in the other 256 longs of the LUT?

  • @Surac said:
    Just out of curiosity. Is there a memory map what is stored in the other 256 longs of the LUT?

    There's a memory map in the general.md file in the documentation. In fact nothing is stored in the other 256 LUT longs, they're left available for the user (e.g. to use as a LUT, or to put specific functions in by declaring them as {++lut}).

  • roglohrogloh Posts: 5,786
    edited 2021-04-01 13:04

    @ersmith said:

    If it executes from LUTRAM is the size of an ORG block in FlexSpin then limited to 512 longs, or is is something less than this?

    256 longs, actually. This is slightly less than PNut's limit of $130 longs, but on the other hand the PNut limit is the total for all ORG blocks in the program and its objects, whereas the FlexSpin limit is per inline assembly block. (Please do keep in mind that a lot of objects use inline assembly, so it would be a bad idea to use up too much of PNut's space in any one object.)

    Can I use PA, and PB, PTRA, PTRB, registers as temporary registers or are they already used by FlexSpin? If they are can I just push them?

    As mentioned in the docs, PTRA must be saved/restored. It'd probably be a good idea to save/restore PTRB too just in case someone else is using it.

    Also can I do a CALL with the HW stack (one level deep)? How much of the HW stack can I use?

    No, CALL will not work. RET has to be handled specially in ORG/END blocks because of the difference between PNut and FlexSpin, so don't try to use it yourself. The HW stack itself is OK to use, there will only be one level active when your code is entered.

    Thanks again for the further info Eric. LOL, I just found out the hard way about CALL not working...looks like I need to unroll my two copies. This is not ideal if it is going to get so large as to consume the bulk of the resource. I thought PNut would load in the code like your FCACHE method, didn't realize it was a fixed area for ALL inline PASM2. This is going to be troublesome to get it to fit if there is other inline code, plus I use just over 16 locals+params which I think PNUT can't support.

    I also found your inline PASM doesn't like this syntax either:

       wrlong  buf, ptrb[4]
    

    Propeller Spin/PASM Compiler 'FastSpin' (c) 2011-2020 Total Spectrum Software Inc.
    Version 4.3.1 Compiled on: Sep 1 2020
    hdgfx.spin2
    |-p2videodrv.spin2
    |-SmartSerial.spin2
    |-ers_fmt.spin2
    |-memory.spin2
    |-|-hyperdrv.spin2
    |-|-psramdrv.spin2
    |-|-ers_fmt.spin2
    error: call/return from fcached code not supported
    error: call/return from fcached code not supported

    PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen, tmp1, tmp2, mailbox, list
        mailbox:=mem.getMailboxAddr(bus,cogid()) 
        list:=@listbuf
        listbuf[2]:=1 ' preallocate fixed data (eventually should do this outside the function in a top level somewhere)
        listbuf[3]:=$80000000
        listbuf[5]:=width
        listbuf[6]:=0
        listbuf[7]:=0
        org
            shr     bpp, #4
            mov     dx,x1
            sub     dx, x0
            abs     dx
            cmp     x0, x1 wc
      if_c  mov     sx,#1
      if_nc neg     sx,#1
            mov     dy, y1
            sub     dy, y0
            abs     dy
            neg     dy
            cmp     y0, y1 wc
       if_c mov     sy, #1
      if_nc neg     sy, #1
            mov     xlen, #0
            mov     ylen, #0
            mov     err, dx
            add     err, dy
    testvertical
            test    dx wz
      if_nz jmp     #testhorizontal
            ' code to do gfxFill etc
            shl     x0,bpp
            add     buf, x0
            fles    y0, y1
            mul     y0, width
            add     buf, y0
            subr    dy, #1
            or      bpp, #$c
            setnib  buf, bpp, #7
            rdlong  tmp2, mailbox
            tjs     tmp2, #$-1 ' wait for mailbox to be empty
            wrlong  buf, list  ' or use streamer here?
            add     list, #4
            wrlong  colour, list
            add     list, #8
            wrlong  dy, list
            sub     list, #12
            mov     tmp2, mailbox
            add     tmp2, #4
            wrlong  tmp1, tmp2
            wrlong  ##-1, mailbox
            jmp     #done
    testhorizontal
            test    dy wz
      if_nz jmp     #loop
            ' code to do fill etc
            fles    x0, x1
            shl     x0, bpp
            add     buf, x0
            mul     y0, width
            add     buf, y0
            add     dx, #1
            or      bpp, #$C
            setnib  buf, bpp, #7
            rdlong  tmp2, mailbox
            tjs     tmp2, #$-1 ' wait for mailbox to be empty
            mov     tmp2, mailbox
            add     tmp2, #8
            wrlong  dx, tmp2
            sub     tmp2, #4
            wrlong  colour, tmp2
            wrlong  buf, mailbox ' trigger request
            jmp     #done
    
    loop
            cmp     x1, x0 wz
       if_z cmp     y1, y0 wz
      if_nz jmp     #skipend
            test    ylen wz
      if_nz call    #dovert
            'compute gfxfill params
      if_nz jmp     #done
    skip2   test    xlen wz
      if_nz call    #dohoriz
      if_nz jmp     #done   
            'compute fill params
    skipend
            mov     e2, err
            shl     e2, #1 
            cmps    e2, dy wc
      if_c  jmp     #skip3
            test    ylen wz
      if_nz call    #dovert
      if_nz mov     ylen, #0
            add     err, dy
            add     x0, sx
            add     xlen, #1
    
    skip3   cmps    e2, dx wcz
     if_nc_and_nz   jmp #loop
            test    xlen wz
      if_nz call    #dohoriz
      if_nz mov     xlen, #0
            add     err, dx
            add     y0, sy
            add     ylen, #1
            jmp     #loop
    
    dohoriz
            mov     tmp1, sx
            mul     tmp1, xlen
            subr    tmp1, x0
            shl     tmp1, bpp
            add     tmp1, buf
            mov     tmp2, y0
            mul     tmp2, width
            add     tmp1, tmp2
            mov     tmp2, bpp
            or      tmp2, #$C
            setnib  tmp1, tmp2, #7
            rdlong  tmp2, mailbox
            tjs     tmp2, #$-1 ' wait for mailbox to be empty
            mov     tmp2, mailbox
            add     tmp2, #8
            wrlong  colour, tmp2
            sub     tmp2, #4
            wrlong  xlen, tmp2
            wrlong  tmp1, mailbox
            ret
    
    dovert
            mov     tmp1, sy
            mul     tmp1, ylen
            subr    tmp1, y0
            mul     tmp1, width
            mov     tmp2, x0
            shl     tmp2, bpp
            add     tmp1, buf
            add     tmp1, tmp2
            mov     tmp2, bpp
            or      tmp2, #$c
            setnib  tmp1, tmp2, #7
            rdlong  tmp2, mailbox
            tjs     tmp2, #$-1
            wrlong  buf, list  ' or use streamer here?
            add     list, #4
            wrlong  colour, list
            add     list, #8
            wrlong  dy, list
            sub     list, #12
            mov     tmp2, mailbox
            add     tmp2, #4
            wrlong  list, tmp2
            wrlong  ##-1, mailbox
            ret
    
    done 
            rdlong  tmp1, mailbox ' wait for mailbox to be complete
            tjs     tmp1, #$-1
    
            end
    
  • roglohrogloh Posts: 5,786
    edited 2021-04-01 14:15

    Just got it working under FlexSpin.

    Result when drawing a bunch of different angled lines in both the original SPIN2 and my new PASM2 version:

    spin2 ticks = 290342186
    pasm2 ticks = 110185432
    

    The inline PASM2 is almost 3x faster, nice. :smile: I imagine under PNut it would be even more of a difference.

    Here is the method now (still not perfect but somewhat functional at least):

    PUB plotLine3(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen, tmp1, tmp2, mailbox, list
        mailbox:=mem.getMailboxAddr(bus,cogid()) 
        list:=@listbuf
        listbuf[2]:=1 ' preallocate
        listbuf[3]:=$80000000
        listbuf[5]:=width
        listbuf[6]:=0
        listbuf[7]:=0
        org
            shr     bpp, #4
            mov     dx, x1
            sub     dx, x0
            abs     dx
            cmp     x0, x1 wc
      if_c  mov     sx,#1
      if_nc neg     sx,#1
            mov     dy, y1
            sub     dy, y0
            abs     dy
            neg     dy
            cmp     y0, y1 wc
       if_c mov     sy, #1
      if_nc neg     sy, #1
            mov     xlen, #0
            mov     ylen, #0
            mov     err, dx
            add     err, dy
    testvertical
            test    dx wz
      if_nz jmp     #testhorizontal
            ' code to do gfxFill etc
            shl     x0,bpp
            add     buf, x0
            fles    y0, y1
            mul     y0, width
            add     buf, y0
            subr    dy, #1
            or      bpp, #$c
            setnib  buf, bpp, #7
            rdlong  tmp2, mailbox
            tjs     tmp2, #$-1 ' wait for mailbox to be empty
            wrlong  buf, list
            add     list, #4
            wrlong  colour, list
            add     list, #12
            wrlong  dy, list
            sub     list, #16
            mov     tmp2, mailbox
            add     tmp2, #4
            wrlong  list, tmp2
            wrlong  ##-1, mailbox
            jmp     #done
    testhorizontal
            test    dy wz
      if_nz jmp     #loop
            ' code to do fill etc
            fles    x0, x1
            shl     x0, bpp
            add     buf, x0
            mul     y0, width
            add     buf, y0
            add     dx, #1
            or      bpp, #$C
            setnib  buf, bpp, #7
            rdlong  tmp2, mailbox
            tjs     tmp2, #$-1 ' wait for mailbox to be empty
            mov     tmp2, mailbox
            add     tmp2, #8
            wrlong  dx, tmp2
            sub     tmp2, #4
            wrlong  colour, tmp2
            wrlong  buf, mailbox ' trigger request
            jmp     #done
    
    loop
            modc    0 wc  ' c is used to determine where to "return" to in dovert/dohoriz
            cmp     x1, x0 wz
       if_z cmp     y1, y0 wz
      if_nz jmp     #skipend
    
            test    ylen wz
      if_nz jmp     #dovert
    
    skip2   test    xlen wz
      if_nz jmp     #dohoriz
    
    skipend
            mov     e2, err
            shl     e2, #1 
            cmps    e2, dy wc
      if_c  jmp     #skip3
            test    ylen wz
            modc    15 wc
      if_nz jmp     #dovert
    ret1
      if_nz mov     ylen, #0
            add     err, dy
            add     x0, sx
            add     xlen, #1
    
    skip3   cmps    e2, dx wcz
     if_nc_and_nz   jmp #loop
            test    xlen wz
            modc    15 wc
      if_nz jmp     #dohoriz
    ret2
      if_nz mov     xlen, #0
            add     err, dx
            add     y0, sy
            add     ylen, #1
            jmp     #loop
    
    dohoriz
            mov     tmp1, sx
            mul     tmp1, xlen
            subr    tmp1, x0
            shl     tmp1, bpp
            add     tmp1, buf
            mov     tmp2, y0
            mul     tmp2, width
            add     tmp1, tmp2
            mov     tmp2, bpp
            or      tmp2, #$C
            setnib  tmp1, tmp2, #7
            rdlong  tmp2, mailbox
            tjs     tmp2, #$-1 ' wait for mailbox to be empty
            mov     tmp2, mailbox
            add     tmp2, #8
            wrlong  xlen, tmp2
            sub     tmp2, #4
            wrlong  colour, tmp2
            wrlong  tmp1, mailbox
       if_c jmp     #ret2
            jmp     #done
    
    dovert
            mov     tmp1, sy
            mul     tmp1, ylen
            subr    tmp1, y0
            mul     tmp1, width
            mov     tmp2, x0
            shl     tmp2, bpp
            add     tmp1, buf
            add     tmp1, tmp2
            mov     tmp2, bpp
            or      tmp2, #$c
            setnib  tmp1, tmp2, #7
            rdlong  tmp2, mailbox
            tjs     tmp2, #$-1
            wrlong  tmp1, list
            add     list, #4
            wrlong  colour, list
            add     list, #12
            wrlong  ylen, list
            sub     list, #16
            mov     tmp2, mailbox
            add     tmp2, #4
            wrlong  list, tmp2
            wrlong  ##-1, mailbox
       if_c jmp     #ret1
    
    done 
            rdlong  tmp1, mailbox ' wait for mailbox to be complete
            tjs     tmp1, #$-1
    
            end
    
  • @ersmith said:
    As I understand it, in PNut the inline assembly is resident in the COG (so the total amount of inline assembly must not exceed the area left for it by the interpreter) and the variables are copied from the stack into internal memory before running the assembly, and then copied back afterwards. In flexspin it's the reverse: the variables are already in COG memory, but the code has to be copied into LUT before execution. This means that flexspin lets you have an arbitrary number of inline assembly blocks, but there's a bit more latency before starting the asm (at least when there is more code than data). It is possible to "fix" a function in LUT or COG memory in flexspin by using {++cog} or {++lut} in its declaration, but that's best used extremely sparingly.

    Are you sure? That's from Spin-Documentation:

    Here's the internal Spin2 procedure for executing in-line PASM code:
    
    Save the current streamer address for restoration after the PASM code executes.
    Copy the method's first 16 long variables, including any parameters, return values, and local variables, from hub RAM to cog registers $1E0..$1EF.
    Copy the in-line PASM-code longs from hub RAM into cog registers, starting at the ORG address (default is $000).
    CALL the PASM code.
    Restore the 16 longs in cog registers $1E0..$1EF back to hub RAM, in order to update any modified method variables.
    Restore the streamer address and resume Spin2 bytecode execution.
    

    As I understand, you could have persistent pasm code if you copy it at the end of the "inlinecode space", where it would not be overwritten.

  • @dnalor : It appears I misunderstood what the address in PNut's ORG meant. It does seem that PNut, like FlexSpin, loads the inline assembly code into local memory at run time, so different inline assembly does not have to share. Thank you for the correction.

  • Cluso99Cluso99 Posts: 18,069

    @rogloh
    In my LCD graphics driver (a Quick Bytes article) I have draw and fill routines in spin. It would be nice if possible to use the same order calling parameters.

  • @Cluso99 Right now this API is just for my testing and internal use. Any final API would be different and hopefully simpler. It also only works for 8, 16, 32(24) bpp pixel modes, and likely still has bugs as I found a problem in some octants using this algorithm. Drawing in 1/2/4 bpp needs another approach too and gets quite a bit messier with read-modify-write instead of simple fill operations.

    I'm also still considering including line drawing as a possible hub-exec extension/accelerator in my memory driver and want to compare the performance there as well. It would eliminate a new memory request to be made for every vertical or horizontal segment that makes up a line and could then offload the requesting COG quite a lot. It probably would not support the 1/2/4bpp modes though, at this stage.

  • @rogloh : BTW, you seriously need to update your flexspin compiler :):

    Propeller Spin/PASM Compiler 'FastSpin' (c) 2011-2020 Total Spectrum Software Inc.
    Version 4.3.1 Compiled on: Sep 1 2020

    The current version is 5.3.1 of March 2021.

    I also found your inline PASM doesn't like this syntax either:
    wrlong buf, ptrb[4]

    That's fixed in github now, and will be in the upcoming 5.3.2 release.

  • roglohrogloh Posts: 5,786
    edited 2021-04-02 01:06

    LOL, well I have updated it, main problem is there are just about 5 different versions of it installed on my Mac at this time and I need to rationalize where I keep them all. :smiley: I just need to map my alias/path to invoke the "proper" one.

  • A big compatibility flag for inline-assembly: FlexSpin runs the code from LUT rather than COG memory, so no variables may be declared in the inline assembly. That is, instead of writing:

    pub inc1000(x) : r
      org
        add x, thousand
        ret
    thousand
        long 1000
      end
      return x
    

    you should instead write:

    pub inc1000(x) : r | thousand
      thousand := 1000
      org
        add x, thousand
      end
      return x
    
Sign In or Register to comment.