Best cross-toolchain way to do inline assembly?

I have a SPIN2 function to plot lines using my video and memory driver that is currently being developed and I was wondering what is the best way to convert it into inline/callable PASM that would work with both FlexProp and PropTool/PNut toolchains in order to speed things up more.
PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen
dx := abs(x1-x0)
sx := (x0<x1) ? 1 : -1
dy := -abs(y1-y0)
sy := (y0<y1) ? 1 : -1
xlen := 0
ylen := 0
err := dx+dy
if dx == 0 ' vertical line optimization
mem.gfxFill(buf+x0<<(bpp>>4)+ (y0<#y1)*width, width, 1, 1-dy, colour, 0, bpp>>3)
return
if dy == 0 ' horizontal line optimization
mem.fill(buf+((x0<#x1)<<(bpp>>4))+ y0*width, colour, 1+dx, 0, bpp>>3)
return
repeat ' continue until final co-ordinates are reached
if (x0 == x1 && y0 == y1)
if ylen
mem.gfxFill(buf+x0<<(bpp>>4)+(y0-(sy*ylen))*width, width, 1, ylen, colour, 0, bpp>>3)
if xlen
mem.fill(buf+(x0-sx*xlen)<<(bpp>>4)+y0*width, colour, xlen, 0, bpp>>3)
return
e2 := err<<1
if (e2 >= dy)
if ylen
mem.gfxFill(buf+x0<<(bpp>>4)+(y0-(sy*ylen))*width, width, 1, ylen, colour, 0, bpp>>3)
ylen:=0
err += dy
x0 += sx
xlen++
if (e2 <= dx)
if xlen
mem.fill(buf+(x0-sx*xlen)<<(bpp>>4)+y0*width, colour, xlen, 0, bpp>>3)
xlen:=0
err += dx
y0 += sy
ylen++
If I convert this code to PASM2, I know I can change the existing method calls to both mem.fill(...) and mem.gfxFill(...) to instead write directly to my memory mailbox, so plotLine won't have to call any deeper down SPIN2 code, meaning this entire method could be completely written in PASM2 for the fastest drawing speed. I imagine it would probably fit into the $12F longs allowed by PropTool for full COG exec.
But how can/should the code be written to work in both toolchains?
It looks like PropTool SPIN2 is fairly well documented as to what is allowed, but can FlexProp also support COGRAM execution like PNut or is it only HUB-exec only? Can I simply start to write this type of code shown below...and can the symbols names be retained from the locals and parameters? Or do I need to do it another way using some PR0-PR7 names etc? What type of inline PASM2 coding actually works with both tools as a sort of lowest common denominator? Has anyone figured this out already? @ersmith basically what is the state of play with inline PASM2 in comparison with what PropTool does? It this fully documented anywhere as to how it all works and what P2 resources I can use while executing inline PASM2? Am I just worrying unnecessarily here and should I just code it up fully and try it out?
PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen
org
mov dx, x1
sub dx, x0
abs dx
cmp x0, x1 wc
if_c mov sx, #1
if_nc neg sx, #1
mov dy, y1
sub dy, y0
abs dy
neg dy
cmp y0, y1
if_c mov sy, #1
if_nc neg sy, #1
mov xlen, #0
mov ylen, #0
mov err, dx
add err, dy
test dx wz
if_z jmp #xx ' some code to do gfxFill etc
test dy wz
if_z jmp #zz ' some code to do fill etc
...
end
Comments
Roger, the flexspin inline assembly documentation is in general.md (since it applies to all 3 languages). There's some discussion of the difference between flexspin and PNut in the spin.md file as well. ORG/END blocks are automatically copied to LUT before execution. ORG with an address is not supported by flexspin, but plain ORG should generally work fine. The code you wrote above looks like it should work.
As I understand it, in PNut the inline assembly is resident in the COG (so the total amount of inline assembly must not exceed the area left for it by the interpreter) and the variables are copied from the stack into internal memory before running the assembly, and then copied back afterwards. In flexspin it's the reverse: the variables are already in COG memory, but the code has to be copied into LUT before execution. This means that flexspin lets you have an arbitrary number of inline assembly blocks, but there's a bit more latency before starting the asm (at least when there is more code than data). It is possible to "fix" a function in LUT or COG memory in flexspin by using {++cog} or {++lut} in its declaration, but that's best used extremely sparingly.
Sounds good Eric, thanks for the explanation.
If it executes from LUTRAM is the size of an ORG block in FlexSpin then limited to 512 longs, or is is something less than this? I think my code is hopefully going to fit. I am up to 114 total longs, but am almost coded. I don't imagine it will reach more than about 200 longs even if I add some clipping stuff to it.
Can I use PA, and PB, PTRA, PTRB, registers as temporary registers or are they already used by FlexSpin? If they are can I just push them? Also can I do a CALL with the HW stack (one level deep)? How much of the HW stack can I use?
256 longs, actually. This is slightly less than PNut's limit of $130 longs, but on the other hand the PNut limit is the total for all ORG blocks in the program and its objects, whereas the FlexSpin limit is per inline assembly block. (Please do keep in mind that a lot of objects use inline assembly, so it would be a bad idea to use up too much of PNut's space in any one object.) (EDIT: this may not actually be right, I think I misread the docs -- PNut, like FlexSpin, loads the assembly at run time so the limit is not global.)
As mentioned in the docs, PTRA must be saved/restored. It'd probably be a good idea to save/restore PTRB too just in case someone else is using it.
No, CALL will not work. RET has to be handled specially in ORG/END blocks because of the difference between PNut and FlexSpin, so don't try to use it yourself. The HW stack itself is OK to use, there will only be one level active when your code is entered.
Just out of curiosity. Is there a memory map what is stored in the other 256 longs of the LUT?
There's a memory map in the general.md file in the documentation. In fact nothing is stored in the other 256 LUT longs, they're left available for the user (e.g. to use as a LUT, or to put specific functions in by declaring them as {++lut}).
Thanks again for the further info Eric. LOL, I just found out the hard way about CALL not working...looks like I need to unroll my two copies. This is not ideal if it is going to get so large as to consume the bulk of the resource. I thought PNut would load in the code like your FCACHE method, didn't realize it was a fixed area for ALL inline PASM2. This is going to be troublesome to get it to fit if there is other inline code, plus I use just over 16 locals+params which I think PNUT can't support.
I also found your inline PASM doesn't like this syntax either:
wrlong buf, ptrb[4]
Propeller Spin/PASM Compiler 'FastSpin' (c) 2011-2020 Total Spectrum Software Inc.
Version 4.3.1 Compiled on: Sep 1 2020
hdgfx.spin2
|-p2videodrv.spin2
|-SmartSerial.spin2
|-ers_fmt.spin2
|-memory.spin2
|-|-hyperdrv.spin2
|-|-psramdrv.spin2
|-|-ers_fmt.spin2
error: call/return from fcached code not supported
error: call/return from fcached code not supported
PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen, tmp1, tmp2, mailbox, list mailbox:=mem.getMailboxAddr(bus,cogid()) list:=@listbuf listbuf[2]:=1 ' preallocate fixed data (eventually should do this outside the function in a top level somewhere) listbuf[3]:=$80000000 listbuf[5]:=width listbuf[6]:=0 listbuf[7]:=0 org shr bpp, #4 mov dx,x1 sub dx, x0 abs dx cmp x0, x1 wc if_c mov sx,#1 if_nc neg sx,#1 mov dy, y1 sub dy, y0 abs dy neg dy cmp y0, y1 wc if_c mov sy, #1 if_nc neg sy, #1 mov xlen, #0 mov ylen, #0 mov err, dx add err, dy testvertical test dx wz if_nz jmp #testhorizontal ' code to do gfxFill etc shl x0,bpp add buf, x0 fles y0, y1 mul y0, width add buf, y0 subr dy, #1 or bpp, #$c setnib buf, bpp, #7 rdlong tmp2, mailbox tjs tmp2, #$-1 ' wait for mailbox to be empty wrlong buf, list ' or use streamer here? add list, #4 wrlong colour, list add list, #8 wrlong dy, list sub list, #12 mov tmp2, mailbox add tmp2, #4 wrlong tmp1, tmp2 wrlong ##-1, mailbox jmp #done testhorizontal test dy wz if_nz jmp #loop ' code to do fill etc fles x0, x1 shl x0, bpp add buf, x0 mul y0, width add buf, y0 add dx, #1 or bpp, #$C setnib buf, bpp, #7 rdlong tmp2, mailbox tjs tmp2, #$-1 ' wait for mailbox to be empty mov tmp2, mailbox add tmp2, #8 wrlong dx, tmp2 sub tmp2, #4 wrlong colour, tmp2 wrlong buf, mailbox ' trigger request jmp #done loop cmp x1, x0 wz if_z cmp y1, y0 wz if_nz jmp #skipend test ylen wz if_nz call #dovert 'compute gfxfill params if_nz jmp #done skip2 test xlen wz if_nz call #dohoriz if_nz jmp #done 'compute fill params skipend mov e2, err shl e2, #1 cmps e2, dy wc if_c jmp #skip3 test ylen wz if_nz call #dovert if_nz mov ylen, #0 add err, dy add x0, sx add xlen, #1 skip3 cmps e2, dx wcz if_nc_and_nz jmp #loop test xlen wz if_nz call #dohoriz if_nz mov xlen, #0 add err, dx add y0, sy add ylen, #1 jmp #loop dohoriz mov tmp1, sx mul tmp1, xlen subr tmp1, x0 shl tmp1, bpp add tmp1, buf mov tmp2, y0 mul tmp2, width add tmp1, tmp2 mov tmp2, bpp or tmp2, #$C setnib tmp1, tmp2, #7 rdlong tmp2, mailbox tjs tmp2, #$-1 ' wait for mailbox to be empty mov tmp2, mailbox add tmp2, #8 wrlong colour, tmp2 sub tmp2, #4 wrlong xlen, tmp2 wrlong tmp1, mailbox ret dovert mov tmp1, sy mul tmp1, ylen subr tmp1, y0 mul tmp1, width mov tmp2, x0 shl tmp2, bpp add tmp1, buf add tmp1, tmp2 mov tmp2, bpp or tmp2, #$c setnib tmp1, tmp2, #7 rdlong tmp2, mailbox tjs tmp2, #$-1 wrlong buf, list ' or use streamer here? add list, #4 wrlong colour, list add list, #8 wrlong dy, list sub list, #12 mov tmp2, mailbox add tmp2, #4 wrlong list, tmp2 wrlong ##-1, mailbox ret done rdlong tmp1, mailbox ' wait for mailbox to be complete tjs tmp1, #$-1 end
Just got it working under FlexSpin.
Result when drawing a bunch of different angled lines in both the original SPIN2 and my new PASM2 version:
spin2 ticks = 290342186 pasm2 ticks = 110185432
The inline PASM2 is almost 3x faster, nice.
I imagine under PNut it would be even more of a difference.
Here is the method now (still not perfect but somewhat functional at least):
PUB plotLine3(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen, tmp1, tmp2, mailbox, list mailbox:=mem.getMailboxAddr(bus,cogid()) list:=@listbuf listbuf[2]:=1 ' preallocate listbuf[3]:=$80000000 listbuf[5]:=width listbuf[6]:=0 listbuf[7]:=0 org shr bpp, #4 mov dx, x1 sub dx, x0 abs dx cmp x0, x1 wc if_c mov sx,#1 if_nc neg sx,#1 mov dy, y1 sub dy, y0 abs dy neg dy cmp y0, y1 wc if_c mov sy, #1 if_nc neg sy, #1 mov xlen, #0 mov ylen, #0 mov err, dx add err, dy testvertical test dx wz if_nz jmp #testhorizontal ' code to do gfxFill etc shl x0,bpp add buf, x0 fles y0, y1 mul y0, width add buf, y0 subr dy, #1 or bpp, #$c setnib buf, bpp, #7 rdlong tmp2, mailbox tjs tmp2, #$-1 ' wait for mailbox to be empty wrlong buf, list add list, #4 wrlong colour, list add list, #12 wrlong dy, list sub list, #16 mov tmp2, mailbox add tmp2, #4 wrlong list, tmp2 wrlong ##-1, mailbox jmp #done testhorizontal test dy wz if_nz jmp #loop ' code to do fill etc fles x0, x1 shl x0, bpp add buf, x0 mul y0, width add buf, y0 add dx, #1 or bpp, #$C setnib buf, bpp, #7 rdlong tmp2, mailbox tjs tmp2, #$-1 ' wait for mailbox to be empty mov tmp2, mailbox add tmp2, #8 wrlong dx, tmp2 sub tmp2, #4 wrlong colour, tmp2 wrlong buf, mailbox ' trigger request jmp #done loop modc 0 wc ' c is used to determine where to "return" to in dovert/dohoriz cmp x1, x0 wz if_z cmp y1, y0 wz if_nz jmp #skipend test ylen wz if_nz jmp #dovert skip2 test xlen wz if_nz jmp #dohoriz skipend mov e2, err shl e2, #1 cmps e2, dy wc if_c jmp #skip3 test ylen wz modc 15 wc if_nz jmp #dovert ret1 if_nz mov ylen, #0 add err, dy add x0, sx add xlen, #1 skip3 cmps e2, dx wcz if_nc_and_nz jmp #loop test xlen wz modc 15 wc if_nz jmp #dohoriz ret2 if_nz mov xlen, #0 add err, dx add y0, sy add ylen, #1 jmp #loop dohoriz mov tmp1, sx mul tmp1, xlen subr tmp1, x0 shl tmp1, bpp add tmp1, buf mov tmp2, y0 mul tmp2, width add tmp1, tmp2 mov tmp2, bpp or tmp2, #$C setnib tmp1, tmp2, #7 rdlong tmp2, mailbox tjs tmp2, #$-1 ' wait for mailbox to be empty mov tmp2, mailbox add tmp2, #8 wrlong xlen, tmp2 sub tmp2, #4 wrlong colour, tmp2 wrlong tmp1, mailbox if_c jmp #ret2 jmp #done dovert mov tmp1, sy mul tmp1, ylen subr tmp1, y0 mul tmp1, width mov tmp2, x0 shl tmp2, bpp add tmp1, buf add tmp1, tmp2 mov tmp2, bpp or tmp2, #$c setnib tmp1, tmp2, #7 rdlong tmp2, mailbox tjs tmp2, #$-1 wrlong tmp1, list add list, #4 wrlong colour, list add list, #12 wrlong ylen, list sub list, #16 mov tmp2, mailbox add tmp2, #4 wrlong list, tmp2 wrlong ##-1, mailbox if_c jmp #ret1 done rdlong tmp1, mailbox ' wait for mailbox to be complete tjs tmp1, #$-1 end
Are you sure? That's from Spin-Documentation:
Here's the internal Spin2 procedure for executing in-line PASM code: Save the current streamer address for restoration after the PASM code executes. Copy the method's first 16 long variables, including any parameters, return values, and local variables, from hub RAM to cog registers $1E0..$1EF. Copy the in-line PASM-code longs from hub RAM into cog registers, starting at the ORG address (default is $000). CALL the PASM code. Restore the 16 longs in cog registers $1E0..$1EF back to hub RAM, in order to update any modified method variables. Restore the streamer address and resume Spin2 bytecode execution.
As I understand, you could have persistent pasm code if you copy it at the end of the "inlinecode space", where it would not be overwritten.
@dnalor : It appears I misunderstood what the address in PNut's ORG meant. It does seem that PNut, like FlexSpin, loads the inline assembly code into local memory at run time, so different inline assembly does not have to share. Thank you for the correction.
@rogloh
In my LCD graphics driver (a Quick Bytes article) I have draw and fill routines in spin. It would be nice if possible to use the same order calling parameters.
@Cluso99 Right now this API is just for my testing and internal use. Any final API would be different and hopefully simpler. It also only works for 8, 16, 32(24) bpp pixel modes, and likely still has bugs as I found a problem in some octants using this algorithm. Drawing in 1/2/4 bpp needs another approach too and gets quite a bit messier with read-modify-write instead of simple fill operations.
I'm also still considering including line drawing as a possible hub-exec extension/accelerator in my memory driver and want to compare the performance there as well. It would eliminate a new memory request to be made for every vertical or horizontal segment that makes up a line and could then offload the requesting COG quite a lot. It probably would not support the 1/2/4bpp modes though, at this stage.
@rogloh : BTW, you seriously need to update your flexspin compiler
:
The current version is 5.3.1 of March 2021.
That's fixed in github now, and will be in the upcoming 5.3.2 release.
LOL, well I have updated it, main problem is there are just about 5 different versions of it installed on my Mac at this time and I need to rationalize where I keep them all.
I just need to map my alias/path to invoke the "proper" one.
A big compatibility flag for inline-assembly: FlexSpin runs the code from LUT rather than COG memory, so no variables may be declared in the inline assembly. That is, instead of writing:
pub inc1000(x) : r org add x, thousand ret thousand long 1000 end return x
you should instead write:
pub inc1000(x) : r | thousand thousand := 1000 org add x, thousand end return x