Best cross-toolchain way to do inline assembly?
I have a SPIN2 function to plot lines using my video and memory driver that is currently being developed and I was wondering what is the best way to convert it into inline/callable PASM that would work with both FlexProp and PropTool/PNut toolchains in order to speed things up more.
PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen dx := abs(x1-x0) sx := (x0<x1) ? 1 : -1 dy := -abs(y1-y0) sy := (y0<y1) ? 1 : -1 xlen := 0 ylen := 0 err := dx+dy if dx == 0 ' vertical line optimization mem.gfxFill(buf+x0<<(bpp>>4)+ (y0<#y1)*width, width, 1, 1-dy, colour, 0, bpp>>3) return if dy == 0 ' horizontal line optimization mem.fill(buf+((x0<#x1)<<(bpp>>4))+ y0*width, colour, 1+dx, 0, bpp>>3) return repeat ' continue until final co-ordinates are reached if (x0 == x1 && y0 == y1) if ylen mem.gfxFill(buf+x0<<(bpp>>4)+(y0-(sy*ylen))*width, width, 1, ylen, colour, 0, bpp>>3) if xlen mem.fill(buf+(x0-sx*xlen)<<(bpp>>4)+y0*width, colour, xlen, 0, bpp>>3) return e2 := err<<1 if (e2 >= dy) if ylen mem.gfxFill(buf+x0<<(bpp>>4)+(y0-(sy*ylen))*width, width, 1, ylen, colour, 0, bpp>>3) ylen:=0 err += dy x0 += sx xlen++ if (e2 <= dx) if xlen mem.fill(buf+(x0-sx*xlen)<<(bpp>>4)+y0*width, colour, xlen, 0, bpp>>3) xlen:=0 err += dx y0 += sy ylen++
If I convert this code to PASM2, I know I can change the existing method calls to both mem.fill(...) and mem.gfxFill(...) to instead write directly to my memory mailbox, so plotLine won't have to call any deeper down SPIN2 code, meaning this entire method could be completely written in PASM2 for the fastest drawing speed. I imagine it would probably fit into the $12F longs allowed by PropTool for full COG exec.
But how can/should the code be written to work in both toolchains?
It looks like PropTool SPIN2 is fairly well documented as to what is allowed, but can FlexProp also support COGRAM execution like PNut or is it only HUB-exec only? Can I simply start to write this type of code shown below...and can the symbols names be retained from the locals and parameters? Or do I need to do it another way using some PR0-PR7 names etc? What type of inline PASM2 coding actually works with both tools as a sort of lowest common denominator? Has anyone figured this out already? @ersmith basically what is the state of play with inline PASM2 in comparison with what PropTool does? It this fully documented anywhere as to how it all works and what P2 resources I can use while executing inline PASM2? Am I just worrying unnecessarily here and should I just code it up fully and try it out?
PUB plotLine(buf, width, x0, y0, x1, y1, colour, bpp) | dx, sx, dy, sy, err, e2, xlen, ylen org mov dx, x1 sub dx, x0 abs dx cmp x0, x1 wc if_c mov sx, #1 if_nc neg sx, #1 mov dy, y1 sub dy, y0 abs dy neg dy cmp y0, y1 if_c mov sy, #1 if_nc neg sy, #1 mov xlen, #0 mov ylen, #0 mov err, dx add err, dy test dx wz if_z jmp #xx ' some code to do gfxFill etc test dy wz if_z jmp #zz ' some code to do fill etc ... end
Comments
Roger, the flexspin inline assembly documentation is in general.md (since it applies to all 3 languages). There's some discussion of the difference between flexspin and PNut in the spin.md file as well. ORG/END blocks are automatically copied to LUT before execution. ORG with an address is not supported by flexspin, but plain ORG should generally work fine. The code you wrote above looks like it should work.
As I understand it, in PNut the inline assembly is resident in the COG (so the total amount of inline assembly must not exceed the area left for it by the interpreter) and the variables are copied from the stack into internal memory before running the assembly, and then copied back afterwards. In flexspin it's the reverse: the variables are already in COG memory, but the code has to be copied into LUT before execution. This means that flexspin lets you have an arbitrary number of inline assembly blocks, but there's a bit more latency before starting the asm (at least when there is more code than data). It is possible to "fix" a function in LUT or COG memory in flexspin by using {++cog} or {++lut} in its declaration, but that's best used extremely sparingly.
Sounds good Eric, thanks for the explanation.
If it executes from LUTRAM is the size of an ORG block in FlexSpin then limited to 512 longs, or is is something less than this? I think my code is hopefully going to fit. I am up to 114 total longs, but am almost coded. I don't imagine it will reach more than about 200 longs even if I add some clipping stuff to it.
Can I use PA, and PB, PTRA, PTRB, registers as temporary registers or are they already used by FlexSpin? If they are can I just push them? Also can I do a CALL with the HW stack (one level deep)? How much of the HW stack can I use?
256 longs, actually. This is slightly less than PNut's limit of $130 longs, but on the other hand the PNut limit is the total for all ORG blocks in the program and its objects, whereas the FlexSpin limit is per inline assembly block. (Please do keep in mind that a lot of objects use inline assembly, so it would be a bad idea to use up too much of PNut's space in any one object.) (EDIT: this may not actually be right, I think I misread the docs -- PNut, like FlexSpin, loads the assembly at run time so the limit is not global.)
As mentioned in the docs, PTRA must be saved/restored. It'd probably be a good idea to save/restore PTRB too just in case someone else is using it.
No, CALL will not work. RET has to be handled specially in ORG/END blocks because of the difference between PNut and FlexSpin, so don't try to use it yourself. The HW stack itself is OK to use, there will only be one level active when your code is entered.
Just out of curiosity. Is there a memory map what is stored in the other 256 longs of the LUT?
There's a memory map in the general.md file in the documentation. In fact nothing is stored in the other 256 LUT longs, they're left available for the user (e.g. to use as a LUT, or to put specific functions in by declaring them as {++lut}).
Thanks again for the further info Eric. LOL, I just found out the hard way about CALL not working...looks like I need to unroll my two copies. This is not ideal if it is going to get so large as to consume the bulk of the resource. I thought PNut would load in the code like your FCACHE method, didn't realize it was a fixed area for ALL inline PASM2. This is going to be troublesome to get it to fit if there is other inline code, plus I use just over 16 locals+params which I think PNUT can't support.
I also found your inline PASM doesn't like this syntax either:
Propeller Spin/PASM Compiler 'FastSpin' (c) 2011-2020 Total Spectrum Software Inc.
Version 4.3.1 Compiled on: Sep 1 2020
hdgfx.spin2
|-p2videodrv.spin2
|-SmartSerial.spin2
|-ers_fmt.spin2
|-memory.spin2
|-|-hyperdrv.spin2
|-|-psramdrv.spin2
|-|-ers_fmt.spin2
error: call/return from fcached code not supported
error: call/return from fcached code not supported
Just got it working under FlexSpin.
Result when drawing a bunch of different angled lines in both the original SPIN2 and my new PASM2 version:
The inline PASM2 is almost 3x faster, nice. I imagine under PNut it would be even more of a difference.
Here is the method now (still not perfect but somewhat functional at least):
Are you sure? That's from Spin-Documentation:
As I understand, you could have persistent pasm code if you copy it at the end of the "inlinecode space", where it would not be overwritten.
@dnalor : It appears I misunderstood what the address in PNut's ORG meant. It does seem that PNut, like FlexSpin, loads the inline assembly code into local memory at run time, so different inline assembly does not have to share. Thank you for the correction.
@rogloh
In my LCD graphics driver (a Quick Bytes article) I have draw and fill routines in spin. It would be nice if possible to use the same order calling parameters.
@Cluso99 Right now this API is just for my testing and internal use. Any final API would be different and hopefully simpler. It also only works for 8, 16, 32(24) bpp pixel modes, and likely still has bugs as I found a problem in some octants using this algorithm. Drawing in 1/2/4 bpp needs another approach too and gets quite a bit messier with read-modify-write instead of simple fill operations.
I'm also still considering including line drawing as a possible hub-exec extension/accelerator in my memory driver and want to compare the performance there as well. It would eliminate a new memory request to be made for every vertical or horizontal segment that makes up a line and could then offload the requesting COG quite a lot. It probably would not support the 1/2/4bpp modes though, at this stage.
@rogloh : BTW, you seriously need to update your flexspin compiler :
The current version is 5.3.1 of March 2021.
That's fixed in github now, and will be in the upcoming 5.3.2 release.
LOL, well I have updated it, main problem is there are just about 5 different versions of it installed on my Mac at this time and I need to rationalize where I keep them all. I just need to map my alias/path to invoke the "proper" one.
A big compatibility flag for inline-assembly: FlexSpin runs the code from LUT rather than COG memory, so no variables may be declared in the inline assembly. That is, instead of writing:
you should instead write: