Enhancing Spin2 with Inline Assembly - Let's Build a Code Cookbook
JonnyMac
Posts: 9,503
Updates in the P2 architecture and instruction set resulted in a significant speed improvement over the P1 -- roughly 16x with the same code at the same clock speed. Still, there are processes that will require the precision of assembly. In the P1, we are forced to start a cog to use assembly. In the P2, we're able to use inline assembly, a feature long enjoyed by those using compilers.
For those wondering how inline assembly works with the Spin interpreter, let's look at the structure of a P2 method that uses inline assembly.
Please join me in this thread by sharing your inline assembly snippets. Let's build a code cookbook to help each other and those new to the Propeller 2.
Tips:
-- Keep your code short, and neatly formatted; if it's easy on the eyes, it will be easier to follow.
-- Comments are always a good thing; more is better
-- Include an archive with a demo that shows off your cool code.
For those wondering how inline assembly works with the Spin interpreter, let's look at the structure of a P2 method that uses inline assembly.
pub method(param1, param2) : result | local1, local2 ' setup code (Spin) org ' assembly instructions end ' finish code (Spin)When a method that uses inline assembly is encountered, any parameters, result variable(s), local variables(s), and the assembly code is moved into a reserved area of the Spin interpreter cog. If the assembly segment needs variables, they are defined as local variables of the method. When the assembly code is finished, the parameters, result(s), and local(s) are moved back to the hub. This allows pre- and post-assembly work with these variables. You can think of the process as temporarily adding a command to the Spin interpreter.
Please join me in this thread by sharing your inline assembly snippets. Let's build a code cookbook to help each other and those new to the Propeller 2.
Tips:
-- Keep your code short, and neatly formatted; if it's easy on the eyes, it will be easier to follow.
-- Comments are always a good thing; more is better
-- Include an archive with a demo that shows off your cool code.

Comments
con WSTR = US_001 * 50 ' reset timing WST0 = US_001 * 400 / 1000 - 6 ' 0 bit timing WST1 = US_001 * 800 / 1000 - 6 ' 1 bit timing WSTC = US_001 * 1250 / 1000 ' cycle ticks @ 800kHz pub ws2812b(pin, count, p_colors) | outval, rgswap, tcycle '' Update WS2812b strip on pin '' -- count is # of LEDs on strip '' -- p_colors is pointer to colors (array of longs) '' * uses $RR_GG_BB_00 color format org drvl pin ' make pin output and low waitx ##WSTR ' allow reset led_loop rdlong outval, p_colors ' get color add p_colors, #4 ' point to next mov rgswap, outval ' make copy of outval shr rgswap, #16 ' rgswap = 00_00_RR_GG setbyte outval, rgswap, #3 ' outval.byte[3] <-- rgswap.byte[0] g shr rgswap, #8 ' rgswap = 00_00_00_RR setbyte outval, rgswap, #2 ' outval.byte[2] <-- rgswap.byte[0] r getct tcycle ' start cycle timer rep @.bitz, #24 ' 8 bits x 3 colors rol outval, #1 wc ' get MSB into C drvh pin ' pin on if_nc waitx ##WST0 ' hold for bit timing if_c waitx ##WST1 drvl pin ' pin off addct1 tcycle, ##WSTC ' update cycle timer waitct1 ' let cycle finish .bitz djnz count, #led_loop ' next pixel endNote that the inline assembly is also able to use constants defined in Spin.Here's a working Spin2 method duplicates the functionality of the PBASIC PULSIN command (with a little better resolution):
pub pulsin(pin, state) : result | t0 state &= 1 ' isolate to state.bit0 repeat until (pinr(pin) <> state) ' wait for idle state repeat t0 := getct() ' start timing while (pinr(pin) <> state) ' at start of pulse repeat result := getct() ' stop timing while (pinr(pin) == state) ' at end of pulse result := ((result - t0) + (US_001 >> 1)) / US_001 ' round to nearest microsecondWhile this works, we may try our hand at PASM2 by duplicating the heart of the method in assembly. Working with a small bit of assembly in a method like this is easier than launching a cog. Here's my final assembly version.
pub pulsin(pin, state) : result | t0 '' Measures pulse on pin in microseconds '' -- WARNING: Blocks and does not have timeout '' -- pin..... the input pin receving the pulse '' -- state... the target state of the pulse '' * 0 for high-low-high '' * 1 for low-high-low org fltl pin ' make pin an input testb state, #0 wc ' state.0 --> C waitidle testp pin wz ' pin level --> Z if_z_eq_c jmp #waitidle ' hold for idle state edge1 getct t0 ' time at change to state testp pin wz ' pin level --> Z if_z_ne_c jmp #edge1 ' wait for pin to match state edge2 getct result ' time at change back to idle testp pin wz ' pin level --> Z if_z_eq_c jmp #edge2 ' wait for pin to be idle sub result, t0 ' difference is pulse width end return (result + (US_001 >> 1)) / US_001 ' round to nearest microsecondcon { timing } CLK_FREQ = 200_000_000 US_001 = CLK_FREQ / 1_000_000 ' ticks per microsecond MS_001 = CLK_FREQ / 1_000 ' ticks per millisecond pub get_ms() : ms | lo, hi '' Return milliseconds after reset. '' -- system counter is fixed; cannot be changed by user org getct hi wc ' get cnt (now) getct lo setq hi ' divide cnt by ticks/ms qdiv lo, ##MS_001 getqx ms endThis is the equivalent of the millis() function used in the Arduino ecosystem. I run my projects at 200MHz hence am able to use a constant for ticks/ms.Inline assembly makes it easy to learn PASM2.
And, turning on LEDs is always fun...
Edit: This is a far more elegant solution, as suggested by @Wuerfel_21 (thank you!)
pub pause(ms) '' Delay for ms milliseconds org rep @.msloop, ms waitx ##MS_001-4 .msloop endNote, again, that this code uses a constant for the number of ticks in a millisecond.I think you could write it slightly better as
pub pause(ms) | t0 '' Delay in milliseconds '' -- for long delays (>10s) '' * use waitms() for delays < 10s org rep @.msloop, ms waitx ##MS_001 -4 .msloop endTo Ray's point, if you really wanted t duplicate the HIGH and LOW instructions from PBASIC, you could -- as he points out, twiddling and output (especially one with an LED) us a great place to start learning.
pub high(pin) '' Make pin output and high org drvh pin end pub low(pin) '' Make pin output and low org drvl pin endWhen I attended the Skip Barber Racing School they would tell us, "Slow in is fast out." This was referring to setting up for a corner -- exit speed from the corner is far more important than entry speed. That is to say, start slowly/simply and work you way toward complex code.Just doing my first steps in reading PASM2 documents, which lead me to this thread and a question:
Isn't there a chance to get a wrong reading when doing a getct for the high long and then another getct for the low long?
For example getct high is reading $0000_0000 and getct is reading $0000_0000. This might mean that ct(h) was just read 2 ticks before ct(l) ran into a rollover. So, the correct result would actually be $0000_0001_0000_0000.
This means, that is dangerous to rely on the result of your division, as it might be off by (2 pow(32)-1)/200Mhz every once in a while.
That actually doesn't happen. I think the high long is incremented if the low long is about to overflow or something.
Chip designed the hardware to account for low word rollover. I’m on my phone, so can’t find where it states it in the documentation.
My recollection is that a GETCT WC latches the state of both halves and stalls interrupts for one instruction to protect a following GETCT.
There may be a fix-up as she states, but either way, the error you are concerned about is prevented by Chip’s thoughtful design. Just one more thing that makes the P2 such a great processor.