Enhancing Spin2 with Inline Assembly - Let's Build a Code Cookbook

JonnyMac · 2020-08-15 14:41

Updates in the P2 architecture and instruction set resulted in a significant speed improvement over the P1 -- roughly 16x with the same code at the same clock speed. Still, there are processes that will require the precision of assembly. In the P1, we are forced to start a cog to use assembly. In the P2, we're able to use inline assembly, a feature long enjoyed by those using compilers.

For those wondering how inline assembly works with the Spin interpreter, let's look at the structure of a P2 method that uses inline assembly.

pub method(param1, param2) : result | local1, local2

  ' setup code (Spin)

  org

  ' assembly instructions

  end

  ' finish code (Spin)

When a method that uses inline assembly is encountered, any parameters, result variable(s), local variables(s), and the assembly code is moved into a reserved area of the Spin interpreter cog. If the assembly segment needs variables, they are defined as local variables of the method. When the assembly code is finished, the parameters, result(s), and local(s) are moved back to the hub. This allows pre- and post-assembly work with these variables. You can think of the process as temporarily adding a command to the Spin interpreter.

Please join me in this thread by sharing your inline assembly snippets. Let's build a code cookbook to help each other and those new to the Propeller 2.

Tips:
-- Keep your code short, and neatly formatted; if it's easy on the eyes, it will be easier to follow.
-- Comments are always a good thing; more is better
-- Include an archive with a demo that shows off your cool code.

JonnyMac · 2020-08-15 14:52

Here's an excellent example of using inline assembly in the P2. Let's say you want to build a convention badge with the P2 (I do), and you'd like to populate the badge with a few WS2812b LED pixels. While you could use a standard driver (e.g., jm_rgbx_pixel.spin2), with so few LEDs you could easily update the LEDs from your mainline code and save a cog.

con

  WSTR = US_001 *   50                                          ' reset timing
  WST0 = US_001 *  400 / 1000 - 6                               ' 0 bit timing
  WST1 = US_001 *  800 / 1000 - 6                               ' 1 bit timing
  WSTC = US_001 * 1250 / 1000                                   ' cycle ticks @ 800kHz


pub ws2812b(pin, count, p_colors) | outval, rgswap, tcycle

'' Update WS2812b strip on pin
'' -- count is # of LEDs on strip
'' -- p_colors is pointer to colors (array of longs)
''    * uses $RR_GG_BB_00 color format

  org
                drvl      pin                                   ' make pin output and low
                waitx     ##WSTR                                ' allow reset
                                                                 
led_loop        rdlong    outval, p_colors                      ' get color
                add       p_colors, #4                          ' point to next
                                                                 
                mov       rgswap, outval                        ' make copy of outval
                shr       rgswap, #16                           ' rgswap = 00_00_RR_GG
                setbyte   outval, rgswap, #3                    ' outval.byte[3] <-- rgswap.byte[0] g
                shr       rgswap, #8                            ' rgswap = 00_00_00_RR
                setbyte   outval, rgswap, #2                    ' outval.byte[2] <-- rgswap.byte[0] r
                 
                getct     tcycle                                ' start cycle timer
                 
                rep       @.bitz, #24                           ' 8 bits x 3 colors
                 rol      outval, #1                    wc      ' get MSB into C
                 drvh     pin                                   ' pin on
    if_nc        waitx    ##WST0                                ' hold for bit timing
    if_c         waitx    ##WST1                                 
                 drvl     pin                                   ' pin off
                 addct1   tcycle, ##WSTC                        ' update cycle timer
                 waitct1                                        ' let cycle finish
.bitz                                                            
                                                                 
                djnz      count, #led_loop                      ' next pixel
  end

Note that the inline assembly is also able to use constants defined in Spin.

JonnyMac · 2020-08-15 15:48

Another great aspect of inline assembly with Spin2 is that it allows us to experiment with small sections of assembly in a controlled environment.

Here's a working Spin2 method duplicates the functionality of the PBASIC PULSIN command (with a little better resolution):

pub pulsin(pin, state) : result | t0

  state &= 1                                                    ' isolate to state.bit0
  repeat until (pinr(pin) <> state)                             ' wait for idle state  

  repeat   
    t0 := getct()                                               ' start timing
  while (pinr(pin) <> state)                                    '  at start of pulse

  repeat 
    result := getct()                                           ' stop timing
  while (pinr(pin) == state)                                    '  at end of pulse
  
  result := ((result - t0) + (US_001 >> 1)) / US_001            ' round to nearest microsecond

While this works, we may try our hand at PASM2 by duplicating the heart of the method in assembly. Working with a small bit of assembly in a method like this is easier than launching a cog. Here's my final assembly version.

pub pulsin(pin, state) : result | t0

'' Measures pulse on pin in microseconds
'' -- WARNING: Blocks and does not have timeout
'' -- pin..... the input pin receving the pulse
'' -- state... the target state of the pulse
''             * 0 for high-low-high
''             * 1 for low-high-low

  org
                fltl      pin                                   ' make pin an input
                testb     state, #0                     wc      ' state.0 --> C

waitidle        testp     pin                           wz      ' pin level --> Z
    if_z_eq_c   jmp       #waitidle                             ' hold for idle state

edge1           getct     t0                                    ' time at change to state
                testp     pin                           wz      ' pin level --> Z          
    if_z_ne_c   jmp       #edge1                                ' wait for pin to match state

edge2           getct     result                                ' time at change back to idle
                testp     pin                           wz      ' pin level --> Z            
    if_z_eq_c   jmp       #edge2                                ' wait for pin to be idle

                sub       result, t0                            ' difference is pulse width
  end 

  return (result + (US_001 >> 1)) / US_001                      ' round to nearest microsecond

JonnyMac · 2020-08-15 16:58

Here's one that I have -- for the moment -- because GETMS() dropped out of Spin2 (it's supposed to come back).

con { timing }

  CLK_FREQ = 200_000_000

  US_001   = CLK_FREQ / 1_000_000                               ' ticks per microsecond
  MS_001   = CLK_FREQ / 1_000                                   ' ticks per millisecond


pub get_ms() : ms | lo, hi

'' Return milliseconds after reset.
'' -- system counter is fixed; cannot be changed by user

  org
                getct     hi                            wc      ' get cnt (now)
                getct     lo                               
                setq      hi                                    ' divide cnt by ticks/ms
                qdiv      lo, ##MS_001                               
                getqx     ms
  end

This is the equivalent of the millis() function used in the Arduino ecosystem. I run my projects at 200MHz hence am able to use a constant for ticks/ms.

Rayman · 2020-08-15 17:24

I'd start with just turning the LEDs on and off on the Eval board.

Inline assembly makes it easy to learn PASM2.

And, turning on LEDs is always fun...

JonnyMac · 2020-08-15 17:34

The waitms() and waitus() instructions are very useful, but have an upper end time limit that can be defined by:

maximum_delay = 2 ^ 31 / clkfreq

...in seconds, which works out to ~10.737 seconds. That means if you do this:

  waitms(30_000)

...you will not get the delay you expect. Here's a routine called pause() that will let us do long delays is we really want to.
Edit: This is a far more elegant solution, as suggested by @Wuerfel_21 (thank you!)

pub pause(ms)

'' Delay for ms milliseconds

  org
                rep       @.msloop, ms
                 waitx    ##MS_001-4 
.msloop
  end

Note, again, that this code uses a constant for the number of ticks in a millisecond.

Wuerfel_21 · 2020-08-15 17:47

JonnyMac wrote: »
The waitms() and waitus() instructions are very useful, but have an upper end time limit that can be defined by:
maximum_delay = 2 ^ 31 / clkfreq
...in seconds, which works out to ~10.737 seconds. That means if you do this:
  waitms(30_000)
...you will not get the delay you expect. Here's a routine called pause() that will let us do long delays is we really want to.
pub pause(ms) | t0

'' Delay in milliseconds
'' -- for long delays (>10s)
''    * use waitms() for delays < 10s

  org
                getct     t0                                    ' snapshot counter
                rep       @.msloop, ms                          ' delay
                 addct1   t0, ##MS_001 
                 waitct1
.msloop
  end
Note, again, that this code uses a constant for the number of ticks in a millisecond.

I think you could write it slightly better as

pub pause(ms) | t0

'' Delay in milliseconds
'' -- for long delays (>10s)
''    * use waitms() for delays < 10s

  org
                rep       @.msloop, ms
                waitx ##MS_001 -4
.msloop
  end

ersmith · 2020-08-15 19:08

I don't want to sound cynical, but if ever there was a problem that did not cry out for optimization in assembler, it's "waiting for more than 10 seconds"!

JonnyMac · 2020-08-15 19:28

Rayman wrote: »

I'd start with just turning the LEDs on and off on the Eval board.
Inline assembly makes it easy to learn PASM2.
And, turning on LEDs is always fun...

To Ray's point, if you really wanted t duplicate the HIGH and LOW instructions from PBASIC, you could -- as he points out, twiddling and output (especially one with an LED) us a great place to start learning.

pub high(pin)

'' Make pin output and high

  org
                drvh      pin
  end


pub low(pin)

'' Make pin output and low

  org
                drvl      pin
  end

When I attended the Skip Barber Racing School they would tell us, "Slow in is fast out." This was referring to setting up for a corner -- exit speed from the corner is far more important than entry speed. That is to say, start slowly/simply and work you way toward complex code.

JonnyMac · 2020-08-15 19:34

I think you could write it slightly better as...

Thank you, Ada, you're absolutely right! As humans we sometimes hang onto our first idea and close our eyes to a more elegant solution.

MagIO2 · 2021-03-01 20:06

@JonnyMac said:
Here's one that I have -- for the moment -- because GETMS() dropped out of Spin2 (it's supposed to come back).con { timing } CLK_FREQ = 200_000_000 US_001 = CLK_FREQ / 1_000_000 ' ticks per microsecond MS_001 = CLK_FREQ / 1_000 ' ticks per millisecond pub get_ms() : ms | lo, hi '' Return milliseconds after reset. '' -- system counter is fixed; cannot be changed by user org getct hi wc ' get cnt (now) getct lo setq hi ' divide cnt by ticks/ms qdiv lo, ##MS_001 getqx ms end

This is the equivalent of the millis() function used in the Arduino ecosystem. I run my projects at 200MHz hence am able to use a constant for ticks/ms.

Just doing my first steps in reading PASM2 documents, which lead me to this thread and a question:
Isn't there a chance to get a wrong reading when doing a getct for the high long and then another getct for the low long?
For example getct high is reading $0000_0000 and getct is reading $0000_0000. This might mean that ct(h) was just read 2 ticks before ct(l) ran into a rollover. So, the correct result would actually be $0000_0001_0000_0000.

This means, that is dangerous to rely on the result of your division, as it might be off by (2 pow(32)-1)/200Mhz every once in a while.

Wuerfel_21 · 2021-03-01 20:25

@MagIO2 said:

@JonnyMac said:
Here's one that I have -- for the moment -- because GETMS() dropped out of Spin2 (it's supposed to come back).con { timing } CLK_FREQ = 200_000_000 US_001 = CLK_FREQ / 1_000_000 ' ticks per microsecond MS_001 = CLK_FREQ / 1_000 ' ticks per millisecond pub get_ms() : ms | lo, hi '' Return milliseconds after reset. '' -- system counter is fixed; cannot be changed by user org getct hi wc ' get cnt (now) getct lo setq hi ' divide cnt by ticks/ms qdiv lo, ##MS_001 getqx ms end

This is the equivalent of the millis() function used in the Arduino ecosystem. I run my projects at 200MHz hence am able to use a constant for ticks/ms.

Just doing my first steps in reading PASM2 documents, which lead me to this thread and a question:
Isn't there a chance to get a wrong reading when doing a getct for the high long and then another getct for the low long?
For example getct high is reading $0000_0000 and getct is reading $0000_0000. This might mean that ct(h) was just read 2 ticks before ct(l) ran into a rollover. So, the correct result would actually be $0000_0001_0000_0000.

This means, that is dangerous to rely on the result of your division, as it might be off by (2 pow(32)-1)/200Mhz every once in a while.

That actually doesn't happen. I think the high long is incremented if the low long is about to overflow or something.

AJL · 2021-03-01 21:38

Chip designed the hardware to account for low word rollover. I’m on my phone, so can’t find where it states it in the documentation.

My recollection is that a GETCT WC latches the state of both halves and stalls interrupts for one instruction to protect a following GETCT.
There may be a fix-up as she states, but either way, the error you are concerned about is prevented by Chip’s thoughtful design. Just one more thing that makes the P2 such a great processor.