Shop OBEX P1 Docs P2 Docs Learn Events
Enhancing Spin2 with Inline Assembly - Let's Build a Code Cookbook — Parallax Forums

Enhancing Spin2 with Inline Assembly - Let's Build a Code Cookbook

JonnyMacJonnyMac Posts: 9,102
edited 2020-08-15 16:43 in Propeller 2
Updates in the P2 architecture and instruction set resulted in a significant speed improvement over the P1 -- roughly 16x with the same code at the same clock speed. Still, there are processes that will require the precision of assembly. In the P1, we are forced to start a cog to use assembly. In the P2, we're able to use inline assembly, a feature long enjoyed by those using compilers.

For those wondering how inline assembly works with the Spin interpreter, let's look at the structure of a P2 method that uses inline assembly.
pub method(param1, param2) : result | local1, local2

  ' setup code (Spin)

  org

  ' assembly instructions

  end

  ' finish code (Spin)
When a method that uses inline assembly is encountered, any parameters, result variable(s), local variables(s), and the assembly code is moved into a reserved area of the Spin interpreter cog. If the assembly segment needs variables, they are defined as local variables of the method. When the assembly code is finished, the parameters, result(s), and local(s) are moved back to the hub. This allows pre- and post-assembly work with these variables. You can think of the process as temporarily adding a command to the Spin interpreter.

Please join me in this thread by sharing your inline assembly snippets. Let's build a code cookbook to help each other and those new to the Propeller 2.

Tips:
-- Keep your code short, and neatly formatted; if it's easy on the eyes, it will be easier to follow.
-- Comments are always a good thing; more is better
-- Include an archive with a demo that shows off your cool code.

Comments

  • JonnyMacJonnyMac Posts: 9,102
    edited 2020-08-15 16:44
    Here's an excellent example of using inline assembly in the P2. Let's say you want to build a convention badge with the P2 (I do), and you'd like to populate the badge with a few WS2812b LED pixels. While you could use a standard driver (e.g., jm_rgbx_pixel.spin2), with so few LEDs you could easily update the LEDs from your mainline code and save a cog.
    con
    
      WSTR = US_001 *   50                                          ' reset timing
      WST0 = US_001 *  400 / 1000 - 6                               ' 0 bit timing
      WST1 = US_001 *  800 / 1000 - 6                               ' 1 bit timing
      WSTC = US_001 * 1250 / 1000                                   ' cycle ticks @ 800kHz
    
    
    pub ws2812b(pin, count, p_colors) | outval, rgswap, tcycle
    
    '' Update WS2812b strip on pin
    '' -- count is # of LEDs on strip
    '' -- p_colors is pointer to colors (array of longs)
    ''    * uses $RR_GG_BB_00 color format
    
      org
                    drvl      pin                                   ' make pin output and low
                    waitx     ##WSTR                                ' allow reset
                                                                     
    led_loop        rdlong    outval, p_colors                      ' get color
                    add       p_colors, #4                          ' point to next
                                                                     
                    mov       rgswap, outval                        ' make copy of outval
                    shr       rgswap, #16                           ' rgswap = 00_00_RR_GG
                    setbyte   outval, rgswap, #3                    ' outval.byte[3] <-- rgswap.byte[0] g
                    shr       rgswap, #8                            ' rgswap = 00_00_00_RR
                    setbyte   outval, rgswap, #2                    ' outval.byte[2] <-- rgswap.byte[0] r
                     
                    getct     tcycle                                ' start cycle timer
                     
                    rep       @.bitz, #24                           ' 8 bits x 3 colors
                     rol      outval, #1                    wc      ' get MSB into C
                     drvh     pin                                   ' pin on
        if_nc        waitx    ##WST0                                ' hold for bit timing
        if_c         waitx    ##WST1                                 
                     drvl     pin                                   ' pin off
                     addct1   tcycle, ##WSTC                        ' update cycle timer
                     waitct1                                        ' let cycle finish
    .bitz                                                            
                                                                     
                    djnz      count, #led_loop                      ' next pixel
      end
    
    Note that the inline assembly is also able to use constants defined in Spin.
  • Another great aspect of inline assembly with Spin2 is that it allows us to experiment with small sections of assembly in a controlled environment.

    Here's a working Spin2 method duplicates the functionality of the PBASIC PULSIN command (with a little better resolution):
    pub pulsin(pin, state) : result | t0
    
      state &= 1                                                    ' isolate to state.bit0
      repeat until (pinr(pin) <> state)                             ' wait for idle state  
    
      repeat   
        t0 := getct()                                               ' start timing
      while (pinr(pin) <> state)                                    '  at start of pulse
    
      repeat 
        result := getct()                                           ' stop timing
      while (pinr(pin) == state)                                    '  at end of pulse
      
      result := ((result - t0) + (US_001 >> 1)) / US_001            ' round to nearest microsecond
    

    While this works, we may try our hand at PASM2 by duplicating the heart of the method in assembly. Working with a small bit of assembly in a method like this is easier than launching a cog. Here's my final assembly version.
    pub pulsin(pin, state) : result | t0
    
    '' Measures pulse on pin in microseconds
    '' -- WARNING: Blocks and does not have timeout
    '' -- pin..... the input pin receving the pulse
    '' -- state... the target state of the pulse
    ''             * 0 for high-low-high
    ''             * 1 for low-high-low
    
      org
                    fltl      pin                                   ' make pin an input
                    testb     state, #0                     wc      ' state.0 --> C
    
    waitidle        testp     pin                           wz      ' pin level --> Z
        if_z_eq_c   jmp       #waitidle                             ' hold for idle state
    
    edge1           getct     t0                                    ' time at change to state
                    testp     pin                           wz      ' pin level --> Z          
        if_z_ne_c   jmp       #edge1                                ' wait for pin to match state
    
    edge2           getct     result                                ' time at change back to idle
                    testp     pin                           wz      ' pin level --> Z            
        if_z_eq_c   jmp       #edge2                                ' wait for pin to be idle
    
                    sub       result, t0                            ' difference is pulse width
      end 
    
      return (result + (US_001 >> 1)) / US_001                      ' round to nearest microsecond
    
  • JonnyMacJonnyMac Posts: 9,102
    edited 2020-08-15 17:00
    Here's one that I have -- for the moment -- because GETMS() dropped out of Spin2 (it's supposed to come back).
    con { timing }
    
      CLK_FREQ = 200_000_000
    
      US_001   = CLK_FREQ / 1_000_000                               ' ticks per microsecond
      MS_001   = CLK_FREQ / 1_000                                   ' ticks per millisecond
    
    
    pub get_ms() : ms | lo, hi
    
    '' Return milliseconds after reset.
    '' -- system counter is fixed; cannot be changed by user
    
      org
                    getct     hi                            wc      ' get cnt (now)
                    getct     lo                               
                    setq      hi                                    ' divide cnt by ticks/ms
                    qdiv      lo, ##MS_001                               
                    getqx     ms
      end
    
    This is the equivalent of the millis() function used in the Arduino ecosystem. I run my projects at 200MHz hence am able to use a constant for ticks/ms.
  • RaymanRayman Posts: 14,632
    I'd start with just turning the LEDs on and off on the Eval board.

    Inline assembly makes it easy to learn PASM2.

    And, turning on LEDs is always fun...
  • JonnyMacJonnyMac Posts: 9,102
    edited 2020-08-15 19:56
    The waitms() and waitus() instructions are very useful, but have an upper end time limit that can be defined by:
    maximum_delay = 2 ^ 31 / clkfreq
    
    ...in seconds, which works out to ~10.737 seconds. That means if you do this:
      waitms(30_000)
    
    ...you will not get the delay you expect. Here's a routine called pause() that will let us do long delays is we really want to.
    Edit: This is a far more elegant solution, as suggested by @Wuerfel_21 (thank you!)
    pub pause(ms)
    
    '' Delay for ms milliseconds
    
      org
                    rep       @.msloop, ms
                     waitx    ##MS_001-4 
    .msloop
      end
    
    Note, again, that this code uses a constant for the number of ticks in a millisecond.

  • Wuerfel_21Wuerfel_21 Posts: 5,050
    edited 2020-08-15 17:49
    JonnyMac wrote: »
    The waitms() and waitus() instructions are very useful, but have an upper end time limit that can be defined by:
    maximum_delay = 2 ^ 31 / clkfreq
    
    ...in seconds, which works out to ~10.737 seconds. That means if you do this:
      waitms(30_000)
    
    ...you will not get the delay you expect. Here's a routine called pause() that will let us do long delays is we really want to.
    pub pause(ms) | t0
    
    '' Delay in milliseconds
    '' -- for long delays (>10s)
    ''    * use waitms() for delays < 10s
    
      org
                    getct     t0                                    ' snapshot counter
                    rep       @.msloop, ms                          ' delay
                     addct1   t0, ##MS_001 
                     waitct1
    .msloop
      end
    
    Note, again, that this code uses a constant for the number of ticks in a millisecond.

    I think you could write it slightly better as
    pub pause(ms) | t0
    
    '' Delay in milliseconds
    '' -- for long delays (>10s)
    ''    * use waitms() for delays < 10s
    
      org
                    rep       @.msloop, ms
                    waitx ##MS_001 -4
    .msloop
      end
    
  • I don't want to sound cynical, but if ever there was a problem that did not cry out for optimization in assembler, it's "waiting for more than 10 seconds"!
  • Rayman wrote: »
    I'd start with just turning the LEDs on and off on the Eval board.
    Inline assembly makes it easy to learn PASM2.
    And, turning on LEDs is always fun...

    To Ray's point, if you really wanted t duplicate the HIGH and LOW instructions from PBASIC, you could -- as he points out, twiddling and output (especially one with an LED) us a great place to start learning.
    pub high(pin)
    
    '' Make pin output and high
    
      org
                    drvh      pin
      end
    
    
    pub low(pin)
    
    '' Make pin output and low
    
      org
                    drvl      pin
      end
    
    When I attended the Skip Barber Racing School they would tell us, "Slow in is fast out." This was referring to setting up for a corner -- exit speed from the corner is far more important than entry speed. That is to say, start slowly/simply and work you way toward complex code.

  • JonnyMacJonnyMac Posts: 9,102
    edited 2020-08-15 19:42
    I think you could write it slightly better as...
    Thank you, Ada, you're absolutely right! As humans we sometimes hang onto our first idea and close our eyes to a more elegant solution.
  • @JonnyMac said:
    Here's one that I have -- for the moment -- because GETMS() dropped out of Spin2 (it's supposed to come back).con { timing } CLK_FREQ = 200_000_000 US_001 = CLK_FREQ / 1_000_000 ' ticks per microsecond MS_001 = CLK_FREQ / 1_000 ' ticks per millisecond pub get_ms() : ms | lo, hi '' Return milliseconds after reset. '' -- system counter is fixed; cannot be changed by user org getct hi wc ' get cnt (now) getct lo setq hi ' divide cnt by ticks/ms qdiv lo, ##MS_001 getqx ms end

    This is the equivalent of the millis() function used in the Arduino ecosystem. I run my projects at 200MHz hence am able to use a constant for ticks/ms.

    Just doing my first steps in reading PASM2 documents, which lead me to this thread and a question:
    Isn't there a chance to get a wrong reading when doing a getct for the high long and then another getct for the low long?
    For example getct high is reading $0000_0000 and getct is reading $0000_0000. This might mean that ct(h) was just read 2 ticks before ct(l) ran into a rollover. So, the correct result would actually be $0000_0001_0000_0000.

    This means, that is dangerous to rely on the result of your division, as it might be off by (2 pow(32)-1)/200Mhz every once in a while.

  • @MagIO2 said:

    @JonnyMac said:
    Here's one that I have -- for the moment -- because GETMS() dropped out of Spin2 (it's supposed to come back).con { timing } CLK_FREQ = 200_000_000 US_001 = CLK_FREQ / 1_000_000 ' ticks per microsecond MS_001 = CLK_FREQ / 1_000 ' ticks per millisecond pub get_ms() : ms | lo, hi '' Return milliseconds after reset. '' -- system counter is fixed; cannot be changed by user org getct hi wc ' get cnt (now) getct lo setq hi ' divide cnt by ticks/ms qdiv lo, ##MS_001 getqx ms end

    This is the equivalent of the millis() function used in the Arduino ecosystem. I run my projects at 200MHz hence am able to use a constant for ticks/ms.

    Just doing my first steps in reading PASM2 documents, which lead me to this thread and a question:
    Isn't there a chance to get a wrong reading when doing a getct for the high long and then another getct for the low long?
    For example getct high is reading $0000_0000 and getct is reading $0000_0000. This might mean that ct(h) was just read 2 ticks before ct(l) ran into a rollover. So, the correct result would actually be $0000_0001_0000_0000.

    This means, that is dangerous to rely on the result of your division, as it might be off by (2 pow(32)-1)/200Mhz every once in a while.

    That actually doesn't happen. I think the high long is incremented if the low long is about to overflow or something.

  • Chip designed the hardware to account for low word rollover. I’m on my phone, so can’t find where it states it in the documentation.

    My recollection is that a GETCT WC latches the state of both halves and stalls interrupts for one instruction to protect a following GETCT.
    There may be a fix-up as she states, but either way, the error you are concerned about is prevented by Chip’s thoughtful design. Just one more thing that makes the P2 such a great processor.

Sign In or Register to comment.