Shop OBEX P1 Docs P2 Docs Learn Events
P2 Execute PASM COG-CODE in hRAM — Parallax Forums

P2 Execute PASM COG-CODE in hRAM

Hello,
I noticed that in a SPIN project COG code can be executed directly.

I would like to know what are the requirements to use this under PASM.

How does it behave with the addresses since the COG supports only 9 bits?

Comments

  • cgraceycgracey Posts: 14,151

    You can start a cog using COGINIT to run A PASM program.

    For a cog that is running the Spin2 interpreter, registers $000.. $123 are available for PASM code, as well, via in-line assembly and REGEXEC, REGLOAD, and CALL commands.

  • I just found the piece of CODE I was referring to in #001.

    PRI calculateAccumulatorFrequency(mixingFrequency, amigaPeriod) : r | upper, lower
      'R = (65536 * 3546894) / amigaPeriod / mixingFrequency
      org
        qmul    ##65536, ##3546894
        getqx   lower
        getqy   upper
        setq    upper
        qdiv    lower, amigaPeriod
        getqx   lower
        qdiv    lower, mixingFrequency
        getqx   lower
      end
      return lower
    

    What are the restrictions on the PASM code between "org" and "end"?

    What wording should be avoided as it can change the timing in the code processing. E.g. where from one command in the source code the IDE has to split it into several.

  • RaymanRayman Posts: 14,640
    edited 2021-05-25 12:40

    @pic18f2550 I think we're calling this "inline assembly". There probably are some restrictions and it's good you asked because it doesn't appear to be documented in the Spin2 docs yet. I guess that's where it should go...

    In the FlexProp C version, you have to use local variables only in the inline assembly code. You can't use things like global variables or global constants.

  • cgraceycgracey Posts: 14,151

    The QMUL generates three instructions due to the ## and ##: AUGD, AUGS, QMUL. The other lines each generate one instruction.

    The code between ORG and END must fit in $000..$123.

  • RaymanRayman Posts: 14,640
    edited 2021-05-25 13:23

    I seem to recall you don’t need a RET line at the end of the code because a “ret” is automatically added for you.

    This inline assembly feature is really nice.

    There are underscores on either side of the “ret” above but they don’t show up for some reason

  • I am currently working with the Propeller Tool.

    between org and end fit $12F commands before the IDE grumbles.

    CON
      MP = 65535      ''<-- ?
    
    VAR
      long array[255] 
    
    PUB go()| a, b
    
        a := @array
        b := MP      ''<-- ?
    
      org 0
        mov    a1, a
    
        rdlong b1, a1
    
        add    a1, #1
        rdlong b2, a1
    
        rdlong b3, b      ''<-- And here the control seems to fail.
    
        jmp #exit
        'ret
    
    a1  long 0
    
    b1  long 0
    b2  long 0
    b3  long 0
    
    exit
      end
    
  • RaymanRayman Posts: 14,640

    Try using ".exit" instead of "exit" in two places. It may be that only local labels are allowed... Could be wrong though...

    I've also not seen things like "a1 long 0" in inline assembly. Might be better to have them as local Spin2 variables.

  • JonnyMacJonnyMac Posts: 9,102
    edited 2021-05-25 19:14

    I did a little reformatting and used DEBUG statements -- other than the compiler not seeming to like the use of 'exit,' everything behaves as expected. Note, too, that I traded your RAM array for a DAT array so I could pre-load it with known values.

    I also called go() from another method since inline methods are designed to return to a caller. In your case, I'm not sure what would happen given it's the only code in the program. If you want to run pure assembly, you can do that -- just don't put the code into a Spin2 method.

    con 
    
      CLK_FREQ = 200_000_000                                        ' system freq as a constant
      MS_001   = CLK_FREQ / 1_000                                   ' ticks in 1ms
      US_001   = CLK_FREQ / 1_000_000                               ' ticks in 1us
    
      _clkfreq = CLK_FREQ                                           ' set system clock
    
    
    con
    
      MP = 65535 
    
    
    var
    
    ' long  array[255]
    
    
    dat
    
      array         long      $33221100, $77665544, $BBAA9988
    
    
    pub main() | v
    
      go()
    
      v := long[MP]
      debug(uhex(v))
    
      repeat
    
    
    pub go() | a, b
    
      a := @array
      b := MP
    
      debug(uhex(a), uhex(b))
    
      org
                    mov       a1, a
                    rdlong    b1, a1
                    debug(uhex(a1), uhex(b1)) 
    
                    add       a1, #1
                    rdlong    b2, a1
                    debug(uhex(a1), uhex(b2))  
    
                    rdlong    b3, b
                    debug(uhex(b), uhex(b3))
    
                    jmp       #done
    
    a1              long      0
    
    b1              long      0
    b2              long      0
    b3              long      0
    
    done            ret 
      end
    

    Here's the DEBUG output after running.

  • pic18f2550pic18f2550 Posts: 400
    edited 2021-05-26 09:11

    Hello JonnyMac,
    I noticed two things:

    1. the IDE (Propeller Tool) does not check the values for permissible value range, where it actually could.
      This concerns the value "b".
      "rdlong b3, b" should be only a 9Bit value because in the COG code no "long b 0" was defined.
      "rdlong b3, b" no value check
      "rdlong b3, MP" Value check OK

    2. the code is not executed directly in the hRAM, but loaded with a "rdfast" into the COG-RAM and executed only here.
      I thought that it gets a segment address like the i8086 and uses this as COG-Ram except for the special registers.
      That would be maybe an option for the P3? :)

    CON 
      CLK_FREQ = 200_000_000                                        ' 'system freq as a constant
      MS_001   = CLK_FREQ / 1_000                                   '' ticks in 1ms
      US_001   = CLK_FREQ / 1_000_000                               '' ticks in 1us
      _clkfreq = CLK_FREQ                                           '' set system clock
    
    CON
      MP  = 65535
      MPX = 511
    
    VAR
    ''  long array[1024]
    
    DAT
      array         long      $33221100, $77665544, $BBAA9988
    
    PUB main() | v
      go()
    
      v := long[MP]
      debug(uhex(v))
    
      repeat
    
    PRI go()| a, b
        a := @array
        b := MP
    
        debug(uhex(a), uhex(b))
    
      org
        mov    a1, a
        rdlong b1, a1
        debug(uhex(a1), uhex(b1)) 
    
        add    a1, #1
        rdlong b2, a1
        debug(uhex(a1), uhex(b2))
    
    ''    rdlong b3, MP               '' IDE meldet Fehler wenn > 511
        rdlong b3, b                '' IDE meldet keinen Fehler wenn > 511
        debug(uhex(b), uhex(b3))
    
        rdlong b3, MPX
        debug(uhex(b), uhex(b3))
    
        ret                         '' wenn kein weiterer SPIN-CODE volgt
        jmp #ex                     '' wird benötigt wenn weiterer SPIN-CODE volgt
    
    a1  long 0
    
    b1  long 0
    b2  long 0
    b3  long 0
    
    ex
      end
    
    
  • msrobotsmsrobots Posts: 3,709
    edited 2021-05-26 14:09

    b is defined as local long in spin and can be used in assembler, so it is defined, allowed to be >512, but locals have no guaranteed value, just return values are initialized to 0.

    And no there are no segment registers. For code addresses $000-$1FF are COG ram execution $200-3FF are LUT ram execution >=$400 is HUB ram execution.
    you need to jmp over borders (no problem for real Germans) your code can not simply run from COG to LUT ram or LUT ram to HUB ram. You need to jmp/call.

    for data access rd/wr/long/word/byte will access $000-$400 as HUB ram (no code execution). There was some discussion a long time ago that code execution in HUB ram below $400 would work with odd (not even) addresses, not sure where that went and if still valid.

    Enjoy!

    Mike

  • JonnyMacJonnyMac Posts: 9,102
    edited 2021-05-26 15:42

    "rdlong b3, b" should be only a 9Bit value because in the COG code no "long b 0" was defined.

    Yes, it was -- by the compiler. When inline code is passed to the cog, all of the parameters, return value(s), and local variable(s) are passed, too. When the routine is finished, all of those [potentially modified] values are moved back so they can be accessed by high-level Spin code that might follow.

    This example shows how the inline PASM can modify variables defined in the high-level code of the method.

    pub add_two(x, y) : result | sum
    
      debug(sdec(x), sdec(y), sdec(sum))  
    
      org
                    mov     sum, x
                    adds    sum, y
                    mov     result, sum
      end
    
      debug(sdec(x), sdec(y), sdec(sum))
    

    The 9-bit limitation is for literal values, and even that can be modified by using ##.

    Have a look. Note that when using constants in PASM they must be prefaced by # or ## (>511).

    con
    
      CLK_FREQ = 200_000_000                                        ' 'system freq as a constant
    
      MS_001   = CLK_FREQ / 1_000                                   '' ticks in 1ms
      US_001   = CLK_FREQ / 1_000_000                               '' ticks in 1us
    
      _clkfreq = CLK_FREQ                                           '' set system clock
    
    
    con
    
      MP  = 65535
      MPX = 511
    
    
    var
    
    ' long array[1024]
    
    
    dat
    
      array         long      $33221100, $77665544, $BBAA9988
    
    
    pub main() | v
    
      go()
    
      v := long[MP]
      debug(uhex(v))
    
      v := long[MPX]
      debug(uhex(v))
    
      repeat
    
    
    pri go() | a, b
    
      a := @array
      b := MP
    
      debug(uhex(a), uhex(b))
    
      org
                    mov       a1, a
                    rdlong    b1, a1
    
                    debug(uhex(a1), uhex(b1))
    
                    add       a1, #1
                    rdlong    b2, a1
    
                    debug(uhex(a1), uhex(b2))
    
                    rdlong    b3, b
    
                    debug(uhex(b), uhex(b3))             
    
                    rdlong    b3, ##MP
    
                    debug(uhex(b3))             
    
                    rdlong    b3, ##MPX
    
                    debug(uhex(b3))             
    
                    jmp       #done
    
    a1              long      0
    
    b1              long      0
    b2              long      0
    b3              long      0
    
    done
      end
    

    That would be maybe an option for the P3?

    Don't hold your breath -- the P2 was 12 years in development because Chip accommodated nearly every request thrown at him (this is not a sustainable development process). The problem with those of us with experience is that we bring our biases. Give the P2 a try for what it is, not what you wish it was. With your experience I'm sure you'll be able to do really neat things that will benefit your clients and the Propeller community.

  • evanhevanh Posts: 15,914
    edited 2021-05-27 05:32

    CogRAM cannot be compared to segmentation, or any other mapping tricks. There's no way for hubRAM to be accessed with low latency like cogRAM is. Even if caching was thrown at it you still don't get guarantees of deterministic read latencies.

    EDIT: I guess the term "inline assembly" slightly misrepresents what is actually happening, since it only inline in the source. The byte-coded Pnut/Proptool output is more disjointed than that.

    On the other hand, I think the default for Flexspin, since it compiles to native machine code, does produce truly inlined code as compiled hubexec. It can be given directives to use lutRAM for inline assembly routines if desired.

  • Okay, so it's not that simple.
    So I stay with my "rdfast" method with loading routine to switch to other drivers.
    My goal was to save the clocks for loading the newun code, by direct access to the hRAM.

    JonnyMac don't panic I don't want to force more work on Chip.
    I can imagine how such a wave of wishes and ideas rolls over you. :)
    If you look at the P2 like this, you can only take your hat off and say thank you.

  • RaymanRayman Posts: 14,640
    edited 2021-05-28 17:27

    I don't think Flexspin uses LUT RAM any more... Was just reading the attached and says now goes to COG RAM:
    COG RAM from $00 to $ff is used for FCACHE

    @pic18f2550 You might want to read the "Restrictions in inline assembly" in the attached. I'm guessing that most of this applies to Spin2 as well...

    Oops. This was old version.

  • I have attached the loader.

  • evanhevanh Posts: 15,914

    Ah, I don't think RDFAST is the instruction you're looking for .... ;)

    There's two general fast solutions for that. One is a SETQ+RDLONG combo for data block copy from hubRAM to cogRAM, which can then be branched into from hubexec. The other being COGINIT for relaunching the same cog with its self copy feature.

  • evanhevanh Posts: 15,914

    @Rayman said:
    I don't think Flexspin uses LUT RAM any more... Was just reading the attached and says now goes to COG RAM:
    COG RAM from $00 to $ff is used for FCACHE

    Cool, thanks. I've used Spin so little I've got behind on latest changes.

  • evanhevanh Posts: 15,914
    edited 2021-05-28 05:29

    I see this in that "general.pdf":

    Most of COG RAM is used by the compiler, except that $0-$1f and $1e0-$1ef are left free for application use. The second half of LUT is used for FCACHE; the first half is used by any functions placed into LUT.

    So lutRAM still gets used as before. What I think is new is functions can individually be assigned to cogRAM and lutRAM and, relatedly, a lot more of cogRAM is available now ... so that "Most of COG RAM ..." assertion is also out of date.

  • evanhevanh Posts: 15,914

    And flexspin's inline assembly blurb:

    Inline assembly
    All of the languages allow inline assembly within functions. There are 3 different forms of inline assembly:
    (1) Plain inline assembly. This is generated by asm/endasm in Spin and BASIC, and from an __asm { } block in C. These blocks run in hubexec mode (for P2) or LMM (for P1) and are optimized by the optimizer.
    (2) HUB non-optimized assembly. This is generated by asm const/end asm in BASIC and __asm const{} in C; it is not currently available in Spin. Like plain assembly this runs from HUB, but is not subject to optimization.
    (3) FCACHEd non-optimized assembly. This is generated by org/end in Spin, asm cpu/end in BASIC, and __asm volatile{} in C. This is not subject to optimization, and before execution it is loaded into the FCACHE area, so its timing is based on running from internal memory rather than HUB.

  • evanhevanh Posts: 15,914

    @evanh said:
    Ah, I don't think RDFAST is the instruction you're looking for .... ;)

    There's two general fast solutions for that. One is a SETQ+RDLONG combo for data block copy from hubRAM to cogRAM, which can then be branched into from hubexec. The other being COGINIT for relaunching the same cog with its self copy feature.

    Correction, if only a small routine being copied with SETQ+RDLONG then the branch can be from cogexec too.

  • RaymanRayman Posts: 14,640
    edited 2021-05-28 14:02

    @evanh Could you have an old version?

    Here's what I have:

    Most of COG RAM is used by the compiler, except that $1e0-$1ef are left free
    for application use. COG RAM from $00 to $ff is used for FCACHE, and so
    when you are sure no FCACHE is in use you may use this for scratch.
    

    This is ver. 5.4.3 from 08May21
    Reading further looks like you can also force things into the first half of LUT RAM.

  • evanhevanh Posts: 15,914
    edited 2021-05-28 17:26

    I'm just reading the PDF you provided. I've hardly written any Spin code.

  • RaymanRayman Posts: 14,640
    edited 2021-05-28 17:32

    Ok, I posted an old version there... From way back in January... Sorry about that. Just deleted above.

    Here's the new version.

  • evanhevanh Posts: 15,914
    edited 2021-05-28 17:52

    Thanks, and good. I had thought about the change in memory uses when you brought it up. That leaves lutRAM unused by default. Leaves it free for application purposes, including streamer ops.

    EDIT: Err, or not. Second half is now vaguely "used for internal purposes":

    LUT
    The first part of LUT memory (from $200 to $300) is used for any functions explicitly placed into LUT. The LUT memory from $300 to $400 (the second half of LUT) is used for internal purposes.

  • RaymanRayman Posts: 14,640

    Looks like the different languages all have ways to force things into LUT...

  • RaymanRayman Posts: 14,640

    Looks like there's a new "IN-LINE PASM CODE" section in the Spin2 docs:
    https://docs.google.com/document/d/16qVkmA6Co5fUNKJHF6pBfGfDupuRwDtf-wyieh_fbqw

  • pic18f2550pic18f2550 Posts: 400
    edited 2021-05-29 07:50

    This section would be interesting for me, but unfortunately it is not completely readable.

    Can anything be done about it?

  • Yes, just download it as a pdf. It displays then fine. Maybe there are other ways around as well.

Sign In or Register to comment.