P2 Execute PASM COG-CODE in hRAM

pic18f2550 · 2021-05-25 11:33

Hello,
I noticed that in a SPIN project COG code can be executed directly.

I would like to know what are the requirements to use this under PASM.

How does it behave with the addresses since the COG supports only 9 bits?

cgracey · 2021-05-25 11:46

You can start a cog using COGINIT to run A PASM program.

For a cog that is running the Spin2 interpreter, registers $000.. $123 are available for PASM code, as well, via in-line assembly and REGEXEC, REGLOAD, and CALL commands.

pic18f2550 · 2021-05-25 12:33

I just found the piece of CODE I was referring to in #001.

PRI calculateAccumulatorFrequency(mixingFrequency, amigaPeriod) : r | upper, lower
  'R = (65536 * 3546894) / amigaPeriod / mixingFrequency
  org
    qmul    ##65536, ##3546894
    getqx   lower
    getqy   upper
    setq    upper
    qdiv    lower, amigaPeriod
    getqx   lower
    qdiv    lower, mixingFrequency
    getqx   lower
  end
  return lower

What are the restrictions on the PASM code between "org" and "end"?

What wording should be avoided as it can change the timing in the code processing. E.g. where from one command in the source code the IDE has to split it into several.

Rayman · 2021-05-25 12:39

@pic18f2550 I think we're calling this "inline assembly". There probably are some restrictions and it's good you asked because it doesn't appear to be documented in the Spin2 docs yet. I guess that's where it should go...

In the FlexProp C version, you have to use local variables only in the inline assembly code. You can't use things like global variables or global constants.

cgracey · 2021-05-25 12:44

The QMUL generates three instructions due to the ## and ##: AUGD, AUGS, QMUL. The other lines each generate one instruction.

The code between ORG and END must fit in $000..$123.

Rayman · 2021-05-25 13:21

I seem to recall you don’t need a RET line at the end of the code because a “ret” is automatically added for you.

This inline assembly feature is really nice.

There are underscores on either side of the “ret” above but they don’t show up for some reason

pic18f2550 · 2021-05-25 14:36

I am currently working with the Propeller Tool.

between org and end fit $12F commands before the IDE grumbles.

CON
  MP = 65535      ''<-- ?

VAR
  long array[255] 

PUB go()| a, b

    a := @array
    b := MP      ''<-- ?

  org 0
    mov    a1, a

    rdlong b1, a1

    add    a1, #1
    rdlong b2, a1

    rdlong b3, b      ''<-- And here the control seems to fail.

    jmp #exit
    'ret

a1  long 0

b1  long 0
b2  long 0
b3  long 0

exit
  end

Rayman · 2021-05-25 15:01

Try using ".exit" instead of "exit" in two places. It may be that only local labels are allowed... Could be wrong though...

I've also not seen things like "a1 long 0" in inline assembly. Might be better to have them as local Spin2 variables.

JonnyMac · 2021-05-25 19:06

I did a little reformatting and used DEBUG statements -- other than the compiler not seeming to like the use of 'exit,' everything behaves as expected. Note, too, that I traded your RAM array for a DAT array so I could pre-load it with known values.

I also called go() from another method since inline methods are designed to return to a caller. In your case, I'm not sure what would happen given it's the only code in the program. If you want to run pure assembly, you can do that -- just don't put the code into a Spin2 method.

con 

  CLK_FREQ = 200_000_000                                        ' system freq as a constant
  MS_001   = CLK_FREQ / 1_000                                   ' ticks in 1ms
  US_001   = CLK_FREQ / 1_000_000                               ' ticks in 1us

  _clkfreq = CLK_FREQ                                           ' set system clock


con

  MP = 65535 


var

' long  array[255]


dat

  array         long      $33221100, $77665544, $BBAA9988


pub main() | v

  go()

  v := long[MP]
  debug(uhex(v))

  repeat


pub go() | a, b

  a := @array
  b := MP

  debug(uhex(a), uhex(b))

  org
                mov       a1, a
                rdlong    b1, a1
                debug(uhex(a1), uhex(b1)) 

                add       a1, #1
                rdlong    b2, a1
                debug(uhex(a1), uhex(b2))  

                rdlong    b3, b
                debug(uhex(b), uhex(b3))

                jmp       #done

a1              long      0

b1              long      0
b2              long      0
b3              long      0

done            ret 
  end

Here's the DEBUG output after running.

pic18f2550 · 2021-05-26 08:47

Hello JonnyMac,
I noticed two things:

the IDE (Propeller Tool) does not check the values for permissible value range, where it actually could.
This concerns the value "b".
"rdlong b3, b" should be only a 9Bit value because in the COG code no "long b 0" was defined.
"rdlong b3, b" no value check
"rdlong b3, MP" Value check OK
the code is not executed directly in the hRAM, but loaded with a "rdfast" into the COG-RAM and executed only here.
I thought that it gets a segment address like the i8086 and uses this as COG-Ram except for the special registers.
That would be maybe an option for the P3?

CON 
  CLK_FREQ = 200_000_000                                        ' 'system freq as a constant
  MS_001   = CLK_FREQ / 1_000                                   '' ticks in 1ms
  US_001   = CLK_FREQ / 1_000_000                               '' ticks in 1us
  _clkfreq = CLK_FREQ                                           '' set system clock

CON
  MP  = 65535
  MPX = 511

VAR
''  long array[1024]

DAT
  array         long      $33221100, $77665544, $BBAA9988

PUB main() | v
  go()

  v := long[MP]
  debug(uhex(v))

  repeat

PRI go()| a, b
    a := @array
    b := MP

    debug(uhex(a), uhex(b))

  org
    mov    a1, a
    rdlong b1, a1
    debug(uhex(a1), uhex(b1)) 

    add    a1, #1
    rdlong b2, a1
    debug(uhex(a1), uhex(b2))

''    rdlong b3, MP               '' IDE meldet Fehler wenn > 511
    rdlong b3, b                '' IDE meldet keinen Fehler wenn > 511
    debug(uhex(b), uhex(b3))

    rdlong b3, MPX
    debug(uhex(b), uhex(b3))

    ret                         '' wenn kein weiterer SPIN-CODE volgt
    jmp #ex                     '' wird benötigt wenn weiterer SPIN-CODE volgt

a1  long 0

b1  long 0
b2  long 0
b3  long 0

ex
  end

msrobots · 2021-05-26 13:51

b is defined as local long in spin and can be used in assembler, so it is defined, allowed to be >512, but locals have no guaranteed value, just return values are initialized to 0.

And no there are no segment registers. For code addresses $000-$1FF are COG ram execution $200-3FF are LUT ram execution >=$400 is HUB ram execution.
you need to jmp over borders (no problem for real Germans) your code can not simply run from COG to LUT ram or LUT ram to HUB ram. You need to jmp/call.

for data access rd/wr/long/word/byte will access $000-$400 as HUB ram (no code execution). There was some discussion a long time ago that code execution in HUB ram below $400 would work with odd (not even) addresses, not sure where that went and if still valid.

Enjoy!

Mike

JonnyMac · 2021-05-26 15:22

"rdlong b3, b" should be only a 9Bit value because in the COG code no "long b 0" was defined.

Yes, it was -- by the compiler. When inline code is passed to the cog, all of the parameters, return value(s), and local variable(s) are passed, too. When the routine is finished, all of those [potentially modified] values are moved back so they can be accessed by high-level Spin code that might follow.

This example shows how the inline PASM can modify variables defined in the high-level code of the method.

pub add_two(x, y) : result | sum

  debug(sdec(x), sdec(y), sdec(sum))  

  org
                mov     sum, x
                adds    sum, y
                mov     result, sum
  end

  debug(sdec(x), sdec(y), sdec(sum))

The 9-bit limitation is for literal values, and even that can be modified by using ##.

Have a look. Note that when using constants in PASM they must be prefaced by # or ## (>511).

con

  CLK_FREQ = 200_000_000                                        ' 'system freq as a constant

  MS_001   = CLK_FREQ / 1_000                                   '' ticks in 1ms
  US_001   = CLK_FREQ / 1_000_000                               '' ticks in 1us

  _clkfreq = CLK_FREQ                                           '' set system clock


con

  MP  = 65535
  MPX = 511


var

' long array[1024]


dat

  array         long      $33221100, $77665544, $BBAA9988


pub main() | v

  go()

  v := long[MP]
  debug(uhex(v))

  v := long[MPX]
  debug(uhex(v))

  repeat


pri go() | a, b

  a := @array
  b := MP

  debug(uhex(a), uhex(b))

  org
                mov       a1, a
                rdlong    b1, a1

                debug(uhex(a1), uhex(b1))

                add       a1, #1
                rdlong    b2, a1

                debug(uhex(a1), uhex(b2))

                rdlong    b3, b

                debug(uhex(b), uhex(b3))             

                rdlong    b3, ##MP

                debug(uhex(b3))             

                rdlong    b3, ##MPX

                debug(uhex(b3))             

                jmp       #done

a1              long      0

b1              long      0
b2              long      0
b3              long      0

done
  end

That would be maybe an option for the P3?

Don't hold your breath -- the P2 was 12 years in development because Chip accommodated nearly every request thrown at him (this is not a sustainable development process). The problem with those of us with experience is that we bring our biases. Give the P2 a try for what it is, not what you wish it was. With your experience I'm sure you'll be able to do really neat things that will benefit your clients and the Propeller community.

evanh · 2021-05-27 05:19

CogRAM cannot be compared to segmentation, or any other mapping tricks. There's no way for hubRAM to be accessed with low latency like cogRAM is. Even if caching was thrown at it you still don't get guarantees of deterministic read latencies.

EDIT: I guess the term "inline assembly" slightly misrepresents what is actually happening, since it only inline in the source. The byte-coded Pnut/Proptool output is more disjointed than that.

On the other hand, I think the default for Flexspin, since it compiles to native machine code, does produce truly inlined code as compiled hubexec. It can be given directives to use lutRAM for inline assembly routines if desired.

pic18f2550 · 2021-05-27 10:20

Okay, so it's not that simple.
So I stay with my "rdfast" method with loading routine to switch to other drivers.
My goal was to save the clocks for loading the newun code, by direct access to the hRAM.

JonnyMac don't panic I don't want to force more work on Chip.
I can imagine how such a wave of wishes and ideas rolls over you.
If you look at the P2 like this, you can only take your hat off and say thank you.

Rayman · 2021-05-27 12:36

I don't think Flexspin uses LUT RAM any more... Was just reading the attached and says now goes to COG RAM:
COG RAM from $00 to $ff is used for FCACHE

@pic18f2550 You might want to read the "Restrictions in inline assembly" in the attached. I'm guessing that most of this applies to Spin2 as well...

Oops. This was old version.

pic18f2550 · 2021-05-27 13:37

I have attached the loader.

evanh · 2021-05-28 05:14

Ah, I don't think RDFAST is the instruction you're looking for ....

There's two general fast solutions for that. One is a SETQ+RDLONG combo for data block copy from hubRAM to cogRAM, which can then be branched into from hubexec. The other being COGINIT for relaunching the same cog with its self copy feature.

evanh · 2021-05-28 05:16

@Rayman said:
I don't think Flexspin uses LUT RAM any more... Was just reading the attached and says now goes to COG RAM:
COG RAM from $00 to $ff is used for FCACHE

Cool, thanks. I've used Spin so little I've got behind on latest changes.

evanh · 2021-05-28 05:25

I see this in that "general.pdf":

Most of COG RAM is used by the compiler, except that $0-$1f and $1e0-$1ef are left free for application use. The second half of LUT is used for FCACHE; the first half is used by any functions placed into LUT.

So lutRAM still gets used as before. What I think is new is functions can individually be assigned to cogRAM and lutRAM and, relatedly, a lot more of cogRAM is available now ... so that "Most of COG RAM ..." assertion is also out of date.

evanh · 2021-05-28 05:35

And flexspin's inline assembly blurb:

Inline assembly
All of the languages allow inline assembly within functions. There are 3 different forms of inline assembly:
(1) Plain inline assembly. This is generated by asm/endasm in Spin and BASIC, and from an __asm { } block in C. These blocks run in hubexec mode (for P2) or LMM (for P1) and are optimized by the optimizer.
(2) HUB non-optimized assembly. This is generated by asm const/end asm in BASIC and __asm const{} in C; it is not currently available in Spin. Like plain assembly this runs from HUB, but is not subject to optimization.
(3) FCACHEd non-optimized assembly. This is generated by org/end in Spin, asm cpu/end in BASIC, and __asm volatile{} in C. This is not subject to optimization, and before execution it is loaded into the FCACHE area, so its timing is based on running from internal memory rather than HUB.

evanh · 2021-05-28 09:22

@evanh said:
Ah, I don't think RDFAST is the instruction you're looking for ....

There's two general fast solutions for that. One is a SETQ+RDLONG combo for data block copy from hubRAM to cogRAM, which can then be branched into from hubexec. The other being COGINIT for relaunching the same cog with its self copy feature.

Correction, if only a small routine being copied with SETQ+RDLONG then the branch can be from cogexec too.

Rayman · 2021-05-28 13:19

@evanh Could you have an old version?

Here's what I have:

Most of COG RAM is used by the compiler, except that $1e0-$1ef are left free
for application use. COG RAM from $00 to $ff is used for FCACHE, and so
when you are sure no FCACHE is in use you may use this for scratch.

This is ver. 5.4.3 from 08May21
Reading further looks like you can also force things into the first half of LUT RAM.

evanh · 2021-05-28 17:24

I'm just reading the PDF you provided. I've hardly written any Spin code.

Rayman · 2021-05-28 17:30

Ok, I posted an old version there... From way back in January... Sorry about that. Just deleted above.

Here's the new version.

evanh · 2021-05-28 17:49

Thanks, and good. I had thought about the change in memory uses when you brought it up. That leaves lutRAM unused by default. Leaves it free for application purposes, including streamer ops.

EDIT: Err, or not. Second half is now vaguely "used for internal purposes":

LUT
The first part of LUT memory (from $200 to $300) is used for any functions explicitly placed into LUT. The LUT memory from $300 to $400 (the second half of LUT) is used for internal purposes.

Rayman · 2021-05-28 18:24

Looks like the different languages all have ways to force things into LUT...

Rayman · 2021-05-28 18:29

Looks like there's a new "IN-LINE PASM CODE" section in the Spin2 docs:
https://docs.google.com/document/d/16qVkmA6Co5fUNKJHF6pBfGfDupuRwDtf-wyieh_fbqw

pic18f2550 · 2021-05-28 18:45

This section would be interesting for me, but unfortunately it is not completely readable.

Can anything be done about it?

Maciek · 2021-05-28 18:55

Yes, just download it as a pdf. It displays then fine. Maybe there are other ways around as well.

P2 Execute PASM COG-CODE in hRAM

Comments