Understanding the performance costs of hub execution.

brianh · 2022-09-23 13:56

Hi all. I'm new to the P2 and could use some help understanding this from a code execution time and predictability standpoint.

For my P1 project, I needed a relatively small amount of code with very low latency (<400 ns in many areas). I followed the "typical" model of writing PASM code for cog RAM with tight loops and pin wait code, and only used Spin1 to provide some simple start/stop/configure cog semantics and hub "mailbox" message passing. I'm sure we've all seen this model for coding on both the P1 and P2.

For my P2 project, I have more head room on the latency side, and my application is much more complicated. This makes me want to write much more code in Spin2 and use PASM2 much more sparingly. I'd like to try to have cog execution/RAM only used for running inline PASM2 code that is put in as an optimization, and probably also for code that is interrupt triggered (though, I don't really understand P2 interrupts and events very well, yet). I'll use smart pins too, and would prefer to manage them from Spin2.

I think this means that my code's critical sections will incur only the performance cost of moving them to cog RAM each time they execute, or the performance cost of hub execution if the compiler doesn't move them. Is this correct? How do I choose between these two options, and is there a straight-forward way to measure this cost and have it be predictable? Also, in cases where this performance cost is not tenable, is an event/interrupt handler that permanently resides in cog RAM the right solution? How to set those up and how do I also measure the performance overhead for that code?

Really, my goal here is to make the best use I can of Spin2 so that my code is simpler and easier to maintain in all the non-critical places. How best to do that, while cleanly integrating PASM2 for critical sections, and understanding my performance trade-offs and implications- that is the crux of my question. Also, I'm very happy to be pointed to existing, relevant, parts of the P2 documentation for the answer, too.

Best.

pik33 · 2022-09-23 14:25

Hub execution in a P2 is the same speed as cog execution, except:

you cannot use FIFO, as it is used by the hub execution itself (and makes it fast)
several instructions, mainly jumps, (and then loops) are slower, as the jump if taken, needs to reload the FIFO
skipf doesnt work any faster than stadard skip
you can not have interrupts (at least using Flexprop, I don't know how Propeller Tool handles this)
maybe there are several other restriction I can't remember now

When using Flexprop you can insert PASM code in 2 ways: either make the compiler to move the asm code to the cog and execute it from there, or inline it into the hub ram program.

As flexprop can compile Basic and C you can also consider using one of these languages too.

Wuerfel_21 · 2022-09-23 14:26

The answer depends on wether you're using the official Spin2 compiler or flexspin.

The normal ORG/END block in Spin2 will always load the code into cog RAM, at the cost of 1 clock cycle per instruction every time the block runs. Official spin will also add 2(?) cycles per local variable that's in scope.

Flexspin has ASM/ENDASM which puts the code into hub RAM (though if you have a loop in there the optimizer might end up caching it to cog ram anyways). I think(tm) you can do ORGH/END for optimizer-bypassing hubexec asm, too.

Also, in cases where this performance cost is not tenable, is an event/interrupt handler that permanently resides in cog RAM the right solution? How to set those up and how do I also measure the performance overhead for that code?

There's some free memory in the cog that can be used for such things, but it's kinda clunky to do so and the interrupt latency will be very wobbly depending on what the main thread is doing (math, inline ASM loading and many other things will stall IRQs).

If you want to do something low-latency or intensive, you'll still want to run it in a separate cog, but for doing a little bit of processing you can do inline ASM now. Though with flexspin the compiled Spin code is often close to optimal to begin with.

evanh · 2022-09-23 14:28

System time register is fetched with GETCT instruction. There is Spin2 equivalent, GETCT(), that can be used for precisely measuring code execution intervals.

Pnut/Proptool supports embedding ISR in cogRAM alongside compiled code.
Flex doesn't support this from any language.

Pnut/Proptool always pseudo inlines inline pasm, as in it always copies such code into cogexec space upon each entry.
FlexC/FlexBasic is optional, it will optimise accordingly by default. But there is directives to force true hubexec inlining for example. The different behaviour is a product of the difference type of compile. Flex produces native compiled. Pnut/Proptool compiles to bytecode. That said, FlexSpin, as opposed to FlexC, doesn't have any directives I don't think. It just mimics Pnut's cogexec for inline pasm. Err, I keep forgetting about ASM/ENDASM.

brianh · 2022-09-23 15:00

@pik33 said:
As flexprop can compile Basic and C you can also consider using one of these languages too.

Thanks for that summary. Also, pardon the silly question, but is flexprop just a GUI for running the flexspin compiler that I'm a little familiar with?

@evanh said:
System time register is fetched with GETCT instruction. There is Spin2 equivalent, GETCT(), that can be used for precisely measuring code execution intervals.

Thanks. super useful pro tip!

@Wuerfel_21 said:
The answer depends on wether you're using the official Spin2 compiler or flexspin.

Aye, thanks for pointing that out. Tool complexities I had not even considered yet, and brings up the query about how to mix together output from multiple tools.

brianh · 2022-09-23 15:09

@Wuerfel_21 said:

Flexspin has ASM/ENDASM which puts the code into hub RAM (though if you have a loop in there the optimizer might end up caching it to cog ram anyways). I think(tm) you can do ORGH/END for optimizer-bypassing hubexec asm, too.

Optimizer caching scares me, though I'm not sure and maybe it's a great solution. Support for ORGH/END seems like it would support self-modifying code, and that is kind of exciting to me.

evanh · 2022-09-23 15:15

@brianh said:
Aye, thanks for pointing that out. Tool complexities I had not even considered yet, and brings up the query about how to mix together output from multiple tools.

The easy way is to just use Flex. It'll compile all together. The Spin2 sources integrate into C/Basic and vis-versa.

evanh · 2022-09-23 15:19

@brianh said:
Optimizer caching scares me, though I'm not sure and maybe it's a great solution. Support for ORGH/END seems like it would support self-modifying code, and that is kind of exciting to me.

There is directives for controlling what Pasm code gets left alone. And Spin2 based ORG/END is unoptimised.

Wuerfel_21 · 2022-09-23 15:24

@brianh said:

@Wuerfel_21 said:

Flexspin has ASM/ENDASM which puts the code into hub RAM (though if you have a loop in there the optimizer might end up caching it to cog ram anyways). I think(tm) you can do ORGH/END for optimizer-bypassing hubexec asm, too.

Optimizer caching scares me, though I'm not sure and maybe it's a great solution. Support for ORGH/END seems like it would support self-modifying code, and that is kind of exciting to me.

ASM/ENDASM just plomps your inline ASM into the same IR processing that the high-level generated code goes through (as do the other inline ASM methods, but they flag it so as to not participate in most opt passes). Useful for when you just need some sequence of code without regard for its particular timing or order. Which is most of the time when not doing I/O.

Mickster · 2022-09-23 21:11

FWIW: I just went through this cog-execution thing:

https://forums.parallax.com/discussion/174078/flexbasic-for-cog#latest

Craig

Rayman · 2022-09-23 21:28

I rarely need self-modifying code with P2 thanks to the new ALTx type instructions...

RossH · 2022-09-24 04:02

@pik33 said:
Hub execution in a P2 is the same speed as cog execution, except:

...

skipf doesnt work any faster than stadard skip

you can not have interrupts (at least using Flexprop, I don't know how Propeller Tool handles this)

...

Yes, you have to avoid skipf if you are using interrupts, but other than that AFAIK interrupts work ok with Hub execution.

Perhaps this is a Flexprop limitation?. Can anyone clarify?

Ross.

evanh · 2022-09-24 04:47

The Flex suite is not written as interrupt safe. Mixing an ISR with compiled code on the same cog will cause data corruption in the compiled code.
Chip has carefully ensured Pnut/Proptool's VM is interrupt safe with many inserted cases of REP #1 shielding those sensitive code areas.
As Ross has observed, an ISR may happily exist in hubRAM.

RossH · 2022-09-24 05:28

@evanh said:
The Flex suite is not written as interrupt safe. Mixing an ISR with compiled code on the same cog will cause data corruption in the compiled code.
Chip has carefully ensured Pnut/Proptool's VM is interrupt safe with many inserted cases of REP #1 shielding those sensitive code areas.
As Ross has observed, an ISR may happily exist in hubRAM.

That makes sense. Thanks.

Actually, my earlier post was slightly misleading - yes, SKIPF can't be used in Hub mode, but SKIP can, and it is ok to use both SKIP and SKIPF in conjunction with interrupts - you just can't use them inside an ISR. Since Catalina allows an arbitrary C function to be used as an ISR, this means I have to avoid using both SKIP and SKIPF pretty much altogether except in situations where I know there are no interrupts in use. In particular I can't use them in any library code, because an ISR can call any library function.

Christof Eb. · 2022-09-26 17:49

@pik33 said:
Hub execution in a P2 is the same speed as cog execution, except:

you cannot use FIFO, as it is used by the hub execution itself (and makes it fast)

several instructions, mainly jumps, (and then loops) are slower, as the jump if taken, needs to reload the FIFO

skipf doesnt work any faster than stadard skip

you can not have interrupts (at least using Flexprop, I don't know how Propeller Tool handles this)

maybe there are several other restriction I can't remember now

As far as I understand, any Hub Ram Data access will disturb the egg beater timing too. The egg beater is optimal only, if (adress mod 8) is consecutive.

TonyB_ · 2022-09-26 19:17

@"Christof Eb." said:

@pik33 said:
Hub execution in a P2 is the same speed as cog execution, except:

you cannot use FIFO, as it is used by the hub execution itself (and makes it fast)

several instructions, mainly jumps, (and then loops) are slower, as the jump if taken, needs to reload the FIFO

skipf doesn't work any faster than standard skip

you can not have interrupts (at least using Flexprop, I don't know how Propeller Tool handles this)

maybe there are several other restriction I can't remember now

As far as I understand, any Hub Ram Data access will disturb the egg beater timing too. The egg beater is optimal only, if (address mod 8) is consecutive.

Not quite sure what you mean, however hub RAM reads and writes in hub exec mode can be 10 cycles slower than in cog/LUT exec mode. This is due to the FIFO reloading when six longs are free, i.e. every six instructions after a branch. Due to pipelining reloading might start after five instructions (I have not tested it).

evanh · 2022-09-26 22:51

@"Christof Eb." said:
... The egg beater is optimal only, if (adress mod 8) is consecutive.

That's true for the Prop1. Prop2 is more complicated. And can be as low as 3 sysclock ticks for hubRAM write.

TonyB_ · 2022-09-26 23:33

@evanh said:

@"Christof Eb." said:
... The egg beater is optimal only, if (adress mod 8) is consecutive.

That's true for the Prop1. Prop2 is more complicated. And can be as low as 3 sysclock ticks for hubRAM write.

... or as high as 20 cycles in hub exec mode, compared to 10 max for cog/LUT exec.

20 is a slightly strange number and 18 seems more obvious at first glance. I'll try to reverse-engineer 20 to understand what is going on.

Christof Eb. · 2022-09-27 15:46

Wanted to know the speed and the penalty of hub-variables versus cog-variables.
In this case both versions compiled with FlexProp, standard Optimization, using fcache, so it's executed from cog ram.
Speed relation is about 6 times faster in this case (!) with local Cog variables versus global Hub variables.

Switching off optimization led to Hub execution, which needs time factor 1.7 in relation to fcache Cog execution for the variant with local variables.

/*
  Fibotest
  fibo(46) 
  P2 200MHz FlexProp 100 Durchläufe local: int a, b, c, i; uses fcache 571 µs ==> 5.7µs 
  P2 200MHz FlexProp 100 Durchläufe global: int a, b, c, i; uses fcache 3362 µs ==> 33.62µs 

  P2 200MHz FlexProp 100 Durchläufe local: int a, b, c, i; no Optim. no fcache: 963 µs ==>  9.63µs 
  P2 200MHz FlexProp 100 Durchläufe global: int a, b, c, i; no Optim. no fcache: 4787 µs ==>  47.87µs 

*/

enum { _xtlfreq = 25_000_000 } ;
enum {    _clkfreq = 200_000_000 } ;
enum {    DOWNLOAD_BAUD = 230_400 } ;
enum {    DEBUG_BAUD = 230_400 } ;


#include <stdio.h>

unsigned int starttime, endtime;
//int a, b, c, i; // Global Hub variables

int fibo(int n)
{
  int a, b, c, i; // Local Cog Variables
  a=0; b=1; 
  for(i=0; i<(n-1); i++)
  {
    c= a+b;
    a=b;
    b=c;
  }
  return c;
}

// the setup routine runs once when you press reset:
void setup() {
  //delay(200);
  //Serial.begin(115200);     // Open serial communications:
  //delay(100);
  int erg;
  starttime= _getus();
  for(int i=0; i<100; i++)
  {
    erg= fibo(46);
  }
  endtime=_getus();
  printf("Fibonacci(46): ");
  printf("%d ",erg);
  printf("%d ", endtime-starttime);
  printf(" µs");
}

// the loop routine runs over and over again forever:
void loop() {
  while(1);
  //delay(1);        
}

void main() {
   setup();
   loop();
}

evanh · 2022-09-27 19:58

I'd go one better and keep the execution time measure as local to setup() in all tests so then the measure isn't adding more overhead than needed. And while I was at it I replaced the _getus() with _cnt() for more accurate time sampling which proved to be more substantial.

void setup() {
  unsigned int starttime, endtime;
  int erg;

  //delay(200);
  //Serial.begin(115200);     // Open serial communications:
  //delay(100);

  starttime=_cnt();
  for(int i=0; i<100; i++)
  {
    erg= fibo(46);
  }
  endtime=_cnt();
  printf("\nFibonacci(46): %d ",erg);
  printf(" Completed in %d ns\n",_muldiv64(endtime-starttime, 1_000_000_000, _clockfreq()));
}

With locals: 5.64 us
Without locals: 33.52 us

Huh, and amusingly, when removing the 100x looping, the measurement improves again by measuring a single run:
With locals: 5.55 us
Without locals: 33.47 us

evanh · 2022-09-27 20:24

Double huh, the tables turn with -O0 unoptimised:
With locals: 100 runs = 9.60 us, 1 run = 9.72 us
Without locals: 100 runs = 46.00 us, 1 run = 46.12 us

evanh · 2022-09-28 06:02

Hmm, -O2 optimising does a number on the locals case:
With locals: (core dump)
Without locals: 100 runs = 32.85 us, 1 run = 33.26 us

No one is testing -O2 I suspect. I better report it ...

Christof Eb. · 2022-09-28 11:21

Some time ago, I had done this Fibo46 tests with Taqoz Forth and it's Assembler. All done with P2 and @200MHz.
The Forth version is a version using normal Forth words, not the special word BOUNDS. Taqoz holds the first 4 items of the stack in cog registers. So commands like swap or rot can just work with cog registers. The stack is in Lut Ram.

: fibo ( n -- f )  \ Fibonacci Reihe, liefert letztes Ergebnis
    0 1 
    rot 1 - 0 do 
        swap over \ tuck 
        + 
    loop
    swap drop
;

So the leftmost 3 bars are with "local" "cog variables". It is quite impressing, I think, that the interactive Taqoz Forth is in the same league here as the compiler!

Wuerfel_21 · 2022-09-28 11:40

Try changing your implementation to

int fibo(int n)
{
  int a, b, c; // Local Cog Variables
  a=0; b=1; 
  n -= 1;
  do {
    c= a+b;
    a=b;
    b=c;
  } while (--n);
  return c;
}

(I think that has the same semantics? Haven't tried running it)
The compiler isn't quite smart enough to know when a for loop can be simplified. The do {...} while (--n) construct maps directly onto the hardware DJNZ instruction (which can get further optimized into REP).

ASM generated for the above:

_fibo
    mov _var01, #0
    mov _var02, #1
    sub arg01, #1
    loc pa, #(@LR__0003-@LR__0001)
    call    #FCACHE_LOAD_
LR__0001
    rep @LR__0004, arg01
LR__0002
    add _var01, _var02
    mov _var03, _var01
    mov _var01, _var02
    mov _var02, _var03
LR__0003
LR__0004
    mov result1, _var03
_fibo_ret
    ret

ASM for original for loop:

_fibo
    mov _var01, #0
    mov _var02, #1
    mov _var03, #0
    loc pa, #(@LR__0002-@LR__0001)
    call    #FCACHE_LOAD_
LR__0001
    mov _var04, arg01
    sub _var04, #1
    cmps    _var03, _var04 wc
 if_ae  jmp #LR__0003
    add _var01, _var02
    mov _var05, _var01
    mov _var01, _var02
    mov _var02, _var05
    add _var03, #1
    jmp #LR__0001
LR__0002
LR__0003
    mov result1, _var05
_fibo_ret
    ret

Yep, that's 8 cycles per iteration vs 22 cycles

Christof Eb. · 2022-09-28 11:56

Nice!
So the benchmark above applies for less gifted programmers. :-)

Wuerfel_21 · 2022-09-28 12:11

I think the loop-reduce optimization (only in -O2) should improve it a bit with the same code, but that crashes the compiler for some reason.

evanh · 2022-09-28 12:56

Eric just fixed the compiler bug ...

-O2 optimisation:
With locals: 100 runs = 5.10 us, 1 run = 5.37 us
Without locals: 100 runs = 32.85 us, 1 run = 33.26 us

evanh · 2022-09-28 13:01

@Wuerfel_21 said:
Try changing your implementation to

int fibo(int n)
{
  int a, b, c; // Local Cog Variables
  a=0; b=1; 
  n -= 1;
  do {
    c= a+b;
    a=b;
    b=c;
  } while (--n);
  return c;
}

-O2 optimisation:
With locals: 100 runs = 1.89 us, 1 run = 2.17 us
Without locals: 100 runs = 20.00 us, 1 run = 20.47 us

brianh · 2022-09-28 13:02

@evanh said:
Eric just fixed the compiler bug ...

-O2 optimisation:
With locals: 100 runs = 5.10 us, 1 run = 5.37 us
Without locals: 100 runs = 32.85 us, 1 run = 33.26 us

Hi. I'm trying to understand this 6x performance difference in the context of my original question, which was mainly about code execution in cog RAM vs. Hub RAM. But, of course I see now, that where the variables are stored also matters.

So, just for my own edification, this benchmark shows that executing code in cog RAM (not Hub RAM), with variables also stored in cog RAM, is 6x faster than if the variables are stored in Hub RAM, right? But what if I execute the code from Hub RAM while having my variables stored in cog RAM?

evanh · 2022-09-28 13:08

@brianh said:
So, just for my own edification, this benchmark shows that executing code in cog RAM (not Hub RAM), with variables also stored in cog RAM, is 6x faster than if the variables are stored in Hub RAM, right? But what if I execute the code from Hub RAM while having my variables stored in cog RAM?

compiler option --fcache=0

evanh · 2022-09-28 13:10

Ah-ha! Doing that makes the single run faster than the 100 runs.

Edit: With --fcache=0 -O2 the 5.37 us increases to 7.40 us.
And the 5.10 us increases to 7.52 us.

Understanding the performance costs of hub execution.

Comments