Understanding the performance costs of hub execution.
Hi all. I'm new to the P2 and could use some help understanding this from a code execution time and predictability standpoint.
For my P1 project, I needed a relatively small amount of code with very low latency (<400 ns in many areas). I followed the "typical" model of writing PASM code for cog RAM with tight loops and pin wait code, and only used Spin1 to provide some simple start/stop/configure cog semantics and hub "mailbox" message passing. I'm sure we've all seen this model for coding on both the P1 and P2.
For my P2 project, I have more head room on the latency side, and my application is much more complicated. This makes me want to write much more code in Spin2 and use PASM2 much more sparingly. I'd like to try to have cog execution/RAM only used for running inline PASM2 code that is put in as an optimization, and probably also for code that is interrupt triggered (though, I don't really understand P2 interrupts and events very well, yet). I'll use smart pins too, and would prefer to manage them from Spin2.
I think this means that my code's critical sections will incur only the performance cost of moving them to cog RAM each time they execute, or the performance cost of hub execution if the compiler doesn't move them. Is this correct? How do I choose between these two options, and is there a straight-forward way to measure this cost and have it be predictable? Also, in cases where this performance cost is not tenable, is an event/interrupt handler that permanently resides in cog RAM the right solution? How to set those up and how do I also measure the performance overhead for that code?
Really, my goal here is to make the best use I can of Spin2 so that my code is simpler and easier to maintain in all the non-critical places. How best to do that, while cleanly integrating PASM2 for critical sections, and understanding my performance trade-offs and implications- that is the crux of my question. Also, I'm very happy to be pointed to existing, relevant, parts of the P2 documentation for the answer, too.
Best.
Comments
Hub execution in a P2 is the same speed as cog execution, except:
When using Flexprop you can insert PASM code in 2 ways: either make the compiler to move the asm code to the cog and execute it from there, or inline it into the hub ram program.
As flexprop can compile Basic and C you can also consider using one of these languages too.
The answer depends on wether you're using the official Spin2 compiler or flexspin.
The normal ORG/END block in Spin2 will always load the code into cog RAM, at the cost of 1 clock cycle per instruction every time the block runs. Official spin will also add 2(?) cycles per local variable that's in scope.
Flexspin has ASM/ENDASM which puts the code into hub RAM (though if you have a loop in there the optimizer might end up caching it to cog ram anyways). I think(tm) you can do ORGH/END for optimizer-bypassing hubexec asm, too.
There's some free memory in the cog that can be used for such things, but it's kinda clunky to do so and the interrupt latency will be very wobbly depending on what the main thread is doing (math, inline ASM loading and many other things will stall IRQs).
If you want to do something low-latency or intensive, you'll still want to run it in a separate cog, but for doing a little bit of processing you can do inline ASM now. Though with flexspin the compiled Spin code is often close to optimal to begin with.
System time register is fetched with GETCT instruction. There is Spin2 equivalent, GETCT(), that can be used for precisely measuring code execution intervals.
Pnut/Proptool supports embedding ISR in cogRAM alongside compiled code.
Flex doesn't support this from any language.
Pnut/Proptool always pseudo inlines inline pasm, as in it always copies such code into cogexec space upon each entry.
FlexC/FlexBasic is optional, it will optimise accordingly by default. But there is directives to force true hubexec inlining for example. The different behaviour is a product of the difference type of compile. Flex produces native compiled. Pnut/Proptool compiles to bytecode. That said, FlexSpin, as opposed to FlexC, doesn't have any directives I don't think. It just mimics Pnut's cogexec for inline pasm. Err, I keep forgetting about ASM/ENDASM.
Thanks for that summary. Also, pardon the silly question, but is flexprop just a GUI for running the flexspin compiler that I'm a little familiar with?
Thanks. super useful pro tip!
Aye, thanks for pointing that out. Tool complexities I had not even considered yet, and brings up the query about how to mix together output from multiple tools.
Optimizer caching scares me, though I'm not sure and maybe it's a great solution. Support for ORGH/END seems like it would support self-modifying code, and that is kind of exciting to me.
The easy way is to just use Flex. It'll compile all together. The Spin2 sources integrate into C/Basic and vis-versa.
There is directives for controlling what Pasm code gets left alone. And Spin2 based ORG/END is unoptimised.
ASM/ENDASM just plomps your inline ASM into the same IR processing that the high-level generated code goes through (as do the other inline ASM methods, but they flag it so as to not participate in most opt passes). Useful for when you just need some sequence of code without regard for its particular timing or order. Which is most of the time when not doing I/O.
FWIW: I just went through this cog-execution thing:
https://forums.parallax.com/discussion/174078/flexbasic-for-cog#latest
Craig
I rarely need self-modifying code with P2 thanks to the new ALTx type instructions...
Yes, you have to avoid skipf if you are using interrupts, but other than that AFAIK interrupts work ok with Hub execution.
Perhaps this is a Flexprop limitation?. Can anyone clarify?
Ross.
The Flex suite is not written as interrupt safe. Mixing an ISR with compiled code on the same cog will cause data corruption in the compiled code.
Chip has carefully ensured Pnut/Proptool's VM is interrupt safe with many inserted cases of REP #1 shielding those sensitive code areas.
As Ross has observed, an ISR may happily exist in hubRAM.
That makes sense. Thanks.
Actually, my earlier post was slightly misleading - yes, SKIPF can't be used in Hub mode, but SKIP can, and it is ok to use both SKIP and SKIPF in conjunction with interrupts - you just can't use them inside an ISR. Since Catalina allows an arbitrary C function to be used as an ISR, this means I have to avoid using both SKIP and SKIPF pretty much altogether except in situations where I know there are no interrupts in use. In particular I can't use them in any library code, because an ISR can call any library function.
As far as I understand, any Hub Ram Data access will disturb the egg beater timing too. The egg beater is optimal only, if (adress mod 8) is consecutive.
Not quite sure what you mean, however hub RAM reads and writes in hub exec mode can be 10 cycles slower than in cog/LUT exec mode. This is due to the FIFO reloading when six longs are free, i.e. every six instructions after a branch. Due to pipelining reloading might start after five instructions (I have not tested it).
That's true for the Prop1. Prop2 is more complicated. And can be as low as 3 sysclock ticks for hubRAM write.
... or as high as 20 cycles in hub exec mode, compared to 10 max for cog/LUT exec.
20 is a slightly strange number and 18 seems more obvious at first glance. I'll try to reverse-engineer 20 to understand what is going on.
Wanted to know the speed and the penalty of hub-variables versus cog-variables.
In this case both versions compiled with FlexProp, standard Optimization, using fcache, so it's executed from cog ram.
Speed relation is about 6 times faster in this case (!) with local Cog variables versus global Hub variables.
Switching off optimization led to Hub execution, which needs time factor 1.7 in relation to fcache Cog execution for the variant with local variables.
I'd go one better and keep the execution time measure as local to setup() in all tests so then the measure isn't adding more overhead than needed. And while I was at it I replaced the
_getus()
with_cnt()
for more accurate time sampling which proved to be more substantial.With locals: 5.64 us
Without locals: 33.52 us
Huh, and amusingly, when removing the 100x looping, the measurement improves again by measuring a single run:
With locals: 5.55 us
Without locals: 33.47 us
Double huh, the tables turn with -O0 unoptimised:
With locals: 100 runs = 9.60 us, 1 run = 9.72 us
Without locals: 100 runs = 46.00 us, 1 run = 46.12 us
Hmm, -O2 optimising does a number on the locals case:
With locals: (core dump)
Without locals: 100 runs = 32.85 us, 1 run = 33.26 us
No one is testing -O2 I suspect. I better report it ...
Some time ago, I had done this Fibo46 tests with Taqoz Forth and it's Assembler. All done with P2 and @200MHz.
The Forth version is a version using normal Forth words, not the special word BOUNDS. Taqoz holds the first 4 items of the stack in cog registers. So commands like swap or rot can just work with cog registers. The stack is in Lut Ram.
So the leftmost 3 bars are with "local" "cog variables". It is quite impressing, I think, that the interactive Taqoz Forth is in the same league here as the compiler!
Try changing your implementation to
(I think that has the same semantics? Haven't tried running it)
The compiler isn't quite smart enough to know when a for loop can be simplified. The
do {...} while (--n)
construct maps directly onto the hardware DJNZ instruction (which can get further optimized into REP).ASM generated for the above:
ASM for original for loop:
Yep, that's 8 cycles per iteration vs 22 cycles
Nice!
So the benchmark above applies for less gifted programmers. :-)
I think the loop-reduce optimization (only in -O2) should improve it a bit with the same code, but that crashes the compiler for some reason.
Eric just fixed the compiler bug ...
-O2 optimisation:
With locals: 100 runs = 5.10 us, 1 run = 5.37 us
Without locals: 100 runs = 32.85 us, 1 run = 33.26 us
-O2 optimisation:
With locals: 100 runs = 1.89 us, 1 run = 2.17 us
Without locals: 100 runs = 20.00 us, 1 run = 20.47 us
Hi. I'm trying to understand this 6x performance difference in the context of my original question, which was mainly about code execution in cog RAM vs. Hub RAM. But, of course I see now, that where the variables are stored also matters.
So, just for my own edification, this benchmark shows that executing code in cog RAM (not Hub RAM), with variables also stored in cog RAM, is 6x faster than if the variables are stored in Hub RAM, right? But what if I execute the code from Hub RAM while having my variables stored in cog RAM?
compiler option
--fcache=0
Ah-ha! Doing that makes the single run faster than the 100 runs.
Edit: With
--fcache=0 -O2
the 5.37 us increases to 7.40 us.And the 5.10 us increases to 7.52 us.