P2 hub execution performance variation
rogloh
Posts: 5,787
Here is something we all now need to consider when running the P2 in hub exec mode.
During some C code benchmarking I noticed some weird behavior when I recompiled and saw variations in results in my testing which I had first (incorrectly) fully attributed to my software changes trying to improve performance. It turns out that the execution speed of the code also depends on its alignment to hub boundaries and it is noticeable.
To test this out I linked in a dummy C module that just added extra space to the linker which pushed up the rest of my code by some variable number of longs using this code:
I found my different test benchmark results varied by something in the order of 2.5% and followed a cycle that corresponds to a hub boundary of 8 longs.
Unfortunately this is going to be difficult to control if you want the best performance possible. It would appear that some functions will run faster and some slower, depending on which hub address they start on, and each would have its own sweet spot and so you would not really be able to optimize every single one without a lot of work. This is not really practical to do unless hand coding in assembly language so we may have to live with the fact that on each recompile we do performance will just vary by some (hopefully small) amount, just based on the spacing between executable code changing each time files are edited etc.
This effect is not something you'd typically ever see on other micro controllers because their memory access time is more consistent, unlike the P2, particularly when running in hub exec mode.
During some C code benchmarking I noticed some weird behavior when I recompiled and saw variations in results in my testing which I had first (incorrectly) fully attributed to my software changes trying to improve performance. It turns out that the execution speed of the code also depends on its alignment to hub boundaries and it is noticeable.
To test this out I linked in a dummy C module that just added extra space to the linker which pushed up the rest of my code by some variable number of longs using this code:
int space[12]; // vary this number and run the benchmarks
I found my different test benchmark results varied by something in the order of 2.5% and followed a cycle that corresponds to a hub boundary of 8 longs.
Additional Micropython benchmarks memory pad 1 adds 2 adds 3 adds 10! int space[1]: 407296 309103 241826 15783 int space[2]: 406496 302099 243592 15747 int space[3]: 399995 302556 244486 15708 int space[4]: 400798 303474 245687 15742 int space[5]: 400792 307676 247512 15775 int space[6]: 404858 308150 249053 15780 int space[7]: 408128 307203 249053 15718 int space[8]: 409827 310543 246596 15746 int space[9]: 407306 309103 241825 15783 << pattern repeats from space[1] above int space[10]: 406484 302099 243592 15747 int space[11]: 399979 302556 244486 15708 int space[12]: 400786 303474 245687 15742
Unfortunately this is going to be difficult to control if you want the best performance possible. It would appear that some functions will run faster and some slower, depending on which hub address they start on, and each would have its own sweet spot and so you would not really be able to optimize every single one without a lot of work. This is not really practical to do unless hand coding in assembly language so we may have to live with the fact that on each recompile we do performance will just vary by some (hopefully small) amount, just based on the spacing between executable code changing each time files are edited etc.
This effect is not something you'd typically ever see on other micro controllers because their memory access time is more consistent, unlike the P2, particularly when running in hub exec mode.
Comments
I think this is better called an alignment issue than a boundary issue ?
I think P2 code fetch from COG should have no artifacts, but VARs that reside in HUB, even for COG code, will need to wait for their slot.
If you need to access many variables sequentially, there can be better and worse memory spaces/orders to use.
Other micros can also have ALIGN artifacts. The EFM8UB3 Peter is using on P2D2, actually has a quite similar, but simpler artifact, not that well documented.
On that chip, the code fetch is 32 bits wide, and then opcodes are drawn from that buffer.
That means the worst possible outcome is to have a 2 or 3 byte opcode that straddles 2 32b fetches.
In that case, the CPU cannot execute until the next part of the opcode arrives, so there is a 1 SysCLK added penalty on top of the Flash pipeline effects.
You can tune code to avoid this (well, reduce it on critical paths), by careful use of ALIGN operator in Assembler, and even careful shuffles to place shorter opcodes on landing addresses.
That trades off a small size hit, for faster code, so it's something you do in very time critical interrupts. (eg the Serial port interrupts)
Not so easy in a HLL.
ie rather like the P2, EFM8 linear code is quite quick, but 'jumps all over the place' incur penalties.
In many cases, such smaller code path variations would be tolerated, but I can see it makes code tuning more of a pain, as you are less sure if a change drove a gain, or if it was an align fluke.
Maybe a P2 simulator that reports code time + align times, could break out the effects into 2 or 3 timing columns ?
It's curious that not all columns repeat exactly, why would that be ?
10! repeats, 3 adds almost repeats (1 is off by 1), 2 adds repeats, but 1 adds alternates faster/slower more ?
Maybe you need to shift/sweep just the test code thru memory.
1 - The time between two data accesses.
2 - The address distance between two data addresses.
The time factor is what the prop1 has. For an 8-cog prop2, every 8 clocks is another access slot for your cog. That's it, but this has implications when combined with the second factor.
The data addressing factor is dynamic in clocks, depending on longword aligned distance apart, modulo'd to 8. The effect this can produce can even be taken advantage of.
The equation for the prop1 is: interval = rotations * 16
The equation for an 8-cog prop2 is: interval = (rotations * 8 ) + (distance & 7)
Interval is clocks until next hubRAM access slot is available. Rotations is how many hub rotations have occurred between data accesses. Distance is address difference, in longword alignments, between last data access and this data access.
And rotations can be zero for the prop2. This allows possible utilising of minimum instruction times.
EDIT: Added the longword alignments clarification.
EDIT2: Miss-aligned accesses also add one clock to the access time. If bursting then it adds one clock to the burst duration. This effectively is a minus one to the upcoming distance, or rounded down byte to long addressing calculation.
I think considering this is probably one of the keys to understanding it. When padding is added to the code and everything moves to higher addresses some of the jumps will now go to different target addresses which may take longer to occur in some cases, and lesser in others. I would imagine that the data accesses may also similarly be affected if the data being accessed is interspersed within the code as it is in p2gcc. I would hope in the future by using a proper GNU linker that respects map files etc you'd have more control over data and code alignment and also where it exists in the image so you may be able to isolate the data access address changes a little more from any code changes. This fine control is going to be tough to deal with as a C programmer so I imagine in most cases people will just live with all these timing variations unless they need precise real time control in which case they will want to get right down to the HW hub timing architecture to be able to figure it out and probably work directly in PASM instead of C.
Certainly for HUB exec/Data cases, yes, but someone coding in C for COG/LUT code and COG/LUT Data should avoid this variance effect ?
This will be where your in-house languages have an edge, because they do not have the GCC + Another processor baggage.
ie even PASM is not going to eliminate HUB slot timing alignment, it will give you more control over it, but I'd expect only in rare cases would users bother.
Someone may even choose to run HUB.RISC-V.GCC and use Fastspin.C /BASIC/Spin in a separate COG, where hard real time is needed.
Good idea, I might try it to see the effect when I get the chance.
Is SPIN2 going to be hard real-time I wonder? Not sure about that. That's also touching hub variables, and I heard rumours that parts of it are now going to be using hub exec too. I guess at least that the data blocks can be separated from the code blocks if it continues to lay memory out like SPIN1 did, so if you change your code somewhere, another unchanged function could hopefully continue to execute at the same speed if its data is still maintained at the same addresses, or does it not work like that...?
So I tried to do a test here with -falign-functions=32 and falign-labels=32 and the p2asm tool wouldn't let me build it by default. It only supported alignment up to 16. However when I patched it to include 32 byte alignment the image crashed so there is some other problem to deal with.
One thing I did notice after doing this is that the native Micropython image size increased by another 51kB compared to the original 217kB (excluding the heap) so it wouldn't really be that practical to waste this extra space in many cases where you have a large application with lots of functions.
Yeah, it's not really going to be a general solution.
It could be useful to place some local vars into LUT - and even place interrupt code into LUT, but that's getting outside the scope of interpreters
The only time I've used align to control code speed, is in interrupt code on the EFM8 series, it's not that important outside where cycles matter.
P2 has the ability to use WAIT to lock back to hard real time, should someone require that.