P2 hub execution performance variation

rogloh · 2019-07-21 08:34

Here is something we all now need to consider when running the P2 in hub exec mode.

During some C code benchmarking I noticed some weird behavior when I recompiled and saw variations in results in my testing which I had first (incorrectly) fully attributed to my software changes trying to improve performance. It turns out that the execution speed of the code also depends on its alignment to hub boundaries and it is noticeable.

To test this out I linked in a dummy C module that just added extra space to the linker which pushed up the rest of my code by some variable number of longs using this code:

int space[12];  // vary this number and run the benchmarks

I found my different test benchmark results varied by something in the order of 2.5% and followed a cycle that corresponds to a hub boundary of 8 longs.

 Additional         Micropython benchmarks
 memory pad      1 adds  2 adds  3 adds   10!

int space[1]:    407296  309103  241826  15783
int space[2]:    406496  302099  243592  15747
int space[3]:    399995  302556  244486  15708
int space[4]:    400798  303474  245687  15742
int space[5]:    400792  307676  247512  15775
int space[6]:    404858  308150  249053  15780
int space[7]:    408128  307203  249053  15718
int space[8]:    409827  310543  246596  15746

int space[9]:    407306  309103  241825  15783 << pattern repeats from space[1] above
int space[10]:   406484  302099  243592  15747
int space[11]:   399979  302556  244486  15708
int space[12]:   400786  303474  245687  15742

Unfortunately this is going to be difficult to control if you want the best performance possible. It would appear that some functions will run faster and some slower, depending on which hub address they start on, and each would have its own sweet spot and so you would not really be able to optimize every single one without a lot of work. This is not really practical to do unless hand coding in assembly language so we may have to live with the fact that on each recompile we do performance will just vary by some (hopefully small) amount, just based on the spacing between executable code changing each time files are edited etc.

This effect is not something you'd typically ever see on other micro controllers because their memory access time is more consistent, unlike the P2, particularly when running in hub exec mode.

jmg · 2019-07-21 09:58

rogloh wrote: »

I found my different test benchmark results varied by something in the order of 2.5% and followed a cycle that corresponds to a hub boundary of 8 longs.

When you say 'hub boundary' here, is the main effect variables moving to a different egg beater slot, or code jump destinations varying, and so change the time needed to wait-for-align ?
I think this is better called an alignment issue than a boundary issue ?

rogloh wrote: »

This effect is not something you'd typically ever see on other micro controllers because their memory access time is more consistent, unlike the P2, particularly when running in hub exec mode.

I think P2 code fetch from COG should have no artifacts, but VARs that reside in HUB, even for COG code, will need to wait for their slot.
If you need to access many variables sequentially, there can be better and worse memory spaces/orders to use.

Other micros can also have ALIGN artifacts. The EFM8UB3 Peter is using on P2D2, actually has a quite similar, but simpler artifact, not that well documented.
On that chip, the code fetch is 32 bits wide, and then opcodes are drawn from that buffer.
That means the worst possible outcome is to have a 2 or 3 byte opcode that straddles 2 32b fetches.
In that case, the CPU cannot execute until the next part of the opcode arrives, so there is a 1 SysCLK added penalty on top of the Flash pipeline effects.

You can tune code to avoid this (well, reduce it on critical paths), by careful use of ALIGN operator in Assembler, and even careful shuffles to place shorter opcodes on landing addresses.
That trades off a small size hit, for faster code, so it's something you do in very time critical interrupts. (eg the Serial port interrupts)

Not so easy in a HLL.

ie rather like the P2, EFM8 linear code is quite quick, but 'jumps all over the place' incur penalties.

In many cases, such smaller code path variations would be tolerated, but I can see it makes code tuning more of a pain, as you are less sure if a change drove a gain, or if it was an align fluke.

Maybe a P2 simulator that reports code time + align times, could break out the effects into 2 or 3 timing columns ?

rogloh wrote: »

 memory pad      1 adds  2 adds  3 adds   10!
int space[1]:    407296  309103  241826  15783
int space[2]:    406496  302099  243592  15747
int space[3]:    399995  302556  244486  15708
int space[4]:    400798  303474  245687  15742
int space[5]:    400792  307676  247512  15775
int space[6]:    404858  308150  249053  15780
int space[7]:    408128  307203  249053  15718
int space[8]:    409827  310543  246596  15746

int space[9]:    407306  309103  241825  15783 << pattern repeats from space[1] above
int space[10]:   406484  302099  243592  15747
int space[11]:   399979  302556  244486  15708
int space[12]:   400786  303474  245687  15742

It's curious that not all columns repeat exactly, why would that be ?
10! repeats, 3 adds almost repeats (1 is off by 1), 2 adds repeats, but 1 adds alternates faster/slower more ?
Maybe you need to shift/sweep just the test code thru memory.

Cluso99 · 2019-07-21 10:09

Once hub instructions sync, the instructions will remain in sync until a jump executes. At this point, the hub needs to wait until the timeslot for that instruction comes around, and then will remain in sync until the next jump, etc.

evanh · 2019-07-21 10:12

It's tricky, what you're seeing there is an alignment arrangement between cog number and hub data address. On the Prop1 it was just a code spacing concern to line up hub data reads and writes. On the Prop2 that spacing is still there but there is additionally this relative cog number to data address relation too. Then, on top of that, if using hubexec, there is also FIFO reloads stalling the data accesses.

evanh · 2019-07-21 13:25

Err, I'll be wrong about the cog number thing. It exists but you haven't hit variances from that one yet. So, ignoring that, there is still two factors at play:
1 - The time between two data accesses.
2 - The address distance between two data addresses.

The time factor is what the prop1 has. For an 8-cog prop2, every 8 clocks is another access slot for your cog. That's it, but this has implications when combined with the second factor.

The data addressing factor is dynamic in clocks, depending on longword aligned distance apart, modulo'd to 8. The effect this can produce can even be taken advantage of.

The equation for the prop1 is: interval = rotations * 16
The equation for an 8-cog prop2 is: interval = (rotations * 8 ) + (distance & 7)

Interval is clocks until next hubRAM access slot is available. Rotations is how many hub rotations have occurred between data accesses. Distance is address difference, in longword alignments, between last data access and this data access.

And rotations can be zero for the prop2. This allows possible utilising of minimum instruction times.

EDIT: Added the longword alignments clarification.
EDIT2: Miss-aligned accesses also add one clock to the access time. If bursting then it adds one clock to the burst duration. This effectively is a minus one to the upcoming distance, or rounded down byte to long addressing calculation.

rogloh · 2019-07-21 14:22

Cluso99 wrote: »

Once hub instructions sync, the instructions will remain in sync until a jump executes. At this point, the hub needs to wait until the timeslot for that instruction comes around, and then will remain in sync until the next jump, etc.

I think considering this is probably one of the keys to understanding it. When padding is added to the code and everything moves to higher addresses some of the jumps will now go to different target addresses which may take longer to occur in some cases, and lesser in others. I would imagine that the data accesses may also similarly be affected if the data being accessed is interspersed within the code as it is in p2gcc. I would hope in the future by using a proper GNU linker that respects map files etc you'd have more control over data and code alignment and also where it exists in the image so you may be able to isolate the data access address changes a little more from any code changes. This fine control is going to be tough to deal with as a C programmer so I imagine in most cases people will just live with all these timing variations unless they need precise real time control in which case they will want to get right down to the HW hub timing architecture to be able to figure it out and probably work directly in PASM instead of C.

evanh · 2019-07-21 14:30

Yes, FIFO reloads follow the same rules as I've just listed. Only diff is there is a burst length, the first and last addresses are one apart after modulo. So this affects the distance factor.

Wuerfel_21 · 2019-07-21 16:55

GCC has -falign-functions=32 and -falign-labels=32 (or any other number) options. That might lead to more reliable (not necessarily better) execution time. Not sure whether propeller-gcc supports those.

jmg · 2019-07-21 19:48

rogloh wrote: »

... This fine control is going to be tough to deal with as a C programmer so I imagine in most cases people will just live with all these timing variations unless they need precise real time control in which case they will want to get right down to the HW hub timing architecture to be able to figure it out and probably work directly in PASM instead of C.

Certainly for HUB exec/Data cases, yes, but someone coding in C for COG/LUT code and COG/LUT Data should avoid this variance effect ?
This will be where your in-house languages have an edge, because they do not have the GCC + Another processor baggage.
ie even PASM is not going to eliminate HUB slot timing alignment, it will give you more control over it, but I'd expect only in rare cases would users bother.

Someone may even choose to run HUB.RISC-V.GCC and use Fastspin.C /BASIC/Spin in a separate COG, where hard real time is needed.

rogloh · 2019-07-22 00:18

Wuerfel_21 wrote: »

GCC has -falign-functions=32 and -falign-labels=32 (or any other number) options. That might lead to more reliable (not necessarily better) execution time. Not sure whether propeller-gcc supports those.

Good idea, I might try it to see the effect when I get the chance.

jmg wrote: »

Someone may even choose to run HUB.RISC-V.GCC and use Fastspin.C /BASIC/Spin in a separate COG, where hard real time is needed.

Is SPIN2 going to be hard real-time I wonder? Not sure about that. That's also touching hub variables, and I heard rumours that parts of it are now going to be using hub exec too. I guess at least that the data blocks can be separated from the code blocks if it continues to lay memory out like SPIN1 did, so if you change your code somewhere, another unchanged function could hopefully continue to execute at the same speed if its data is still maintained at the same addresses, or does it not work like that...?

Tubular · 2019-07-22 01:00

These are good questions. I guess we'll find out quite soon

rogloh · 2019-07-22 02:04

rogloh wrote: »

Wuerfel_21 wrote: »

GCC has -falign-functions=32 and -falign-labels=32 (or any other number) options. That might lead to more reliable (not necessarily better) execution time. Not sure whether propeller-gcc supports those.

Good idea, I might try it to see the effect when I get the chance.

So I tried to do a test here with -falign-functions=32 and falign-labels=32 and the p2asm tool wouldn't let me build it by default. It only supported alignment up to 16. However when I patched it to include 32 byte alignment the image crashed so there is some other problem to deal with.

One thing I did notice after doing this is that the native Micropython image size increased by another 51kB compared to the original 217kB (excluding the heap) so it wouldn't really be that practical to waste this extra space in many cases where you have a large application with lots of functions.

jmg · 2019-07-22 04:42

rogloh wrote: »

So I tried to do a test here with -falign-functions=32 and falign-labels=32 and the p2asm tool wouldn't let me build it by default. It only supported alignment up to 16. However when I patched it to include 32 byte alignment the image crashed so there is some other problem to deal with.

One thing I did notice after doing this is that the native Micropython image size increased by another 51kB compared to the original 217kB (excluding the heap) so it wouldn't really be that practical to waste this extra space in many cases where you have a large application with lots of functions.

Yeah, it's not really going to be a general solution.
It could be useful to place some local vars into LUT - and even place interrupt code into LUT, but that's getting outside the scope of interpreters

The only time I've used align to control code speed, is in interrupt code on the EFM8 series, it's not that important outside where cycles matter.
P2 has the ability to use WAIT to lock back to hard real time, should someone require that.

P2 hub execution performance variation

Comments