P1 vs P1E32 vs P1E16 vs P2 cycle counting benchmark comparison!
Bill Henning
Posts: 6,445
Hi guys,
While I obviously prefer the P2 - and make no bones about it - unlike most people, I actually enjoy doing timing analysis.
I thought it would be nice to provide some apples-to-apples timing comparisons!
If you find a flaw in my calculations, I'd appreciate you telling me, as I am a firm believer in the scientific method and properly run experiments.
(and everyone makes mistakes at times)
I evaluated the performance of strlen() on P1, P1E32, P1E16, P2 in cog mode, LMM, LMM with FCACHE, and hubexec.
Please note other benchmarks will return different results, to wit more hub access will increase the P2 performance margin, less hub access will make P1E32 and P1E16 perform better (relatively speaking) in cog mode.
I present the source code, timing analysis, equations etc. at the end of this article.
EXECUTIVE SUMMARY
NOTE: I did not bother to update the P2 figures for the four cycle hub access, as it would only improve the numbers by 0.1us... that caching is wonderful
NOTE: all data based on published specifications, P1Exx hubexec performance estimate based on Chip's proposed specification and non-caching hubexec. Conclusions would change if the underlying architectures changed
Conclusions, based on the results:
P1E32 is the same speed as a P1 when code has frequent hub access.
P1E16 is roughly twice as fast as P1 when code has frequent hub access.
P2 is roughly five times as fast as a P1 when code has frequent hub access.
For cog code for strlen() benchmark
P2 cogs are 4.97x faster than P1 at the same clock speed
P2 cogs are 4.95x faster than P1E32 variant at the same "MIPS", 10x at the same clock speed
P2 cogs are 2.47x faster than P1E16 variant at the same "MIPS", 5x at the same clock speed
For LMM code for strlen (vs hubexec)
P2 cogs are 23.68x faster than LMM on P1 at the same clock speed
P2 cogs are 23.68x faster than LMM on P1E32 at the same "MIPS", 47.36x at the same clock speed
P2 cogs are 11.84x faster than LMM on P1E16 at the same "MIPS", 23.68x at the same clock speed
For LMM FCACHE'd strlen() (vs hubexec)
P2 cogs are 5.27x faster than LMM with FCACHE on P1 at the same clock speed
P2 cogs are 5.24x faster than LMM with FCACHE on P1E32 at the same "MIPS", 10.48x faster at the same clock speed
P2 cogs are 2.60x faster than LMM with FCACHE on P1E16 at the same "MIPS", 5.20x faster at the same clock speed
For hubexec code (per David's request)
P2 is 19.75x faster than P1E32 if non-cached hubexec is added to the P1E
P2 is 9.88x faster than P1E16 if non-cached hubexec is added to the P1E
Source Code & Detailed Analysis
Note: LMM calculations based on 8-way unrolled LMM loop, FCACHE loading on efficient loader
1) Propeller 2 as currently defined
128*3 cycles + 4*5 cycles for cache miss extra cycles + 6 cycles setup/return + 7 cyles max to sync to hub = 417 cycles
20.85us @ 20Mhz
10.42us @ 40Mhz
5.21us @ 80Mhz
4.17us @ 100Mhz
2.61us @ 160Mhz
The timing would be practically the same with hubexec, maybe add 8 cycles for the initial fetch of code.
2) Propeller 1
128*16 cycles + 12 cycles setup/return + 15 cyles max to sync to hub = 2075 cycles
23.94us @ 80Mhz
20.75us @ 100Mhz
Using an eight way unrolled LMM loop, each cog LMM instruction takes 18 clock cycles, and hub instructions takes
128 * (18*3 + 16/3) + 18*3 cycles = 128 * (18*4 + 16/3) = 128 * (72 + 5.33) = 128 * 77.33 = 9898.24
LMM without FCACHE
123.78us @ 80Mhz ***
98.98us @ 100Mhz ***
to be fair, using FCACHE, it would take approx. 7*16 clock cycles to load, add another 16 to set up, so 8*16 = 128 cycles above cog code
LMM with FCACHE
25.46us @ 80Mhz
22.03us @ 100Mhz
3) P1E 16/32 cog / 512MB version
200Mhz hub clock, 2 clock per instruction, 32 hub clocks to get a hub window, cog sees it as 16 instruction clocs, P1E has same RDxxx timing as P1
Exactly same code as P1 case above, calculation in clock cycles: 32*128 + 8 + 32 = 4128 cycles
20.64us @ 200MHz (100MIPS) with 32 cogs
10.32us @ 200MHz (100MIPS) with 16 cogs
The ratio between cog & LMM will be the same, so we can just multiply with P1 results to get LMM numbers
20.64 * 98.8/20.75 = 98.28us @ 200Mhz/100MIPS/32 core ***
10.32 * 98.8/20.75 = 49.14us @ 200Mhz/100MIPS/16 core ***
The ratio would be just as effective with the FCACHE version
20.64 * 22.03/20.75 = 21.91us
10.32 * 22.03/20.75 = 10.96us
While I obviously prefer the P2 - and make no bones about it - unlike most people, I actually enjoy doing timing analysis.
I thought it would be nice to provide some apples-to-apples timing comparisons!
If you find a flaw in my calculations, I'd appreciate you telling me, as I am a firm believer in the scientific method and properly run experiments.
(and everyone makes mistakes at times)
I evaluated the performance of strlen() on P1, P1E32, P1E16, P2 in cog mode, LMM, LMM with FCACHE, and hubexec.
Please note other benchmarks will return different results, to wit more hub access will increase the P2 performance margin, less hub access will make P1E32 and P1E16 perform better (relatively speaking) in cog mode.
I present the source code, timing analysis, equations etc. at the end of this article.
EXECUTIVE SUMMARY
P1 @ 100MHz cog mode : 20.75us P1 @ 100MHz LMM : 98.98us P1 @ 100MHz FCACHE : 22.03us P1E32 @ 100MHz : 20.64us P1E32 @ 100MHz LMM : 98.28us P1E32 @ 100MHz FCACHE: 21.91us P1E32 @ 100MHz HUBEXEC: 82.56us (assumes no icache) P1E16 @ 100MHz : 10.32us P1E16 @ 100MHz LMM : 49.14us P1E16 @ 100MHz FCACHE: 10.96us P1E16 @ 100MHz HUBEXEC: 41.14us (assumes no icache) P2 @ 100MHz cog mode : 4.17us P2 @ 100Mhz hubexec : 4.18us [b]CORRECTION[/b] P1E32 and P1E64 actually run at 200MHz, with 2 clocks per instruction, thus 100 cog mips.
NOTE: I did not bother to update the P2 figures for the four cycle hub access, as it would only improve the numbers by 0.1us... that caching is wonderful
NOTE: all data based on published specifications, P1Exx hubexec performance estimate based on Chip's proposed specification and non-caching hubexec. Conclusions would change if the underlying architectures changed
Conclusions, based on the results:
P1E32 is the same speed as a P1 when code has frequent hub access.
P1E16 is roughly twice as fast as P1 when code has frequent hub access.
P2 is roughly five times as fast as a P1 when code has frequent hub access.
For cog code for strlen() benchmark
P2 cogs are 4.97x faster than P1 at the same clock speed
P2 cogs are 4.95x faster than P1E32 variant at the same "MIPS", 10x at the same clock speed
P2 cogs are 2.47x faster than P1E16 variant at the same "MIPS", 5x at the same clock speed
For LMM code for strlen (vs hubexec)
P2 cogs are 23.68x faster than LMM on P1 at the same clock speed
P2 cogs are 23.68x faster than LMM on P1E32 at the same "MIPS", 47.36x at the same clock speed
P2 cogs are 11.84x faster than LMM on P1E16 at the same "MIPS", 23.68x at the same clock speed
For LMM FCACHE'd strlen() (vs hubexec)
P2 cogs are 5.27x faster than LMM with FCACHE on P1 at the same clock speed
P2 cogs are 5.24x faster than LMM with FCACHE on P1E32 at the same "MIPS", 10.48x faster at the same clock speed
P2 cogs are 2.60x faster than LMM with FCACHE on P1E16 at the same "MIPS", 5.20x faster at the same clock speed
For hubexec code (per David's request)
P2 is 19.75x faster than P1E32 if non-cached hubexec is added to the P1E
P2 is 9.88x faster than P1E16 if non-cached hubexec is added to the P1E
Source Code & Detailed Analysis
Note: LMM calculations based on 8-way unrolled LMM loop, FCACHE loading on efficient loader
1) Propeller 2 as currently defined
' call with string pointer in ptra, use locptra @string, returns length in ptra strlen reps #256,#3 getptra strbase rdbytec ch,ptra++ wz if_z subptra strbase if_z ret ret for 127 character string
128*3 cycles + 4*5 cycles for cache miss extra cycles + 6 cycles setup/return + 7 cyles max to sync to hub = 417 cycles
20.85us @ 20Mhz
10.42us @ 40Mhz
5.21us @ 80Mhz
4.17us @ 100Mhz
2.61us @ 160Mhz
The timing would be practically the same with hubexec, maybe add 8 cycles for the initial fetch of code.
2) Propeller 1
' assume string address is passed in using 'string', returns length in 'string' strlen mov strbase, string loop rdbyte ch, strbase wz add strbase,#1 if_nz jmp #loop subr string, strbase ret for 127 character string
128*16 cycles + 12 cycles setup/return + 15 cyles max to sync to hub = 2075 cycles
23.94us @ 80Mhz
20.75us @ 100Mhz
Using an eight way unrolled LMM loop, each cog LMM instruction takes 18 clock cycles, and hub instructions takes
128 * (18*3 + 16/3) + 18*3 cycles = 128 * (18*4 + 16/3) = 128 * (72 + 5.33) = 128 * 77.33 = 9898.24
LMM without FCACHE
123.78us @ 80Mhz ***
98.98us @ 100Mhz ***
to be fair, using FCACHE, it would take approx. 7*16 clock cycles to load, add another 16 to set up, so 8*16 = 128 cycles above cog code
LMM with FCACHE
25.46us @ 80Mhz
22.03us @ 100Mhz
3) P1E 16/32 cog / 512MB version
200Mhz hub clock, 2 clock per instruction, 32 hub clocks to get a hub window, cog sees it as 16 instruction clocs, P1E has same RDxxx timing as P1
Exactly same code as P1 case above, calculation in clock cycles: 32*128 + 8 + 32 = 4128 cycles
20.64us @ 200MHz (100MIPS) with 32 cogs
10.32us @ 200MHz (100MIPS) with 16 cogs
The ratio between cog & LMM will be the same, so we can just multiply with P1 results to get LMM numbers
20.64 * 98.8/20.75 = 98.28us @ 200Mhz/100MIPS/32 core ***
10.32 * 98.8/20.75 = 49.14us @ 200Mhz/100MIPS/16 core ***
The ratio would be just as effective with the FCACHE version
20.64 * 22.03/20.75 = 21.91us
10.32 * 22.03/20.75 = 10.96us
Comments
http://forums.parallax.com/showthread.php/155091-P1-vs-P1E32-vs-P1E16-vs-P2-cycle-counting-benchmark-comparison!?p=1256413&viewfull=1#post1256413
Bill has quoted the earlier post. I've put a better, more clear one below.
I don't like making unsupported statements, so I thought I'd take the time to document a usage case that will be very common.
strcpy()/memcpy() would give an even larger performance advantage to the P2, and be the same speed on P1/P1E32
Even I was initially surprised at the lack of performance difference between the P1 and P1E32, but in retrospect, hub access dominates.
The current P2 design kicks ***!
FYI, my best guess is that a P1E32 @ 200MHz (100MIPS/core) would use significantly more power than a P2 @ 100Mhz (100MIPS/core), and a P1E16 would use roughly the same amount.
The above data shows that on a per cog basis, P1E32 is the same speed as the original P1 when there is a reasonable amount of hub access (ie LMM)
There is no magic, just toggling gates using power
Where did you get your PIE hardware for your conclusions?
They are of course estimated conclusions. Still an interesting "thought experiment."
I estimate that "You may be wrong or you may be right." -Billy Joel
Much easier and less work than presenting actual supported arguments, of course.
I did not need hardware, I used Chip's information regarding how many clocks between hub cycles, how many instructions per clock, and "it behaves the same as P1".
All else follows from that.
If Chip changes how it works, I can adjust the numbers.
I may revise what I wrote above. Or I may just leave it...
Here's what I don't get yet:
Producing a many COG P1E clocked high enough to really make a difference, means producing a chip that will use power, and it's going to be on the order of watts, or it really doesn't do anything meaningful!
This is due to the basic process physics. No getting around that!
The power profile is more modest, because that P1E is performance limited due to the design being simple, etc...
No matter what, it's going to take more significant power to run anything we produce in this process. Really pushing a P1E makes heat! Really pushing a P2 makes heat!
The current P2 design is much quicker! This means really pushing it can make more heat! More than people feel good about.
But why do we have to do that?
P2 performance at a reasonable clock, with optimizations / gating, etc... to maximize it's overall efficiency is going to deliver more features than a P1E would, at a similar power profile, while delivering great performance relative to the P1E, so what is the benefit exactly?
I expressed that poorly above, and I'll leave it, but really the statement in bold is what is bugging me.
Secondly, a P1E cannot be pushed to the peak performance and features possible on the P2! How is that different from just being sane about how we limit a P2?
And finally, for those who want to run it really fast and hot, the P2 can max out the process to a much greater degree than the P1E can, putting a lot on the table for those who can entertain the power / thermal design challenges.
Really, the P1E is safer, simply because it does not perform. A P2, fed the right power voltages, optimized and clocked appropriately can deliver similar performance, while also delivering killer features.
???
Assuming the hub access is only one long wide, I can do that easily, however that requires at a minimum jump with 16 bit address instructions. Given those, the source code code would be the same as the P1 cog mode.
Any caching would of course help.
Corrections:
Initial numbers were wrong, bad calculation + forgot the actual readbyte! Here are the corrected numbers:[/B]
P1E32 would roughly be (128*4*32+3*32+32)/200 =82.56us (100MIPS / 200MHz)
P1E16 would roughly be (128*4*16+3*16+16)/200 = 41.28us (100MIPS / 200MHz)
And you are welcome
I've updated the message, and here are the numbers again:
P1E32 would roughly be (128*4*32+3*32+32)/200 =82.56us (100MIPS / 200MHz)
P1E16 would roughly be (128*4*16+3*16+16)/200 = 41.28us (100MIPS / 200MHz)
P2 @ 100Mhz would be 4.18us, 10x faster at the same mips (20x at the same MHz)
Basically the mistake was I counted 3 hub cycles per loop, not including the hub cycle for fetching the byte. I think I may have made the same error on post#1, I will check/fix that now.
!!!
Shouldn't we really be thinking about the idea JMG had to provide for per COG clocking then? That could make a very significant impact in the power profile, keep timing resolution, and still perform very nicely indeed!
Running big programs at high speeds is probably the single biggest Propeller exception there is.
Assuming the same clock speed for all tests, and P1 as the reference both P1E variants should be twice as fast for code executing in a cog with no hub access since they need 2 clocks per instruction to the P1's 4 clocks. The P2 cog would be four times faster due to it's single clock cycle per instruction rate.
The 2x difference in speed between the P1E16 and P1E32 clearly shows that frequent hub access would be the major bottleneck as the number of cogs increases. Increasing the number of cogs without increasing hub bandwidth reaches a point of diminishing returns rapidly, and I suspect 32 cogs is past the sweet spot for most applications.
A propeller with 16 cogs, 256K hub ram, 64 I/O pins, and single cycle hub access would give a ratio of 4 pins, 16K hub ram, and 5MHz of hub bandwidth per cog with an 80MHz clock. If there was extra space on the die 512K of hub ram would be nice.
A better comparison when getting to this level, is the Operation Energy, but for that we need P1E Power Simulations.
Note that a P1E is unlikely to operate on a all-cogs-equal basis, for HUB access, but that is a design-craft decision, and out of the box, the simplest user mapping could be all-user-reserved-hubs-get-equal-access.
It's good to make estimates. Later when you have actual measurements they will be good for comparisons.
I know you hate to be challenged. That's what makes it so much fun.
'-)
I fixed the first post, I'd forgotten to add the cycles used by the rdbyte in the initial calculations for P1, P1E... it is fixed now.
P2 wins... HUGE... with hubexec, even compared to a simple hubexec on P1E!
P2@100MHz is ~10x faster at hubexec than P1E16@200MHz
P2@100MHz is ~20x faster at hubexec than P1E32@200MHz
Checkmate.
I hate to be challenged by people without any data or technical arguments, just hand-waving.
You are VERY good at that!
:-)
I totally agree, the results were predicable. Unfortunately a lot of people don't believe in performing calculations, so I thought I'd do it for them, to make the differences obvious.
The hub latency impact is made very obvious above, and the benefits of even the small caching the P2 does are rather dramatic.
Make the P2 as it is. If the thing boils over disable a few COGS.
It could even be arranged that if COGs are disabled the HUB round robin cycle is shortened accordingly. Bingo we have even more performance per COGS as a result:)
Sounds like we are still winning big time.
My suspicion - and it is only an educated guess - is that running the 32 P1E32 cogs at 200MHz will actually end up toggling more transistors than running 8 P2 cogs at 100Mhz.
32*200 = 6400 p1e-cogmhz
8*100 = 800 p1-cogmhz
The P2 could toggle 8x the transistors of the P1E32 and use roughly the same amount of power.
Yes. I think should always be the first choice until we're told it's "impossible" - you know how the community and Chip deal with impossible!
JMG has repeately expressed good ideas here to save power and improve timing. Frankly, I read those wrong, and I apologize for some earlier comments JMG.
Now that we are here and I have read about these things, I see how they are necessary at this process size.
Chip said gating the COGS can be done easily, and if we offer simple steps for clock speeds that do not make a mess of the HUB, mooch starts to make a TON of sense people!
Want that fast, big program? Run it on a high speed COG, run your simple video at 40Mhz, and I'm not kidding as basic video needs work fine at that clock and lower even, run a moderate speed COG or few for peripherials, and another fast one for SDRAM, whatever.
I think we need a 16 entry hub slot assignment table for cogs.
We have 8 cogs.
Let's make the hub wheel have 16 spokes.
The table is initialized as follows:
spoke 0: 0
spoke 1: 1
spoke 2: 2
spoke 3: 3
spoke 4: 4
spoke 5: 5
spoke 6: 6
spoke 7: 7
spoke 8: 0
spoke 9: 1
spoke A: 2
spoke B: 3
spoke C: 4
spoke 5
spoke E: 6
spoke F: 7
jmg suggested that we can flag a slot as unassigned, as a power saving measure. I am not sure how that would help save power, I'd rather drop a cogs multiplier, or cogstop it (which presumably freezes its power usage)
This ofcourse allows totally deterministic allocation of hub slots, all the way down to 1/16, or heck, 1/32 if the table has 32 entries.
Totally deterministic hub timing for cogs, as needed.
Add "mooch", for hubexec cogs, and we have P2 nirvana
Leakage power is not anywhere near the consideration active power is at this process size, correct?
We also have the ability to load COGS, start them, stop them, and start them in HUBEXEC, and start them at any address without loading. This means we can burst with COGS without even thinking too hard about it. In terms of power, this is a LOT of flexibility. IMHO, a more optimal solution than a lot of little P1 COGS would be.
We know mooch is going to make a COG run fast when we want it too. If this hub scheme improves power utilization options, I'm all for it. Again, that's primary right now.
Thanks.
You can ignore this if you want, because I think I might be the only here that really wants to know.
You have not computed the numbers for a 4 Cog P2 in hardware running at 160 MHz with Cogs that seem to run on nitrous… at 40MHz.
I am also interested and cannot fathom the energy scaling… it seems to me that a 4 Cog P2 in hardware could be scaled with a much tighter grain and to a far lower end point.
It seems that for some serious markets, a single design that can be broadly applied to the widest range of application has benefits.
How does an FPGA P2 running at 80 MHz compare with the fastest p1 variant running in hardware at 100MHz?
I think a lot of the guys are supporting what they view is in Parallax's best interests. And a lot of them have been around for very long time
and care deeply about the issue. It is a very delicate issue, but one that I think can tolerate more consideration.
Rich
The numbers given were per cog
Energy wise, my best guess, and this is only a guess, is P1E32 takes more power than P1x8, and P1E16 roughly the same. But that is only my guess. I was calculating computer performance for strlen().
P2@80Mhz ~5.21us (cog or hubexec mode)
P1@100Mhz ~20.75us (cog mode, LMM with FCACHE only about 1us slower, P2 ~ 4x faster)
P1@100Mhz ~98.98us (pure LMM, P2 about 19x faster)
I hate rejection.
Nice job.
But strmove and strcopy are hardly normal instructions so may heavily skew results. As you know, numbers can always be made to fit a desired scenario.
A P2 cog at the same clock will currently use ~15x the power of the P1 cog. Chip can and will reduce the P2 cog power usage by clock gating, but it still will be a power hog compared to a P1 cog.
There are some ideas that can help both P1E and/or P2...
* Slot allocation. I know you and I both agree (not necessarily on implementation, but it can be fine-tuned).
This would give extra time to the critical cogs, and to the large general cog.
* Cogs not being equal. (I don't recall your stance on this)
Real life means at least one cog needs to be different from the rest. It needs to be...
- the main controlling cog
- capable of large memory (hubexec really helps over LMM)
- non-deterministic
- capable of lapping up unused slots
The remaining cogs need to be...
- lean, mean for driver performance
- deterministic (default)
- lots of them
- low power, relatively fast
- configurable fixed slot allocation (by Cog #0)
- hw assist - Video/Counters for P1; Video/Counters/UART/USB/SERDES for P2
- lots of them removes the need for multi-tasking and multi-threading
Everyone want more hub ram. P1E can give us at least 512KB. Maybe with these above refinements, the P2 can get 512KB too?
Therefore, I suggest Cog #0 be different..
- The large controlling cog, hubexec, culled for large programs without being lumbered with the extras such as video etc. Perhaps just raw I/O.
All other cogs be lean and mean...
- Culled for lean driver performance
Perhaps with these changes we can get to a power efficient P2 design in 180nm ???