P1 vs P1E32 vs P1E16 vs P2 cycle counting benchmark comparison!

jmg · 2014-04-05 15:24

Bill Henning wrote: »

jmg suggested that we can flag a slot as unassigned, as a power saving measure. I am not sure how that would help save power, I'd rather drop a cogs multiplier, or cogstop it (which presumably freezes its power usage)

COGSTOP of course works, but you have just lost a COG - so it is a rather blunt instrument.
COG dividers will work, but need a little care with interactions with COG slot assignments, and it all gets harder to understand.,

There is only one COG slot assignment block, and is it small RAM, so it can be more comprehensive, my expanded idea is to flesh out thus :

* Expand the table size have both Power and HUB Alloc fields, 8b + 3b, and bump to 32 or 64. entries
* Add a Wrap counter(or equiv), to allow good control of less than 8 COGs

This is a tiny RAM just 44 or 88 bytes, ( or 48/96 bytes, if the design pops the WRAP control into a RAM bit column)

There is now a single array that manages both Power Envelope control, and Hub BW.
Users choose by simple mapping, which COGS get what share of the resources of Power Envelope control, and Hub BW

One COG could be given 100% Power, and 50% COG BW, or 50% Power and 50% COG BW, and the others could
go as small as 1/32 or 1/64 of the power envelope (~3% or 1.5% quanta on both Hub BW and Power Envelope)

This is all 100% deterministic, with no surprises, and controlled from a single place. Easy to SW report, and generate the table.

This would work for both P2 and P1E, and it would be slightly larger on P1E.

Addit: I think this could even widen to work to the COG-Task level, in a 4 COG design.
It is a similar time-scan-mapping as tasks use now.

RossH · 2014-04-05 17:12

Hi Bill,

Interesting. So for string-type operations the P16X32B would be twice as fast as the P1. But of course string-type operations are a miniscule percentage of most programs, and for pure cog operations the P16X32B would be five times faster, and for most LMM applications it would be somewhere between 2 and 4 times faster (depending on caching).

So, overall, we could expect the P16X32B to be pretty close to 4 times faster than the P1? With 256KB RAM and 64 I/O pins?

I'd buy that for a dollar!

Ross.

Bill Henning · 2014-04-05 17:21

Thanks Ray.

strlen() is typical of any sequence of code where you have two cog and one hub operation. Heavier hub use would tilt towards the P2.

It is also an excellent example of LMM vs Hubexec, as every instruction has a minimum of one hub access in LMM or hubexec, and two hub accesses roughly 1/3rd of the time.

Pure cog code would show less of an advantage for the P2, but it is much more rare.

Cluso99 wrote: »

Bill,
Nice job.

But strmove and strcopy are hardly normal instructions so may heavily skew results. As you know, numbers can always be made to fit a desired scenario.

A P2 cog at the same clock will currently use ~15x the power of the P1 cog. Chip can and will reduce the P2 cog power usage by clock gating, but it still will be a power hog compared to a P1 cog.

There are some ideas that can help both P1E and/or P2...
* Slot allocation. I know you and I both agree (not necessarily on implementation, but it can be fine-tuned).
This would give extra time to the critical cogs, and to the large general cog.
* Cogs not being equal. (I don't recall your stance on this)

Thanks Ray.

strlen() is typical of any sequence of code where you have two cog and one hub operation. Heavier hub use would tilt towards the P2.

It is also an excellent example of LMM vs Hubexec, as every instruction has a minimum of one hub access in LMM or hubexec, and two hub accesses roughly 1/3rd of the time.

Pure cog code would show less of an advantage for the P2, but it is much more rare.

I have no problem whatsoever with unequal cogs, as long we only have two types

Cluso99 wrote: »

Real life means at least one cog needs to be different from the rest. It needs to be...
- the main controlling cog
- capable of large memory (hubexec really helps over LMM)
- non-deterministic
- capable of lapping up unused slots
The remaining cogs need to be...
- lean, mean for driver performance
- deterministic (default)
- lots of them
- low power, relatively fast
- configurable fixed slot allocation (by Cog #0)
- hw assist - Video/Counters for P1; Video/Counters/UART/USB/SERDES for P2
- lots of them removes the need for multi-tasking and multi-threading
Everyone want more hub ram. P1E can give us at least 512KB. Maybe with these above refinements, the P2 can get 512KB too?

Therefore, I suggest Cog #0 be different..
- The large controlling cog, hubexec, culled for large programs without being lumbered with the extras such as video etc. Perhaps just raw I/O.
All other cogs be lean and mean...
- Culled for lean driver performance

Perhaps with these changes we can get to a power efficient P2 design in 180nm ???

I suggest we need at an absolute minimum TWO "super cogs". One for high resolution video, One for C code. Four would be much better.

I am perfectly happy with tasks, they can be deterministic. Map the cog memory as 128 longs per cog, give hub access slot to each task within a cog round robbin, presto, 100% deterministic baby cogs with 16 cycle hub access (assuming total of 4 cogs).

That maps a 4 cogs P2 into two super cogs, and eight baby cogs. Likely the video cog will have two spare tasks, mind you, not as deterministic.

Having said that, I'd also be OK with two P2 and say 8 P1 cogs (two cycle per instruction), or even more P1 cogs (four cycle per instruction).

I am dead set against not having at least two P2 cogs, as it rules out high rez video, and also rules out great compiled code performance.

I can be convinced for a mixed approach, and I actually believe the 4 P2 cogs can be configured as I stated - giving 8 baby very capable cogs and two super cogs.

msrobots · 2014-04-05 17:41

RossH,

So, overall, we could expect the P16X32B to be pretty close to 4 times faster than the P1? With 256KB RAM and 64 I/O pins?

I'd buy that for a dollar!

Ross.

Lets give Parallax at least a slightly chance of revenue and pay $10 for that ...

Enjoy!

Mike

Bill Henning · 2014-04-05 17:42

Ross,

I did not expect a straw man argument from you.

You know as well as I do that the example I chose, two cog ops and one hub op, is pretty typical.

So yes, a P16E32 @ 200Mhz (100MIPS) would be twice as fast as the P1 @ 100MHz (25MIPS) any code walking the hub, computing lengths, checksums, summing a vector etc. VERY common. And the P2 would be 5x as fast (2.5x as fast as the P16E32).

For cog only code, that does not touch the hub, a P16E32 @ 200Mhz cog would be four times as fast as a P1 @ 100MHz cog. Not five times. Also, most programs DO use the hub somewhere between one in three or one in six instructions. A P2 cog in this case would be 4x-8x++ faster due to pointer etc instructions, much faster if MUL/DIV/CORDIC/MAC are needed.

P16E32 @ 200MHz would run LMM code at exactly twice the speed of a P1 @ 100MHz, NOT somewhere between two to four times. It is hub cycle access bound, so twice is the max (fcached code that did zero hub access might be four times as fast, or as little as twice as fast if it uses a lot of hub access)

I'd say we could expect a P16E32 @ 200Mhz cog, on average, to be roughly 3 times faster on average (blended cog/hub access) than a 100MHz P1 cog..

Best case: 4x (cog only)
Worst case: 2x (hub bound)
Expected average: 3x (so I agree about LMM speed, fixed)

LMM case: 2x (hub bound)
Hub bandwidth per cog: 25MB/sec

Now let's compare that P1 cog @ 100MHz to a P2 cog @ 100Mhz (4 cog version)

Cog only exec: 2x - 4x (cog only) using P2 instructions (indexing etc) not available on P1
Mixed cog/hub: 5x (strcpy)

And now, P2 cog destroys P1 (and P16E32)
LMM vs hubexec: 23.68x faster
Hub bandwidth: 800MB/sec (32x hub bandwidth)

Can you say it does not matter that hubexec will run ~24 times as fast as LMM, and P2 cogs get 32X the hub bandwidth?

P16E32, average 3x faster than LMM on P1
P2x4, average 24x faster than LMM on P1

==> P2 hubexec is 8x faster than P16E32 LMM (and 24x P1 LMM)

Checkmate.

RossH wrote: »

Hi Bill,

Interesting. So for string-type operations the P16X32B would be twice as fast as the P1. But of course string-type operations are a miniscule percentage of most programs, and for pure cog operations the P16X32B would be five times faster, and for most LMM applications it would be somewhere between 2 and 4 times faster (depending on caching).

So, overall, we could expect the P16X32B to be pretty close to 4 times faster than the P1? With 256KB RAM and 64 I/O pins?

I'd buy that for a dollar!

Ross.

RossH · 2014-04-05 17:54

Bill Henning wrote: »

P16E32 @ 200MHz would run LMM code at exactly twice the speed of a P1 @ 100MHz, NOT somewhere between two to four times. It is hub cycle access bound, so twice is the max (fcached code that did zero hub access might be four times as fast, or as little as twice as fast if it uses a lot of hub access)

You are forgetting that cog execution is five times faster. Which means that executing LMM "primitives" will be five times faster, and executing cached code will be five times faster. Overall speed up for LMM programs even with very limited caching will be somewhere between two and five times. Exactly where in there is difficult to guess without benchmarking, but it will be much more than two times - likely closer to four.

Also, you are forgetting CMM mode - one hub read fetches two instructions. Those instructions can be decoded and executed five times faster. It could even be that for CMM programs, it would be possible to decode and execute these two instructions and still meet the next hub slot (which is not possible on the P1). This would speed up CMM execution much more dramatically than LMM execution.

Ross.

RossH · 2014-04-05 18:01

Bill Henning wrote: »

P16E32, average 3x faster than LMM on P1
P2x4, average 24x faster than LMM on P1

==> P2 hubexec is 8x faster than P16E32 LMM (and 24x P1 LMM)

Checkmate.

What's your point? That the P2 will be faster than the P16X32B? Of course it will!

And the P16X32B will be faster than the P1. And by considerably more than your limited calculations for "strlen" show.

For me, your figures demonstrate that the P16X32B will provide a very nice speed increment over the P1. We already knew that would be the case, but it is nice to see it confirmed.

Ross.

Bill Henning · 2014-04-05 18:18

Cog only:

P1: 100MHz, four cycles per instruction, 25MIPS @ 100Mhz
P16E32: 200Mhz, two cycles per instruction, 100MIPS @ 200MHz
P2: 100MHz, once cycle per instruction, 100MIPS @ 100MHz

I think I just figured out where you get 5x.

You are comparing 80MHz P1. I was deliberately comparing at 100MHz, which is 4x.

I deliberately used 100MHz, for more apples to apples.

Yes, 100MHz P16E32 will be 5x faster than 80MHz P1 for cog only code, and 25% faster for LMM.

Sorry, I was confused, my analysis in this thread was based on 100MHz P1, and your rebuttal was based on 80MHz P1.

I agree with your figures, now that I figured out you meant 80MHz P1. Sorry, you changed MHz on me!

CMM loses more to decoding than it gains due to fetch. Cannot meet next slot, take a peek at the source code. CMM will be faster on P16E32 by slightly larger factor, but definitely not 5x.

Good discussion, I wish you had not changed to 80MHz P1 from my stated 100MHz, then there would not have been any confusion.

Thank you for reminding me about primitives! Means hubexec will be more like 16x-20x faster than P16E32LMM (instead of 12x) when not using fcache

RossH wrote: »

You are forgetting that cog execution is five times faster. Which means that executing LMM "primitives" will be five times faster, and executing cached code will be five times faster. Overall speed up for LMM programs even with very limited caching will be somewhere between two and five times. Exactly where in there is difficult to guess without benchmarking, but it will be much more than two times - likely closer to four.

Also, you are forgetting CMM mode - one hub read fetches two instructions. Those instructions can be decoded and executed five times faster. It could even be that for CMM programs, it would be possible to decode and execute these two instructions and still meet the next hub slot (which is not possible on the P1). This would speed up CMM execution much more dramatically than LMM execution.

Ross.

Bill Henning · 2014-04-05 18:23

My point is that a P2 cog will run compiled C code 8x-12x than a P16E32 cog at the same MIPS rating

Even with heavy FCACHE use P2 cog will still be at least 4x faster than LMM on P16E32 due to not needing the primitives.

(thanks for reminding me of LMM primitives not needed by hubexec)

People who use compiled code will care a LOT about that 4x - 12x performance difference.

RossH wrote: »

What's your point? That the P2 will be faster than the P16X32B? Of course it will!

And the P16X32B will be faster than the P1. And by considerably more than your limited calculations for "strlen" show.

For me, your figures demonstrate that the P16X32B will provide a very nice speed increment over the P1. We already knew that would be the case, but it is nice to see it confirmed.

Ross.

AntoineDoinel · 2014-04-05 18:30

To be frank, the first thing that meets my eyes looking at that numbers is that FCACHE appears to be nearly optimal.

Is that really the case when the cache is continuosly reloaded? (i.e. lots of strlen calls on small strings, intermixed with another fcached funtion)

Bill Henning · 2014-04-05 18:38

When I came up with LMM, I included FCACHE and FLIB, as with proper use (assuming the code fragment fits) FCACHE can approach cog-only speed very closely.

Catalina does not use a large FCACHE buffer, so it does not benefit as much.

Prop GCC has a larger fcache buffer, and has flib (renamed "kernel extensions") however both C compilers have problems generating optimal fcache usage.

On the P2, there is a four line, 8 long per line, instruction cache with automatic pre-fetch, and a single 8 long data cache line. When it hits its stride, code execution will be 1 clock per hub instruction (no need for fcache, or helper functions), and the single dcache line helps a LOT for data access. So yes, it can closely approach cog code performance for code, and the data cache helps hub data access quite a bit too.

The proposed P16E32 @ 200MHz would need to use an LMM interpreter, which would be twice as fast as the one on a P1 @ 100Mhz. Due to the faster cog code execution, even without FCACHE you could reasonably expect anoter 10%-20% boost for the P16E32, and FCACHE would help both.

The P2 @ 100MHz, due to hubexec, can be expected to be 8x-15x the speed of a P16E32 @ 200MHz running LMM.

If the compiler technology was much better, the difference could drop to 4x - 8x.... which is still a huge margin in favor of P2.

AntoineDoinel wrote: »

To be frank, the first thing that meets my eyes looking at that numbers is that FCACHE appears to be nearly optimal.

Is that really the case when the cache is continuosly reloaded? (i.e. lots of strlen calls on small strings, intermixed with another fcached funtion)

P1 vs P1E32 vs P1E16 vs P2 cycle counting benchmark comparison!

Comments