P1 vs P1E32 vs P1E16 vs P2 cycle counting benchmark comparison!

Bill Henning · 2014-04-05 10:45

Hi guys,

While I obviously prefer the P2 - and make no bones about it - unlike most people, I actually enjoy doing timing analysis.

I thought it would be nice to provide some apples-to-apples timing comparisons!

If you find a flaw in my calculations, I'd appreciate you telling me, as I am a firm believer in the scientific method and properly run experiments.

(and everyone makes mistakes at times)

I evaluated the performance of strlen() on P1, P1E32, P1E16, P2 in cog mode, LMM, LMM with FCACHE, and hubexec.

Please note other benchmarks will return different results, to wit more hub access will increase the P2 performance margin, less hub access will make P1E32 and P1E16 perform better (relatively speaking) in cog mode.

I present the source code, timing analysis, equations etc. at the end of this article.

EXECUTIVE SUMMARY

P1 @ 100MHz cog mode : 20.75us
P1 @ 100MHz LMM      : 98.98us
P1 @ 100MHz FCACHE   : 22.03us

P1E32 @ 100MHz       : 20.64us
P1E32 @ 100MHz LMM   : 98.28us 
P1E32 @ 100MHz FCACHE: 21.91us
P1E32 @ 100MHz HUBEXEC: 82.56us (assumes no icache)

P1E16 @ 100MHz       : 10.32us
P1E16 @ 100MHz LMM   : 49.14us
P1E16 @ 100MHz FCACHE: 10.96us
P1E16 @ 100MHz HUBEXEC: 41.14us (assumes no icache)

P2 @ 100MHz cog mode :  4.17us
P2 @ 100Mhz hubexec  :  4.18us

[b]CORRECTION[/b] P1E32 and P1E64 actually run at 200MHz, with 2 clocks per instruction, thus 100 cog mips.

NOTE: I did not bother to update the P2 figures for the four cycle hub access, as it would only improve the numbers by 0.1us... that caching is wonderful

NOTE: all data based on published specifications, P1Exx hubexec performance estimate based on Chip's proposed specification and non-caching hubexec. Conclusions would change if the underlying architectures changed

Conclusions, based on the results:

P1E32 is the same speed as a P1 when code has frequent hub access.
P1E16 is roughly twice as fast as P1 when code has frequent hub access.
P2 is roughly five times as fast as a P1 when code has frequent hub access.

For cog code for strlen() benchmark

P2 cogs are 4.97x faster than P1 at the same clock speed
P2 cogs are 4.95x faster than P1E32 variant at the same "MIPS", 10x at the same clock speed
P2 cogs are 2.47x faster than P1E16 variant at the same "MIPS", 5x at the same clock speed

For LMM code for strlen (vs hubexec)

P2 cogs are 23.68x faster than LMM on P1 at the same clock speed
P2 cogs are 23.68x faster than LMM on P1E32 at the same "MIPS", 47.36x at the same clock speed
P2 cogs are 11.84x faster than LMM on P1E16 at the same "MIPS", 23.68x at the same clock speed

For LMM FCACHE'd strlen() (vs hubexec)

P2 cogs are 5.27x faster than LMM with FCACHE on P1 at the same clock speed
P2 cogs are 5.24x faster than LMM with FCACHE on P1E32 at the same "MIPS", 10.48x faster at the same clock speed
P2 cogs are 2.60x faster than LMM with FCACHE on P1E16 at the same "MIPS", 5.20x faster at the same clock speed

For hubexec code (per David's request)

P2 is 19.75x faster than P1E32 if non-cached hubexec is added to the P1E
P2 is 9.88x faster than P1E16 if non-cached hubexec is added to the P1E

Source Code & Detailed Analysis

Note: LMM calculations based on 8-way unrolled LMM loop, FCACHE loading on efficient loader

1) Propeller 2 as currently defined

' call with string pointer in ptra, use locptra @string, returns length in ptra

strlen	reps #256,#3
	getptra strbase

	rdbytec	ch,ptra++ wz
  if_z	subptra	strbase
  if_z  ret

	ret

for 127 character string

128*3 cycles + 4*5 cycles for cache miss extra cycles + 6 cycles setup/return + 7 cyles max to sync to hub = 417 cycles

20.85us @ 20Mhz
10.42us @ 40Mhz
5.21us @ 80Mhz
4.17us @ 100Mhz
2.61us @ 160Mhz

The timing would be practically the same with hubexec, maybe add 8 cycles for the initial fetch of code.

2) Propeller 1

' assume string address is passed in using 'string', returns length in 'string'

strlen	mov strbase, string

loop	rdbyte	ch, strbase wz
  	add	strbase,#1
  if_nz	jmp	#loop

	subr	string, strbase
	ret

for 127 character string

128*16 cycles + 12 cycles setup/return + 15 cyles max to sync to hub = 2075 cycles

23.94us @ 80Mhz
20.75us @ 100Mhz

Using an eight way unrolled LMM loop, each cog LMM instruction takes 18 clock cycles, and hub instructions takes

128 * (18*3 + 16/3) + 18*3 cycles = 128 * (18*4 + 16/3) = 128 * (72 + 5.33) = 128 * 77.33 = 9898.24

LMM without FCACHE

123.78us @ 80Mhz ***
98.98us @ 100Mhz ***

to be fair, using FCACHE, it would take approx. 7*16 clock cycles to load, add another 16 to set up, so 8*16 = 128 cycles above cog code

LMM with FCACHE

25.46us @ 80Mhz
22.03us @ 100Mhz

3) P1E 16/32 cog / 512MB version

200Mhz hub clock, 2 clock per instruction, 32 hub clocks to get a hub window, cog sees it as 16 instruction clocs, P1E has same RDxxx timing as P1

Exactly same code as P1 case above, calculation in clock cycles: 32*128 + 8 + 32 = 4128 cycles

20.64us @ 200MHz (100MIPS) with 32 cogs

10.32us @ 200MHz (100MIPS) with 16 cogs

The ratio between cog & LMM will be the same, so we can just multiply with P1 results to get LMM numbers

20.64 * 98.8/20.75 = 98.28us @ 200Mhz/100MIPS/32 core ***

10.32 * 98.8/20.75 = 49.14us @ 200Mhz/100MIPS/16 core ***

The ratio would be just as effective with the FCACHE version

20.64 * 22.03/20.75 = 21.91us

10.32 * 22.03/20.75 = 10.96us

potatohead · 2014-04-05 11:12

Thanks Bill!

http://forums.parallax.com/showthread.php/155091-P1-vs-P1E32-vs-P1E16-vs-P2-cycle-counting-benchmark-comparison!?p=1256413&viewfull=1#post1256413

Bill has quoted the earlier post. I've put a better, more clear one below.

Bill Henning · 2014-04-05 11:24

You are most welcome!

I don't like making unsupported statements, so I thought I'd take the time to document a usage case that will be very common.

strcpy()/memcpy() would give an even larger performance advantage to the P2, and be the same speed on P1/P1E32

Even I was initially surprised at the lack of performance difference between the P1 and P1E32, but in retrospect, hub access dominates.

The current P2 design kicks ***!

FYI, my best guess is that a P1E32 @ 200MHz (100MIPS/core) would use significantly more power than a P2 @ 100Mhz (100MIPS/core), and a P1E16 would use roughly the same amount.

The above data shows that on a per cog basis, P1E32 is the same speed as the original P1 when there is a reasonable amount of hub access (ie LMM)

There is no magic, just toggling gates using power

potatohead wrote: »

Thanks Bill!

These help with the power profile problem. Really fast COGS can proceed to get something done, consume their power, and then spend time in waiting.

Some pathological cases will run things hot. Computing fractals on all 8 cogs, same with something like fourier transforms, or maybe running a few hundred sprites...

Others, like controlling pins may not! Some functions we drive hard in software loops are very significantly improved by the counter features, and having DACS everywhere gets rid of a lot of expensive PWM loops.

If we took a month to optimize the P2, add the SERDES, while at the same time working through use cases and their likely power profile, we are extremely likely to find the P2 chip at 100Mhz is going to sing on a USB power budget in a very significant number of those.

For those pathological cases, heat can be managed away.

Code execute performance out of the COG is a big deal. How many years did we spend attempting to write big programs that actually performed on P1?

Too many. P2 runs big programs at near native speed, and it's been designed to work with C too. Sigh.

So this also means the P1 variant being discussed will generate heat in this process. It has to! That's the nature of the process.

The opportunities to author efficient code have more potential with greater rewards in terms of throughput and overall capability. Again, the lower the clock, the more this is true!

David Betz · 2014-04-05 11:36

Can you estimate the performance of a P1E that supports hub execution? I know Chip pretty much took that off the table but I'm curious about how it would perform.

jazzed · 2014-04-05 11:39

LOL

Where did you get your PIE hardware for your conclusions?

They are of course estimated conclusions. Still an interesting "thought experiment."

I estimate that "You may be wrong or you may be right." -Billy Joel

Bill Henning · 2014-04-05 11:48

Steve, its great to have your unfounded derision back! I almost missed it!

Much easier and less work than presenting actual supported arguments, of course.

I did not need hardware, I used Chip's information regarding how many clocks between hub cycles, how many instructions per clock, and "it behaves the same as P1".

All else follows from that.

If Chip changes how it works, I can adjust the numbers.

jazzed wrote: »

LOL

Where did you get your PIE hardware for your conclusions?

They are of course estimated conclusions. Still an interesting "thought experiment."

I estimate that "You may be wrong or you may be right." -Billy Joel

potatohead · 2014-04-05 11:49

I'm curious about David's question too.

I may revise what I wrote above. Or I may just leave it...

Here's what I don't get yet:

Producing a many COG P1E clocked high enough to really make a difference, means producing a chip that will use power, and it's going to be on the order of watts, or it really doesn't do anything meaningful!

This is due to the basic process physics. No getting around that!

The power profile is more modest, because that P1E is performance limited due to the design being simple, etc...

No matter what, it's going to take more significant power to run anything we produce in this process. Really pushing a P1E makes heat! Really pushing a P2 makes heat!

The current P2 design is much quicker! This means really pushing it can make more heat! More than people feel good about.

But why do we have to do that?

P2 performance at a reasonable clock, with optimizations / gating, etc... to maximize it's overall efficiency is going to deliver more features than a P1E would, at a similar power profile, while delivering great performance relative to the P1E, so what is the benefit exactly?

I expressed that poorly above, and I'll leave it, but really the statement in bold is what is bugging me.

Secondly, a P1E cannot be pushed to the peak performance and features possible on the P2! How is that different from just being sane about how we limit a P2?

And finally, for those who want to run it really fast and hot, the P2 can max out the process to a much greater degree than the P1E can, putting a lot on the table for those who can entertain the power / thermal design challenges.

Really, the P1E is safer, simply because it does not perform. A P2, fed the right power voltages, optimized and clocked appropriately can deliver similar performance, while also delivering killer features.

???

Delus · 2014-04-05 11:54

You don't need hardware to make meaningful bench mark comparisons, this is a valid and very useful design tool

Bill Henning · 2014-04-05 11:57

Hi David.

Assuming the hub access is only one long wide, I can do that easily, however that requires at a minimum jump with 16 bit address instructions. Given those, the source code code would be the same as the P1 cog mode.

Any caching would of course help.

' assume string address is passed in using 'string', returns length in 'string'

strlen	mov strbase, string

loop	rdbyte	ch, strbase wz
  	add	strbase,#1
  if_nz	jmp	#loop

	subr	string, strbase
	ret

for 127 character string

	128*16 cycles + 12 cycles setup/return  + 15 cyles max to sync to hub = 2075 cycles

Corrections:

Initial numbers were wrong, bad calculation + forgot the actual readbyte! Here are the corrected numbers:[/B]

P1E32 would roughly be (128*4*32+3*32+32)/200 =82.56us (100MIPS / 200MHz)

P1E16 would roughly be (128*4*16+3*16+16)/200 = 41.28us (100MIPS / 200MHz)

David Betz wrote: »

Can you estimate the performance of a P1E that supports hub execution? I know Chip pretty much took that off the table but I'm curious about how it would perform.

David Betz · 2014-04-05 12:02

Bill Henning wrote: »
Hi David.

Assuming the hub access is only one long wide, I can do that easily, however that requires at a minimum jump with 16 bit address instructions. Given those, the source code code would be the same as the P1 cog mode.

Any caching would of course help.
' assume string address is passed in using 'string', returns length in 'string'

strlen	mov strbase, string

loop	rdbyte	ch, strbase wz
  	add	strbase,#1
  if_nz	jmp	#loop

	subr	string, strbase
	ret

for 127 character string

	128*16 cycles + 12 cycles setup/return  + 15 cyles max to sync to hub = 2075 cycles 
P1E32 would roughly be (128*3*32+3*32+32)/200 = 21.12us (100MIPS / 200MHz)

P1E32 would roughly be (128*3*16+3*16+16)/200 = 10.56us (100MIPS / 200MHz)

Actually, you need a JMP with a 17 bit immediate address if we get 512K of hub memory. Anyway, thanks for your numbers!

Bill Henning · 2014-04-05 12:12

You are right... I still have 256KB on the brain!

And you are welcome

David Betz wrote: »

Actually, you need a JMP with a 17 bit immediate address if we get 512K of hub memory. Anyway, thanks for your numbers!

Bill Henning · 2014-04-05 12:23

Turns out I messed up the initial P1E32 and P1E16 cacheless hubexec calculation. Sorry about that.

I've updated the message, and here are the numbers again:

P1E32 would roughly be (128*4*32+3*32+32)/200 =82.56us (100MIPS / 200MHz)

P1E16 would roughly be (128*4*16+3*16+16)/200 = 41.28us (100MIPS / 200MHz)

P2 @ 100Mhz would be 4.18us, 10x faster at the same mips (20x at the same MHz)

Basically the mistake was I counted 3 hub cycles per loop, not including the hub cycle for fetching the byte. I think I may have made the same error on post#1, I will check/fix that now.

potatohead · 2014-04-05 12:28

P2 @ 100Mhz would be 4.18us, 10x faster at the same mips (20x at the same MHz)

!!!

Shouldn't we really be thinking about the idea JMG had to provide for per COG clocking then? That could make a very significant impact in the power profile, keep timing resolution, and still perform very nicely indeed!

Running big programs at high speeds is probably the single biggest Propeller exception there is.

kwinn · 2014-04-05 12:32

Thanks for posting that. Interesting analysis that provides some hard numbers to the conclusions one would reach by looking at the overall architecture of the three chips.

Assuming the same clock speed for all tests, and P1 as the reference both P1E variants should be twice as fast for code executing in a cog with no hub access since they need 2 clocks per instruction to the P1's 4 clocks. The P2 cog would be four times faster due to it's single clock cycle per instruction rate.

The 2x difference in speed between the P1E16 and P1E32 clearly shows that frequent hub access would be the major bottleneck as the number of cogs increases. Increasing the number of cogs without increasing hub bandwidth reaches a point of diminishing returns rapidly, and I suspect 32 cogs is past the sweet spot for most applications.

A propeller with 16 cogs, 256K hub ram, 64 I/O pins, and single cycle hub access would give a ratio of 4 pins, 16K hub ram, and 5MHz of hub bandwidth per cog with an 80MHz clock. If there was extra space on the die 512K of hub ram would be nice.

jmg · 2014-04-05 12:37

Bill Henning wrote: »

The current P2 design kicks ***!

FYI, my best guess is that a P1E32 @ 200MHz (100MIPS/core) would use significantly more power than a P2 @ 100Mhz (100MIPS/core), and a P1E16 would use roughly the same amount.

A better comparison when getting to this level, is the Operation Energy, but for that we need P1E Power Simulations.
Note that a P1E is unlikely to operate on a all-cogs-equal basis, for HUB access, but that is a design-craft decision, and out of the box, the simplest user mapping could be all-user-reserved-hubs-get-equal-access.

jazzed · 2014-04-05 12:41

Bill,

It's good to make estimates. Later when you have actual measurements they will be good for comparisons.

I know you hate to be challenged. That's what makes it so much fun.

'-)

Bill Henning · 2014-04-05 13:00

Hi guys,

I fixed the first post, I'd forgotten to add the cycles used by the rdbyte in the initial calculations for P1, P1E... it is fixed now.

P2 wins... HUGE... with hubexec, even compared to a simple hubexec on P1E!

P2@100MHz is ~10x faster at hubexec than P1E16@200MHz

P2@100MHz is ~20x faster at hubexec than P1E32@200MHz

Checkmate.

Bill Henning · 2014-04-05 13:02

Steve,

I hate to be challenged by people without any data or technical arguments, just hand-waving.

You are VERY good at that!

:-)

jazzed wrote: »

Bill,

It's good to make estimates. Later when you have actual measurements they will be good for comparisons.

I know you hate to be challenged. That's what makes it so much fun.

'-)

potatohead · 2014-04-05 13:08

And what does the power profile look like for the P1E at 200Mhz!

Bill Henning · 2014-04-05 13:10

You are most welcome!

I totally agree, the results were predicable. Unfortunately a lot of people don't believe in performing calculations, so I thought I'd do it for them, to make the differences obvious.

The hub latency impact is made very obvious above, and the benefits of even the small caching the P2 does are rather dramatic.

kwinn wrote: »

Thanks for posting that. Interesting analysis that provides some hard numbers to the conclusions one would reach by looking at the overall architecture of the three chips.

Assuming the same clock speed for all tests, and P1 as the reference both P1E variants should be twice as fast for code executing in a cog with no hub access since they need 2 clocks per instruction to the P1's 4 clocks. The P2 cog would be four times faster due to it's single clock cycle per instruction rate.

The 2x difference in speed between the P1E16 and P1E32 clearly shows that frequent hub access would be the major bottleneck as the number of cogs increases. Increasing the number of cogs without increasing hub bandwidth reaches a point of diminishing returns rapidly, and I suspect 32 cogs is past the sweet spot for most applications.

A propeller with 16 cogs, 256K hub ram, 64 I/O pins, and single cycle hub access would give a ratio of 4 pins, 16K hub ram, and 5MHz of hub bandwidth per cog with an 80MHz clock. If there was extra space on the die 512K of hub ram would be nice.

Bill Henning · 2014-04-05 13:12

Cog clock gating is an excellent way to reduce the power envelope. It is not a law of nature that they need to run at the same clock speed... this would also free up more hub slots for cogs that can use it

potatohead wrote: »

!!!

Shouldn't we really be thinking about the idea JMG had to provide for per COG clocking then? That could make a very significant impact in the power profile, keep timing resolution, and still perform very nicely indeed!

Running big programs at high speeds is probably the single biggest Propeller excep

tion there is.

Heater. · 2014-04-05 13:14

So, the conclusion is:

Make the P2 as it is. If the thing boils over disable a few COGS.

It could even be arranged that if COGs are disabled the HUB round robin cycle is shortened accordingly. Bingo we have even more performance per COGS as a result:)

Sounds like we are still winning big time.

Bill Henning · 2014-04-05 13:18

I'd love to see a power simulation, perhaps of even this exact code, on P1, P2, P1E32, P1E16 cogs.

My suspicion - and it is only an educated guess - is that running the 32 P1E32 cogs at 200MHz will actually end up toggling more transistors than running 8 P2 cogs at 100Mhz.

32*200 = 6400 p1e-cogmhz

8*100 = 800 p1-cogmhz

The P2 could toggle 8x the transistors of the P1E32 and use roughly the same amount of power.

jmg wrote: »

A better comparison when getting to this level, is the Operation Energy, but for that we need P1E Power Simulations.
Note that a P1E is unlikely to operate on a all-cogs-equal basis, for HUB access, but that is a design-craft decision, and out of the box, the simplest user mapping could be all-user-reserved-hubs-get-equal-access.

mindrobots · 2014-04-05 13:19

Heater. wrote: »

So, the conclusion is:

Make the P2 as it is. If the thing boils over disable a few COGS.

It could even be arranged that if COGs are disabled the HUB round robin cycle is shortened accordingly. Bingo we have even more performance per COGS as a result:)

Sounds like we are still winning big time.

Yes. I think should always be the first choice until we're told it's "impossible" - you know how the community and Chip deal with impossible!

potatohead · 2014-04-05 13:20

That is precisely where I am at on it.

JMG has repeately expressed good ideas here to save power and improve timing. Frankly, I read those wrong, and I apologize for some earlier comments JMG.

Now that we are here and I have read about these things, I see how they are necessary at this process size.

Chip said gating the COGS can be done easily, and if we offer simple steps for clock speeds that do not make a mess of the HUB, mooch starts to make a TON of sense people!

Want that fast, big program? Run it on a high speed COG, run your simple video at 40Mhz, and I'm not kidding as basic video needs work fine at that clock and lower even, run a moderate speed COG or few for peripherials, and another fast one for SDRAM, whatever.

Bill Henning · 2014-04-05 13:28

Thinking some more...

I think we need a 16 entry hub slot assignment table for cogs.

We have 8 cogs.

Let's make the hub wheel have 16 spokes.

The table is initialized as follows:

spoke 0: 0
spoke 1: 1
spoke 2: 2
spoke 3: 3
spoke 4: 4
spoke 5: 5
spoke 6: 6
spoke 7: 7
spoke 8: 0
spoke 9: 1
spoke A: 2
spoke B: 3
spoke C: 4
spoke

5
spoke E: 6
spoke F: 7

jmg suggested that we can flag a slot as unassigned, as a power saving measure. I am not sure how that would help save power, I'd rather drop a cogs multiplier, or cogstop it (which presumably freezes its power usage)

This ofcourse allows totally deterministic allocation of hub slots, all the way down to 1/16, or heck, 1/32 if the table has 32 entries.

Totally deterministic hub timing for cogs, as needed.

Add "mooch", for hubexec cogs, and we have P2 nirvana

potatohead · 2014-04-05 13:55

I would rather COGSTOP it too.

Leakage power is not anywhere near the consideration active power is at this process size, correct?

We also have the ability to load COGS, start them, stop them, and start them in HUBEXEC, and start them at any address without loading. This means we can burst with COGS without even thinking too hard about it. In terms of power, this is a LOT of flexibility. IMHO, a more optimal solution than a lot of little P1 COGS would be.

We know mooch is going to make a COG run fast when we want it too. If this hub scheme improves power utilization options, I'm all for it. Again, that's primary right now.

rjo__ · 2014-04-05 14:19

Bill,
Thanks.

You can ignore this if you want, because I think I might be the only here that really wants to know.

You have not computed the numbers for a 4 Cog P2 in hardware running at 160 MHz with Cogs that seem to run on nitrous… at 40MHz.
I am also interested and cannot fathom the energy scaling… it seems to me that a 4 Cog P2 in hardware could be scaled with a much tighter grain and to a far lower end point.
It seems that for some serious markets, a single design that can be broadly applied to the widest range of application has benefits.

How does an FPGA P2 running at 80 MHz compare with the fastest p1 variant running in hardware at 100MHz?

I think a lot of the guys are supporting what they view is in Parallax's best interests. And a lot of them have been around for very long time
and care deeply about the issue. It is a very delicate issue, but one that I think can tolerate more consideration.

Rich

Bill Henning · 2014-04-05 14:29

Hi Rich,

The numbers given were per cog

Energy wise, my best guess, and this is only a guess, is P1E32 takes more power than P1x8, and P1E16 roughly the same. But that is only my guess. I was calculating computer performance for strlen().

P2@80Mhz ~5.21us (cog or hubexec mode)
P1@100Mhz ~20.75us (cog mode, LMM with FCACHE only about 1us slower, P2 ~ 4x faster)
P1@100Mhz ~98.98us (pure LMM, P2 about 19x faster)

rjo__ wrote: »

Bill,
Thanks.

You can ignore this if you want, because I think I might be the only here that really wants to know.

You have not computed the numbers for a 4 Cog P2 in hardware running at 160 MHz with Cogs that seem to run on nitrous… at 40MHz.
I am also interested and cannot fathom the energy scaling… it seems to me that a 4 Cog P2 in hardware could be scaled with a much tighter grain and to a far lower end point.
It seems that for some serious markets, a single design that can be broadly applied to the widest range of application has benefits.

How does an FPGA P2 running at 80 MHz compare with the fastest p1 variant running in hardware at 100MHz?

I think a lot of the guys are supporting what they view is in Parallax's best interests. And a lot of them have been around for very long time
and care deeply about the issue. It is a very delicate issue, but one that I think can tolerate more consideration.

Rich

rjo__ · 2014-04-05 14:59

If I type in just "Thanks" my message gets rejected.
I hate rejection.

Cluso99 · 2014-04-05 15:17

Bill,
Nice job.

But strmove and strcopy are hardly normal instructions so may heavily skew results. As you know, numbers can always be made to fit a desired scenario.

A P2 cog at the same clock will currently use ~15x the power of the P1 cog. Chip can and will reduce the P2 cog power usage by clock gating, but it still will be a power hog compared to a P1 cog.

There are some ideas that can help both P1E and/or P2...
* Slot allocation. I know you and I both agree (not necessarily on implementation, but it can be fine-tuned).
This would give extra time to the critical cogs, and to the large general cog.
* Cogs not being equal. (I don't recall your stance on this)

Real life means at least one cog needs to be different from the rest. It needs to be...
- the main controlling cog
- capable of large memory (hubexec really helps over LMM)
- non-deterministic
- capable of lapping up unused slots
The remaining cogs need to be...
- lean, mean for driver performance
- deterministic (default)
- lots of them
- low power, relatively fast
- configurable fixed slot allocation (by Cog #0)
- hw assist - Video/Counters for P1; Video/Counters/UART/USB/SERDES for P2
- lots of them removes the need for multi-tasking and multi-threading
Everyone want more hub ram. P1E can give us at least 512KB. Maybe with these above refinements, the P2 can get 512KB too?

Therefore, I suggest Cog #0 be different..
- The large controlling cog, hubexec, culled for large programs without being lumbered with the extras such as video etc. Perhaps just raw I/O.
All other cogs be lean and mean...
- Culled for lean driver performance

Perhaps with these changes we can get to a power efficient P2 design in 180nm ???

P1 vs P1E32 vs P1E16 vs P2 cycle counting benchmark comparison!

Comments