The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Heater. · 2014-05-08 03:43

Here is my suggestion for doubling the available HUB bandwidth in the simplest possible way:
<serious>
Reduce the number of COGS from 16 back to 8.
</serious>
Years ago Chip posed the question to the forum "Should we go for 16 COGs or more RAM?" As far as I remember the consensus was to take the RAM.

The main reason being that as COG count goes up HUB bandwidth goes down. There are diminishing returns on overall performance. You can't fight Amdahl's law. Somewhere around about 8 COGs is the sweet spot. That's before we even think about how those available HUB access slots are allocated to processors.

Of course there is more to it than just maximizing some theoretical average, overall performance. As you might do when building a multi-core general purpose computing machine, like an Intel PC or supercomputer.

The "more to it" here is that COGs are actually running their own code in their own little space as fast as they can and interacting with the external real world in real-time. Maximizing global aggregate performance is not the goal here.

So my feeling is that:

1) For those who want to do that an 8 COG solution is preferable, or indeed a different device altogether. STM make some nice ones I have heard.

2) For those who want a Propeller, we can stick to 16 COGs. 15 of them for those self-contained little drivers and 1 that runs big code, hubexec, and sucks up all the "left-over" HUB slots. Nice and simple. Nice and predictable.

Heater. · 2014-05-08 04:02

mindrobots,

Plus, King COG is worth implementing for the name alone!

It crossed my mind already that it's actually "King COG Zero". He would be known as "Zog" by his friends. A name which many here might recognize.

RossH · 2014-05-08 04:05

Heater. wrote: »

mindrobots,

It crossed my mind already that it's actually "King COG Zero". He would be known as "Zog" by his friends. A name which many here might recognize.

Don't you mean this Zog ...

Ross.

Heater. · 2014-05-08 04:17

Indeed I do. I could only find Zog Jr. just then.

RossH · 2014-05-08 05:21

Heater. wrote: »

Indeed I do. I could only find Zog Jr. just then.

Hope you don't mind Heater - I've started a separate thread to discuss your scheme here.

Ross.

Todd Marshall · 2014-05-08 07:18

kwinn wrote: »

Voila, your super cog 0 wish is granted courtesy of a look up table. Simple to implement, simple to program, easy to control, and easy to understand. And we can all have it our way. Yours is the bottom one.

Attachment not found.

Are you limiting your table to 32 entries? If so, why?
Are you forcing modulo 32 indexing into your table? If so, why?

koehler · 2014-05-08 12:47

Heater. wrote: »

Here is my suggestion for doubling the available HUB bandwidth in the simplest possible way:
<serious>
Reduce the number of COGS from 16 back to 8.
</serious>
Years ago Chip posed the question to the forum "Should we go for 16 COGs or more RAM?" As far as I remember the consensus was to take the RAM.

The main reason being that as COG count goes up HUB bandwidth goes down. There are diminishing returns on overall performance. You can't fight Amdahl's law. Somewhere around about 8 COGs is the sweet spot. That's before we even think about how those available HUB access slots are allocated to processors.

.

Oh vey!

Now in the interest of simplicity and determinism, we step back to fewer Cores and force anyone needing more Core/Peripherals to
make use of a work-around for this limitation by multi-threading Cores with multiple peripherals ?

Rather than something that IS more Prop-like, which is adding Cores to allow objects to be simpler and own their own Core...

The rare case where one may need higher bandwdith which can be accomplished with a simple voluntary hubshare, is now worth considering retrograde re-architecture which will cause potentially everybody to lose the benefit of greater #Cores.... This suggestion has magnitudes greater impact than anything hubsharing would have.

Going back to 8, no matter the speed, will undoubtedly end up with the same internet reception and infamy as the original P1 vis-a-vis #Cores vs no peripherals.

I'm willing to bet a tidy sum that this is never going to be acceptable to Chip/Parallax.

As for Amdahl's Law being bandied about so frequently.... Why does that impact only hub sharing, and not the doubling(?) of Core, Core vs HubExec memory, SmartPins.....

If you want to accurately use Amdahl's law, then NOT allowing for an increase in Hub is what will limit the "expected speedup of parallelized implementations ".

Let's now give 200% hub for all the simpler objects that are in OBEX, or will be used as single-Core objects, even though they rarely need 0.5-1.0x.

Heater. · 2014-05-08 13:38

It's only me who "bandies about" Amdahl's law. See my other post somewhere here today as to why it cannot be ignored.

I would go so far as to suggest that the interminable "Groudhogs Day" debate about how to allocate HUB cycles is exactly due to Amdahl's Law.

There are ways you can bend it to suite this purpose or that. But no way to beat it in general. We cannot agree on "purpose" so we cannot agree on how to bend it.

potatohead · 2014-05-08 13:51

Precisely so. This is why I have maintained this is not entirely a technical discussion.

There will be tradeoffs. It is the relative worth or difficulty perception, or reuse implications that drive the differences in opinion.

jmg · 2014-05-08 13:57

Todd Marshall wrote: »

Are you limiting your table to 32 entries? If so, why?

There is no cast-iron size - the appeal of 32 is it can load atomic, and it allows ALL GOGs to choose 50% BW, and it can overlay to give slot-sharing to ANY other COG.

See my test case numbers here
http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1265840&viewfull=1#post1265840

a) 32x, Reload Scan, Default Config Boolean -> 240.500MHz
d) 48x, Reload Scan, Default Config Boolean ->192.530MHz
b) 64x, Reload Scan, Default Config Boolean -> 168.407MHz
those real-world numbers, point to 32

Todd Marshall wrote: »

Are you forcing modulo 32 indexing into your table? If so, why?

No, M32 is simply the default. Other choices are available for Slot Granularity control, for when determinism of tight loops need to be something other than a coarse, locked, 16N.

Todd Marshall · 2014-05-08 16:33

jmg wrote: »

There is no cast-iron size - the appeal of 32 is it can load atomic, and it allows ALL GOGs to choose 50% BW, and it can overlay to give slot-sharing to ANY other COG.

See my test case numbers here
http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1265840&viewfull=1#post1265840

a) 32x, Reload Scan, Default Config Boolean -> 240.500MHz
d) 48x, Reload Scan, Default Config Boolean ->192.530MHz
b) 64x, Reload Scan, Default Config Boolean -> 168.407MHz
those real-world numbers, point to 32

And the downside of have a number larger than 32 ... like maybe even 320?

jmg wrote: »

No, M32 is simply the default. Other choices are available for Slot Granularity control, for when determinism of tight loops need to be something other than a coarse, locked, 16N.

Works for me.

jmg · 2014-05-08 17:02

Todd Marshall wrote: »

And the downside of have a number larger than 32 ... like maybe even 320?

See the MHz figures above, those need to be comfortably > 200MHz target for FPGA.
32 meets that, 48 and above, fail.

In ASIC, an asymmetric dual port memory could likely come in with a flatter bit-ns impact, but this needs to be FPGA test-able, and any hub-alloc needs to not pull-down any SysCLK figures.

koehler · 2014-05-08 17:54

Heater. wrote: »

It's only me who "bandies about" Amdahl's law. See my other post somewhere here today as to why it cannot be ignored.

I would go so far as to suggest that the interminable "Groudhogs Day" debate about how to allocate HUB cycles is exactly due to Amdahl's Law.

There are ways you can bend it to suite this purpose or that. But no way to beat it in general. We cannot agree on "purpose" so we cannot agree on how to bend it.

LOL, not saying it can be ignored at all.
Saying that you are bending that 'Law' when it comes to the rest of the system being beefed up speed, memory, etc, and not hub memory access.

To date, for all the fuss over this by some, I don't think anyone has explained how my project with 9 standard objects will be a failure when I throw in a HS-Object.
I mean, even accidentally.
Unless it requires 8 or more Cores.

Or wait, I load a HS-Object that needs 3 Cores. Now I have 4 Cores left.
Now that thats working fine and dandy, I decide I want this other cool HS-Object and add it into the mix.
.....
Ok, the local fire brigade has left, and I'm in my wife's boudair with her laptop as my area is flooded.
I do a little 24 CSI work, and create a GUI in Visual Basic to scan the subnet, and find the docs for that last object, in about 2 minutes flat.
It says it needs 5 Cores! You Maniacs! You blew it up! Ah, da** you! God da** you all to he**!

That all happened because the 'object required 5 Cores', not because I'm a blithering idiot who apparently loads objects without reading any documentation at all.

Aside from the above scenario, the only other one would be somehow assuming Chip is not going to simply update CogNew to set Sharing Cores as unavailable unless explicitly requested as non-hub.

This is for simple hub sharing.

How are standard objects impaired by me using an HS-Object concurrently?

I realize this argument is going nowhere. Perhaps we should just go backto arguing for/against Tasking, especially since Chip has been vewy vewy quiet.....

"I had totally forgotten about video. Tasking helps a cog get a lot of other stuff done.
I'll get things running without tasking, at first, and then see what it would take to add it back in. I just need to hew this down to get it started"

cgracey · 2014-05-08 20:04

I've been holed up working on the new design. It's been pretty much a ground-up effort (meaning: from the ground level, upwards, though it started out feeling like ground-up powder). At first, it was just overwhelming, so I marked off what I wouldn't implement, initially, like hub exec. This has made the whole set of problems a lot more approachable. Even the hub is not implemented, yet, as I work on the cog, itself. Everything I add gets scrutinized for its effect on timing. There is definitely beauty in simplicity.

It's looking like the FPGA will be able to go 160MHz, for 80 MIPS per cog (not 200MHz / 100 MIPS). By the time the various mux's get implemented, things slow down quite a bit from what the raw ALU is capable of. The latest bottleneck is the data-forwarding circuits, which pass the current ALU result to the D and S latches, circumventing a read-in-progress for a register which is getting overwritten.

Anyway, I'm keeping things as simple and lean as I can. Hopefully, I'll have an FPGA image in the next week, or so.

Invent-O-Doc · 2014-05-08 20:10

"there is definitely beauty in simplicity"

cgracey wrote: »

I've been holed up working on the new design. It's been pretty much a ground-up effort (meaning: from the ground level, upwards, though it started out feeling like a lot of ground up powder). At first, it was just overwhelming, so I marked off what I wouldn't implement, initially, like hub exec. This has made the whole set of problems a lot more approachable. Even the hub is not implemented, yet, as I work on the cog, itself. Everything I add gets scrutinized for its effect on timing. There is definitely beauty in simplicity.

It's looking like the FPGA will be able to go 160MHz, for 80 MIPS per cog (not 200MHz / 100 MIPS). By the time the various mux's get implemented, things slow down quite a bit from what the raw ALU is capable of. The latest bottleneck is the data-forwarding circuits, which pass the current ALU result to the D and S latches, circumventing a read-in-progress for a register which is getting overwritten.

Anyway, I'm keeping things as simple and lean as I can. Hopefully, I'll have an FPGA image in the next week, or so.

User Name · 2014-05-08 20:18

cgracey wrote: »

Anyway, I'm keeping things as simple and lean as I can.

I'm sure you are. It will be great!

Thank you!!

cgracey · 2014-05-08 20:19

Invent-O-Doc wrote: »

"there is definitely beauty in simplicity"

But what level of simplicity is acceptable? PTRA/PTRB are a bit of a pain. Hub exec is more of a pain. Prop1 was square in the zone of beautiful simplicity. What I have, so far, is like Prop1, but improved.

What is going to disrupt simplicity in this design is wide data paths (4 longs) and hub exec. I think those features are important enough, though, that the disruption they cause must be accommodated.

potatohead · 2014-05-08 20:25

I agree with both of those.

Having the 4 longs between COG and HUB really makes the most of the cycle. A COG can get a lot of data or push a lot of data, and that's an important improvement over the P1. Makes things like WAITVID a bit more relaxed too as bigger chunks of data allow for more options and processing time between cycles.

Hubexec is just nice because we can assemble PASM anywhere, and not doing that easily limited use of LMM, IMHO.

Thanks for the update Chip. Hope you are getting some sleep when you need it.

jazzed · 2014-05-08 21:03

cgracey wrote: »

But what level of simplicity is acceptable? PTRA/PTRB are a bit of a pain. Hub exec is more of a pain. Prop1 was square in the zone of beautiful simplicity. What I have, so far, is like Prop1, but improved.

What is going to disrupt simplicity in this design is wide data paths (4 longs) and hub exec. I think those features are important enough, though, that the disruption they cause must be accommodated.

Once you get to a productive and salient point, consider one of these COG sharing things people are batting around. It may be simpler to do than HUB exec, and bandwidth on demand would be an interesting and useful feature (it could be off by default of course).

Is having some kind of an auto-incrementing address read still useful? That's what PTRA/B are for right? (no I haven't kept up! just waiting for a bit-file.)

kwinn · 2014-05-08 21:38

I limited the table to a maximum of 32 for simplicity. A 32x4 table can be loaded with a single quad write, so it is inherently an atomic operation. The reload register allows the use of any modulo up to and including modulo 32 (the second last table in the posted spreadsheet was modulo 30) . I did mention in some post that it might be worth increasing the table size to 64, but then again we may not gain enough to make it worth the extra complexity. Also keep in mind that this table could be updated on the fly, which might also be useful.

Cluso99 · 2014-05-08 23:54

cgracey wrote: »

I've been holed up working on the new design. It's been pretty much a ground-up effort (meaning: from the ground level, upwards, though it started out feeling like ground-up powder). At first, it was just overwhelming, so I marked off what I wouldn't implement, initially, like hub exec. This has made the whole set of problems a lot more approachable. Even the hub is not implemented, yet, as I work on the cog, itself. Everything I add gets scrutinized for its effect on timing. There is definitely beauty in simplicity.

It's looking like the FPGA will be able to go 160MHz, for 80 MIPS per cog (not 200MHz / 100 MIPS). By the time the various mux's get implemented, things slow down quite a bit from what the raw ALU is capable of. The latest bottleneck is the data-forwarding circuits, which pass the current ALU result to the D and S latches, circumventing a read-in-progress for a register which is getting overwritten.

Anyway, I'm keeping things as simple and lean as I can. Hopefully, I'll have an FPGA image in the next week, or so.

cgracey wrote: »

But what level of simplicity is acceptable? PTRA/PTRB are a bit of a pain. Hub exec is more of a pain. Prop1 was square in the zone of beautiful simplicity. What I have, so far, is like Prop1, but improved.

What is going to disrupt simplicity in this design is wide data paths (4 longs) and hub exec. I think those features are important enough, though, that the disruption they cause must be accommodated.

Simplicity is the best. Once you get that done and into an fpga there will be time to look at the other things.

Please don't bother about these threads for now - we are progressing but still no consensus yet.

If you need a break and some fresh air, start a new thread to discuss whatever takes your fancy. Or, go pick some walnuts or something.

You mention the 160MHz FPGA limit. Is the OnSemi process likely to be the same limit, or might we possibly see some improvement???

Postedit:
Yes, quad hub access and hubexec seem way to important to not have. I wonder if there is another way to implement hubexec that might simplify it???

RossH · 2014-05-09 04:52

cgracey wrote: »

But what level of simplicity is acceptable? PTRA/PTRB are a bit of a pain. Hub exec is more of a pain. Prop1 was square in the zone of beautiful simplicity. What I have, so far, is like Prop1, but improved.

What is going to disrupt simplicity in this design is wide data paths (4 longs) and hub exec. I think those features are important enough, though, that the disruption they cause must be accommodated.

Stay in the zone of beautiful simplicity and you can't really ever go wrong, Chip.

Hubexec? Wide data paths? Who really needs these? Mostly seems to be those who want this new chip to be something it doesn't naturally want to be.

Would we like HubExec? Sure!

Would we like wide data paths? Sure!

But can we live without either of these? Of course we can!

An improved Prop1? Magic!

Ross.

potatohead · 2014-05-09 07:57

I strongly second this.

Heater. · 2014-05-09 08:16

I strongly second that as well.

We all want more of this and more of that in every dimension and seven virgins as well.

The past year or so have shown that it cannot all be done in any sensible way.

I'm amazed at how well you keep focus in all this turmoil Chip.

David Betz · 2014-05-09 08:20

I guess I agree as well but I wonder if a P1+ that is mostly the same as P1 with twice the COGs and more memory will have a bigger market than the P1 did. Can Parallax survive if the P1+ sells only as well as P1 sold?

Kerry S · 2014-05-09 08:50

David Betz wrote: »

I guess I agree as well but I wonder if a P1+ that is mostly the same as P1 with twice the COGs and more memory will have a bigger market than the P1 did. Can Parallax survive if the P1+ sells only as well as P1 sold?

More importantly will they survive if the P1+ ends up draining sales from the P1 instead of opening up significant new markets?

I am beginning to think we would have been better off with a 4 core 80 MHZ P2 with the full I/O. At least that could have targeted different applications than a P1/P1+.

Don't get me wrong, I think the P1+ will be a great update to the P1... the problem is that it is looking more and more like just an update.

Ken Gracey · 2014-05-09 08:51

cgracey wrote: »

But what level of simplicity is acceptable? PTRA/PTRB are a bit of a pain. Hub exec is more of a pain. Prop1 was square in the zone of beautiful simplicity. What I have, so far, is like Prop1, but improved.

What is going to disrupt simplicity in this design is wide data paths (4 longs) and hub exec. I think those features are important enough, though, that the disruption they cause must be accommodated.

Internal opinions welcome, too?

Many of the customers I work with simply want forward progression from Parallax. The big steps we've been taking are so big for them that they want to wander elsewhere between our releases. They love our initial offering in P8X32A and want more of it, but can be very frustrated with our design cycle as it is too long for their product iterations which need to happen every two years. The volume user I mentioned the other day who loves Spin/ASM and planned 10,000 chips this year needs higher speed, more RAM, same number of cores (16 would be great, but speed is more important), and cordic for simplifying math routines. You are currently providing all of these features.

The cost of not having P16X finished is an opportunity cost. And it's at least 2x the cost of making multiple foundry runs with new designs. Customers want iterative steps forward to match their product design cycles, and industry wants frequent releases to be considered a player on their radar.

Therefore, as your biggest proponent I respectfully cast my vote in favor of leaving out important features in favor of simplicity, completion, and the concept of more frequent future releases which could include hubexec and wide data paths.

The benefit of an open design process is that we get early adopters, applications, and community ownership. The drawback is that things we discuss that don't come true shall be held against us by some. I favor benefits of the process more than the drawbacks and don't care to let the drawbacks dictate our agenda, even though they will cost us some applications and design-ins.

I've been with your for projects for over forty years now, and one fact I know better than any forum member is what two weeks or one month really means to a development cycle. In this case, I believe your gut instincts are consistent with my advice. And if this doesn't ring, just take the path of least resistance and lower stress.

Ken Gracey

David Betz · 2014-05-09 09:00

Well, one advantage of no hub exec is that we don't have to change the PropGCC code generator much for the new chip. There will, of course, be more opportunities to optimize its output using some of the new features but just getting PropGCC working should be easier than if we had hub exec.

Rayman · 2014-05-09 09:03

I agree with Ken... If Chip thinks hubexec and quads are a pain, maybe "we" should just leave it off this time and push for another iteration in 2 years...

Bird in the hand being worth 2 in the bush and all...

David Betz · 2014-05-09 09:12

Rayman wrote: »

I agree with Ken... If Chip thinks hubexec and quads are a pain, maybe we should just leave it off this time and push for another iteration in 2 years...

Or maybe we can forget hub execution entirely and reconsider the idea of an "application processor" with a different architecture that fits high level languages better like an ARM or MIPS core.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments