The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

jmg · 2014-05-05 21:55

kwinn wrote: »

A cog using 8 hub access slots would not have time to execute any other instructions, so is pretty much useless except possibly for hubex. Even hubex would not really need more than 4 slots if it were to fetch quads.

Problem is, not everything can be nicely delivered in quads. Quads only solve some problems.
Further, the new REPS opcode can shrink loops, (not sure if AutoINC is still there, I think yes?), and there is always loop unroll, but you get the idea - burst handlers can be made quite compact, and well above 12.5MHz

kwinn wrote: »

It would only be necessary for some cogs to donate half of their hub slots to make this very useful, and the silicon to implement it should be tiny. Something in the order of a 32x4 lut and a 5 bit counter.

Yes, there are many ways to manage Slot ReAssign.
The best are the most flexible, and include a config subset to give the locked 1:16 assign ( 12.5MHz) we have now.

jmg · 2014-05-05 22:01

koehler wrote: »

Actually, the ability to donate 1/2 hub seems like a better, deterministic mode than mootching.

1 primary Core with 2 50% donors would still double hub for the primary, while not leaving the other 2 Cores hubless or 'useless'.

Agreed, See #1822 for mention of 50% (mappable) donate as alternative to mooch. (I think timing details exclude mooch anyway)
For a great many applications, the 6.25MHz is more than enough, yet it allows donation all the way up to 100MHz

RossH · 2014-05-05 22:14

jmg wrote: »

Wow, a sweeping claim, that's simply not true : A 6.25MHz HUB Slot BW will be plenty for a very large number of applications.

But you don't get that Hub bandwidth if another cog is using it. And what exactly is a cog without Hub access good for? I can't think of much.

jmg wrote: »

You also seem to presume I want a 100MHz CPU, yet my test benchmark cases are for Burst Data Capture/Stimulus generate - something a Prop certainly SHOULD be able to be good at, if it is allowed to get there.

Well, in fact you are the one claiming you can achieve 100Mhz, but (as I have demonstrated) this is illusory. What speed do you need? And is this kind of application especially suitable for a symmetrical multiprocessor? If not, why use a Propeller?

jmg wrote: »

Just proves how one designers edge cases are another designers bread and butter !

To a man whose only tool is a hammer, every job looks like it needs a nail.

Ross.

jmg · 2014-05-05 22:30

RossH wrote: »

jmg wrote:

Wow, a sweeping claim, that's simply not true : A 6.25MHz HUB Slot BW will be plenty for a very large number of applications.

But you don't get that Hub bandwidth if another cog is using it. And what exactly is a cog without Hub access good for? I can't think of much.

Hmm... Even more sweeping statements that are simply incorrect.

I clearly stated this "A 6.25MHz HUB Slot BW", and you make up imagined dependencies ?
Perhaps a little more effort reading ?

In none of my stated design examples, is Hub allocation a lottery. The 6.25MHz is fully deterministic.
I guess you are now ok, given the glaring mistake is corrected ?

RossH · 2014-05-05 22:46

jmg wrote: »

Hmm... Even more sweeping statements that are simply incorrect.

I clearly stated this "A 6.25MHz HUB Slot BW", and you make up imagined dependencies ?
Perhaps a little more effort reading ?

In none of my stated design examples, is Hub allocation a lottery. The 6.25MHz is fully deterministic.
I guess you are now ok, given the glaring mistake is corrected ?

Well, something is certainly amiss here. If one cog can use all the available hub bandwidth of another cog, then what I have said is true.

Perhaps you had better try and explain how your "TopCogShared" scheme works again. I may be misunderstanding that.

Ross.

jmg · 2014-05-05 23:17

RossH wrote: »

Well, something is certainly amiss here. If one cog can use all the available hub bandwidth of another cog, then what I have said is true.
Perhaps you had better try and explain how your "TopCogShared" scheme works again. I may be misunderstanding that.

If you start at #1822, and also read #1849 & others, you should be able to follow that the 6.25MHz does not use the TopUsedCog feature, (which still operates independently) but is a sub-scan variant of a dual-table map, first mentioned by Cluso.

The 6.25MHz figure should give a massive clue, as that is SysCLK/32
6.25MHz is a guaranteed minimum, for full 16H scans, it can increase for 8 COGS, 4 COGs etc, via the TopUsedCog detail.

The general case uses 2 x 4 bit fields, and they alternate-select each scan.
(or you can think of it as a /32 scan, with 2 choices)

That allows 100% to self, 50% to self, 50% to other, (& converse), and 100% to other slot alloc choices.
The 50% to self, 50% to other, gives 6.25MHz to self, and adds 6.25MHz to other.

Applying that in Multiple COGS, to a same Other, can go above 50MHz easily. ( 2 x 50MHz or 3 x 33MHz could be more practical than 1 x 100MHz, but that is for the user to choose). ALL other COGS can count on a fixed 6.25MHz slot rate.

( I'm checking if enough case coverage is possible with a Boolean + 4 bit field, but that's an implementation detail )

Cluso99 · 2014-05-05 23:19

If hub slots can be dynamically changed, then a video cog can utilise the higher bandwidth by using another cog(s) slots, and then release those slots back to the other cog(s) while it does not require them. Then these cogs will be co-operating and dynamically sharing their slots. There are plenty of video apps that would benefit from this sharing.

Also, because we have 16 cogs, there will be applications that will never utilise the whole 16 cogs. In these cases, being able to donate the unused slots to other cogs will give throughput benefits.

Why cannot some of you not see the benefits ??? But the original 1:16 can still remain if that's what you want. But please don't limit what some of us would like to do with unused resources. You are trying to prevent Chip from even considering this - it will be his choice in the end.

Cluso99 · 2014-05-05 23:30

Real use for slot sharing: USB FS

Currently I cannot get USB FS working. It did not work in the old P2 at 80MHz, but I believed I could make it work with the addition of a helper instruction or two. At 160MHz I also may have been able to get it to work. But with the last posted P2 code and 80MHz, I could not read successive bits for a packet.

The P1+ only makes this even harder. I am hoping that just maybe, I can utilise additional slots in one or two cogs so that I can successfully read a group of bits and pass each byte over to another cog (via hub is the only way at present) for further parallel processing. The latency of 1:16 will be an issue for me. But if I can run at 2:16 or better, then maybe, just maybe, I can get this to work using multiple cogs and helper instruction(s).

While I was confident I could get USB FS running (with caveats) on the P2, I am not confident at all that I can manage it on a P1+, even using multiple cogs.
IIRC the P1 implementation skipped every second message, and still used 4 or 5 cogs. There were lots of caveats.

jmg · 2014-05-05 23:33

Cluso99 wrote: »

If hub slots can be dynamically changed, then a video cog can utilise the higher bandwidth by using another cog(s) slots, and then release those slots back to the other cog(s) while it does not require them. Then these cogs will be co-operating and dynamically sharing their slots. There are plenty of video apps that would benefit from this sharing.

Yup, and the 50% sub-scan option would mean in many cases, this is a simple set-and-forget, as 6.25MHz is plenty for a lot of apps.
Those COGs may never need a change to Hub Slot rate. More advanced cases can dynamically realloc, if they need to.

Dual-Table may be an overkill, (tho it is very flexible*) - a Boolean for 50%, or just always [Share@50%] when sharing may be simpler, and still have enough coverage ?
(* Dual table allows even the unusual donate of both slots to two different COGs)

RossH · 2014-05-06 00:10

jmg wrote: »

If you start at #1822, and also read #1849 & others, you should be able to follow that the 6.25MHz does not use the TopUsedCog feature, (which still operates independently) but is a sub-scan variant of a dual-table map, first mentioned by Cluso.

The 6.25MHz figure should give a massive clue, as that is SysCLK/32
6.25MHz is a guaranteed minimum, for full 16H scans, it can increase for 8 COGS, 4 COGs etc, via the TopUsedCog detail.

The general case uses 2 x 4 bit fields, and they alternate-select each scan.
(or you can think of it as a /32 scan, with 2 choices)

That allows 100% to self, 50% to self, 50% to other, (& converse), and 100% to other slot alloc choices.
The 50% to self, 50% to other, gives 6.25MHz to self, and adds 6.25MHz to other.

Applying that in Multiple COGS, to a same Other, can go above 50MHz easily. ( 2 x 50MHz or 3 x 33MHz could be more practical than 1 x 100MHz, but that is for the user to choose). ALL other COGS can count on a fixed 6.25MHz slot rate.

( I'm checking if enough case coverage is possible with a Boolean + 4 bit field, but that's an implementation detail )

Ok, it seems you have proposed a couple of different schemes over the course of this thread, but I think I have it all now.

Under this scheme, if you need a fast cog, you not only figure out how many hub slots your fast cogs need, but also where you are going to get them from. To make a fast cog, you must decide whether to allocate cycles sparsely across several other cogs (making it impossible to use some objects in those cogs) or allocate all cycles from specific cogs (making those cog unusable for much else). This will depend on how often your fast cog needs hub access, and also how much jitter it can tolerate in that access (since there is now no fixed timing relationship between cog and hub), and also on how many other objects you need to be able to run. Also, if timing of your fast cog is critical, some slot allocation schemes will work ok, while others may mysteriously fail, even though they give you the same amount of hub access.

Similarly, when you are writing each library object, you have to keep in mind not just how many hub access slots this cog code needs to operate correctly (out of 3 possibilities), but also whether it needs a specific TopCogUsed value (which will affect the time between those hub access slots). A cog that has specific timing requirements (such as a video driver) may fail completely if either of these preconditions are not met, or your program may just mysteriously fail occasionally. Also, you will have to keep track of which cogs can be used to load which objects, or again your program may mysteriously fail. Alternatively, the chip itself could tell you this - but you would have to add more complexity to both the object itself, and the logic used to load them.

I can't see how all this complexity could possibly be worthwhile. If you really need the speed, I'd suggest you just use a faster chip.

Ross.

Brian Fairchild · 2014-05-06 02:20

The thing about the original P1 hub access scheme was...

1) It was simple
2) It was simple to explain in words
3) It was even simpler to draw a picture explaining how it worked

Flowing from that simplicity came things like determinism.

I have no problems with any of the schemes proposed for the P1+ so far but unless they can meet 1) - 3) above then I fear that all they will do is put off acceptance of the chip by people more used to a 'normal' architecture. I seriously doubt that Parallax will recover their costs if the only people buying the P1+ are the people in this topic (and I include myself in that).

It's long been my opinion, from the earliest mentions of the P2, that there will need to be something more than the OBEX. That there will need to be an official Parallax Peripheral Library, maybe even built right into the various IDEs, which implements the usual peripherals found on other processors.

Todd Marshall · 2014-05-06 02:36

A brief return from the peanut gallery:

I wrote a simple detailed description of a workable implementation but I proof read it so many times I lost my time slot and the work was thrown away.

Just two questions (I actually had 7 before)?
1) Why are you limiting yourself to 16 HUB slots? Try 256 or 10,000 and see how that feels.
2) Why are you requiring modulo 8 or 16 HUB rotation? Try any number less than 256 or 10,000 and see how that feels.

I had gone into an example that served:
(1) three existing legacy COGS (COG0, COG4, and COG5); unknown requirements.
(2) a new COG14 requiring 112 slot latency or more, but not more than 140 and requiring determinence.
(3) a new COG6 wanting 5 slot latency but willing to take what it can get and not requiring determinence.

The solution:
Seed COG6 first into slots 0, 5, 10, 15, ... up to slot 120 (14 legacy rotations)
Seed COG14 into slot 113 knowing COG0 will need 112
Seed COGs 0, 4, and 5 in legacy fashion: COG0 (0,8,16,...); COG4 (4,12,20,...); COG5 (5,13,21,...) up to slot 120 (14 legacy rotations)

If you make a mistake, legacy code is not broken since it is seeded last.

Cost:
Some trivial amount of HUB ram for slot table
An up timer with a reset at desired modulus to index the table (my example 120)

Assumption:
A reasonable way of selecting COGs onto the bus in 2 clocks which I'm sure is already being employed.

Thanks for your indulgence.

Brian Fairchild · 2014-05-06 02:59

@Todd

A few weeks ago Bill Henning (IIRC) proposed a slot allocation scheme a bit like that. Might be worth trying to find it.

As you say, there is no reason a slot allocation table can't have more entries than there are cores for instances where a core doesn't need that much memory bandwidth.

evanh · 2014-05-06 03:08

Todd Marshall wrote: »

Cost:
Some trivial amount of HUB ram for slot table

Can't be HubRAM. This slot table needs to be accessed every clock cycle and HubRAM's data bus is fully committed already.

The slot table could occupy hub address space no problem but the table would have to be constructed from it's own registers or a RAM block independent of HubRAM.

jmg · 2014-05-06 03:32

Todd Marshall wrote: »

Just two questions (I actually had 7 before)?
1) Why are you limiting yourself to 16 HUB slots? Try 256 or 10,000 and see how that feels.

The proposed improvements are not limited to 16, that is merely the default.
32 is another valid number, as that allows 50% slot sharing, for 6.25MHz Bandwidth quanta

Todd Marshall wrote: »

2) Why are you requiring modulo 8 or 16 HUB rotation? Try any number less than 256 or 10,000 and see how that feels.

Again, the proposed improvements are not limited to 16, that is merely the default.
There is no reason to not allow a reload value <> 16, and that can be via a Config value, or a Priority encoder.
Both work to vary reload, note that a Priority Encoder is atomic, live, and granular

Todd Marshall wrote: »

Cost:
Some trivial amount of HUB ram for slot table
An up timer with a reset at desired modulus to index the table (my example 120)

A down scanner that reloads on Zero, is actually smaller, but the outside world does not see that implementation detail.
If you consider 10,000 slots, then that's not quite 'trivial' ram.
Certainly 16-32 nibbles is tolerable storage. 32 gives a smallest quanta of 50% of one slot - seems practical ?

The table is read as a single block by the scanner, but it does not have to be written as one.
Choices are GOG level Config, via COG config load, or Global. ( or even both, if it is dual port )

One Issue with Global, are what if multiple COGS try to change Config ?
COG level write avoids that, each COG 'sees' only its own slot nibbles.

jmg · 2014-05-06 03:39

evanh wrote: »

Can't be HubRAM. This slot table needs to be accessed every clock cycle and HubRAM's data bus is fully committed already.

The slot table could occupy hub address space no problem but the table would have to be constructed from it's own registers or a RAM block independent of HubRAM.

Agreed, it can use a 'Hub address' and even use WRQUAD etc, but physically it is written as registers, and read by scanner at SysCLK speeds.

To keep atomic control, an XOR write (via WRQUAD) would allow N nibbles to be updated in a single clock.
That makes update a Read, xor, then xor-write-quad. If WRBYTE etc are all supported, at this special address,
then COGs can send a single byte to their pair of nibbles.

Heater. · 2014-05-06 04:15

Good morning. I see it is still Groundhog Day here.

I woke up with this great idea as to how to maximize utility of the HUB access windows whilst at the same time ensuring that all COG are treated equally no matter how many are in use at any moment. It's so easy I think everyone will like it.

Let's say we have C cogs running, 0 <= C <= 15, and that at any moment N cogs are hitting a HUB operation. 0 <= N <= C. Then:

i) If N = 0 do nothing.

ii) If N >= 1 generate a random number R where 1 <= R <= N

iii) Use R to enable the HUB access of one of the cogs requesting a HUB operation.

BINGO. Job done.

Pros:

1) Maximizes overall HUB bandwidth usage.

This scheme ensures that whenever one or more cogs wants a HUB operation one of them will always get it. Or to put it another way, no HUB access slot is ever left unused. This is clearly superior to the round robin approach or any other "hub slot" mooching/sharing scheme where it cannot be guaranteed that all available hub slots are used and bandwidth is therefore wasted.

2) All COGS are equal.

In this scheme all COGs are treated equally. An overriding concern for many of us on this forum.

3) Simplicity.

From the users perspective there are no funny HUB priorities to configure. No head scratching about COG pairing. No "modes" to tweak. Nothing to think about at all except writing your program.

4) All COGs are equal.

Did I say that already?

Cons:

1) Timing determinism is thrown out the window.

You will have noticed that we have totally lost timing determinism for COGs that access HUB resources. We can no longer tell how long a RDLONG for example will take. It has been randomized. Indeed as described there is not even a guarantee that a RDLONG, for example,will ever complete!

To that I say:

a) Never mind. Who cares? The probably of "not winning the draw" becomes less and less as time goes by eventually becoming vanishingly small. Your code will eventually get past that RDLONG.

b) The fact that this scheme maximises HUB usage implies that all programs on average run much faster. How great is that?! In the extreme you might only be running two COGs and averaged over time they get a full 50% of HUB bandwidth. Try doing that with any other scheme that also scales painlessly to 16 COGS

c) This lack of determinism is not much worse that all the other HUB bandwidth maximizing schemes presented. Whilst actually meeting it's goal of maximizing HUB bandwidth usage which they don't.

d) I presume (I have no idea), that a scheme could be devised that would ensure that a COG having once "won the draw" is prevented from entering the draw again until the guys remaining in the draw have had their go.

The astute reader will notice that this scheme is heavily influenced by the way Ethernet works. Ethernet has random delays for retries built into the protocol so as to resolve contention for the network cable by multiple devices. Prior to the invention of Ethernet many other schemes for sharing the network were devised, token ring for example. They were all complex, expensive, and slow. The dumb and non-deterministic Ethernet protocol won the day. I believe this scheme is the "Ethernet of the HUB" as opposed to all the other "token ring of the HUB" ideas.

My thanks go to Robert Metcalfe for the inspiration here.

Rayman · 2014-05-06 04:41

I sure hope Chip chimes in soon and ends this...

Heater. · 2014-05-06 04:45

Ray,

I was hoping my little parody would cheer everyone up

RossH · 2014-05-06 04:51

Heater. wrote: »

I was hoping my little parody would cheer everyone up

Your idea is simpler and more useful than some of the others that have been proposed!

Ross.

T Chap · 2014-05-06 04:59

Many hours of bright minds trying to find a solution to a problem over many months, and still no consensus. That means the method is flawed. Although someone could say that there the hub is a limited resource, I would say the goal would not be to find a way around the round robin, but rather how to solve the limited-ness of the hub. For example, make the hub not one memory core, but 2 or 4, all being copies, each independently accessible. Then limit the number of hubs that can be accessed at one time versus the number or time slots per cog. There is of course some background management to solve for keeping all copies up to date. But it seems that less time and energy would be devoted to that scheme than has towards the current endless debate. I stayed at a Holiday Inn last night, so I feel like I can solve this today.

Brian Fairchild · 2014-05-06 05:05

T Chap wrote: »

...and still no consensus. That means the method is flawed.

Not really, just means that, unlike in the real world, no-one has said "Right, this is how we are going to do it. Now get on with it." That's what project managers are for.

evanh · 2014-05-06 05:09

Chip is working on other areas. In the mean time there is a few feature topics that have been going round and round. Maybe a bit too much in some cases.

However, when Chip asks for input we might have something he quickly likes.

T Chap · 2014-05-06 05:11

I hope he likes my idea of a multi-access hub.

dMajo · 2014-05-06 05:11

jmg wrote: »

I don't think so, with 200MHz HUB memory, and a 2 cycle opcode, hub write should fit into the second phase of the opcode.
(making hat 10ns)

I'm not sure if Chip has given explicit opcode timings, outside the 2 cycles.

Jmc, given the cog core executes at 100 mips your 100MHz bandwidth to the hub comes true only if every consecutive instruction is a hub access, without any kind of logic there between and if the hub op take the same clocks as internal cog ones. That means that you will only have 100MHz bursts and to allow 100 MHz bursts on one cog you are going to sacrifice the bandwidth of all others. Am I wrong somewhere?

dMajo · 2014-05-06 05:22

Cluso99 wrote: »

If hub slots can be dynamically changed, then a video cog can utilise the higher bandwidth by using another cog(s) slots, and then release those slots back to the other cog(s) while it does not require them. Then these cogs will be co-operating and dynamically sharing their slots. There are plenty of video apps that would benefit from this sharing.

Why cannot some of you not see the benefits ???

Yes, this has a benefit, in some ways ... but ONLY if such driver uses the two cogs and share the global two-cog bandwidth how it want between the two cogs. This is equivalent to current drivers using two cogs with fixed bandwidth. But it must be clearly stated that this driver uses two cogs, a two-cogs driver: it will start them and, as the owner, only hack their bandwidth. It must not be allowed to expect slots/bandwidth from third/other (not owned) cogs !!!

Heater. · 2014-05-06 05:28

evanh,

However, when Chip asks for input we might have something he quickly likes.

Oh Smile, you we have to regurgitate all this again !

RossH · 2014-05-06 05:33

dMajo wrote: »

Jmc, given the cog core executes at 100 mips your 100MHz bandwidth to the hub comes true only if every consecutive instruction is a hub access, without any kind of logic there between and if the hub op take the same clocks as internal cog ones. That means that you will anly have 100MHz bursts and to allow 100 MHs bursts on one cog you are going to sacrifice the bandwidth an all others. Am I wrong somewhere?

You are looking at the wrong proposal, as I did. The current proposal allocates each cog 2 cog slots, and so you can steal just one if you want. You don't have to steal them all (although you still can do so).

Ross.

evanh · 2014-05-06 05:40

Heater. wrote: »

evanh,

Oh Smile, [] we have to regurgitate all this again !

Only links. :P

ctwardell · 2014-05-06 05:40

Here is my suggestion for slot sharing.

It's a simplification of the one proposed by jmg.

No TOPCOG, hub cycle always repeats every 16 clocks.
Benefit, totally stable for objects not participating in slot sharing.

Single instruction that lets a cog donate ALL of it's own slots to another cog. This may or may not be ANY other cog, that depends on details that we don't know yet, such as how close together slots can be and still be usable.
It may prove most simple to only donate to the cog at the opposite side of the normal mapping, for example 0 and 8, 1 and 9, etc.

The reason for donating ALL slots is that it is completely determinate for the recipient. If the recipient is getting the slots from an opposing cog, say 0 getting ALL the slots from 8, 0 will have a slot every 4 instruction cycles.
If 0 would only receive a portion of the slots from 8, say every other one, the pattern would be 8, 4, 4, 8, 4, 4...which may lead to the timing issues that some have mentioned as concerns.

But what about the now slotless donor?

The command to share slots can be executed at runtime and would ideally take affect on the next round of hub cycles.

Since a cog can donate ALL of its own slots to another cog, the primary cog (the one that is the recipient from the original donor) can donate its own slots back to the secondary cog (the original donor) for a brief period.
The primary cog still has hub access from the slots it has from the secondary, and the secondary temporarily has access using the slots from the primary. This would be useful when using the secondary to perform occasional work.
Th secondary would sit waiting by executing a hub read that would block until it is given hub access. When hub access is granted it grabs what it needs from the hub and writes back some value that can be checked by the primary.
When the primary recognizes that the secondary is done using the hub it can stop donating hub slots to the secondary.

This allows the secondary to still get some hub usage, but at the control of the primary.

For cases when the primary does not need fully determinate access to donated slots it could be left to the secondary to occasionally revoke donation when it needs the hub.

Key points:

A cog can only donate and revoke, it cannot steal.
Determinate
Simple

Open items that need input from Chip since he is the only one that knows the full underlying timing and structure.

Because of timing differences what hub slots are compatible with what cogs?
For example can even numbered cogs use odd numbered slots and vice versa.
This helps determine limits on what cogs can be the recipient for a given donor.

What it the minimum spacing between slots that can be used by a cog?
For example can a cog do a hub operation on every instruction, every other, every fourth, etc.
This will determine if a recipient can receive donations from more than one donor.

What I would personally like to see to keep it dead simple:

Can only donate/receive within opposing pairs, 0-8, 1-9, etc.

C.W.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments