The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

kwinn · 2014-05-05 16:35

JRetSapDoog wrote: »

.............

In the nutshell, the scheme involves dividing the hub into 16 banks, wherein each cog accesses one bank at a time on a rotating (hub) basis, with all of the cogs doing so simultaneously. It also involves or necessitates accessing memory in a spaced-apart manner to make access flow. That's the gist............

Please, no bank switching. Writing software for any processor that used bank switching turned out to be problematic compared to using a flat memory model. Shared ram should all be shared equally as one flat memory space, not broken into little chunks that can only be accessed at certain times or by certain cogs. Better (and probably simpler) to find some way of assigning unused hub slots to cogs that need more access.

jmg · 2014-05-05 16:38

RossH wrote: »

As far as I know, Cluso's "pairing" scheme is the only one that both preserves determinism, and also makes it explicit that you are "stealing" hub cycles from a specific cog

The table approach is also fully deterministic, but more flexible as 4 bits can change, not just one.

Otherwise, they are identical, in that another cog is given this cogs slot.

Pairing (one bit change) cannot allocate many slots to one COG, but the simple superset of table (4 bits) can allocate many scan slots to the same COG (up to all of them, tho 50% to 2 COGS is more practical, given the 100MOP COG speed)

jmg · 2014-05-05 16:41

RossH wrote: »

Ok. I think I understand ... but now that I do, I see that your scheme also seems to have the property that the performance of any cog is unpredictable unless you know (in advance) what will be running in every other cog.

Nope, study the priority encoder a little more carefully.

Once you have chosen a desired system design COG count, the design is stable.
That count can be 16, or 8, or 6, or 5 or 3 etc and can include unused COGs during development & test.

RossH · 2014-05-05 16:44

jmg wrote: »

Nope, study the priority encoder a little more carefully.

Once you have chosen a desired system design COG count, the design is stable.
That count can be 16, or 8, or 6, or 5 or 3 etc and can include unused COGs during development & test.

But what happens when you need to add a cog?

RossH · 2014-05-05 16:53

jmg wrote: »

The table approach is also fully deterministic, but more flexible as 4 bits can change, not just one.

Otherwise, they are identical, in that another cog is given this cogs slot.

Pairing (one bit change) cannot allocate many slots to one COG, but the simple superset of table (4 bits) can allocate many scan slots to the same COG (up to all of them, tho 50% to 2 COGS is more practical, given the 100MOP COG speed)

Any scheme that can make an arbitrary number of other cogs unusable would be too difficult to manage. Cluso's pairing scheme can only reduce the number of available useful cogs to a minimum of 8 (which is as many as we have now on the P1). Your scheme reduces this number to 1 (or perhaps 2) with the rest of the cogs unusable for most purposes. What would be the point of having a Propeller with a fast main thread that could not use any other cogs? How would it communicate with other hardware in the system without cogs?

Ross.

jmg · 2014-05-05 16:58

RossH wrote: »

But what happens when you need to add a cog?

Usually that simply allocates one of the spares in your chosen set.

If your boss moves the goal posts so much, that a radical re-write ie needed, then of course, you will revise the total count

Anyone who really has no idea how many COGS they might use, would use Count 16
- you only need to reduce this in special cases. (which is why it defaults off)

The combination of each COG being able to allocate-to-CogID its Slot, and the total active COGS covers all use cases.

jmg · 2014-05-05 17:08

RossH wrote: »

Any scheme that can make an arbitrary number of other cogs unusable would be too difficult to manage.

Why ? people do that now with P1 - you do not HAVE to use all COGS ! You can choose any number you like.

RossH wrote:

Cluso's pairing scheme can only reduce the number of available useful cogs to a minimum of 8 (which is as many as we have now on the P1). Your scheme reduces this number to 1 (or perhaps 2) with the rest of the cogs unusable for most purposes.

Err, no. The pairing scheme simply flips a single bit, which limits mapping choices, the table(nibble) version uses 4 bits to map any ID it likes. Otherwise they are identical.

There is no implicit reduction in the number of COGS anywhere in either of these bit-widths, that is a separate optional feature.

As I have explained above, an optional scan modulus is needed to give determinism (jitter free) slot scans where a locked 16 cannot.

jmg · 2014-05-05 17:16

JRetSapDoog wrote: »

....
But first, a look back: In another thread (I believe), I mentioned the obvious possibility of each cog having its own separate bank of memory, with banks being assignable, such that individual cogs could get maximum bandwidth since they wouldn't need to share. In other words, that would be a non-hub approach....

A key HUB memory feature is ANY cog can scan all of it. Large buffers will be common, as will large code.

Also the present design calls for 200MHz 512K SRAM - ie 5ns access times, including the Scan Muxes.
That is pushing the process, and there is not room for any extra delays.

There have been suggestions to expand Local COG memory, yes, that would be a logical step if there was surplus HUB memory, but I don't think 512K is quite at that 'surplus' threshold. More HUB would be nice.

RossH · 2014-05-05 17:26

jmg wrote: »

Why ? people do that now with P1 - you do not HAVE to use all COGS ! You can choose any number you like.

But you are losing the property (which we have now) that you don't need to worry about how many cogs you might need - unless you run out of course, and even if you do (as we all probably have done) the problem is very easy to solve - precisely because the Propeller guarantees that all cogs are equal.

For instance, in one program I wrote I ran out of cogs, so I re-implemented the RTC and the SD drivers to run in the same cog. I didn't have to touch the other cogs, or worry about the consequences for the rest of the program, because the Propeller virtually guarantees everything else will continue to work correctly.

jmg wrote: »

Anyone who really has no idea how many COGS they might use, would use Count 16

Yes, I really think this is the only option under your scheme for anything critical. Perhaps there is scope near the end of a project to mark some cogs as unused in order to speed up other cogs - but since this would potentially change the behavior of all other cogs, this would then require you to re-test everything.

Which is why I say that schemes of this complexity are really useful only in edge cases. Most people will see little benefit, and (potentially) lots of problems.

Ross.

kwinn · 2014-05-05 17:35

RossH wrote: »

Any scheme that can make an arbitrary number of other cogs unusable would be too difficult to manage. Cluso's pairing scheme can only reduce the number of available useful cogs to a minimum of 8 (which is as many as we have now on the P1). Your scheme reduces this number to 1 (or perhaps 2) with the rest of the cogs unusable for most purposes. What would be the point of having a Propeller with a fast main thread that could not use any other cogs? How would it communicate with other hardware in the system without cogs?

Ross.

I can see how a cog can be used so that it does not need to communicate with the hub once it is started since one of my projects does that. It simply reads in some state bits and outputs timing pulses (or not) based on that state.

What I am wondering is what is the maximum rate a cog program could write data to the hub if it had unlimited access. Would that still be 8 cycles? If so would it be possible for a cog to write/read to/from the hub that fast?

IOW, what's the fastest hub speed we need to give to a cog? Could we provide that by having 2 or more cogs give up half their hub bandwidth?

mark · 2014-05-05 17:44

RossH wrote: »

The problem with any scheme that give some cogs "extra" access to the hub is that that extra access has to come from the allocation normally granted to other cogs. Which means the objects that can be loaded and run in those other cogs is now limited - and programs that don't take this into account can fail without obvious reasons, or (worse!) behave differently on the bench than they do in the field.

Of course some cogs are going to get hub access more frequently at the expense of others (but in my proposal, they will always get guaranteed access at some deterministic rate). That's just the reality for any of these methods given the maximum ram speed available. In an earlier post I suggested that cog objects specify their hub ram latency (Searith's terminology, which is fitting) requirements which the development tools can use to warn the programmer of conflicts at compile time. This way, as long as cog objects are running in cogs with appropriate hub ram latency, they will work properly (which can't be guaranteed in the mooching scheme), and cogs won't potentially get wasted as in the slot sharing method due to needing all that cog's hub access.

RossH wrote: »

Mooching schemes have a slightly different problem, but with the same potentially unpleasant consequences - i.e. programs will run differently depending on what is loaded into the other cogs, and also on what those other cogs are presently doing - so they also may behave differently on the bench than they do in the field.

Personally, I feel mooching is the most dangerous of all. If any of the proposed methods are to cause problems with OBEX objects or break at any time, this would be it. There's no bandwidth guarantee for at least one of the cogs, and therefore is the most antithetical to the concept of the propeller and determinism. I would rather have none of the options than have this - wayyyy too dangerous.

RossH wrote: »

As far as I know, Cluso's "pairing" scheme is the only one that both preserves determinism, and also makes it explicit that you are "stealing" hub cycles from a specific cog, and (hopefully!) also marks that cog as unsuitable for other general-purpose use so that it would never be used by a subsequent COGNEW (but it could still be explicitly used by a COGINIT).

Ross.

Again, what I've proposed preserves determinism too, just not hub access every 16 clocks - it could be higher for some and lower for others depending on the mode, but the rate will ALWAYS be the same in any given mode for each respective cog. Cluso's method is safe, and it guarantees hub access at least every 16 clocks, but at the expense of wasting a cog, unless you can give said cog something to do that doesn't require hub access. It also introduces some tricky situations which, depending on which cog you're taking cycles from will determine the phasing of hub ram access. In most cases, you would go for a cog whose hub window cycle is 180 degrees out of phase with its own, but this might pose further problems if cogs indeed end up only being given access to 4 specific pins for fast DACs. There's a lot of balancing going on here.

I'll mention it again, as it seems my first post in this thread was glanced over, but I think a small shared hub-based cache -- perhaps 32 to 128 bytes or so -- of 4-port memory operating at 200MHz providing an access window to all cogs every 2 instructions (4 clocks), would be immensely useful where we're more concerned with shuffling data between cogs than we are with moving large data or running code from hub ram. With such a cache, some cog objects could get away with having reduced (or 0 in Cluso's proposal) hub ram access.

RossH · 2014-05-05 17:48

kwinn wrote: »

I can see how a cog can be used so that it does not need to communicate with the hub once it is started since one of my projects does that. It simply reads in some state bits and outputs timing pulses (or not) based on that state.

Yes, but it's quite a rare case that a cog doesn't need ANY hub access. You'd probably be lucky to find there was even a single such cog in most projects.

kwinn wrote: »

What I am wondering is what is the maximum rate a cog program could write data to the hub if it had unlimited access. Would that still be 8 cycles? If so would it be possible for a cog to write/read to/from the hub that fast?

IOW, what's the fastest hub speed we need to give to a cog? Could we provide that by having 2 or more cogs give up half their hub bandwidth?

I think the time required to execute each hub instruction is why jmg is assuming you could only give a single cog half the available bandwidth at most. So you could two "fast" cogs, but not one "superfast" one.

Ross.

RossH · 2014-05-05 17:59

[QUOTE=mark

jmg · 2014-05-05 18:07

RossH wrote: »

But you are losing the property...

WRONG, nothing is lost at all - but I think you corrected it further down

RossH wrote: »

Which is why I say that schemes of this complexity are really useful only in edge cases. Most people will see little benefit, and (potentially) lots of problems.

Which is why it defaults to OFF.

It is not complex, but it certainly is useful in edge cases, and there, those users will see huge benefit as it fixes determinism.

Anyone else can blithely ignore it, just like they ignore most Counter modes. ( It appears as a single control bit, in config )

jmg · 2014-05-05 18:12

RossH wrote: »

Yes, but it's quite a rare case that a cog doesn't need ANY hub access. You'd probably be lucky to find there was even a single such cog in most projects.

This may become more common, with new bitop opcodes, and likely buried pins, as well as possible smart pin serial links.

It is also perhaps better to state it as cogs often do not need HUB access all the time.
That's why I implemented a boolean called UsesHub, to signal to the scanner when HUB access is (not) needed.

jmg · 2014-05-05 18:19

kwinn wrote: »

What I am wondering is what is the maximum rate a cog program could write data to the hub if it had unlimited access. Would that still be 8 cycles? If so would it be possible for a cog to write/read to/from the hub that fast?

Not quite sure what you are asking, but the HUB SRAM runs at 200MHz, with 5ns access times.(impressive)
COGS run at 100MOPs, so they interleave HUB access. (A Fixed 1:16 means a 12.5MHz slot rate)
The RAM can see 200MHz data, but any single COG is 100MOP limited.

dMajo · 2014-05-05 18:31

Cluso99 wrote: »

dMajo,
I think you have not fully understood the discussion.
By utilising cog slots in a better fashion, we can get faster cogs and still retain fully deterministic behaviour in both the faster cog(s) and also in the cog(s) donating their slots (if they are even used).
None of this prevents using specific pins because can determine which cogs use these additional features, and of course the default behaviour is these added features are disabled. So it is the users choice.

The mooching seems to be amuch more complex (hardware) implementation, andof course there are no guarantees of additional slots, and determinism is lost for those mooching cog(s). Then there is the problem of more than one moocher, and which moocher gets priority.
Therefore I see that mooching should be limited to any 1 cog, or any 2 cogs wher one gets additional even slots and the other odd slots. This would simplify the logic significantly.

Given the choice, I would opt for slot configuration / slot pairing over mooching because it can be done deterministically because we now have 16 cogs.

No Cluso I do have fully understood the thing, but it seems you can't understand me

I agree with your proposal, in the sense that it's correct. The thing is that I want that additional bandwidth can be only un-deterministic so that it can be used only at application level.
I am affraid that if you give the opportunity to have higher bandwidth deterministically, sooner or later everyone will start developing also the building blocks (drivers) using it, and this will lead to wasted cogs just to have them donate the slot to allow a certain driver to be used.
What will happen if you will have eg quadspi (but using 2 cogs because of bandwidth), fat16/32 FS on perhaps 4bit SD (again using 2 cogs because everyone wants read/write speed) ... and so on.
I am not opposed just because of ... but I love the p1 obex and it's enough that we all agree that if we go down this route the new p2 obex will never became as the current one.

jmg · 2014-05-05 18:52

dMajo wrote: »

... and this will lead to wasted cogs just to have them donate the slot to allow a certain driver to be used.

Because the HUB RAM is multiplexed, slot transfer is the only way to increase bandwidth, but COGS not using their slots are not always wasted.
There will be more ways for COGs to communicate.

Cluso also has a twin table proposal (originally intended for mooch), that can be morphed easily into State Scanner to allow 50% or 100% donate.
Now, the Donor gets a 6.25MHz access slot rate, whilst bumping the receptor an additional +6.25MHz on average.

In the limit case, 15 COGS can run at 6.25MHz and one at 100MHz - & none have 'wasted' levels of Bandwidth.

JRetSapDoog · 2014-05-05 19:01

kwinn wrote: »

Please, no bank switching. Writing software for any processor that used bank switching turned out to be problematic compared to using a flat memory model. Shared ram should all be shared equally as one flat memory space, not broken into little chunks that can only be accessed at certain times or by certain cogs. Better (and probably simpler) to find some way of assigning unused hub slots to cogs that need more access.

I think I agree with your conclusion that it's better/simpler to go with another way (I guessed so from the inception). However, I'm sort of doubtful that the scheme I brought up (not advocated for) can be considered to be traditional bank switching. The memory, though broken up into banks of a sort, would appear flat to the user due to the way the hardware would automatically span the banks. And all cogs could access any part of memory, but cogs might have to wait to do so (particularly for random access as opposed to sustained access), similar to how the current round-robin access imposes waits. But if reading or writing a "stream" of data (logically contiguous but not physically so), no waiting would be needed (except for a likely initial wait to start the "ball" rolling), and all cogs could do so simultaneously. The scheme doesn't allow for multiple cogs writing the same memory locations at the same time, but that's probably a feature. Anyway, all of that is assuming that I haven't made a mistake/overlooked something, a big and likely wrong assumption.

But it seems to be an outlandish idea/scheme and is probably fraught with all kinds of perils or caveats that I haven't considered (or downright mistakes that make it impossible), even if the bus switching stuff to establish the 16 concurrent cog-bank pairs (done on a rolling basis) could be implemented in hardware (as, in addition to the large size, the switching mechanisms might not be able to switch fast enough to support a high throughput). I just threw the scheme out there because it seemed different. If I escape with only your kind reply post on this matter, I'll count myself quite lucky (I probably shouldn't have even responded to yours for fear of triggering additional replies). Still, thanks for your insight and for being gentle. But just to be safe, I'm going to go crawl under a rock now for safety.

kwinn · 2014-05-05 19:14

What I am trying to get a handle on is the absolute maximum number of hub access slots that a single cog could make use of.

So the cog takes 10nS to execute 1 instruction and can execute a hub read or write operation once every 80nS.

If the hub read/write timing is the same as the P1 (relatively speaking) then it would take 20nS for the hub access. Is this correct?

If that is the case a cog using 4 hub access slots would not have time to execute any other instructions, so is pretty much useless.

A cog that only used it's own 1/16 hub access slot would have time for 6 instructions between each hub access, which is more useful.

If we somehow doubled the hub access rate there would only be time for 2 instructions between each hub access, and only if those access slots were equally spaced in time. This may still be useful for some things.

Any point in having a single instruction 3 hub access?

jmg wrote: »

Not quite sure what you are asking, but the HUB SRAM runs at 200MHz, with 5ns access times.(impressive)
COGS run at 100MOPs, so they interleave HUB access. (A Fixed 1:16 means a 12.5MHz slot rate)
The RAM can see 200MHz data, but any single COG is 100MOP limited.

RossH · 2014-05-05 19:18

jmg wrote: »

In the limit case, 15 COGS can run at 6.25MHz and one at 100MHz - & none have 'wasted' levels of Bandwidth.

Yes, you could indeed do that - but why would you? What use would it be? You have to remember that you can buy - right now for around $10 - a 200Mhz Cortex-M4 with two Cortex-M0 coprocessors, 282Kb SRAM, hardware floating point plus configurable hardware interfaces to SDRAM, Ethernet, USB, LCD, DAC, SD/MMC, 4xUARTs, GPIO, CAN, SPI, I2C etc etc - which would be a far cheaper and faster solution to almost any problem than such an "asymmetric" Propeller would be.

So remind me again why this scheme is worthwhile?

Ross.

kwinn · 2014-05-05 19:40

@JRetSapDoog

No need to hide, it's just that I have dealt with processors that had banked memory, and it made for some very tricky programming and hardware at times. Having 8 cogs trying to access that same memory can only make things worse, even if there is hardware to prevent multiple access to the same block at the same time. On top of that, having to wait for access to some blocks but not others would be really bad for deterministic timing. Waiting for hub access is bad enough, but at least you can synchronize the cog code to that.

jmg · 2014-05-05 20:12

RossH wrote: »

Yes, you could indeed do that - but why would you? What use would it be?

Do you really not see any benefit in moving from 12.5MHz to 100MHz ?

RossH wrote: »

So remind me again why this scheme is worthwhile?

I'll have to admit, to someone who thinks 12.5MHz is the same as 100MHz, it probably is not worthwhile.
Those lucky users can stay on 12.5MHz, and never worry.

For the others, when bandwidth matters, it matters.

jmg · 2014-05-05 20:16

kwinn wrote: »

If the hub read/write timing is the same as the P1 (relatively speaking) then it would take 20nS for the hub access. Is this correct?

I don't think so, with 200MHz HUB memory, and a 2 cycle opcode, hub write should fit into the second phase of the opcode.
(making hat 10ns)

I'm not sure if Chip has given explicit opcode timings, outside the 2 cycles.

Cluso99 · 2014-05-05 20:18

[QUOTE=mark

Seairth · 2014-05-05 20:46

jmg wrote: »

I don't think so, with 200MHz HUB memory, and a 2 cycle opcode, hub write should fit into the second phase of the opcode.
(making hat 10ns)

I'm not sure if Chip has given explicit opcode timings, outside the 2 cycles.

Instructions are definitely taking 4 clock cycles minimum (see post #1191). They just overlap by two cycles, giving an effective speed of two clocks per instruction.

Guessing that Chip will use the same hub access technique he developed for P2, writes will take 4 clock cycles (2 clocks effective) and reads will take 6 clock cycles (4 clocks effective, stalling the next instruction for two clock cycles). That would result in 6-7 instruction slots/cycles between each hub window.

RossH · 2014-05-05 21:07

jmg wrote: »

Do you really not see any benefit in moving from 12.5MHz to 100MHz ?

No. Because the benefits are not particularly worthwhile. They do not make the Propeller into a 100Mhz chip, because if you did run one cog at 100Mz, you can't do much with it! Unlike it's competitors, the Propeller has no hardware peripherals, so you can have one cog running at 100Mhz, and the other 15 cogs are wasted since you cannot use them run any of the OBEX objects (or whatever the P16X32B equivalents are) that are the Propeller's equivalent to hardware peripherals.

So the P16X32B will fail if you try and market it as a 100Mhz processor, because there are cheaper and faster processors already available that can do what you are trying to do - and they do it much better!

jmg wrote: »

I'll have to admit, to someone who thinks 12.5MHz is the same as 100MHz, it probably is not worthwhile.
Those lucky users can stay on 12.5MHz, and never worry.

For the others, when bandwidth matters, it matters.

The trouble is that the higher bandwidths you think you are getting are illusory - at least for the applications the Propeller is good at.

Only your low end use cases - e.g. having 8 cogs run at 25Mhz - have any general utility at all. This one is similar to Cluso's "paired cogs" proposal.

The rest are edge cases.

Ross.

kwinn · 2014-05-05 21:29

If the hub read/write only takes 10nS for the hub access it changes the equation a bit, but in the end it is still limited by the cog speed.

A cog using 8 hub access slots would not have time to execute any other instructions, so is pretty much useless except possibly for hubex. Even hubex would not really need more than 4 slots if it were to fetch quads.

A cog that only used it's own 1/16 hub access slot would have time for 7 instructions between each hub access, and at best a cog that executed one instruction per hub access could only make use of 4 hub slots. Even this would require hardware repeat of the 2 instructions.

It would only be necessary for some cogs to donate half of their hub slots to make this very useful, and the silicon to implement it should be tiny. Something in the order of a 32x4 lut and a 5 bit counter.

jmg wrote: »

I don't think so, with 200MHz HUB memory, and a 2 cycle opcode, hub write should fit into the second phase of the opcode.
(making hat 10ns)

I'm not sure if Chip has given explicit opcode timings, outside the 2 cycles.

jmg · 2014-05-05 21:45

RossH wrote: »

.... and the other 15 cogs are wasted since you cannot use them run any of the OBEX objects (or whatever the P16X32B equivalents are) that are the Propeller's equivalent to hardware peripherals.

Wow, a sweeping claim, that's simply not true : A 6.25MHz HUB Slot BW will be plenty for a very large number of applications.

You also seem to presume I want a 100MHz CPU, yet my test benchmark cases are for Burst Data Capture/Stimulus generate - something a Prop certainly SHOULD be able to be good at, if it is allowed to get there.

Just proves how one designers edge cases are another designers bread and butter !

koehler · 2014-05-05 21:49

Actually, the ability to donate 1/2 hub seems like a better, deterministic mode than mootching.

1 primary Core with 2 50% donors would still double hub for the primary, while not leaving the other 2 Cores hubless or 'useless'.

kwinn wrote: »

If the hub read/write only takes 10nS for the hub access it changes the equation a bit, but in the end it is still limited by the cog speed.

A cog using 8 hub access slots would not have time to execute any other instructions, so is pretty much useless except possibly for hubex. Even hubex would not really need more than 4 slots if it were to fetch quads.

A cog that only used it's own 1/16 hub access slot would have time for 7 instructions between each hub access, and at best a cog that executed one instruction per hub access could only make use of 4 hub slots. Even this would require hardware repeat of the 2 instructions.

It would only be necessary for some cogs to donate half of their hub slots to make this very useful, and the silicon to implement it should be tiny. Something in the order of a 32x4 lut and a 5 bit counter.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments