P16X32B SuperCog

RossH · 2014-05-08 05:08

Since everyone seems to have taken to starting a new thread for each "hub access sharing" scheme, I can't let Heater's dazzlingy simple, yet brilliantly practical notion pass without also having its own thread ...

Basically, Heater's idea can be summed up as follows:

Cog 0 is a "SuperCog", which (in addition to its own hub access window) also gets to use the hub access of any other cog that is not executing a hub instruction in its hub access window.
All the other 15 cogs operate identically to the way they normally would - even if cog 0 is not running. Absolute determinism and hub-cog synchronicity is maintained for the chip as a whole, and also for cogs 1 .. 15 individually.
Cog 0 can run at full speed if there are sufficient unused cogs (or running cogs not using their hub access window), but it slows down to normal cog speed if all other cogs are utilizing their own hub access windows. No cog has to donate hub access slots - it all happens automatically.
All cog objects remain identical - no need to know how many hub access cycles each one needs or gets, and then have try and figure out if they can all operate together. Plug and play is back!

One of the most attractive features of this scheme is that cog 0 is already special - because it is always the first cog loaded on reset, and it would normally be the cog that executes the Spin interpreter, or a HubExec or LMM-based kernel used by high-level languages - precisely the kind of usage that would benefit most from this scheme - and you don't even need to do anything special to use it! No new registers, no new instructions, and no complex modes that have to be considered, calculated, set or reset. The only change proposed to the current instruction set is that (in order to prevent someone "accidentally" loading cog 0), the COGNEW instruction would be modified to only load cogs 1 .. 15. To load cog 0, you must either reset the Propeller, or use a COGINIT that explicitly specifies cog 0.

A beautifully simple scheme (in fact, possibly the simplest scheme) that gives people what they want most (i.e. a fast cog for executing special application code or a high-level language) while retaining all the determinism, simplicity and orthogonality we have learned to love in the P1.

I think Heater's idea is awsome! What do you think?

Ross.

ctwardell · 2014-05-08 05:25

Make it two SuperCogs and I'll consider it better than not getting any form of slot sharing at all. a reasonable approach.

The reason for two is that I'm pretty sure odd/even slots need to stay with odd/even cogs due to timing.

So cog 0 gets unused evens, cog 1 gets unused odds.

C.W.

Heater. · 2014-05-08 05:26

I'm not sure that the "super cog idea" is actually original to me. Still feel free to attach my name to it so I can catch all the tomatoes and **** that will be thrown at the concept.

What do I think?

I think Heater's idea is awesome!

RossH · 2014-05-08 05:31

ctwardell wrote: »

... I'm pretty sure odd/even slots need to stay with odd/even cogs due to timing ...

I think I'd prefer just one SuperCog, even if it only got access to half the slots. Does anyone know whether this odd/even timing thing is true or not?

Ross.

RossH · 2014-05-08 05:33

Heater. wrote: »

feel free to attach my name to it so I can catch all the tomatoes and **** that will be thrown at the concept.

Why else do you think I was careful to attribute it to you?

ctwardell · 2014-05-08 05:46

RossH wrote: »

I think I'd prefer just one SuperCog, even if it only got access to half the slots. Does anyone know whether this odd/even timing thing is true or not?

Ross.

We don't have enough specifics from Chip to know for sure.

It is currently an assumption based on the cogs executing instructions at half the hub rate. You couldn't give a cog more than 8 slots within a trip around the robin yet 16 slots occur during that period, so it is assumed half the cogs get an even slot and half an odd one.
We don't really know if that relationship matches the cog number.

C.W.

Todd Marshall · 2014-05-08 07:33

Seed COG0 into every hub slot and you have your super cog. What's the big deal?

Heater. · 2014-05-08 08:16

The big deal is that we don't know what you are talking about when you say: "Seed COG0 into every hub slot"
We just run our code and it works.

jazzed · 2014-05-08 08:22

Do you really believe that Chip will let go of "beautiful symmetry" to allow a super cog?

Yes, it has been proposed before.

Heater. · 2014-05-08 08:28

Might do, there is more symmetry in the one super-cog and 15 others all the same than 16 cogs all subject to different slot allocations and modes and tweaks.

ErNa · 2014-05-08 08:49

I believe, as Chip didn't pop up for several days, that decisions are made and we have to see, what the next chip will offer. Time is better invested in infrastructure, communication protocols and standardization of objects. I missed cogs and memory and there was the question: do you want more cogs or memory, long ago. Now we are close to getting both! Never dared to ask. And we discuss over and over hub round robbing and not only we discuss, we rob round and round. Maybe we need a lap counter for ever new old solutions ;-)

Heater. · 2014-05-08 09:02

ErNa,

Yes indeed. It's "Groundhog Day" here until Chip makes a reappearance.

mark · 2014-05-08 09:16

If the "business code" doesn't typically require maximum performance, then what is even the point of having such a SuperCog? Also, it's useless for anything that actually could use the additional bandwidth since it can't be guaranteed. I don't see what problem this proposal is solving.

Now, there may be something to Heater's thoughts on reducing the cog count. A 9 cog device, where one is a so-called "SuperCog", would enable cog0 to have guaranteed hub access every instruction, while the other 8 would have it every 8 instructions (just as they're currently planned to). Of course, you could instead go with 10 cogs, where two are super cogs which have access every other instruction, and the remainder still get it every 8.

But that brings up an important issue: the cog's architecture isn't the most efficient design for running code outside of its cog ram space. Because of this, giving one cog so much bandwidth might in reality be a waste, but at what point is that so? Hub access every 2 instructions? Every 4? For sake of efficiency, supercog architecture probably shouldn't be like the other cogs at all.

potatohead · 2014-05-08 09:17

Many of us don't see the problem a mixed HUB allocation scheme is solving.

jazzed · 2014-05-08 09:32

[QUOTE=mark

mark · 2014-05-08 09:45

jazzed wrote: »

It is useless if you don't have any competition or if you simply don't care if your sales revenue potential is never fully realized.

But this method isn't guaranteeing any additional performance (other than the standard window every 8 instructions). Geez, something like this would look great in the sales literature "Hub access for Cog0 may or may not be more frequent than every 8 instructions. It's complicated. Don't bother trying to figure out if you can squeeze any additional performance out of it, and DEFINITELY don't try to rely on it without being EXTRA EXTRA EXTRA careful!"

Yeah, with such a "feaure", the competition will surely be trembling in their boots.

ctwardell · 2014-05-08 10:29

[QUOTE=mark

mark · 2014-05-08 11:05

ctwardell wrote: »

Actually it can. All you need to do is not load anything in the cog that you want to always be a donor, or load it with something that doesn't use the hub. That would be that same as not allocating the donor any slots in the other schemes.

The SuperCog would be guaranteed the extra slots. The only issue is it may also get extra slots from other cogs, but in most of the cases that would not matter.

I prefer a table based method or one like I suggested in the other thread that is more determinate, but this is better than not getting any slot sharing at all.

At least it has both sides near the table.

C.W.

I hear you. I suppose it would be better than nothing, but it is sloppy. First, when we consider the issue of hub cycle phase that inherently depends on which cog(s) you're getting access from, and secondly, how sacrificing a certain cog might not even be an option because it's needed for fast DAC access on its pins. Pretty much every other proposal mentioned is better than this one, the only thing we're agreeing to here is that hub access shouldn't be fixed between all cogs.

Todd Marshall · 2014-05-08 11:10

Heater. wrote: »

The big deal is that we don't know what you are talking about when you say: "Seed COG0 into every hub slot"
We just run our code and it works.

Use an indirection table for COG sequencing. Put (seed) the COG0's address in every element of the table.

potatohead wrote: »

Many of us don't see the problem a mixed HUB allocation scheme is solving.

The short answer:
Any COG that wants access to the HUB at anything besides modulo 16.

The longer example:
It's not hard to envision a COG with a 24 cycle process wanting HUB access at an 8 cycle interval: RD(8); Update(8); WR(8); ... then idle 8 waiting for RD slot.

Now this 24 cycle process takes two full HUB rotations (64 cycles). If it could get two bites at the apple per rotation, it would take 32 cycles. That's 50% speed improvement. This could be accomplished by seeding the COG at16 cycle intervals (e.g. every 8 slots).

Further, if we weren't stuck with a 16 slot rotation but were allowed say 256 slots, we could meet more COG timing needs more easily. We could potentially have all our COGs running without waiting at all. It's a largest common denominator problem.

The COG code doesn't have to complicate itself with a "paired cog". There's no need for a TopCOG. There's no need for a King COG Zero. There's no need for sharing. There's no need for mooching.

Further, if you don't limit the table length to 16 and you don't limit the modulo of sequencing through the table to its length, all sequencing requirements can be met ... at zero cost if such an indirection table is already being used in the design.

Kwinn illustrated a similar concept in #1975 of the other thread, though he seemed to be unnecessarily limiting his facility to a 32 element table with modulo 32.

If such a table is not already used in the design, changing the design to use such an indirection table adds zero complexity and a few more bytes of ram or registers for the table.

Try on a table of 200 entries and any modulo from 1 to 200 and see how that feels on you. You can appreciate the flexibility of the technique. You know that it will serve any existing implementation right now (table length 16, modulo 16 seeded with COG enumeration 0..15). So any apprehension has to come from a need for discipline ... but that too can be imposed by the disciplinarian, leaving others alone.

As I illustrated earlier, any legacy application behavior can be preserved by seeding it's current activity first. Then seeding new COGs as close to their required access as possible and only considering determinence of new COGS as necessary.

You can then go back and look at the existing COG's requirements for determinence. If they don't have any, move them to make room for a COG that needs their slot for determinence.

You will quickly learn mindless techniques. E.g. seed COGs not needing determinence first. Then seed COGs needing determinence over them.

The issue is similar to timing instructions and data on a magnetic drum in the olden days ... something your average Java programmer is not concerned with but anyone interested in a Propeller chip would likely be interested in.

potatohead · 2014-05-08 11:43

Well, I just realized SUPERCOG would make a great band name.

@Todd

We just disagree. Consistent code execute means structuring things by instruction order, using COGS together, using smart data alignments, etc... solves those same problems and once solved, they remain solved for all time, not just a particular timing combination.

In any case, once Chip puts an FPGA out there, we have code to write!

Heater. · 2014-05-08 12:31

Todd Marshall,

The issue is similar to timing instructions and data on a magnetic drum in the olden days ... something your average Java programmer is not concerned with but anyone interested in a Propeller chip would likely be interested in.

Yes it is. And I repeatedly urge everyone here to read the "The Story of Mel" to see what that was like:
http://www.catb.org/jargon/html/story-of-mel.html

However the Propeller, with it multiple processors, made life much simpler for the programmer. No longer did he have to worry about interrupts, and priorities, and RTOSs. No, just grab some code, drop it in a project and it will work regardless of it's timing requirements vs the rest of the code in the system.

All the HUB sharing suggestions I have seen so far bring us back to the world of Mel.

Todd Marshall · 2014-05-08 12:50

Heater. wrote: »

Todd Marshall,

Yes it is. And I repeatedly urge everyone here to read the "The Story of Mel" to see what that was like:
http://www.catb.org/jargon/html/story-of-mel.html

However the Propeller, with it multiple processors, made life much simpler for the programmer. No longer did he have to worry about interrupts, and priorities, and RTOSs. No, just grab some code, drop it in a project and it will work regardless of it's timing requirements vs the rest of the code in the system.

All the HUB sharing suggestions I have seen so far bring us back to the world of Mel.

Brace yourself Heater. I agree with you. When you can deal with the 16 cycle window, don't screw with it. When your requirement becomes slightly more demanding, a tiny change in the model allows the existing Propeller philosophy to still bring something to the party and doesn't have to leave legacy apps behind. When the problem gets big, the Propeller philosophy breaks down and the interrupt strategy looks better. A hybrid solution can bring the best of both worlds. At a minimum I would look for simple SPI or I2C and probably USB, Firewire, and/or Bluetooth capability in the Propeller to go there.

Thus, I will always view the Propeller as a peripheral device. But I don't want to lock myself into the 16 slot straight jacket if I don't have to. And with the design I describe, I don't have to.

This reminds me of a dialog I had with Don Lancaster about his magic sine waves. He said the biggest problem was jitter. I suggested to him that he should try the Propeller chip as it guaranteed no jitter.

The line went dead.

http://electronics.stackexchange.com/questions/11844/don-lancasters-magic-sinewaves

potatohead · 2014-05-08 12:53

Do you have a sample problem of this kind?

What does Don's routine have to do with the HUB? That kind of thing happens in a COG and it works with the pins.

Heater. · 2014-05-08 13:00

mark

Heater. · 2014-05-08 13:19

Todd Marshall,

When the problem gets big, the Propeller philosophy breaks down and the interrupt strategy looks better.

Please give me an example because I cannot imagine that is ever true.

An interrupt is just a signal that some external happening needs dealing with "now".

One can jump out of the current processing, thus suspending it, and handle the interrupt or one can just have a CPU waiting there to handle it.

I think we know which one gives the least latency and best overall performance. One possible event, one processor.

OK, so we have 16 processors in the PII that can handle the equivalent of 16 interrupts. Very quickly and with no hassle about priorities etc.

You are hinting that "big" means more than 16 external event sources. In that case you may have a problem bigger than the PII can handle or indeed any other common MCU.

The "magic sine waves" are not relevant. They might as well be DDS or whatever. Just another technique. No doubt all suffering from the same jitter problem if they have to run on a single CPU with interruptions going on.

Mike Green · 2014-05-08 13:28

Todd,
Part of the reason for limiting any hub table to 16 or 32 entries is that it has to be implemented as an independent 16 or 32 location memory or register file since it has to be accessible directly by the hub logic. There's not enough time to access hub memory for the table contents. There are other ways the hub slot / cog mapping can be implemented, but it has to be fast and should be relatively simple and small.

Todd Marshall · 2014-05-08 15:57

When the problem gets big, the Propeller philosophy breaks down and the interrupt strategy looks better.

Heater. wrote: »

Todd Marshall,
Please give me an example because I cannot imagine that is ever true.

When the algorithm needs HUB access that is in a different "period" than the HUB rotation the problem shows up. In some cases it syncs up. In other cases it waits up to one full rotation. It "beats" like two sinewaves of slightly different frequency. That means jitter. Audio applications would most likely be susccptable to this issue. I would expect DSP would be susceptable to it also.

Heater. wrote: »

Todd Marshall,
An interrupt is just a signal that some external happening needs dealing with "now".

Correct. But "now" for complicated algorithms becomes more difficult to anticipate and is not likely an even harmonic of the HUB rotation.

Heater. wrote: »

Todd Marshall,
One can jump out of the current processing, thus suspending it, and handle the interrupt or one can just have a CPU waiting there to handle it.

Correct. But if one makes an interrupt it can know exactly how long that interrupt will take to service. If one needs the HUB every 43.12 HUB revolutions, it's going to jitter.

Heater. wrote: »

Todd Marshall,
I think we know which one gives the least latency and best overall performance. One possible event, one processor.

Well, I'm not included in that "we". My gut says otherwise for complicated algorithms making multiple decisions.

Heater. wrote: »

Todd Marshall,
OK, so we have 16 processors in the PII that can handle the equivalent of 16 interrupts. Very quickly and with no hassle about priorities etc.

Yes, if 0 to 31 cycle response is considered very quickly ... and if for your needs, a response varying between 0 and 31 cycles for any particular need does not constitute jitter. Otherwise you have the same issue as magnetic drum timing.

Heater. wrote: »

Todd Marshall,
You are hinting that "big" means more than 16 external event sources. In that case you may have a problem bigger than the PII can handle or indeed any other common MCU.

Not really. I may have only one event source and experience the problem I anticipate if my algrothim is sufficiently complicated and includes decisions and is not an even harmonic of the HUB rotation. I can write code so all decisions take exactly the same amount of time. But I may not be able to predict how many decisions have to be made in series before I need HUB access. It's a hassle associated with larger applications.

I'm inclined to apply the Propeller in layer 1 and maybe into layer 2 of the 7 layer model. Above layer 2 I'm inclined to want a stack machine with interrupts. Thus, I view the Propeller as more of a peripheral. And viewing it as such, I want to control the HUB service sequence.

Heater. wrote: »

Todd Marshall,
The "magic sine waves" are not relevant.

They are if the Propeller can't support them without jitter. My point to Don was that I thought it could. Don seemed uninterested as I never heard from him again. Has anyone you know of implemented them on the Propeller?

Heater. wrote: »

Todd Marshall,
They might as well be DDS or whatever. Just another technique. No doubt all suffering from the same jitter problem if they have to run on a single CPU with interruptions going on.

By DDS: I presume you're referring to Data Distribution Service. I don't think we need to go there to address my concerns here.

Todd Marshall · 2014-05-08 16:11

Mike Green wrote: »

Todd,
Part of the reason for limiting any hub table to 16 or 32 entries is that it has to be implemented as an independent 16 or 32 location memory or register file since it has to be accessible directly by the hub logic. There's not enough time to access hub memory for the table contents. There are other ways the hub slot / cog mapping can be implemented, but it has to be fast and should be relatively simple and small.

If the table can be in 16 registers, it can be in 256 registers. If a COG can be accessed indirectly through a register, it cares not if that register is one of 16 or one of 256. And taking up 16 times as much space for this set of registers is not likely to be a big deal. The speed and complexity don't change. You have a down counter starting at the modulus you want. It indexes into the register vector returning the COG number for that indexed position (time slot). When the counter reaches zero, it resets to the modulus, just like a fixed counter indexing COGS directly resets to 16.

rod1963 · 2014-05-08 16:23

DDS = Direct Digital Synthesis

Analog Devices makes a whole slew of DDS IC's to perform the sort of Sine wave thing you suggested to Don.

jmg · 2014-05-08 16:37

Todd Marshall wrote: »

If the table can be in 16 registers, it can be in 256 registers.

Sure, at first glance, you can pick-any-number, but...

Todd Marshall wrote: »

If a COG can be accessed indirectly through a register, it cares not if that register is one of 16 or one of 256. And taking up 16 times as much space for this set of registers is not likely to be a big deal.

Die size still matters, and you ideally need atomic loading, as this can change on the fly, at run time.
That does not prevent big tables, but it does complicate them.

Todd Marshall wrote: »

The speed and complexity don't change. You have a down counter starting at the modulus you want. It indexes into the register vector returning the COG number for that indexed position (time slot). When the counter reaches zero, it resets to the modulus, just like a fixed counter indexing COGS directly resets to 16.

Correct, except for the "The speed and complexity don't change" claim, which is incorrect.

For some real FPGA P&R values, from verilog test cases, see the more serious thread on Table handling
http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1265840&viewfull=1#post1265840

Modulus reload has no effective penalty, but table size (and double table indexing for if-based), all impact speed. (some do come in slower than the 200MHz FPGA target)

jazzed · 2014-05-08 16:40

Todd Marshall wrote: »

When the algorithm needs HUB access that is in a different "period" than the HUB rotation the problem shows up. In some cases it syncs up. In other cases it waits up to one full rotation. It "beats" like two sinewaves of slightly different frequency. That means jitter. Audio applications would most likely be susccptable to this issue. I would expect DSP would be susceptable to it also.

It is mitigable. The following applies to the shipping P8x32a. We'll figure out what the new chip can do when it's available.

If you need access to HUB from a cog every N*HUB cycle, or every 2N*HUB cycleyou can get it.

N*HUB Cycle flow (deterministically accesses HUB at 5MHz in 80MHz clock):

loop
hub operation
non-hub operation
non-hub operation
hub operation
non-hub operation
jmp #:loop

2N*HUB Cycle flow (deterministically accesses HUB at 2.5MHz in 80MHz clock):

loop
hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
jmp #:loop

4N*HUB Cycle flow (deterministically accesses HUB at 1.25MHz in 80MHz clock):

loop
hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
jmp #:loop

To get higher throughput for the same cycles, more than one cog would be needed.

P16X32B SuperCog

Comments