P16X32B SuperCog
RossH
Posts: 5,477
Since everyone seems to have taken to starting a new thread for each "hub access sharing" scheme, I can't let Heater's dazzlingy simple, yet brilliantly practical notion pass without also having its own thread ...
Basically, Heater's idea can be summed up as follows:
A beautifully simple scheme (in fact, possibly the simplest scheme) that gives people what they want most (i.e. a fast cog for executing special application code or a high-level language) while retaining all the determinism, simplicity and orthogonality we have learned to love in the P1.
I think Heater's idea is awsome! What do you think?
Ross.
Basically, Heater's idea can be summed up as follows:
- Cog 0 is a "SuperCog", which (in addition to its own hub access window) also gets to use the hub access of any other cog that is not executing a hub instruction in its hub access window.
- All the other 15 cogs operate identically to the way they normally would - even if cog 0 is not running. Absolute determinism and hub-cog synchronicity is maintained for the chip as a whole, and also for cogs 1 .. 15 individually.
- Cog 0 can run at full speed if there are sufficient unused cogs (or running cogs not using their hub access window), but it slows down to normal cog speed if all other cogs are utilizing their own hub access windows. No cog has to donate hub access slots - it all happens automatically.
- All cog objects remain identical - no need to know how many hub access cycles each one needs or gets, and then have try and figure out if they can all operate together. Plug and play is back!
A beautifully simple scheme (in fact, possibly the simplest scheme) that gives people what they want most (i.e. a fast cog for executing special application code or a high-level language) while retaining all the determinism, simplicity and orthogonality we have learned to love in the P1.
I think Heater's idea is awsome! What do you think?
Ross.
Comments
The reason for two is that I'm pretty sure odd/even slots need to stay with odd/even cogs due to timing.
So cog 0 gets unused evens, cog 1 gets unused odds.
C.W.
I think Heater's idea is awesome!
I think I'd prefer just one SuperCog, even if it only got access to half the slots. Does anyone know whether this odd/even timing thing is true or not?
Ross.
Why else do you think I was careful to attribute it to you?
We don't have enough specifics from Chip to know for sure.
It is currently an assumption based on the cogs executing instructions at half the hub rate. You couldn't give a cog more than 8 slots within a trip around the robin yet 16 slots occur during that period, so it is assumed half the cogs get an even slot and half an odd one.
We don't really know if that relationship matches the cog number.
C.W.
We just run our code and it works.
Yes, it has been proposed before.
Yes indeed. It's "Groundhog Day" here until Chip makes a reappearance.
Now, there may be something to Heater's thoughts on reducing the cog count. A 9 cog device, where one is a so-called "SuperCog", would enable cog0 to have guaranteed hub access every instruction, while the other 8 would have it every 8 instructions (just as they're currently planned to). Of course, you could instead go with 10 cogs, where two are super cogs which have access every other instruction, and the remainder still get it every 8.
But that brings up an important issue: the cog's architecture isn't the most efficient design for running code outside of its cog ram space. Because of this, giving one cog so much bandwidth might in reality be a waste, but at what point is that so? Hub access every 2 instructions? Every 4? For sake of efficiency, supercog architecture probably shouldn't be like the other cogs at all.
But this method isn't guaranteeing any additional performance (other than the standard window every 8 instructions). Geez, something like this would look great in the sales literature "Hub access for Cog0 may or may not be more frequent than every 8 instructions. It's complicated. Don't bother trying to figure out if you can squeeze any additional performance out of it, and DEFINITELY don't try to rely on it without being EXTRA EXTRA EXTRA careful!"
Yeah, with such a "feaure", the competition will surely be trembling in their boots.
I hear you. I suppose it would be better than nothing, but it is sloppy. First, when we consider the issue of hub cycle phase that inherently depends on which cog(s) you're getting access from, and secondly, how sacrificing a certain cog might not even be an option because it's needed for fast DAC access on its pins. Pretty much every other proposal mentioned is better than this one, the only thing we're agreeing to here is that hub access shouldn't be fixed between all cogs.
Use an indirection table for COG sequencing. Put (seed) the COG0's address in every element of the table.
The short answer:
Any COG that wants access to the HUB at anything besides modulo 16.
The longer example:
It's not hard to envision a COG with a 24 cycle process wanting HUB access at an 8 cycle interval: RD(8); Update(8); WR(8); ... then idle 8 waiting for RD slot.
Now this 24 cycle process takes two full HUB rotations (64 cycles). If it could get two bites at the apple per rotation, it would take 32 cycles. That's 50% speed improvement. This could be accomplished by seeding the COG at16 cycle intervals (e.g. every 8 slots).
Further, if we weren't stuck with a 16 slot rotation but were allowed say 256 slots, we could meet more COG timing needs more easily. We could potentially have all our COGs running without waiting at all. It's a largest common denominator problem.
The COG code doesn't have to complicate itself with a "paired cog". There's no need for a TopCOG. There's no need for a King COG Zero. There's no need for sharing. There's no need for mooching.
Further, if you don't limit the table length to 16 and you don't limit the modulo of sequencing through the table to its length, all sequencing requirements can be met ... at zero cost if such an indirection table is already being used in the design.
Kwinn illustrated a similar concept in #1975 of the other thread, though he seemed to be unnecessarily limiting his facility to a 32 element table with modulo 32.
If such a table is not already used in the design, changing the design to use such an indirection table adds zero complexity and a few more bytes of ram or registers for the table.
Try on a table of 200 entries and any modulo from 1 to 200 and see how that feels on you. You can appreciate the flexibility of the technique. You know that it will serve any existing implementation right now (table length 16, modulo 16 seeded with COG enumeration 0..15). So any apprehension has to come from a need for discipline ... but that too can be imposed by the disciplinarian, leaving others alone.
As I illustrated earlier, any legacy application behavior can be preserved by seeding it's current activity first. Then seeding new COGs as close to their required access as possible and only considering determinence of new COGS as necessary.
You can then go back and look at the existing COG's requirements for determinence. If they don't have any, move them to make room for a COG that needs their slot for determinence.
You will quickly learn mindless techniques. E.g. seed COGs not needing determinence first. Then seed COGs needing determinence over them.
The issue is similar to timing instructions and data on a magnetic drum in the olden days ... something your average Java programmer is not concerned with but anyone interested in a Propeller chip would likely be interested in.
@Todd
We just disagree. Consistent code execute means structuring things by instruction order, using COGS together, using smart data alignments, etc... solves those same problems and once solved, they remain solved for all time, not just a particular timing combination.
In any case, once Chip puts an FPGA out there, we have code to write!
http://www.catb.org/jargon/html/story-of-mel.html
However the Propeller, with it multiple processors, made life much simpler for the programmer. No longer did he have to worry about interrupts, and priorities, and RTOSs. No, just grab some code, drop it in a project and it will work regardless of it's timing requirements vs the rest of the code in the system.
All the HUB sharing suggestions I have seen so far bring us back to the world of Mel.
Brace yourself Heater. I agree with you. When you can deal with the 16 cycle window, don't screw with it. When your requirement becomes slightly more demanding, a tiny change in the model allows the existing Propeller philosophy to still bring something to the party and doesn't have to leave legacy apps behind. When the problem gets big, the Propeller philosophy breaks down and the interrupt strategy looks better. A hybrid solution can bring the best of both worlds. At a minimum I would look for simple SPI or I2C and probably USB, Firewire, and/or Bluetooth capability in the Propeller to go there.
Thus, I will always view the Propeller as a peripheral device. But I don't want to lock myself into the 16 slot straight jacket if I don't have to. And with the design I describe, I don't have to.
This reminds me of a dialog I had with Don Lancaster about his magic sine waves. He said the biggest problem was jitter. I suggested to him that he should try the Propeller chip as it guaranteed no jitter.
The line went dead.
http://electronics.stackexchange.com/questions/11844/don-lancasters-magic-sinewaves
What does Don's routine have to do with the HUB? That kind of thing happens in a COG and it works with the pins.
An interrupt is just a signal that some external happening needs dealing with "now".
One can jump out of the current processing, thus suspending it, and handle the interrupt or one can just have a CPU waiting there to handle it.
I think we know which one gives the least latency and best overall performance. One possible event, one processor.
OK, so we have 16 processors in the PII that can handle the equivalent of 16 interrupts. Very quickly and with no hassle about priorities etc.
You are hinting that "big" means more than 16 external event sources. In that case you may have a problem bigger than the PII can handle or indeed any other common MCU.
The "magic sine waves" are not relevant. They might as well be DDS or whatever. Just another technique. No doubt all suffering from the same jitter problem if they have to run on a single CPU with interruptions going on.
Part of the reason for limiting any hub table to 16 or 32 entries is that it has to be implemented as an independent 16 or 32 location memory or register file since it has to be accessible directly by the hub logic. There's not enough time to access hub memory for the table contents. There are other ways the hub slot / cog mapping can be implemented, but it has to be fast and should be relatively simple and small.
When the algorithm needs HUB access that is in a different "period" than the HUB rotation the problem shows up. In some cases it syncs up. In other cases it waits up to one full rotation. It "beats" like two sinewaves of slightly different frequency. That means jitter. Audio applications would most likely be susccptable to this issue. I would expect DSP would be susceptable to it also.
Correct. But "now" for complicated algorithms becomes more difficult to anticipate and is not likely an even harmonic of the HUB rotation.
Correct. But if one makes an interrupt it can know exactly how long that interrupt will take to service. If one needs the HUB every 43.12 HUB revolutions, it's going to jitter.
Well, I'm not included in that "we". My gut says otherwise for complicated algorithms making multiple decisions.
Yes, if 0 to 31 cycle response is considered very quickly ... and if for your needs, a response varying between 0 and 31 cycles for any particular need does not constitute jitter. Otherwise you have the same issue as magnetic drum timing.
Not really. I may have only one event source and experience the problem I anticipate if my algrothim is sufficiently complicated and includes decisions and is not an even harmonic of the HUB rotation. I can write code so all decisions take exactly the same amount of time. But I may not be able to predict how many decisions have to be made in series before I need HUB access. It's a hassle associated with larger applications.
I'm inclined to apply the Propeller in layer 1 and maybe into layer 2 of the 7 layer model. Above layer 2 I'm inclined to want a stack machine with interrupts. Thus, I view the Propeller as more of a peripheral. And viewing it as such, I want to control the HUB service sequence.
They are if the Propeller can't support them without jitter. My point to Don was that I thought it could. Don seemed uninterested as I never heard from him again. Has anyone you know of implemented them on the Propeller?
By DDS: I presume you're referring to Data Distribution Service. I don't think we need to go there to address my concerns here.
Analog Devices makes a whole slew of DDS IC's to perform the sort of Sine wave thing you suggested to Don.
Die size still matters, and you ideally need atomic loading, as this can change on the fly, at run time.
That does not prevent big tables, but it does complicate them.
Correct, except for the "The speed and complexity don't change" claim, which is incorrect.
For some real FPGA P&R values, from verilog test cases, see the more serious thread on Table handling
http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1265840&viewfull=1#post1265840
Modulus reload has no effective penalty, but table size (and double table indexing for if-based), all impact speed. (some do come in slower than the 200MHz FPGA target)
It is mitigable. The following applies to the shipping P8x32a. We'll figure out what the new chip can do when it's available.
If you need access to HUB from a cog every N*HUB cycle, or every 2N*HUB cycleyou can get it.
N*HUB Cycle flow (deterministically accesses HUB at 5MHz in 80MHz clock):
loop
hub operation
non-hub operation
non-hub operation
hub operation
non-hub operation
jmp #:loop
2N*HUB Cycle flow (deterministically accesses HUB at 2.5MHz in 80MHz clock):
loop
hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
jmp #:loop
4N*HUB Cycle flow (deterministically accesses HUB at 1.25MHz in 80MHz clock):
loop
hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
non-hub operation
jmp #:loop
To get higher throughput for the same cycles, more than one cog would be needed.