The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

dMajo · 2014-05-03 14:03

Phil Pilgrim (PhiPi) wrote: »

I do not believe that's the case. The cog RAM is dual-ported, instead of quad-ported as it was in the P2 design. This permits two accesses at once. An instruction requires four accesses: read instruction, read source, read destination, write result. Only read source and read destination could happen simultaneously in a non-pipelined architecture, leaving three clock cyles to complete the instruction. So one has to conclude that the new architecture is, indeed, pipelined, with each instruction requiring four clocks, and overlapping other instructions by two clocks, yielding a throughput of two clocks per instruction.

-Phil

IIRC the P1 memory has 3 read and 1 write ports, now in P1+ because Chip uses OnSemi standard blocks there is 3 read and 3 write ports and at the same time the access has doubled because it has moved from 64 to 128 bits ... but perhaps I am wrong.

4x5n · 2014-05-03 14:19

ctwardell wrote: »

Now Heaters FUD has become self-replicating...nice.

C.W.

????

ctwardell · 2014-05-03 14:44

4x5n wrote: »

????

Just pointing out that you repeated Heaters 'talking points'.

How specifically does any of this make the OBEX harder to use? I know Heater says it does, but does it really?

Heater says it makes the cogs 'unsymmetrical', how? You already need to consider hub and pin usage, does that make the cogs unsymmetrical if they don't all use exactly the same amount of HUB ram, or the same number of pins?

I'd just like to suggest to everyone to take a deeper look at the problem to understand both the benefits and pitfalls of allowing hub access to be more controllable by the user.

Right now we have two camps digging their heels in and very little analysis. We should act like engineers, not politicians or philosophers.

C.W.

potatohead · 2014-05-03 15:00

It's not a purely technical matter.

It is all about how things are valued, in particular, whether or not the additional hassle associated with individualizing the COG - HUB access relationships is worth having some, or one, etc... COGS faster or less / more useful than other COGS is worth it.

Engineering has so far produced an amazing number of schemes. Perhaps more of those will yield one that impacts the value difference more than the current ones have.

Put really simply, everybody knows we can make some COGS quicker than other COGS. That's not at issue other than how best to do that. What is at issue is whether or not it's worth doing, and whether it's worth it on this design, or the next / another / different one.

Characterizing that value difference as FUD really isn't productive.

Todd Marshall · 2014-05-03 15:33

dMajo wrote: »

Todd, from your questions it seems to me that you even not know how the P1 works ... nothing wrong with it ... don't take this wrongly.

Thanks dMajo. I won't take it wrongly.

But there certainly seems to be a disconnect between what I read and what is actually going on.

I'm now left wondering how the P1 HUB knows which COG gets the baton next (even if its hardwired) and how far in advance (clocks) it must know it. I would be really surprised if the P2 has a different model for this.

I'm still left wondering when WR or RD actually run with respect to other instructions in the HUB access window.

It's also not a trivial exercise determining how the P1 behaves. Resistance to allowing any COG to run in any HUB time SLOT (and they are time SLOTS) reveals there's far more here than meets the eye. And determining real P1 behavior by poking it is more difficult than most controllers, there being no debug facility (short of building your own scaffolding).

That not withstanding, it's going to be exciting to find out why all this resistance. And why all this talk of SLOT sharing and mooching? Give the COG the SLOT and be done with it.

All my references are to P1 because it is a known quantity. I would have put a COG in SLOT placement feature and arbitrary number of slots in the round robin long ago ... probably in the initial design. I've seen no argument so far that comes close to convincing me that would be foolish. When you have a chip where every COG can diddle every pin simultaneously, you're talking some serious cooperation issues as large or larger than which COG has the baton and who gets it next.

As I stated in my opening salvo from the peanut gallery, my concerns increase with double the number of COGs and double the number of SLOTS and a fixed circular queue ,,, all hardwired.

jazzed · 2014-05-03 15:46

4x5n wrote: »

I've been using the propeller I've always thought and have been told that the fact that all pins and cogs were equal and "the same"

While it may have been desirable and may be generally the case, it is not perfectly true.

All pin ports are identical except for 28,29,30,31 in a practical system - we all know the practical applications of those pins. Some cogs behave a little differently than others, and that makes some very good stuff impossible on all cogs: 1. using LOGIC modes of counters to do serial receive (found by kuroneko), and 2. IIRC some cogs have trouble producing nice audio in some circumstances (found by jonnymac?).

Granted these are not typical. The serial receive optimization cog specific failure was a severe disappointment to me.

No one likes singing Kumbaya ? So sad

ctwardell · 2014-05-03 16:31

potatohead wrote: »

It's not a purely technical matter.

It is all about how things are valued, in particular, whether or not the additional hassle associated with individualizing the COG - HUB access relationships is worth having some, or one, etc... COGS faster or less / more useful than other COGS is worth it.

Engineering has so far produced an amazing number of schemes. Perhaps more of those will yield one that impacts the value difference more than the current ones have.

Put really simply, everybody knows we can make some COGS quicker than other COGS. That's not at issue other than how best to do that. What is at issue is whether or not it's worth doing, and whether it's worth it on this design, or the next / another / different one.

Characterizing that value difference as FUD really isn't productive.

But it is FUD, and your doing it right here, trying to be subtle about, but doing it no less...

Look at the little thoughts you planted:

"additional hassle"
- How do we know it's a hassle? It might be another resource that needs consideration for allocation during a design, but so is memory and pin usage, are those hassles as well?

"less / more useful than other COGS"
"make some COGS quicker than other COGS"
- This is purely about access to the HUB, how does that make a COG more or less useful?
- This is about 'right sizing' access to the HUB, how would having unused slots be used by another COG make the donor any less useful?
- In an assigment scheme like suggested by Bill Henning how does assigning less slots to a COG that doesn't need them make it less useful?
- In a pairing scheme like suggested by Clusso99 how does that make the other COGs less useful?

"Characterizing that value difference as FUD really isn't productive"
- You are claiming that the value difference is what is being called FUD, that is simply not true, what is being call FUD is the presentation of that value difference and you know it.

Just look at 4x5n's recent post to see how successful Heater has been at planting fear, uncertainty, and doubt.

4x5n wrote: »

I'm with you on this heater. While I can't exactly put my finger on what I don't like about all the talk about hub slot sharing (or what ever people call it) I don't like it. In the few years I've been using the propeller I've always thought and have been told that the fact that all pins and cogs were equal and "the same" and this seems to be a major shift away from that. If for no other reason it makes objects in the OBEX much harder to use. The programmer would be forced to add up the cog slots needed by the objects and hope that the authors were correct. While that's a lesser issue when the program is initially written it make maintaining and modifying the code much more difficult and dangerous. I think it's a mistake.

Let see, we have:

- cogs will no longer be "the same".
- the OBEX much harder to use.
- forced to add up cog slots.
- difficult and dangerous to maintain.

C.W.

Rayman · 2014-05-03 16:36

Personally, I think striving towards symmetry is a good thing.
Without integral flash, some pins have to be dedicated to booting.
But, at least they are quasi-free after the boot phase (although maybe not really in a practical environment).

Some of the relative pin timing asymmetry may be fixed in this P2 (at least it was in the old P2).

Pin-Pin capacitive coupling is something that there is not much fix for, but I think that can be minimized with the external circuit design.

This idea of a cog stealing hub cycles seems to introduce a severe asymmetry with not much benefit.

ctwardell · 2014-05-03 16:45

Rayman wrote: »

This idea of a cog stealing hub cycles seems to introduce a severe asymmetry with not much benefit.

Who said anything about stealing?

I'm not trying to be difficult, but there seem to be a lot of misconceptions running around.

C.W.

Mike Green · 2014-05-03 16:56

Todd Marshall,
This cog - hub linkage is fundamental to the P1. Each cog gets one access to the hub logic every 16 clock cycles. If it attempts to access the hub logic, whether for a read, a write, or other hub instruction, not during its access slot, the cog stalls (repeats one of the clock cycles) until the slot comes up. That's it. Other than that, the cog runs freely executing most instructions in 4 clock cycles. The hub access instructions usually take 7 clock cycles if no stall is needed. Similarly, the WAITxxx instructions stall the cog until the instruction's condition is satisfied, whether from video timing, I/O pin state, or a match with the system clock value. A stalled cog is in low power mode until the stall is ended. You can usually fit a hub access instruction and 2 other non-hub instructions in a 16 cycle hub cycle keeping synchronization with the hub. For example, you can have a hub read (RDLONG) followed by two adds (for incrementing source and destination addresses) in one hub cycle with these 3 instructions repeated for some number of times for a block transfer into or out of the hub, followed by a DJNZ to loop. You'd lose one hub cycle with the DJNZ, but that's amortized over several transfers due to the repeated sequences.

Each of the cogs has a number (0-7) associated with it and they get a fixed hub access slot in order by number. I don't know whether this is implemented internally with a counter or shift register

Todd Marshall · 2014-05-03 17:08

dMajo wrote: »

The hub windows is not 16 clocks wide but it passes every 16 clocks and if you are synced with it you can exchange value without additional wait states. You can read or write a byte/word/long. If you are not using video hw or other wait instructions once synced with the first hub access it will remain in sync, so you can plan your pasm code to optimize it.

With P1+ the concept is the same, but the timings are improved: pasm instructions takes 2 clocks instead of 4, the exchanged data in addition have quads (4 longs at a time) while the hub window for a cog is still 16 clock cycles (but now there is 16 cogs instead of 8). I don't know if hub ops still uses 7 clocks or less.

4x5n · 2014-05-03 17:25

ctwardell wrote: »

Just pointing out that you repeated Heaters 'talking points'.

How specifically does any of this make the OBEX harder to use? I know Heater says it does, but does it really?

Heater says it makes the cogs 'unsymmetrical', how? You already need to consider hub and pin usage, does that make the cogs unsymmetrical if they don't all use exactly the same amount of HUB ram, or the same number of pins?

I'd just like to suggest to everyone to take a deeper look at the problem to understand both the benefits and pitfalls of allowing hub access to be more controllable by the user.

Right now we have two camps digging their heels in and very little analysis. We should act like engineers, not politicians or philosophers.

C.W.

I agree with heater, oh the horror!!!

I really hope that you're not implying that those that agree with you and your point of view are acting like engineers and anyone that dares to disagree with your superior opinion is practicing FUD and are to be silenced!

4x5n · 2014-05-03 17:31

jazzed wrote: »

While it may have been desirable and may be generally the case, it is not perfectly true.

All pin ports are identical except for 28,29,30,31 in a practical system - we all know the practical applications of those pins. Some cogs behave a little differently than others, and that makes some very good stuff impossible on all cogs: 1. using LOGIC modes of counters to do serial receive (found by kuroneko), and 2. IIRC some cogs have trouble producing nice audio in some circumstances (found by jonnymac?).

Granted these are not typical. The serial receive optimization cog specific failure was a severe disappointment to me.

No one likes singing Kumbaya ? So sad

This is the first I'm hearing about some cogs having trouble producing "nice audio" and am interested in reading more. You of course are correct that during boot cog 0 and pins 28-31 are used. After that however all pins are general purpose although realistically they're not.

potatohead · 2014-05-03 17:37

@Todd Just know it's not a long window. It is actually a specific point in time. Look at the diagram in the datasheet.

We won't know how Chip did it on this design, until we get the FPGA image and docs. I'm hoping there are still 16 cycles, but now one for each COG, which is a nice boost.

You should pick up a P1 and run some code! It is a fun chip. Plus, doing that will get you ramped up to explore this design and test with us.

I'm actually working though some composite video in software code written for P1, getting timings, etc... ready for doing that task on this design as we have VGA support to start. Others will likely do some debugger, test framework stuff. And just play some too.

jmg · 2014-05-03 17:37

Rayman wrote: »

Personally, I think striving towards symmetry is a good thing.

Sounds good to me too

The removal of wasted bandwidth, by adding a simple TopUsedCog scanner, has perfect symmetry : all used COGS are equal.

It even has a 100% backward compatible mode, by simply using Cog#16, (or a Config Enable flag) it behaves exactly as a present fixed 80ns pitch slot does. Users can make their P1+ behave as tho this feature does not exist, if they choose.

Where this added user choice shines, is in those designs with fewer COGS, the HUB SLOT bandwidth can now go as low as 10ns.
With the REP opcode ( and I'm not sure where Auto-Inc is in the mix?) software block move loops can be much faster than 80ns. (REP+AutoINC should make 10ns ?)
On a device with a 10ns cycle time, a fixed 80ns pitch is slow, in P1+ terms.

Starting a design you know will use 6 COGS ? - easy, just make TopCog=6, for no changes in timing during testing.
Clocks can be lower for the same Bandwidth, saving system power.

Rayman wrote: »

This idea of a cog stealing hub cycles seems to introduce a severe asymmetry with not much benefit.

You will be pleased to learn that in the above, there is no stealing at all, and no asymmetry - so there are no concerns.

Even adding Hub Slot Pair sharing, I coded a safe-floor version, again no stealing, so no asymmetry surprise concerns.

Summary of designs tested:

TopUsedCog scan is very simple and has no speed or wider design impact, in the sense of flags.
( UsedCog can mean COG that uses HUB memory, if other usable COG-COG links exist )

Hub Slot Pair Sharing has some Logic path impact, as it makes a fetch-time decision.

The actual Logic resource is not large, I coded a version with 2 Pair choice Enable flags, Pair_4 and Pair_8 as that co-operates better with TopUsedCog scanner.
FPGA reports Total number of LUT4s: increases from 22 to 40, with Pair Choices, (and the MHz is starting to drop as the fetch-time MUX paths contribute)
Hard to say if Hub Slot Pair Sharing would impact the top system MHz, but it could.( IIRC Chip mentioned 250MHz SRAM ? )

TopUsedCog scanner, and a Sync RAM Table scanner, make no fetch-time decisions, so they can run at full design speed.
A Sync RAM Table scanner needs a user config, but can easily default to 80ns pitch.

jazzed · 2014-05-03 17:41

4x5n wrote: »

This is the first I'm hearing about some cogs having trouble producing "nice audio" and am interested in reading more.

http://forums.parallax.com/showthread.php/148280-Prop-sound-quality-question?p=1187328&viewfull=1#post1187328

Maybe it wasn't a COG specific problem ... not sure how I got the impression it was ... too many posts, not enough time ;-)

potatohead · 2014-05-03 17:49

TopUsedCog means a video driver, for example, needing lots of HUB access cycles would not function in a many COG configuration. On the other hand, one coded to use two COGS to get the same result given the simple round robin scheme would.

Limiting number of COGS means some objects would only work with some DAC pins in this design, unless we also want to add some sort of cog number redirection to the works. Doing a COGINIT to insure a specific pin group is at issue here. It's not at issue with the simple round robin scheme.

Re: Pins and HUB RAM. Those must be planned. Additional details, such as the cycle consumption, number of total COGS in use, etc... don't have to be planned. With the simple round robin scheme, not having to plan those things is quite simply less hassle.

4x5n · 2014-05-03 17:50

jazzed wrote: »

http://forums.parallax.com/showthread.php/148280-Prop-sound-quality-question?p=1187328&viewfull=1#post1187328

Maybe it wasn't a COG specific problem ... not sure how I got the impression it was ... too many posts, not enough time ;-)

Nod, Just keeping up with this thread has been a challenge. :-)

ctwardell · 2014-05-03 18:20

4x5n wrote: »

I agree with heater, oh the horror!!!

I really hope that you're not implying that those that agree with you and your point of view are acting like engineers and anyone that dares to disagree with your superior opinion is practicing FUD and are to be silenced!

I'm not trying to silence anyone, but I'd rather they make measured decisions based on technical review and not things they can't put their finger on.

Broad general statements that any form of hub assignment other than strict round robin will make the OBEX unusable is spreading FUD and will not back down from saying that.

C.W.

Todd Marshall · 2014-05-03 18:20

Mike Green wrote: »

Todd Marshall,
This cog - hub linkage is fundamental to the P1. Each cog gets one access to the hub logic every 16 clock cycles. If it attempts to access the hub logic, whether for a read, a write, or other hub instruction, not during its access slot, the cog stalls (repeats one of the clock cycles) until the slot comes up. That's it. Other than that, the cog runs freely executing most instructions in 4 clock cycles. The hub access instructions usually take 7 clock cycles if no stall is needed. Similarly, the WAITxxx instructions stall the cog until the instruction's condition is satisfied, whether from video timing, I/O pin state, or a match with the system clock value. A stalled cog is in low power mode until the stall is ended. You can usually fit a hub access instruction and 2 other non-hub instructions in a 16 cycle hub cycle keeping synchronization with the hub. For example, you can have a hub read (RDLONG) followed by two adds (for incrementing source and destination addresses) in one hub cycle with these 3 instructions repeated for some number of times for a block transfer into or out of the hub, followed by a DJNZ to loop. You'd lose one hub cycle with the DJNZ, but that's amortized over several transfers due to the repeated sequences.

Each of the cogs has a number (0-7) associated with it and they get a fixed hub access slot in order by number. I don't know whether this is implemented internally with a counter or shift register

one access to the hub logic every 16 clock cycles: Crucial point I've totally missed in reading. Leaves me thinking the HUB is multi-tasking since if every COG does a RD in its SLOT, the HUB has 56 clocks worth of work to do in 16 clocks.

This cog - hub linkage is fundamental to the P1: Why???

not during its access slot, the cog stalls (repeats one of the clock cycles) until the slot comes up: Can a WR stall be shorter than a RD stall, since the COG doesn't have to wait for the actual write? And to confirm ... a stall isn't until the next HUB visit, it's until the HUB finishes its work (e.g. obtains the RD results)?

Each of the cogs has a number (0-7) associated with it and they get a fixed hub access slot in order by number. I don't know whether this is implemented internally with a counter or shift register: And this begs the question: If it's a counter, is it indexing an array? ... or selecting in a mux? If it's a shift register, is it selecting the COG on a bus?

When a COG gets initiated does this mean a bit in a register gets set to be anded with the shift register to actually ask the COG if it has an instruction for it?

So typical. One answer generates more questions. Regardless, if stalls are 7 clocks when in sync, makes sense to have the same COG ready for a HUB visit 8 clocks forward rather than hardwired 16 clocks forward.

Surely these details are described somewhere. Can you give me a link?

Todd Marshall · 2014-05-03 18:43

potatohead wrote: »

You should pick up a P1 and run some code! It is a fun chip. Plus, doing that will get you ramped up to explore this design and test with us.

I have 4 on Schmart cards (a bargain), one with the education kit, one with the Professional Development Board, and one DIP chip for battery bank experiments I'm starting. Plus I have a Terasic with a Cyclone V getting ready for the FPGA experience. I'm pretty serious about this. Being a EE and programming for 50 years and having created an interpretive language in C++ (WithGLEE.com) makes me thoroughly dangerous.

Brand X is getting a significant amount of my study time as well. They're cranking out some amazing hybrid stuff lately.

Life is good.

potatohead · 2014-05-03 18:54

Broad general statements that any form of hub assignment other than strict round robin will make the OBEX unusable is spreading FUD and will not back down from saying that.

I agree with you.

IMHO, it's better to say, "more difficult to use" in that COG code may require more from a COG than is made available due to the allocation scheme already favoring other code running on other COGS. In general, the more convoluted the allocation scheme is, the more difficulty potential there is.

How people assign value to that varies considerably, and that's why this discussion is painful. We quite literally aren't solving for the same results.

4x5n · 2014-05-03 19:15

ctwardell wrote: »

I'm not trying to silence anyone, but I'd rather they make measured decisions based on technical review and not things they can't put their finger on.

As long as they agree with you. Anyone daring to disagree is simply dismissed as spreading FUD.

Broad general statements that any form of hub assignment other than strict round robin will make the OBEX unusable is spreading FUD and will not back down from saying that.

C.W.

Nice straw man. I don't remember saying that your hub slot table would make the OBEX unusable and to my knowledge your the only one making that claim. Talk about FUD!!! Why don't you try discussing the merits of your scheme rather then simply dismissing any ideas that differ from your scheme as FUD?

dMajo · 2014-05-03 19:27

Todd Marshall wrote: »

one access to the hub logic every 16 clock cycles: Crucial point I've totally missed in reading. Leaves me thinking the HUB is multi-tasking since if every COG does a RD in its SLOT, the HUB has 56 clocks worth of work to do in 16 clocks.

This cog - hub linkage is fundamental to the P1: Why???

not during its access slot, the cog stalls (repeats one of the clock cycles) until the slot comes up: Can a WR stall be shorter than a RD stall, since the COG doesn't have to wait for the actual write? And to confirm ... a stall isn't until the next HUB visit, it's until the HUB finishes its work (e.g. obtains the RD results)?

Each of the cogs has a number (0-7) associated with it and they get a fixed hub access slot in order by number. I don't know whether this is implemented internally with a counter or shift register: And this begs the question: If it's a counter, is it indexing an array? ... or selecting in a mux? If it's a shift register, is it selecting the COG on a bus?

When a COG gets initiated does this mean a bit in a register gets set to be anded with the shift register to actually ask the COG if it has an instruction for it?

So typical. One answer generates more questions. Regardless, if stalls are 7 clocks when in sync, makes sense to have the same COG ready for a HUB visit 8 clocks forward rather than hardwired 16 clocks forward.

Surely these details are described somewhere. Can you give me a link?

The hub is basically a 1:8 mux to the common (shared) ram. In P1 it runs half the speed of the cog. It performs a read or write in 1 of its clocks. So in 8 of its clocks it serves (if needed) all the 8 cogs. Because the cog runs at double speed it translates to one cog having access to shared ram once every 16 clocks.
The stall is when the cog is out of sync with the hub. If in this case a read or write access is made to the hub the cog "pauses" until the hub window (1 hub clock or 2 cog clocks wide) passes in front of it, then it resumes the work with the waiting hub exchange operation.

http://parallax.com/sites/default/files/downloads/P8X32A-Propeller-Datasheet-v1.4.0_0.pdf
http://parallax.com/sites/default/files/downloads/P8X32A-Web-PropellerManual-v1.2_0.pdf
http://parallax.com/sites/default/files/downloads/AN001-P8X32ACounters-v2.0.pdf
http://www.parallax.com/downloads?title_1=an0

Mike Green · 2014-05-03 22:13

The hub is not multi-tasking. As dMajo noted, the hub functionally acts as a 1:8 multiplexor between the shared RAM and the cogs. For one 2-clock period, one cog is effectively connected to the shared RAM. The Propeller Manual has some diagrams showing how this works in terms of timing (see pages 24-25).

The P1 design allows the cogs to run independently from the hub unless some hub resource is needed (shared memory, locks, cog control). If a hub resource is needed, the time required is deterministic and can be easily predicted given one previous synchronization of the cog to the hub.

Regarding shift register vs. counter ... it doesn't matter. A cog can request its number (0-7) and, when starting up a cog, a specific cog can be specified or the hub can pick the first idle cog identified (by an unspecified mechanism). Read the description of the COGINIT/COGNEW instruction in the Propeller Manual.

koehler · 2014-05-03 23:08

OK, as the technical implementation of hub sharing does not seem to be the primary issue,
perhaps we should just get down to the underlying matter.

With the new P16:
Mode Normal is basically the exact same as P1 except for 16 Cores, agreed? No issues with OBEX, or juggling Core/hub availability.
Though, some tweaking is expected as timing/speed is different, so regardless of what has been said, there is OBEX impact.

Mode Hub Sharing at its simplest (Paired Cores).
The Programmer will have x number of Cores running Standard single-Core objects
He/She will then traipse to the OBEX to pick out a particular Jumbo-Object that they want to use.
When they go to the D/L page, it is quite clear that the Jumbo-Object REQUIRES x number of donor Cores.
When they open the .zip file, the code header also indicates it requires x number or Cores.

Can we stipulate some basics now?

1. Prop users are generally just a bit more informed than the average Arduino user.
2. They ARE going to know that the object they are using requires x Cores, etc.
3. They are going to know what Cores are in use, how many are idle, etc.

What does this all mean?

A. The continued handwaving over the potential confusion and mayhem that will result when users -accidentally- run a Jumbo object is overblown.
B. The intellectual grind of keeping a simple running tab of #Normal Cores and #Jumbo-Object Cores is not excessive considering all of the other things that need to be assigned, managed, etc. Again, rather oberblown.

So, assuming these are generally correct, what is the root cause of all this angst against implementing hub sharing?

It appears to me, that some people would prefer to give up some significant, competitive advantages that could positively affect the new Props interest and uptake, and revenue generation, simply to avoid having to do a minimal amount of resource management that is in itself already required.

While this is not a Democracy, perhaps a simple poll as we had with the P2 4 Core vs 16 Core would:
- Allow everyone to publicly vote for what they wish
- Give Parallax some idea of how popular/useful the feature -may- be

Chip and Ken could then look at those results and assign them whatever weight they deem useful in their overall calculus.

Realistically, from Chip's previous (my interpretation) it seems as if this is already more on the 'Should' implement side, especially considering there is nothing preventing anyone from -not- using the feature.

Just a thought.

Heater. · 2014-05-03 23:42

koehler,

There have been many suggestions put forward to enable a COG to get more HUB bandwidth. Most of them I am adverse to because they introduce a coupling between software components. What happens in one component can modulate the speed of another.

The scheme you outline above and as described by Cluso and others is actually perfectly acceptable.

Basically in that scheme a "software component" consumes two or more COGs and it can allow for one of it's COGs to use the HUB access time that would normally be allocated to another. In the extreme that other COG might not be running any useful code at all!

The main point here is that the software component cannot affect any other components in the system or be affected by them. Timing isolation between components is preserved.

So then we might ask:

1) Is this cheap, easy, quick to do in silicon?
2) Is this easy to use for the programmer?
3) How does this interact with the shared CORDIC hardware?
4) Is the actual benefit at the end of the day worth the hassle of 1) and 2) ?

Well, and we have to as how would it look in practice? Just now we have instructions for claiming a single COG and getting it running, COGINIT/COGNEW. This scheme seems to call for such an instruction that can claim two or more COGs at the same time which seems rather infeasible.

evanh · 2014-05-03 23:47

dMajo wrote: »

The hub is basically a 1:8 mux to the common (shared) ram. In P1 it runs half the speed of the cog. It performs a read or write in 1 of its clocks. So in 8 of its clocks it serves (if needed) all the 8 cogs. Because the cog runs at double speed it translates to one cog having access to shared ram once every 16 clocks.

Oh, I just realised the Prop1 is not 8 instruction intervals per hub access at all. Rather it's only 4. Not sure why I had that one wrong.

Another detail is the execution of each Prop1 instruction is spread across 5 system clocks. Which, interestingly, makes the execute stage the same two clocks long as what has been adopted on the Prop1+. There is a lot more overlap now though, with at least two full instruction flows being juggled at once.

The only distinction between the Prop1+ design and that of a full blown pipeline, that I can make out, is the execute stage is asynchronously spread across the two clocks it occupies. I doubt that is sufficient deviation to say it's not clearly a pipelined design.

Todd Marshall · 2014-05-03 23:48

Mike Green wrote: »

The hub is not multi-tasking. As dMajo noted, the hub functionally acts as a 1:8 multiplexor between the shared RAM and the cogs. For one 2-clock period, one cog is effectively connected to the shared RAM. The Propeller Manual has some diagrams showing how this works in terms of timing (see pages 24-25).

So, on a P1, if you have 8 cogs running and all request a RD (7 cycles to accomplish) in the same round robin, how do 56 cycles of read operation get done in 16 cycles (or is it 8 or is it 32 cycles) of HUB rotation?

evanh · 2014-05-04 00:31

Todd Marshall wrote: »

So, on a P1, if you have 8 cogs running and all request a RD (7 cycles to accomplish) in the same round robin, how do 56 cycles of read operation get done in 16 cycles (or is it 8 or is it 32 cycles) of HUB rotation?

8 to 23 system clocks for a Cog to read a Hub location. However, this only consumes one Hub cycle (two system clocks) per Cog access. All the rest of the Cog's additional time, above the typical 4 clocks, is consumed with decoupling/buffering stuff I presume.

And the Hub accesses are in fixed order with unused accesses being paced out. I'm not sure how the phasing relationship of Reads and Writes sit though. Probably dealt with within the slot time.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments