The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Roy Eltham · 2014-05-04 01:04

Todd,
The RDxxxx instructions take 8-23 cycles to execute, they do not take the whole time to just perform the read. Basically, the cog takes 7 cycles to do work and there is 1 to 16 cycles of waiting on hub. Those 7 cycles of work include reading the instruction from COG memory, reading the cog register which contains the address in hub memory to read, interacting with the hub circuitry to indicate it wants to read that address, waiting for the hub to give it the data back, then writing the data to the cog register specified in the instruction. The actual hub read is only taking one (or maybe it's two) of those cycles. If 8 cogs all issued a rdlong on the same exact cycle, then they would all wait different amounts of cycles for the HUB. One of them would end up being in sync and only wait one cycle (and thus only take 8 cycles total), then rest would have to wait more. All of this waiting is happening overlapped with eachother. The hub will go around to each of them one at a time over the 16 cycles and from it's perspective they will all be serviced. From the cycle that started it all until the end when all rdlong's have been completed will be 23 cycles (7 cycles work, and 16 cycles waiting for the last one serviced).

It's never 7 cycles total for a rdlong, it's minimum 8. Where 7 are work, and 1 is waiting. The wait is actually inline among those 7 work cycles. I think it's one or two before the last work cycle. The last actual work cycle is writing the data to a cog register.

jmg · 2014-05-04 03:40

koehler wrote: »

OK, as the technical implementation of hub sharing does not seem to be the primary issue,
perhaps we should just get down to the underlying matter.
...

Mode Hub Sharing at its simplest (Paired Cores).

See my summary in #1696

The simplest means of not wasting bandwidth, is not pairing, but is TopUsedCog scanning.
That has very low silicon cost and no speed cost. It also can easily default to 1:16 slots, and needs no extra flag, or maybe one config bit. TopUsedCog scanning can work also with additional options below.

Next, is Pairing (or more), note that needs fetch-time decisions, and it needs more logic (can be added on top of TopUsedCog), and it will likely have a speed impact. I have coded a TopUsedCog + Choice of 2 Pairing spans, using safe-floor decision.

Another option of Table mapping (Sync small ram) is fast, as it avoids fetch-time decisions, and has more user control than Pairing.
It can also default to 1:16, but needs a means to load the small RAM table - perhaps WRQUAD ?

Todd Marshall · 2014-05-04 05:42

Roy Eltham wrote: »

Todd,
The RDxxxx instructions take 8-23 cycles to execute, they do not take the whole time to just perform the read. Basically, the cog takes 7 cycles to do work and there is 1 to 16 cycles of waiting on hub. Those 7 cycles of work include reading the instruction from COG memory, reading the cog register which contains the address in hub memory to read, interacting with the hub circuitry to indicate it wants to read that address, waiting for the hub to give it the data back, then writing the data to the cog register specified in the instruction. The actual hub read is only taking one (or maybe it's two) of those cycles. If 8 cogs all issued a rdlong on the same exact cycle, then they would all wait different amounts of cycles for the HUB. One of them would end up being in sync and only wait one cycle (and thus only take 8 cycles total), then rest would have to wait more. All of this waiting is happening overlapped with eachother. The hub will go around to each of them one at a time over the 16 cycles and from it's perspective they will all be serviced. From the cycle that started it all until the end when all rdlong's have been completed will be 23 cycles (7 cycles work, and 16 cycles waiting for the last one serviced).

It's never 7 cycles total for a rdlong, it's minimum 8. Where 7 are work, and 1 is waiting. The wait is actually inline among those 7 work cycles. I think it's one or two before the last work cycle. The last actual work cycle is writing the data to a cog register.

Thanks Roy,

From the manual: The Hub maintains this integrity by controlling access to mutually exclusive resources, giving each cog a turn to access them in a “round robin” fashion from Cog 0 through Cog 7 and back to Cog 0 again. The Hub, and the bus it controls, runs at half the System Clock rate. This means that the Hub gives a cog access to mutually exclusive resources once every 16 System Clock cycles. Hub instructions, the Propeller Assembly instructions that access mutually exclusive resources, require 8 cycles to execute but they first need to be synchronized to the start of the hub access window.

I hope you get my confusion. From this, and what you say, the work actually gets interleaved? COG0, COG2, COG4, COG6, COG1, COG3, COG5, COG7 or 0,_,2,_,4,_,6,_,_,1,_,3,_,5,_,7,_,1,_,3,_,5,_,7,_ (whoops, where did the evens go?). Let's assume by magic they stay in the rotation. Then let's say 2 and 4 drop off just as 5 starts. With less contention we continue 5,_,7,_,1,_,3,_,5,_,_,_,_,_,3,_,_,6,_,0,_,_,3,_. Now 2 and 4 come back. 5,_,7,_,1,_,3,_,_,6,_,0,_,2,_,4,_,6,_,0,_ (whoops ... where did the odds go? )

Still assuming magic, with all this it leaves me wondering:
(1) what is the magic?
(2) is the interval between executions of COG5 guaranteed to stay in sync with itself and the other COGS?

My guess to (2) is no. So even if we get synced up, we aren't guaranteed to stay synced up ... with ourselves or with other COGs? Thus jitter?

Without that guarantee, where is the determinism? If it's there it sure is subtle.

Obviously I've made some mistake.

RossH · 2014-05-04 05:51

Todd Marshall wrote: »

Obviously I've made some mistake.

Obviously. You need to read page 24 of the Propeller Manual v1.2 ("Hub"). It's explained there better than anywhere else.

Ross.

ctwardell · 2014-05-04 06:53

4x5n wrote: »

Nice straw man. I don't remember saying that your hub slot table would make the OBEX unusable and to my knowledge your the only one making that claim. Talk about FUD!!! Why don't you try discussing the merits of your scheme rather then simply dismissing any ideas that differ from your scheme as FUD?

Both of the following are intended to instill fear in others that they will no longer be able to use the OBEX in the same way they currently use it.

Heater. wrote: »

We just want to be able to drop components, written by others, into our designs and have them work.

4x5n wrote: »

If for no other reason it makes objects in the OBEX much harder to use. The programmer would be forced to add up the cog slots needed by the objects and hope that the authors were correct.

BTW, it isn't my scheme. The idea was brought up by others, I made a few suggestions very early on during the time that Chip floated his suggested version for the P2.

My main concern is that on a technical forum an argument is being waged over calls to emotion instead of technical facts.

The entire argument regarding the OBEX is driven by Heaters concern of Bill's 'Mooching' concept of allowing objects to make use of a slot not used by other cogs. Because this is non-determinate, the concern is that an object that runs well when plenty of spare slots are available will fail when someone adds another object, say from the OBEX, that uses all of it's own slots thus reducing the pool of available spare slots.

This is a valid concern, except for the take away that this breaks the OBEX.

If 'mooching' objects are allowed in the OBEX, that would be true, but imposing the requirement that ALL OBEX objects be non-mooching solves that issue.

The concept of breaking the OBEX brought up by this scenario is now used to paint any sharing scheme, even when they are fully determinate and cannot in any way affect other objects.

Clusso99's pairing concept is no different than any other object currently in the P1 OBEX that uses multiple COGs other than letting the paired COG's share THEIR slots in a manner decide by the programmer, yet it is also derided by those opposed to any form of slot sharing.

C.W.

dMajo · 2014-05-04 08:36

Todd Marshall wrote: »

Thanks Roy,

From the manual: The Hub maintains this integrity by controlling access to mutually exclusive resources, giving each cog a turn to access them in a round robin fashion from Cog 0 through Cog 7 and back to Cog 0 again. The Hub, and the bus it controls, runs at half the System Clock rate. This means that the Hub gives a cog access to mutually exclusive resources once every 16 System Clock cycles. Hub instructions, the Propeller Assembly instructions that access mutually exclusive resources, require 8 cycles to execute but they first need to be synchronized to the start of the hub access window.

I hope you get my confusion. From this, and what you say, the work actually gets interleaved? COG0, COG2, COG4, COG6, COG1, COG3, COG5, COG7 or 0,_,2,_,4,_,6,_,_,1,_,3,_,5,_,7,_,1,_,3,_,5,_,7,_ (whoops, where did the evens go?). Let's assume by magic they stay in the rotation. Then let's say 2 and 4 drop off just as 5 starts. With less contention we continue 5,_,7,_,1,_,3,_,5,_,_,_,_,_,3,_,_,6,_,0,_,_,3,_. Now 2 and 4 come back. 5,_,7,_,1,_,3,_,_,6,_,0,_,2,_,4,_,6,_,0,_ (whoops ... where did the odds go? )

Still assuming magic, with all this it leaves me wondering:
(1) what is the magic?
(2) is the interval between executions of COG5 guaranteed to stay in sync with itself and the other COGS?

My guess to (2) is no. So even if we get synced up, we aren't guaranteed to stay synced up ... with ourselves or with other COGs? Thus jitter?

Without that guarantee, where is the determinism? If it's there it sure is subtle.

Obviously I've made some mistake.

Please have a look to the first page of the datasheet. There a function block of the propeller internals and you can see the cog<>hub interaction.

IIRC the execution stages of the "normal 4 clocks"instructions are the following:
1) fetch instruction
2) decode instruction
3) fetch source value
4) fetch destination value
5) execute instruction
6) eventually write result to destination (based on nr)
stages 5/6 are overlapped with stages 1/2 of the next instruction to execute and thus the 4 clocks execution rate (except for the very first instruction I presume).

Now, since the cog and the hub clocks both originate from the system clock with the latter being divided by 2, there can't be jitter, loosing of sync... between cog to cog and cog to hub relations because of the common clock source. There can be only clock periods of delay between cogs, depending on what they are synced to. Eg if you wait on counter or pin with all cogs, when the event happens all cogs will be synced (with same instruction execution stage) between them and without any jitter, of course being all synced between them at least 7 will be out of sync with the hub.

When we say that the cog needs to get in sync with hub that means that one or more clocks periods need to be skipped (without doing nothing) to align the instruction execution stage with the hub window so that data exchange can be done. Once (with the first access) this is done the two will remain in sync forever unless a waitvid or other wait (counter, pin) is done.

The video circuitry runs asynchronously and when the cog needs to exchange data with it more or less the same happens with it. Syncing to the video (skipping cog clocks periods to get in sync with it) will take the cog out of sync with the hub. It will need to wait to restore the sync at the next access.
Since the video circuitry takes whatever is on the bus when it needs new data, and waitvid do exactly that (hold the databus with known data to be picked by the video circuitry) expert pasm programmers can maintain cog hub and video in sync avoiding to use the waitvid but instead presenting the right data on the bus at the right time.

Since English is actually my third language I hope I've been clear enough, others will eventually correct me.

To avoid going OT on this tread I suggest you to open a new one if you need clarifications on Prop internals that have nothing to do with the new P1+ development discussions.

Todd Marshall · 2014-05-04 09:18

RossH wrote: »

Obviously. You need to read page 24 of the Propeller Manual v1.2 ("Hub"). It's explained there better than anywhere else.

Ross.

I quoted from that page. Further it says Cog 1, for example, may start a hub instruction during System Clock cycle 2, in both of these examples, possibly overlapping its execution with that of Cog 0 without any ill effects. Does that violate the statement: To maintain system integrity, mutually exclusive resources must not be accessed by more than one cog at a time? Well, we don't know. It says: The Hub, and the bus it controls, runs at half the System Clock rate. This means that the Hub gives a cog access to mutually exclusive resources once every 16 System Clock cycles. Does this suggest that the bus the HUB controls is "not" a mutually exclusive resource? Seems to. We have COG0 and COG1 both using it at the same time. So what is the mutually exclusive resource the RD and WR HUB instructions are contending for? And once granted, where does it say for how long they are granted? Perhaps careless reading leads me to believe 2 cycles or less is "how long". If it's ever any longer than that, I have issues. But I've been told in this thread that memory is single ported.

From Wikipedia: Dual-ported RAM (DPRAM) is a type of Random Access Memory that allows multiple reads or writes to occur at the same time, or nearly the same time, unlike single-ported RAM which only allows one access at a time.

Thus, the "actual" access must be 2 cycles or less. Further, these 2 cycle accesses must not overlap with any other COG regardless of what mutually exclusive resource the bus is delivering to that COG. The other 6 cycles must be "getting ready for the access" and "moving results to other "non-exclusive" resources". Addresses have to be fetched and put on the bus. Data needs to be delivered over the bus presumably to or from registers in the COG. All in 2 cycles that don't overlap with another COG's 2 cycles doing the same thing (or different things) but shifted two cycles in time. All this guaranteed regardless of how many COGS seek to do HUB instructions in the round. Slick.

Mike Green · 2014-05-04 11:53

The hub and the cogs are running independently, but synchronized and controlled by the system-wide clock generator. Functionally, the hub has a 1:8 multiplexor that connects its memory bus to a different cog every 2 clock cycles. Each cog has its own circuitry that causes the cog to stall if it's at a particular phase (I don't remember which one) of a hub access instruction and the current cog that the hub is connected to is not the cog trying to access it. When that hub access instruction is no longer stalled, it has exclusive access to the hub memory bus for 2 clock cycles. The cog and hub do their things (read or write memory access or lock access or other cog instruction) and the hub logic moves on to the next cog, disconnecting the cog from the memory bus. The cog completes the instruction and goes on to the next one.

Yes, it's slick and relatively straightforward. The diagram in the Manual and Datasheet that shows 8 cogs around a central hub is functionally how it works.

Seairth · 2014-05-04 13:08

dMajo wrote: »

As I have understood the cog ram architecture was redesigned and thus the P1+ instructions executes in 2 clocks, there is no pipeline, simply due to the changed cog ram now more stages executes in parallel.

If it was like you are saying that means that you can sample an IO pin at best every 4 clock with single cog. As I have understood you can now sample effectively every two clocks.

Please look at Chip's timing diagram in [post=1261426]post #1191[/post] of this thread. Each instruction takes 4 clock cycles to complete. They overlap by two clock cycles (what I referred to as semi-pipelined), giving an effective rate of two clocks per instruction. You should still be able to sample IO pins every 2 clock cycles.

Tubular · 2014-05-04 13:17

Hi Todd

There is a simulator called 'Gear' that might help you understand what's going on, as you can pause and advance single clocks, and view whats happening in each of the cogs, as well as the hub. You load the binary (eeprom) image for pretty much and propeller application into it, so can simulate existing code, or try your own. http://sourceforge.net/projects/gear-emu/

Jazzed also has recently made a simulator as described in the P1 threads. I haven't checked it out (yet) but it might well do the same.

Todd Marshall · 2014-05-04 13:35

dMajo and Mke and Tubular,
Thanks for your replies. You people are certainly getting a lot more information out of the diagram in the datasheet than I am.

That not-withstanding, neither reply addresses my points.

Tubular, l will study your reference to GEAR.

Re opening another thread, I actually tried to bring this up on Seairth's blog but the conversation quickly died there. Since it didn't die here, I continued it here.

I'm now going into serious "poke mode" with the P1 chip. If I can't make my concerns show up in the existing P1 iron, I have no issues.

I remain against any "sharing" or "mooching" and in favor of the COG wanting the slot to actually have the slot. I remain unconvinced there is any reason at all to force a 1 to 1 relationship, COG to SLOT. I remain unconvinced a circular queue of assignable length and assignable modulus (less than or equal to the queue length) has any downside at all. With both features added you can perfectly emulate existing behavior. And if for some reason atomic update without cooperation or discipline is desired, it can be had just by giving the same COG enough adjacent slots to perform a RD, manipulate, update, WR. Owning those slots precludes any other COG access.

I'll now quietly return to my seat in the peanut gallery. Thanks for your indulgence.

evanh · 2014-05-04 14:36

It'd be fair to say the understanding, about how the Propellor works generally, comes from having contact with the Propeller over years for many of us.

PS: There definitely isn't any dual-ported RAM in the Prop1.

evanh · 2014-05-04 14:45

I suspect it would be exceedingly greedy to assign multiple consecutive slots for a multi-instruction crafted read-modify-write without having a special instruction that performed the key operation with special hardware in the Hub itself.

Otherwise a single Cog would likely chew up a full round of 16 slots or similar.

There is, of course, the LOCK*** instructions already.

MJB · 2014-05-04 14:46

ctwardell wrote: »

If 'mooching' objects are allowed in the OBEX, that would be true, but imposing the requirement that ALL OBEX objects be non-mooching solves that issue.
C.W.

No no no

Don´t ban the mooching SUPER OBJECTs from Bill and the other experts from the OBEX.
They need to be there so others (like me) can use them and don´t have to write them on their own.
Of course the mooching SUPER OBJECTs have to be clearly identified and flaged as such (so I can search for them ;-) ).
And please STOP (not you C.W. - the others) to think of OBEX users as of little kids,
that need to be saved from the hot fire. A little blister will help the motivation to RTFM.

Good technical discussion helps, but not this spreading of ungraspable fear of the dangerous mooching Object.

koehler · 2014-05-04 15:12

Heater,

Not sure I follow on the 'coupling' comment.

Standard Prop- You have multiple Cores running their single-Core programs.
Currently, if a Standard Core does not have the bandwidth for a project, you can perhaps overcome that by load balancing across 2 or more
Cores. I believe this is what potatohead said he's done.

New Prop- You have Standard Cores running along with an Active/Donor pair.
Proposed solution you do not have to skull sweat it out as per above, but just use an Active/Donor pair.

Would seem to me, in both instances you have potentially the same problem to solve, which may be what you meant by 'timing isolation' ?

Both solutions effectively run faster than a Standard Core alone, and both require the same attention to detail in having them 'mesh' with the Standard Cores.
The difference is that with the Active/Donor option, you can avoid that chunk multi-processing/load sharing work, and just focus on the timing issues with Standard Cores, that you already have to do with load balancing 2 cores.

Of course, the alternative is to not use any hub sharing Objects at all, and forgo any advantages they may bring.
Sooner or later someone is going to want to output 720p or 1080i/p so that they can use cheap panels in their product.
Or maybe someone wants to add a USB2/3 Thumbdrive or SSD, or other sensor/data I/O.

This device is going to be Parallax's big gun for at least the next couple of years, unless they happen to hit a home-run or someone wins the lottery.
Unless hub sharing really would impact User's ability to effectively program the Prop, as a 2nd attempt seems like Parallax should wring every last bit of compute/throughput/connectivity they can from this device, barring extending timelines/financial impacts.

1) Is this cheap, easy, quick to do in silicon?
IIRC, I could swear the couple of times Chip has talked about it, its is a minimal amount of verilog and LUT's.

2) Is this easy to use for the programmer?
Good question. Chip will ultimately be responsible for that, however past history indicates he does work to make things simple/easy

3) How does this interact with the shared CORDIC hardware?
No idea if or how this wil.

4) Is the actual benefit at the end of the day worth the hassle of 1) and 2) ?
Good question. The only other way to get the same benefits of higher bandwidth/low latency is futzing about doing some sot of multiprocessing.
If this reduces that work effort, it opens it up to everyone, perhaps regardless of skill level.

JMG, I can't argue that CogScan or easier, just currently in my mind manual setting up/controlling Active/Donor cog seems easier/safer to me ATM.

BTW, Mooching in some automatic way seems likely to cause a lot more of the problems that Heater et al have complained about. I don't think I want anything 'automagically' doing anything. I'd want to be firmly in charge of this. Yes, it can probably be intelligently automated and fail-safed, however that ups complexity too much for my taste.

Heater. wrote: »

koehler,

There have been many suggestions put forward to enable a COG to get more HUB bandwidth. Most of them I am adverse to because they introduce a coupling between software components. What happens in one component can modulate the speed of another.

The scheme you outline above and as described by Cluso and others is actually perfectly acceptable.

Basically in that scheme a "software component" consumes two or more COGs and it can allow for one of it's COGs to use the HUB access time that would normally be allocated to another. In the extreme that other COG might not be running any useful code at all!

The main point here is that the software component cannot affect any other components in the system or be affected by them. Timing isolation between components is preserved.

So then we might ask:

1) Is this cheap, easy, quick to do in silicon?
2) Is this easy to use for the programmer?
3) How does this interact with the shared CORDIC hardware?
4) Is the actual benefit at the end of the day worth the hassle of 1) and 2) ?

Well, and we have to as how would it look in practice? Just now we have instructions for claiming a single COG and getting it running, COGINIT/COGNEW. This scheme seems to call for such an instruction that can claim two or more COGs at the same time which seems rather infeasible.

jmg · 2014-05-04 15:58

Todd Marshall wrote: »

. I remain unconvinced there is any reason at all to force a 1 to 1 relationship, COG to SLOT.

Agreed.

Todd Marshall wrote: »

I remain unconvinced a circular queue of assignable length and assignable modulus (less than or equal to the queue length) has any downside at all.
With both features added you can perfectly emulate existing behavior.

Agreed. Any design can, (and should be) a safe superset.

Because Pairing has a number of fetch-time impacts and issues, it is looking less ideal.

A RAM table remains simple to implement in HW for run, but a little less easy to change via SW during run.

So I looked for a design that avoids delays, and can be run-time changed more easily, and does not need a 'spread sheet' manager.

This design variant, co-operates with the TopUsed design, and each COG owns and decides on any changes to its slot.
This avoids pairing fetch-time issues, and gives every COG atomic control, and removes the Pair-locking.

Default to 1:16 is still possible, & when enabled, each COG can signal ON, and it can also decide how to map that slot.
Default (of course) is to self.

This gives a 5 bit field from each COG, that feeds to the scanner. No critical path penalties.

Simple, but deceptively powerful, from just 2 easy to follow rules.

A COG can run, and selectively not appear in the HUB alloc, if it chooses not to.
A COG can run, and give its HUB alloc, to any CogID, if it chooses to.
( A 'unused' COG can do either or both of the above, and then do nothing else.)

It is possible to assign every slot to one COG. (200MHz : 100MOP means two COGs can actually get 100% access )

Combined, it allows system designs like (eg)
A 3 used COG design, to Slot set as 1:9, 1:9, 7:9 or 1:9, 2:9, 6:9
Largest shared BW slot skew for Fast/Slow is 15:16 and 1:16. 100% to Two COGs is possible.

To release a slot, a COG has to Start->ReAssignSlot->WaitForever, so a tiny code stub is used for each 'unused' COG doing that.

Simple, but tough, benchmark test:
This HW is also well suited to trigger/capture/burst designs, where (eg) 3 COG can implement 2 as interleaved capture/output, and the 3rd manages start/stop and during idle can enable 67MHz BW access to HUB, during burst it can give 2 COGs each 100MHz.
That leaves SW as the only constraint on burst bandwidth.
Quite a contrast with present ceiling of a fixed 12.5MHz

With the right opcode support, this could potentially burst stream pin info to/from the HUB at 200MHz/5ns (!) (hub memory limited)

jmg · 2014-05-04 16:06

koehler wrote: »

JMG, I can't argue that CogScan or easier, just currently in my mind manual setting up/controlling Active/Donor cog seems easier/safer to me ATM.

A problem with any 'automatic' pairing, (ie rules like 'release if not needed') is they need a fetch-time decision, as well as a pipeline signal - to me those are delay/logic impacts that move it into the 'nest avoided' basket.
The coding is not complex, but the required signals and delays are not so nice.

See my post above, for a design that avoids fetch-time decisions, with simple rules and a (of course) safe default.
In this, COGs make the slot decisions.

Invent-O-Doc · 2014-05-04 16:18

Ok. It seems like we had a nice debate about cogs and the possibility of occupying mere than one slot or other means to take advantage of idle resources. Is everyone opinion known by now? If no, lets hear. If yes, lets move onto something else that's more fun. Nobody is going to convince anybody else at this point and chip has enough info to from posts to make his own judgment.

Is there any other aspect of interest in the p1+/2?

ctwardell · 2014-05-04 16:26

jmg wrote: »

A COG can run, and selectively not appear in the HUB alloc, if it chooses not to.
A COG can run, and give its HUB alloc, to any CogID, if it chooses to.
( A 'unused' COG can do either or both of the above, and then do nothing else.)

Can you clarify by what you mean by 'not appear in the HUB alloc'?

I think you are saying there would only need to be one command, that would be 'who gets my slot'.

The default at cog load is 'I get my own", at any point I can execute the command and give it to another cog.

I get what you are saying about possible delays in pairing, same thing in mooching, where the delay in deciding that the original owner doesn't want it so give it up may have a negative impact on timing.

I like this idea, a cog that needs very little hub bandwidth, say a serial driver could take back hub control briefly to poll and then release it again.

Since an unanswered hub call blocks something like this could work when the donor needs a slot for polling, etc:

SETHUB ME
HUBOP
SETHUB COGX

Sounds like a reasonable option, thanks for continuing to work through these ideas,

C.W.

jmg · 2014-05-04 17:11

ctwardell wrote: »

Can you clarify by what you mean by 'not appear in the HUB alloc'?

I think you are saying there would only need to be one command, that would be 'who gets my slot'.

The default at cog load is 'I get my own", at any point I can execute the command and give it to another cog.

Yes, pretty much, with a small extension of the 5th bit.
I coded it as 5 bits, 1 bit for CogUsesHub, and 4 bits for HubSlot.
HubSlot defaults to me, but can be re-assigned, anytime, as you do above, to anyone.

CogUsesHub is a separate boolean, which allows a cog to (optionally) not appear in the scan at all, if it was the TopCog.
aka SETHUB NOTUSED ' eg not used is 5'b1_0000

- this works with the TopUsedCog, to allow the benchmark example case of 3 COGS, 2 of which can get 100% slot Alloc, under the control of Cog3 CogUsesHub & TopUsedCog, action.

The slot on/off can be event / time / virtual pin / other message paced.

Seairth · 2014-05-04 17:13

A few things I'd like to add about the power-of-two/TopUsedCog scheme:

I expect that some existing P1 apps will be ported to the P1+. With TopUsedCogpower-of-two timing, those applications would have a modulo 8 hub access (because they wouldn't be using more than 8 cogs). This turns out to be a good thing, as this would allow 2-3 instructions (depending on the actual hub timing) between each hub window. Porting P1 applications under these conditions will, I think, be easier than the 6-7 instruction gap that is currently expected with the P1+.

Having said that, the following suggestion (see below) would negate that advantage. Nevertheless, here it is:

Should it (or something like it) get added, I would propose the following: when the chip starts, load the HMAC stuff in the first few cogs as it does now. This will give that code extremely fast access to the hub. Once the loader has been verified, start it in cog 16. This will always insure that the chip starts in a backwards-compatible "mode", where the hub access is modulo 16. All code (that we haven't even written yet) would run exactly the same as it would have if TopUsedCogpower-of-two timing was never implemented. From there, if someone wants to take advantage of a faster access (e.g. modulo 8, modulo 4, etc.), then they must explicitly set up the remaining cogs appropriately (including stopping the active cog #16). For novice users, this should be a safe way to ensure that there are no surprises (for those that feel TopUsedCogpower-of-two timing introduces surprises). As an aside, this would also mean that the Serial Monitor would load in cog 16, which makes the most sense to me anyhow.

(note: A slightly more advanced approach would encode the target "initial" cog in the binary header, which would allow two-stage loaders to load the initial stage low and take advantage of the fast hub access. The default Parallax tools would implement the simpler approach, or make the advanced loader available as a configuration option.)

Seairth · 2014-05-04 17:20

potatohead wrote: »

Limiting number of COGS means some objects would only work with some DAC pins in this design, unless we also want to add some sort of cog number redirection to the works. Doing a COGINIT to insure a specific pin group is at issue here. It's not at issue with the simple round robin scheme.

Hub access schemes aside, I was under the impression that the association of DACs to specific cogs was part of the P2 design, not P1+. With smart pins, I believe that this restriction was no longer needed (which is good for the "every cog is the same" principle).

jmg · 2014-05-04 17:26

Seairth wrote: »

A few things I'd like to add about the power-of-two/TopUsedCog scheme:.....
Having said that, the following suggestion (see below) would negate that advantage. Nevertheless, here it is:

Should it (or something like it) get added, I would propose the following: when the chip starts, load the HMAC stuff in the first few cogs as it does now. This will give that code extremely fast access to the hub. Once the loader has been verified, start it in cog 16. This will always insure that the chip starts in a backwards-compatible "mode", where the hub access is modulo 16. ..
if someone wants to take advantage of a faster access (e.g. modulo 8, modulo 4, etc.), then they must explicitly set up the remaining cogs appropriately (including stopping the active cog #16).

Yes, I was allowing for a Config Boolean, (default off, probably in the chip-wide config CLK-equiv reg), but that is somewhat optional, you can use COG #16 as another more hidden/implicit means of control.

Note also there is no power of 2 constraint.
The scanner automatically adjusts to the highest COG that signals (via the 5th bit), that it is using the HUB.
Higher COGs can be waiting on some event, or off, and do not appear in the Priority Encode.

potatohead · 2014-05-04 17:37

Currently, if a Standard Core does not have the bandwidth for a project, you can perhaps overcome that by load balancing across 2 or more
Cores. I believe this is what potatohead said he's done.

Yep. 2 cases. One case is video related things. On a scan line driver, and that's one where the signal COG simply renders from a scan line buffer created by other COGS, doing things in parallel, such as collecting and masking sprite data, happens on as many COGS as one needs or can spare, using a set of scan line buffers to hold the in-process data. The more objects needed, the more buffers and or COGS needed. Optimal use would be display sorting too, and I never did that.

On a bitmap driver, or character / tile type driver, it's possible to break a screen into regions. Have the COGS work together to build the display in real time instead of a single COG rendering into a second or back buffer. On P1 buffer memory is really expensive. Regions can be vertical portions of the screen, with any drawing COGS working on their region. A variation of this is to sort objects, or simply manage their draw order so that ones near the top get drawn first, ones near the bottom get drawn later, so that the screen itself is the buffer.

The other was pins. For a burst sample, line up COGS so that they are separated a clock apart and have them all grab pin data in sequence, writing to buffers. Once the sample is done, the COGS can be shut down and can do other things, one being to weave the data back together, or process it somehow.

Well, another case was for computation, but I think that one is obvious.

One downside of the P1 is the fairly slow COG start. I'm really hoping we get two things on this design that we put into the "beast", and that is being able to start a COG in hubex right away, and the other is to start a COG at an address without first filling it's RAM from a HUB memory image. Both of these really help dynamically use COGS, and since we've got 16 of the things, having a pool of say 4 available to do stuff like this on demand might make a lot of sense.

Re: mooch, pairing and decisions in the HUB.

I strongly agree with jmg's comments about decisions there being a poor idea. Too bad, because mooch is the scheme I liked the most, due to it being passive and something that people wouldn't count on, and that it could sort of just work with time access windows unused otherwise. But that will take a lot of decisions, and it will slow the clock down.

An emphasis on this design clearly is to get the system clock high. Something I think everybody is on board with.

potatohead · 2014-05-04 17:41

I was under the impression that the association of DACs to specific cogs was part of the P2 design, not P1+. With smart pins, I believe that this restriction was no longer needed (which is good for the "every cog is the same" principle).

I'll have to find it, but Chip said early on that each of the 16 COGS would get 4 DAC pins to be driven by WAITVID, with the other pins being software driven as we understood it on the prior design. If we want PLL/automatic DAC output, that's going to be COG + PIN limited, because neither design so far has that every pin signal bus removed from the "beast", which is what I'll call the earlier design.

Edit: It's in the first post.

VGA is going to simply be a use case of a shifter that drives four DAC outputs to a set of fixed pins attached to each cog. The four channels from highest to lowest can be used as: R, G, B, HSYNC. In some modes, the shifter handles data in ways that is obviously for video, but, otherwise, it's a generic circuit that can simultaneously update four DACs with unique 8-bit data. You can also write the DACs directly in software, with 8 bits of dither, to realize something like 16-bit DACs.

Seairth · 2014-05-04 18:09

jmg wrote: »

Yes, I was allowing for a Config Boolean, (default off, probably in the chip-wide config CLK-equiv reg), but that is somewhat optional, you can use COG #16 as another more hidden/implicit means of control.

Note also there is no power of 2 constraint.
The scanner automatically adjusts to the highest COG that signals (via the 5th bit), that it is using the HUB.
Higher COGs can be waiting on some event, or off, and do not appear in the Priority Encode.

Huh. I thought you had simply given a name to my follow-up comments about how to select the appropriate power-of-two for the approach I proposed. Apparently not. I'll update the prior post.

In the meantime, I'd like to better understand your variation. Given that, what would be the hub frequency if you had 9 active cogs using the hub?

jmg · 2014-05-04 18:16

Seairth wrote: »

Huh. I thought you had simply given a name to my follow-up comments about how to select the appropriate power-of-two for the approach I proposed. Apparently not. I'll update the prior post.

Some posts back, I removed that 'power of 2' constraint, by using a Priority encoder

Seairth wrote: »

In the meantime, I'd like to better understand your variation. Given that, what would be the hub frequency if you had 9 active cogs using the hub?

That's just SysCLK /9, so 22.222' MHz

Seairth · 2014-05-04 18:20

potatohead wrote: »

One downside of the P1 is the fairly slow COG start. I'm really hoping we get two things on this design that we put into the "beast", and that is being able to start a COG in hubex right away, and the other is to start a COG at an address without first filling it's RAM from a HUB memory image. Both of these really help dynamically use COGS, and since we've got 16 of the things, having a pool of say 4 available to do stuff like this on demand might make a lot of sense.

+2 (+1 for each feature). Being able to nearly instantly start cogs, on demand, would be a really good feature to keep from the P2. I think this would have the potential to help reduce power consumption, as it would be more feasible to leave cogs off until needed. I could see this being very useful to compliment the CORDIC stuff (recall the earlier conversation), as you could very quickly fire off a cog to perform a complex mathematical equation in parallel, then shut it back down immediately afterwards (or have it shut itself down).

RossH · 2014-05-04 18:39

Todd Marshall wrote: »

I quoted from that page.

I don't think you quoted from the latest version - at least what you quoted was not what version 1.2 of the manual says. Perhaps you paraphrased, but it is also possible you are looking at an older version - I believe that section was "improved" in version 1.2.

The P1 hub access scheme is very simple, and guaranteed to be deterministic. A lot of people here are really hoping we can keep the same simplicity and determinism in the P16X32B.

Ross.

ctwardell · 2014-05-04 18:41

jmg,

I think the number of hub active cogs needs to be a multiple of two and cogs can only donate to a like odd or even numbered cog.

C.W.

Edit: Added Picture

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments