Locks, flags and Port D XCH discussion
ozpropdev
Posts: 2,793
Here is a example use of the Port D XCH as a lock/flag.
Cog 0 - I2C EEPROM driver
Cog 1 to 3 - 3 tasks all requiring I2C access.
Each Cog has it's own source,destinarion,length and command variable (hub based)
Each cog is also allocated 2 bits on Port D for handshaking.
Port D Bit 0 - Cog1 request signal.
Port D Bit 1 - Cog2 request signal.
Port D Bit 2 - Cog3 request signal.
Port D Bits 3 - Cog1 status signal.
Port D Bits 4 - Cog2 status signal.
Port D Bits 5 - Cog3 status signal.
Lets say Cog2 wants to use the I2C.
First it loads it's own source,destination etc with the desired data.
Cog 2 sets Bit 1 of Port D (Cog 2 request)
Cog 2 then polls Port D bit 4 until high. (Cog 2 I2C command acknowledged)
Cog 2 now clears Port D bit 1 (request clear)
Cog 2 polls Port D bit 4 until low (I2C task completed)
This way only 1 cog can use I2C at a time. The I2C routine handles theround-robin sharing of the resource and could even give priority to a single cog if needed.
Simple and expandable to 16 lock/flags
Cog 0 - I2C EEPROM driver
Cog 1 to 3 - 3 tasks all requiring I2C access.
Each Cog has it's own source,destinarion,length and command variable (hub based)
Each cog is also allocated 2 bits on Port D for handshaking.
Port D Bit 0 - Cog1 request signal.
Port D Bit 1 - Cog2 request signal.
Port D Bit 2 - Cog3 request signal.
Port D Bits 3 - Cog1 status signal.
Port D Bits 4 - Cog2 status signal.
Port D Bits 5 - Cog3 status signal.
Lets say Cog2 wants to use the I2C.
First it loads it's own source,destination etc with the desired data.
Cog 2 sets Bit 1 of Port D (Cog 2 request)
Cog 2 then polls Port D bit 4 until high. (Cog 2 I2C command acknowledged)
Cog 2 now clears Port D bit 1 (request clear)
Cog 2 polls Port D bit 4 until low (I2C task completed)
This way only 1 cog can use I2C at a time. The I2C routine handles theround-robin sharing of the resource and could even give priority to a single cog if needed.
Simple and expandable to 16 lock/flags
Comments
For this example, it would seem that using a LOCK would be preferable.
The way I read PortD working is that each cog has a PortD 32bit output. So Cog 1 could send a long to Cog 2 and Cog 2 could send a long to Cog1, both in parallel. At the same time Cogs 4 & 7 could do likewise, etc.
It may be better to define what scenarios we are after, and then define what might be a solution...
1. LOCKS: A common resource (such as a driver) being shared amongst tasks.
2. COG COMMS: Transfer of byte/word/longs between cogs (eg FullDuplexSerial fifos)
3. FLAGS: A cog signals to another cog that something is available. When taken, the cog resets the flag.
4. Any others???
If possible, it would be nicer if the support instructions were not considered to be hub based (ie does not sync with hub access) so that it could be faster.
When PortD was proposed, it was because a few of us realised that in P1 if the PortB had been implemented internally we could have made use of it for intercog comms.
Chip implemented it, as almost always, with some extra goodies.
However, now our understanding of just what is possible, I wonder if there might be some better solutions.
Some time ago, I wondered if there might be a small block of ram and flags that could be used for simple cog-cog communication. I thought it would fit nicely in the centre of the die (although the autorouter would likely place it there).
When a location is written by a cog, it would set a flag, and when read by a cog, the flag would be reset. The operation of the flag could be controlled by the special read/write/clear instruction(s) use of its WZ & WC flags.
I would expect a single instruction that had bits for r/w/clear, and that the Z & C flags could be set depending on the "ram flag". The instruction would act as a move from/to cog register and block ram, and the other address would contain the block ram address.
Perhaps one of these 32 bit ram long could be accessed as individual bits, and we could then use this as a resource lock, or an availability flag.
This is just a quick first pass at something to get the discussion going. Thoughts anyone?
For this style of scenario, what if there were a single shared 32-bit FLAG register?
A single instruction (1 clock as no need to sync with hub) would have variants to:
(a) read all 32 flags into a cog long, with optional WZ.
(b) can set any flag (1 bit), with optional WZ/WC returning the prior state of the flag.
(c) can clear any flag (1 bit), with optional WZ/WC returning the prior state of the flag.
Postedit: Do we need to ensure that set/clear cannot occur from multiple cogs simultaneously, and does clear override set?
This permits all flags to be checked at once for =0, and copy to cog long for subsequent mask/bit testing. (and we did not waste PortD)
The trace data can be streamed through Port D to Aux ram(or Cog ram, quads(wides?)) thus using 16 bits of the port.
See data below
Your opening example with all those flags and handshaking stuff to share an I2C or whatever resource looks like I horrible messy way to do things.
Surely it's better to have one COG drive the I2C and accept commands from other COGs via the usual HUB using a mailbox and perhaps locks).
A higher level of abstraction.
We could even imagine that in the driver COG the hardware is being manipulated by a thread or two (input and output, as in UART say) Whilst another thread(s) is chatting with the clients of this service.
Is there even a speed advantage to tweaking around with flags like that?
Wrong. Port D, from a cog (self) perspective is a normal IO port with its out,dir, in registers but with ONE EXCEPTION: each cog MUST internally setup which others cog outputs form its inputs.
Thats mean that in the ozpropdev example long transfers will not be possible only between cogs 1,2 and 3. Eg. cogs 7 and 8 can exchange longs that will not be seen and will not affect cogs 1,2,3.
Horrible mess...A little harsh ,but sure.
The point was to show that it can be done without using locks and polling hub registers.
Anyhow it kick started the discussion, so mission accomplished.
Each one of those interactions between the hardware driver and it's "clients" is a case of a single producer/single consumer. As such commands and responses can be exchanged with simple flags in COG or via cyclic FIFO's (as in FullDuplexSerial). No locks required. Is there really much of a speed gain or any other benefit to using port D in that example?
Now, if we can use Port D to stream bytes between COGs faster than can be done through HUB I would start to see some advantages.
%000 = QUADs_to_16_pins (Might be WIDE's now?)
%100 = 16_pins_to_QUADs
What would be nice is COG 1 was set to mode 0 and COG 2 was set to mode 4
If somehow the back to back "bus" could then transfer n "quads" from Cog1 to COG2
Sweet! Even better if 32 pins(bits) were used.
The transfer mechanism is already half there.
Thoughts?
In the Prop II update - BLOG thread I replied to a similar question on multiple writes on the same clock:
I think that would be the most simple implementation. When the usage is such that collisions might occur the operation should occur within a lock.
In most cases it might actually be fine if this was implemented such that set and clear were hub operations and only the read is a non-hub operation.
My goal when I requested the 32 flags and locks was for a very simple implementation that would be useful with a minimal implementation cost.
C.W.
Another possible use could be barrier synchronization, for fine grained parallel processing.
Every processing element (or task) sets its own bit, and then waits for the whole port to be "111..1" (eventually masked with participating PEs only).
Similarly, "antibarrier" can be made by checking for the port to be false, after putting "0" to each respective bit.
This is already possible using WAITPEQ/WAITPNE instructions on port D, up to four task per COG.
The only thing missing is a mean to extend the barrier to multiple chips. But even if done with some software assistance, it should still be ways faster and lower latency than the main data exchange link, whatever it is. Which is the point of having a dedicated barrier at all.
Ok, it's not going to compete with GPU computing, still has an educational value IMHO.
The problem with that rule, is a new set (new data) looses to a clear (data done), which could lose messages.
A logic sense needs to be chosen (I've used Positive logic here) and then I think a set to the active state, should trump a clear.
General rule is upstream sets, downstream acts on hi, clears when done.
The boundary case of same clock, sets and generates another loop.
Are these Port D bits available to WAIT opcodes ? - that could give significant power saving ?
My thinking was that a state change should win because you really shouldn't have been setting or clearing a flag that was already in the cooresponding state.
After some thought I think maybe having the set and clear be hub operations might be the easiest way around the same clock issue.
If we do stick with non-hub set and clear I agree that your suggested method of which state wins is better than what I had suggested.
Having the reads be non-hub is most important so we can poll or possibly waitxxx on the flag without consuming any hub slots.
C.W.
True, a classic design would poll, but that dictates looping code, and IIRC the Port nature of PortD means it can be used with WAITxx, and that nice feature rather shifts the usage, as the power saving of this is likely to be large.
With WaitXX, a master will set anytime it has new info, and the slave waits on pin hi, and clears when done. (Auto-power save)
An arrival of new info, even on the exact clear edge, still works on WAITxx.
Applying Hub quanta could lose a means to precisely sync operations across COGs ?
Looking at the above idea a little deeper.
We know these pin transfer modes work in the background.
Assuming this mode is now "WIDE" instead of "QUAD" and some sort of sync signals could be added we
have a method of transferring 8 longs from one cog to another.
These modes once started repeat until disabled (already implemented).
We already have SETWIDE to set the source or destination WIDE address
We already have the 32 bit pathway in place.
8 longs in 8 clocks in the background. No hub required.