Locks, flags and Port D XCH discussion
ozpropdev
Posts: 2,793
Here is a example use of the Port D XCH as a lock/flag.
Cog 0 - I2C EEPROM driver
Cog 1 to 3 - 3 tasks all requiring I2C access.
Each Cog has it's own source,destinarion,length and command variable (hub based)
Each cog is also allocated 2 bits on Port D for handshaking.
Port D Bit 0 - Cog1 request signal.
Port D Bit 1 - Cog2 request signal.
Port D Bit 2 - Cog3 request signal.
Port D Bits 3 - Cog1 status signal.
Port D Bits 4 - Cog2 status signal.
Port D Bits 5 - Cog3 status signal.
Lets say Cog2 wants to use the I2C.
First it loads it's own source,destination etc with the desired data.
Cog 2 sets Bit 1 of Port D (Cog 2 request)
Cog 2 then polls Port D bit 4 until high. (Cog 2 I2C command acknowledged)
Cog 2 now clears Port D bit 1 (request clear)
Cog 2 polls Port D bit 4 until low (I2C task completed)
This way only 1 cog can use I2C at a time. The I2C routine handles theround-robin sharing of the resource and could even give priority to a single cog if needed.
Simple and expandable to 16 lock/flags
Cog 0 - I2C EEPROM driver
Cog 1 to 3 - 3 tasks all requiring I2C access.
Each Cog has it's own source,destinarion,length and command variable (hub based)
Each cog is also allocated 2 bits on Port D for handshaking.
Port D Bit 0 - Cog1 request signal.
Port D Bit 1 - Cog2 request signal.
Port D Bit 2 - Cog3 request signal.
Port D Bits 3 - Cog1 status signal.
Port D Bits 4 - Cog2 status signal.
Port D Bits 5 - Cog3 status signal.
Lets say Cog2 wants to use the I2C.
First it loads it's own source,destination etc with the desired data.
Cog 2 sets Bit 1 of Port D (Cog 2 request)
Cog 2 then polls Port D bit 4 until high. (Cog 2 I2C command acknowledged)
Cog 2 now clears Port D bit 1 (request clear)
Cog 2 polls Port D bit 4 until low (I2C task completed)
This way only 1 cog can use I2C at a time. The I2C routine handles theround-robin sharing of the resource and could even give priority to a single cog if needed.
Simple and expandable to 16 lock/flags

Comments
For this example, it would seem that using a LOCK would be preferable.
The way I read PortD working is that each cog has a PortD 32bit output. So Cog 1 could send a long to Cog 2 and Cog 2 could send a long to Cog1, both in parallel. At the same time Cogs 4 & 7 could do likewise, etc.
It may be better to define what scenarios we are after, and then define what might be a solution...
1. LOCKS: A common resource (such as a driver) being shared amongst tasks.
2. COG COMMS: Transfer of byte/word/longs between cogs (eg FullDuplexSerial fifos)
3. FLAGS: A cog signals to another cog that something is available. When taken, the cog resets the flag.
4. Any others???
If possible, it would be nicer if the support instructions were not considered to be hub based (ie does not sync with hub access) so that it could be faster.
When PortD was proposed, it was because a few of us realised that in P1 if the PortB had been implemented internally we could have made use of it for intercog comms.
Chip implemented it, as almost always, with some extra goodies.
However, now our understanding of just what is possible, I wonder if there might be some better solutions.
Some time ago, I wondered if there might be a small block of ram and flags that could be used for simple cog-cog communication. I thought it would fit nicely in the centre of the die (although the autorouter would likely place it there).
When a location is written by a cog, it would set a flag, and when read by a cog, the flag would be reset. The operation of the flag could be controlled by the special read/write/clear instruction(s) use of its WZ & WC flags.
I would expect a single instruction that had bits for r/w/clear, and that the Z & C flags could be set depending on the "ram flag". The instruction would act as a move from/to cog register and block ram, and the other address would contain the block ram address.
Perhaps one of these 32 bit ram long could be accessed as individual bits, and we could then use this as a resource lock, or an availability flag.
This is just a quick first pass at something to get the discussion going. Thoughts anyone?
For this style of scenario, what if there were a single shared 32-bit FLAG register?
A single instruction (1 clock as no need to sync with hub) would have variants to:
(a) read all 32 flags into a cog long, with optional WZ.
(b) can set any flag (1 bit), with optional WZ/WC returning the prior state of the flag.
(c) can clear any flag (1 bit), with optional WZ/WC returning the prior state of the flag.
Postedit: Do we need to ensure that set/clear cannot occur from multiple cogs simultaneously, and does clear override set?
This permits all flags to be checked at once for =0, and copy to cog long for subsequent mask/bit testing. (and we did not waste PortD)
The trace data can be streamed through Port D to Aux ram(or Cog ram, quads(wides?)) thus using 16 bits of the port.
See data below
TRACE ----- A cog can cause its execution state (from pipeline stage 4) to be output to pins on every clock cycle by using the SETRACE instruction: SETRACE D/# - Set trace configuration to %E_PPP %E = enable %0 = off (initial value on cog start) %1 = on %PPP = 16-pin group to output to %000 = pins 15..0 %001 = pins 31..16 %010 = pins 47..32 %011 = pins 63..48 %100 = pins 79..64 %101 = pins 95..80 (pins 95..92 don't exist) %110 = pins 111..96 %111 = pins 127..112 The 16 signals that will be output are, from MSB to LSB: TASK[1..0] - the executing task, 0..3 Z - the Z flag C - the C flag GO - the GO signal, 0 = stall pipeline, 1 = instruction done COND - the COND signal, 0 = condition false, 1 = condition true VALID - the VALID signal, 0 = instruction cancelled, 1 = instruction valid PC[8..0] - the program counter, 0..$1FF For the output to appear, the DIR bits corresponding to the 16-pin port must be set. Idea: By outputting trace data to the internal port D pins (%PPP = %11x), and having another cog trigger using WAITPEQ before logging trace data, a trace debugger could be made.PIN TRANSFER ------------ Each cog has a pin transfer (XFR) which can automatically move data between pins and QUADs/AUX, in the background, while instructions execute normally. XFR is configured with the SETXFR instruction: SETXFR D/# - Set XFR configuration to %E_MMM_PPP %E = enable %0 = off (initial state after cog start) %1 = on %MMM = mode %000 = QUADs_to_16_pins %001 = QUADs_to_32_pins %010 = AUX_to_16_pins %011 = AUX_to_32_pins %100 = 16_pins_to_QUADs %101 = 32_pins_to_QUADs %110 = 16_pins_to_AUX %111 = 32_pins_to_AUX %PPP = pin group %000 = pins 15..0 for 16-pin modes, pins 31..0 for 32-pin modes %001 = pins 31..16 for 16-pin modes, pins 31..0 for 32-pin modes %010 = pins 47..32 for 16-pin modes, pins 63..32 for 32-pin modes %011 = pins 63..48 for 16-pin modes, pins 63..32 for 32-pin modes %100 = pins 79..64 for 16-pin modes, pins 95..64 for 32-pin modes %101 = pins 95..80 for 16-pin modes, pins 95..64 for 32-pin modes %110 = pins 111..96 for 16-pin modes, pins 127..96 for 32-pin modes %111 = pins 127..112 for 16-pin modes, pins 127..96 for 32-pin modes For QUADs_to_16_pins mode (%000), on the cycle after SETXFR is executed, the following 8-clock pattern begins and then repeats indefinitely: 1st clock: QUAD0 low word is output to pins 2nd clock: QUAD0 high word is output to pins 3rd clock: QUAD1 low word is output to pins 4th clock: QUAD1 high word is output to pins 5th clock: QUAD2 low word is output to pins 6th clock: QUAD2 high word is output to pins 7th clock: QUAD3 low word is output to pins 8th clock: QUAD3 high word is output to pins For QUADs_to_32_pins mode (%001), on the cycle after SETXFR is executed, the following 4-clock pattern begins and then repeats indefinitely: 1st clock: QUAD0 is output to pins 2nd clock: QUAD1 is output to pins 3rd clock: QUAD2 is output to pins 4th clock: QUAD3 is output to pins For AUX_to_16_pins mode (%010), on the second cycle after SETXFR is executed, the following 2-clock pattern begins and then repeats indefinitely: 1st clock: AUX[SPB] low word is output to pins 2nd clock: AUX[SPB++] high word is output to pins For AUX_to_32_pins mode (%011), on the second cycle after SETXFR is executed, the following 1-clock pattern begins and then repeats indefinitely: 1st clock: AUX[SPB++] is output to pins For 16_pins_to_QUADs mode (%100), on the cycle after SETXFR is executed, the following 8-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled into low word 2nd clock: pins are sampled into high word, long is written to QUAD0 3rd clock: pins are sampled into low word 4th clock: pins are sampled into high word, long is written to QUAD1 5th clock: pins are sampled into low word 6th clock: pins are sampled into high word, long is written to QUAD2 7th clock: pins are sampled into low word 8th clock: pins are sampled into high word, long is written to QUAD3 For 32_pins_to_QUADs mode (%101), on the cycle after SETXFR is executed, the following 4-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled and written to QUAD0 2nd clock: pins are sampled and written to QUAD1 3rd clock: pins are sampled and written to QUAD2 4th clock: pins are sampled and written to QUAD3 For 16_pins_to_AUX mode (%110), on the cycle after SETXFR is executed, the following 2-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled into low word 2nd clock: pins are sampled into high word, long is written to AUX[SPB++] For 32_pins_to_AUX mode (%111), on the cycle after SETXFR is executed, the following 1-clock pattern begins and then repeats indefinitely: 1st clock: pins are sampled and written to AUX[SPB++] While an AUX_to_pins or pins_to_AUX mode is active, you should not read or write AUX or modify SPB, as such attempts will likely interfere with XFR operation and cause unexpected results. VID, however, has an asynchronous second port to AUX, so it can, for example, stream pixels out at the same time XFR streams them in. To stop XFR, execute 'SETXFR #0' on the last cycle of desired XFR operation. An example of XFR usage is in the following program: balls.spinYour opening example with all those flags and handshaking stuff to share an I2C or whatever resource looks like I horrible messy way to do things.
Surely it's better to have one COG drive the I2C and accept commands from other COGs via the usual HUB using a mailbox and perhaps locks).
A higher level of abstraction.
We could even imagine that in the driver COG the hardware is being manipulated by a thread or two (input and output, as in UART say) Whilst another thread(s) is chatting with the clients of this service.
Is there even a speed advantage to tweaking around with flags like that?
Wrong. Port D, from a cog (self) perspective is a normal IO port with its out,dir, in registers but with ONE EXCEPTION: each cog MUST internally setup which others cog outputs form its inputs.
Thats mean that in the ozpropdev example long transfers will not be possible only between cogs 1,2 and 3. Eg. cogs 7 and 8 can exchange longs that will not be seen and will not affect cogs 1,2,3.
Horrible mess...A little harsh ,but sure.
The point was to show that it can be done without using locks and polling hub registers.
Anyhow it kick started the discussion, so mission accomplished.
Each one of those interactions between the hardware driver and it's "clients" is a case of a single producer/single consumer. As such commands and responses can be exchanged with simple flags in COG or via cyclic FIFO's (as in FullDuplexSerial). No locks required. Is there really much of a speed gain or any other benefit to using port D in that example?
Now, if we can use Port D to stream bytes between COGs faster than can be done through HUB I would start to see some advantages.
%000 = QUADs_to_16_pins (Might be WIDE's now?)
%100 = 16_pins_to_QUADs
What would be nice is COG 1 was set to mode 0 and COG 2 was set to mode 4
If somehow the back to back "bus" could then transfer n "quads" from Cog1 to COG2
Sweet!
The transfer mechanism is already half there.
Thoughts?
In the Prop II update - BLOG thread I replied to a similar question on multiple writes on the same clock:
I think that would be the most simple implementation. When the usage is such that collisions might occur the operation should occur within a lock.
In most cases it might actually be fine if this was implemented such that set and clear were hub operations and only the read is a non-hub operation.
My goal when I requested the 32 flags and locks was for a very simple implementation that would be useful with a minimal implementation cost.
C.W.
Another possible use could be barrier synchronization, for fine grained parallel processing.
Every processing element (or task) sets its own bit, and then waits for the whole port to be "111..1" (eventually masked with participating PEs only).
Similarly, "antibarrier" can be made by checking for the port to be false, after putting "0" to each respective bit.
This is already possible using WAITPEQ/WAITPNE instructions on port D, up to four task per COG.
The only thing missing is a mean to extend the barrier to multiple chips. But even if done with some software assistance, it should still be ways faster and lower latency than the main data exchange link, whatever it is. Which is the point of having a dedicated barrier at all.
Ok, it's not going to compete with GPU computing, still has an educational value IMHO.
The problem with that rule, is a new set (new data) looses to a clear (data done), which could lose messages.
A logic sense needs to be chosen (I've used Positive logic here) and then I think a set to the active state, should trump a clear.
General rule is upstream sets, downstream acts on hi, clears when done.
The boundary case of same clock, sets and generates another loop.
Are these Port D bits available to WAIT opcodes ? - that could give significant power saving ?
My thinking was that a state change should win because you really shouldn't have been setting or clearing a flag that was already in the cooresponding state.
After some thought I think maybe having the set and clear be hub operations might be the easiest way around the same clock issue.
If we do stick with non-hub set and clear I agree that your suggested method of which state wins is better than what I had suggested.
Having the reads be non-hub is most important so we can poll or possibly waitxxx on the flag without consuming any hub slots.
C.W.
True, a classic design would poll, but that dictates looping code, and IIRC the Port nature of PortD means it can be used with WAITxx, and that nice feature rather shifts the usage, as the power saving of this is likely to be large.
With WaitXX, a master will set anytime it has new info, and the slave waits on pin hi, and clears when done. (Auto-power save)
An arrival of new info, even on the exact clear edge, still works on WAITxx.
Applying Hub quanta could lose a means to precisely sync operations across COGs ?
Looking at the above idea a little deeper.
We know these pin transfer modes work in the background.
Assuming this mode is now "WIDE" instead of "QUAD" and some sort of sync signals could be added we
have a method of transferring 8 longs from one cog to another.
These modes once started repeat until disabled (already implemented).
We already have SETWIDE to set the source or destination WIDE address
We already have the 32 bit pathway in place.
8 longs in 8 clocks in the background. No hub required.