Locks, flags and Port D XCH discussion

ozpropdev · 2014-01-20 22:12

Here is a example use of the Port D XCH as a lock/flag.

Cog 0 - I2C EEPROM driver
Cog 1 to 3 - 3 tasks all requiring I2C access.

Each Cog has it's own source,destinarion,length and command variable (hub based)
Each cog is also allocated 2 bits on Port D for handshaking.

Port D Bit 0 - Cog1 request signal.
Port D Bit 1 - Cog2 request signal.
Port D Bit 2 - Cog3 request signal.
Port D Bits 3 - Cog1 status signal.
Port D Bits 4 - Cog2 status signal.
Port D Bits 5 - Cog3 status signal.

Lets say Cog2 wants to use the I2C.
First it loads it's own source,destination etc with the desired data.
Cog 2 sets Bit 1 of Port D (Cog 2 request)
Cog 2 then polls Port D bit 4 until high. (Cog 2 I2C command acknowledged)
Cog 2 now clears Port D bit 1 (request clear)
Cog 2 polls Port D bit 4 until low (I2C task completed)

This way only 1 cog can use I2C at a time. The I2C routine handles theround-robin sharing of the resource and could even give priority to a single cog if needed.

Simple and expandable to 16 lock/flags

Cluso99 · 2014-01-20 23:21

While this would work, it blocks PortD from being used to transfer longs between cogs (using XCHG), which would be preferred.
For this example, it would seem that using a LOCK would be preferable.

The way I read PortD working is that each cog has a PortD 32bit output. So Cog 1 could send a long to Cog 2 and Cog 2 could send a long to Cog1, both in parallel. At the same time Cogs 4 & 7 could do likewise, etc.

It may be better to define what scenarios we are after, and then define what might be a solution...

1. LOCKS: A common resource (such as a driver) being shared amongst tasks.
2. COG COMMS: Transfer of byte/word/longs between cogs (eg FullDuplexSerial fifos)
3. FLAGS: A cog signals to another cog that something is available. When taken, the cog resets the flag.
4. Any others???

If possible, it would be nicer if the support instructions were not considered to be hub based (ie does not sync with hub access) so that it could be faster.

When PortD was proposed, it was because a few of us realised that in P1 if the PortB had been implemented internally we could have made use of it for intercog comms.
Chip implemented it, as almost always, with some extra goodies.
However, now our understanding of just what is possible, I wonder if there might be some better solutions.

Some time ago, I wondered if there might be a small block of ram and flags that could be used for simple cog-cog communication. I thought it would fit nicely in the centre of the die (although the autorouter would likely place it there).
When a location is written by a cog, it would set a flag, and when read by a cog, the flag would be reset. The operation of the flag could be controlled by the special read/write/clear instruction(s) use of its WZ & WC flags.
I would expect a single instruction that had bits for r/w/clear, and that the Z & C flags could be set depending on the "ram flag". The instruction would act as a move from/to cog register and block ram, and the other address would contain the block ram address.
Perhaps one of these 32 bit ram long could be accessed as individual bits, and we could then use this as a resource lock, or an availability flag.

This is just a quick first pass at something to get the discussion going. Thoughts anyone?

Cluso99 · 2014-01-20 23:41

ozpropdev wrote: »

Here is a example use of the Port D XCH as a lock/flag.

Cog 0 - I2C EEPROM driver
Cog 1 to 3 - 3 tasks all requiring I2C access.

Each Cog has it's own source,destinarion,length and command variable (hub based)
Each cog is also allocated 2 bits on Port D for handshaking.

Port D Bit 0 - Cog1 request signal.
Port D Bit 1 - Cog2 request signal.
Port D Bit 2 - Cog3 request signal.
Port D Bits 3 - Cog1 status signal.
Port D Bits 4 - Cog2 status signal.
Port D Bits 5 - Cog3 status signal.

Lets say Cog2 wants to use the I2C.
First it loads it's own source,destination etc with the desired data.
Cog 2 sets Bit 1 of Port D (Cog 2 request)
Cog 2 then polls Port D bit 4 until high. (Cog 2 I2C command acknowledged)
Cog 2 now clears Port D bit 1 (request clear)
Cog 2 polls Port D bit 4 until low (I2C task completed)

This way only 1 cog can use I2C at a time. The I2C routine handles theround-robin sharing of the resource and could even give priority to a single cog if needed.

Simple and expandable to 16 lock/flags

For this style of scenario, what if there were a single shared 32-bit FLAG register?

A single instruction (1 clock as no need to sync with hub) would have variants to:

(a) read all 32 flags into a cog long, with optional WZ.
(b) can set any flag (1 bit), with optional WZ/WC returning the prior state of the flag.
(c) can clear any flag (1 bit), with optional WZ/WC returning the prior state of the flag.
Postedit: Do we need to ensure that set/clear cannot occur from multiple cogs simultaneously, and does clear override set?

This permits all flags to be checked at once for =0, and copy to cog long for subsequent mask/bit testing. (and we did not waste PortD)

ozpropdev · 2014-01-21 00:51

Something else to be aware of that can use Port D is the TRACE function when used with XFR.
The trace data can be streamed through Port D to Aux ram(or Cog ram, quads(wides?)) thus using 16 bits of the port.

See data below

TRACE
-----

A cog can cause its execution state (from pipeline stage 4) to be output to pins on
every clock cycle by using the SETRACE instruction:

    SETRACE D/#     - Set trace configuration to %E_PPP

                      %E = enable

                           %0 = off (initial value on cog start)
                           %1 = on

                      %PPP = 16-pin group to output to

                           %000 = pins 15..0
                           %001 = pins 31..16
                           %010 = pins 47..32
                           %011 = pins 63..48
                           %100 = pins 79..64
                           %101 = pins 95..80 (pins 95..92 don't exist)
                           %110 = pins 111..96
                           %111 = pins 127..112


The 16 signals that will be output are, from MSB to LSB:

    TASK[1..0]   - the executing task, 0..3
    Z            - the Z flag
    C            - the C flag
    GO           - the GO signal, 0 = stall pipeline, 1 = instruction done
    COND         - the COND signal, 0 = condition false, 1 = condition true
    VALID        - the VALID signal, 0 = instruction cancelled, 1 = instruction valid
    PC[8..0]     - the program counter, 0..$1FF

For the output to appear, the DIR bits corresponding to the 16-pin port must be set.

Idea: By outputting trace data to the internal port D pins (%PPP = %11x), and having
another cog trigger using WAITPEQ before logging trace data, a trace debugger could
be made.

PIN TRANSFER
------------

Each cog has a pin transfer (XFR) which can automatically move data between pins and
QUADs/AUX, in the background, while instructions execute normally.

XFR is configured with the SETXFR instruction:

    SETXFR  D/#     - Set XFR configuration to %E_MMM_PPP

          %E = enable

                %0 = off (initial state after cog start)
                %1 = on

          %MMM = mode

                %000 = QUADs_to_16_pins
                %001 = QUADs_to_32_pins
                %010 = AUX_to_16_pins
                %011 = AUX_to_32_pins
                %100 = 16_pins_to_QUADs
                %101 = 32_pins_to_QUADs
                %110 = 16_pins_to_AUX
                %111 = 32_pins_to_AUX

          %PPP = pin group

                %000 = pins 15..0    for 16-pin modes,   pins 31..0   for 32-pin modes
                %001 = pins 31..16   for 16-pin modes,   pins 31..0   for 32-pin modes
                %010 = pins 47..32   for 16-pin modes,   pins 63..32  for 32-pin modes
                %011 = pins 63..48   for 16-pin modes,   pins 63..32  for 32-pin modes
                %100 = pins 79..64   for 16-pin modes,   pins 95..64  for 32-pin modes
                %101 = pins 95..80   for 16-pin modes,   pins 95..64  for 32-pin modes
                %110 = pins 111..96  for 16-pin modes,   pins 127..96 for 32-pin modes
                %111 = pins 127..112 for 16-pin modes,   pins 127..96 for 32-pin modes


For QUADs_to_16_pins mode (%000), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

    1st clock: QUAD0 low word is output to pins
    2nd clock: QUAD0 high word is output to pins
    3rd clock: QUAD1 low word is output to pins
    4th clock: QUAD1 high word is output to pins
    5th clock: QUAD2 low word is output to pins
    6th clock: QUAD2 high word is output to pins
    7th clock: QUAD3 low word is output to pins
    8th clock: QUAD3 high word is output to pins


For QUADs_to_32_pins mode (%001), on the cycle after SETXFR is executed, the following
4-clock pattern begins and then repeats indefinitely:

    1st clock: QUAD0 is output to pins
    2nd clock: QUAD1 is output to pins
    3rd clock: QUAD2 is output to pins
    4th clock: QUAD3 is output to pins


For AUX_to_16_pins mode (%010), on the second cycle after SETXFR is executed, the
following 2-clock pattern begins and then repeats indefinitely:

    1st clock: AUX[SPB] low word is output to pins
    2nd clock: AUX[SPB++] high word is output to pins


For AUX_to_32_pins mode (%011), on the second cycle after SETXFR is executed, the
following 1-clock pattern begins and then repeats indefinitely:

    1st clock: AUX[SPB++] is output to pins


For 16_pins_to_QUADs mode (%100), on the cycle after SETXFR is executed, the following
8-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled into low word
    2nd clock: pins are sampled into high word, long is written to QUAD0
    3rd clock: pins are sampled into low word
    4th clock: pins are sampled into high word, long is written to QUAD1
    5th clock: pins are sampled into low word
    6th clock: pins are sampled into high word, long is written to QUAD2
    7th clock: pins are sampled into low word
    8th clock: pins are sampled into high word, long is written to QUAD3


For 32_pins_to_QUADs mode (%101), on the cycle after SETXFR is executed, the following
4-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled and written to QUAD0
    2nd clock: pins are sampled and written to QUAD1
    3rd clock: pins are sampled and written to QUAD2
    4th clock: pins are sampled and written to QUAD3


For 16_pins_to_AUX mode (%110), on the cycle after SETXFR is executed, the following
2-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled into low word
    2nd clock: pins are sampled into high word, long is written to AUX[SPB++]


For 32_pins_to_AUX mode (%111), on the cycle after SETXFR is executed, the following
1-clock pattern begins and then repeats indefinitely:

    1st clock: pins are sampled and written to AUX[SPB++]


While an AUX_to_pins or pins_to_AUX mode is active, you should not read or write AUX or
modify SPB, as such attempts will likely interfere with XFR operation and cause unexpected
results. VID, however, has an asynchronous second port to AUX, so it can, for example,
stream pixels out at the same time XFR streams them in.

To stop XFR, execute 'SETXFR #0' on the last cycle of desired XFR operation.

An example of XFR usage is in the following program:

    balls.spin

Heater. · 2014-01-21 01:32

ozpropdev,

Your opening example with all those flags and handshaking stuff to share an I2C or whatever resource looks like I horrible messy way to do things.

Surely it's better to have one COG drive the I2C and accept commands from other COGs via the usual HUB using a mailbox and perhaps locks).

A higher level of abstraction.

We could even imagine that in the driver COG the hardware is being manipulated by a thread or two (input and output, as in UART say) Whilst another thread(s) is chatting with the clients of this service.

Is there even a speed advantage to tweaking around with flags like that?

dMajo · 2014-01-21 01:55

Cluso99 wrote: »

While this would work, it blocks PortD from being used to transfer longs between cogs (using XCHG), which would be preferred.

Wrong. Port D, from a cog (self) perspective is a normal IO port with its out,dir, in registers but with ONE EXCEPTION: each cog MUST internally setup which others cog outputs form its inputs.

Thats mean that in the ozpropdev example long transfers will not be possible only between cogs 1,2 and 3. Eg. cogs 7 and 8 can exchange longs that will not be seen and will not affect cogs 1,2,3.

ozpropdev · 2014-01-21 01:59

Heater. wrote: »

ozpropdev,

Your opening example with all those flags and handshaking stuff to share an I2C or whatever resource looks like I horrible messy way to do things.

Horrible mess...A little harsh ,but sure.
The point was to show that it can be done without using locks and polling hub registers.
Anyhow it kick started the discussion, so mission accomplished.

Heater. · 2014-01-21 02:20

Sorry for the harsh bit. No offence intended.

Each one of those interactions between the hardware driver and it's "clients" is a case of a single producer/single consumer. As such commands and responses can be exchanged with simple flags in COG or via cyclic FIFO's (as in FullDuplexSerial). No locks required. Is there really much of a speed gain or any other benefit to using port D in that example?

Now, if we can use Port D to stream bytes between COGs faster than can be done through HUB I would start to see some advantages.

ozpropdev · 2014-01-21 02:44

The pin transfer module has the following modes

%000 = QUADs_to_16_pins (Might be WIDE's now?)
%100 = 16_pins_to_QUADs

What would be nice is COG 1 was set to mode 0 and COG 2 was set to mode 4
If somehow the back to back "bus" could then transfer n "quads" from Cog1 to COG2

Sweet!

Even better if 32 pins(bits) were used.

The transfer mechanism is already half there.
Thoughts?

ctwardell · 2014-01-21 07:02

Cluso99 wrote: »

For this style of scenario, what if there were a single shared 32-bit FLAG register?

A single instruction (1 clock as no need to sync with hub) would have variants to:

(a) read all 32 flags into a cog long, with optional WZ.
(b) can set any flag (1 bit), with optional WZ/WC returning the prior state of the flag.
(c) can clear any flag (1 bit), with optional WZ/WC returning the prior state of the flag.
Postedit: Do we need to ensure that set/clear cannot occur from multiple cogs simultaneously, and does clear override set?

This permits all flags to be checked at once for =0, and copy to cog long for subsequent mask/bit testing. (and we did not waste PortD)

In the Prop II update - BLOG thread I replied to a similar question on multiple writes on the same clock:

ctwardell wrote: »

Any COG can set or clear.

If conflicting values are set on the same clock:

If the current value is set, then clear wins.
If the current value is cleared, then set wins.

The use cases I have in mind would occur within a lock so this wouldn't be an issue, but in the general case some rule is needed.

I think that would be the most simple implementation. When the usage is such that collisions might occur the operation should occur within a lock.

In most cases it might actually be fine if this was implemented such that set and clear were hub operations and only the read is a non-hub operation.

My goal when I requested the 32 flags and locks was for a very simple implementation that would be useful with a minimal implementation cost.

ctwardell wrote: »

Non-Hub Flags...

C.W.

AntoineDoinel · 2014-01-21 12:42

Cluso99 wrote: »

4. Any others???

Another possible use could be barrier synchronization, for fine grained parallel processing.

Every processing element (or task) sets its own bit, and then waits for the whole port to be "111..1" (eventually masked with participating PEs only).

Similarly, "antibarrier" can be made by checking for the port to be false, after putting "0" to each respective bit.

This is already possible using WAITPEQ/WAITPNE instructions on port D, up to four task per COG.

The only thing missing is a mean to extend the barrier to multiple chips. But even if done with some software assistance, it should still be ways faster and lower latency than the main data exchange link, whatever it is. Which is the point of having a dedicated barrier at all.

Ok, it's not going to compete with GPU computing, still has an educational value IMHO.

jmg · 2014-01-21 14:08

ctwardell wrote: »

Any COG can set or clear.

If conflicting values are set on the same clock:

If the current value is set, then clear wins.
If the current value is cleared, then set wins.

I think that would be the most simple implementation.

The problem with that rule, is a new set (new data) looses to a clear (data done), which could lose messages.
A logic sense needs to be chosen (I've used Positive logic here) and then I think a set to the active state, should trump a clear.
General rule is upstream sets, downstream acts on hi, clears when done.
The boundary case of same clock, sets and generates another loop.

Are these Port D bits available to WAIT opcodes ? - that could give significant power saving ?

ctwardell · 2014-01-21 14:26

jmg wrote: »

The problem with that rule, is a new set (new data) looses to a clear (data done), which could lose messages.
A logic sense needs to be chosen (I've used Positive logic here) and then I think a set to the active state, should trump a clear.
General rule is upstream sets, downstream acts on hi, clears when done.
The boundary case of same clock, sets and generates another loop.

Are these Port D bits available to WAIT opcodes ? - that could give significant power saving ?

My thinking was that a state change should win because you really shouldn't have been setting or clearing a flag that was already in the cooresponding state.

After some thought I think maybe having the set and clear be hub operations might be the easiest way around the same clock issue.

If we do stick with non-hub set and clear I agree that your suggested method of which state wins is better than what I had suggested.

Having the reads be non-hub is most important so we can poll or possibly waitxxx on the flag without consuming any hub slots.

C.W.

jmg · 2014-01-21 15:00

ctwardell wrote: »

My thinking was that a state change should win because you really shouldn't have been setting or clearing a flag that was already in the cooresponding state.

True, a classic design would poll, but that dictates looping code, and IIRC the Port nature of PortD means it can be used with WAITxx, and that nice feature rather shifts the usage, as the power saving of this is likely to be large.

With WaitXX, a master will set anytime it has new info, and the slave waits on pin hi, and clears when done. (Auto-power save)

An arrival of new info, even on the exact clear edge, still works on WAITxx.

Applying Hub quanta could lose a means to precisely sync operations across COGs ?

ozpropdev · 2014-01-21 15:40

ozpropdev wrote: »

The pin transfer module has the following modes

%000 = QUADs_to_16_pins (Might be WIDE's now?)
%100 = 16_pins_to_QUADs

What would be nice is COG 1 was set to mode 0 and COG 2 was set to mode 4
If somehow the back to back "bus" could then transfer n "quads" from Cog1 to COG2

Sweet! Even better if 32 pins(bits) were used.

The transfer mechanism is already half there.
Thoughts?

Looking at the above idea a little deeper.
We know these pin transfer modes work in the background.
Assuming this mode is now "WIDE" instead of "QUAD" and some sort of sync signals could be added we
have a method of transferring 8 longs from one cog to another.
These modes once started repeat until disabled (already implemented).
We already have SETWIDE to set the source or destination WIDE address
We already have the 32 bit pathway in place.

8 longs in 8 clocks in the background. No hub required.

Locks, flags and Port D XCH discussion

Comments