Another plug for cog-to-cog comms

Seairth · 2014-05-11 18:05

When Chip proposed the P16X64A, it did not include Port D. And, now that we have twice the number of cogs, that approach may not make sense anyhow. So, I'd like to propose a new approach:

SETMSG D/#n
GETMSG D, S/#n

It's this simple:

SETMSG D/#n allows each cog to write to a 32-bit HUB register that belongs to it. This is immediate. No waiting.
GETMSG D, S/#n allows each cog to read another cog's 32-bit HUB register (including its own) into a local register. This is immediate. No waiting.

These 32-bit HUB registers would have to be 2-port memory to allow a SET and GET on the same register at the same time, and it would mean that GETs would be one clock cycle behind SETs (i.e. a GET wouldn't see a SET value until the clock cycle after the SET). But that's it.

Because COGs cannot write to the same registers, there is no need for an arbiter (like round-robin, SuperCog, 32-slot table, etc.). And all cogs can safely read the same register at the same time (if desired). (To be fair, though, I'm not sure what sort of bus circuitry the reads would require. Writes should be trivial.)

And this concept could be extended further: give each cog 16 such register. With this, each cog can effectively pair itself with other cogs (e.g. COG 0 can SET register 1 for COG 1 to read, while COG 0 can SET register 0 for COG 0 to read). This would also allow schemes, where two cogs could use multiple registers to pass large values.

Pros:

FAST cog-to-cog communication without using HUB memory.
Simple to understand
Because it's not port-based, it should work with all future Propeller version.
The 16-register version avoids the muxed approach that Port D used to allow multi-cog communication

Cons:

Due to the explicit instructions, it's not as flexible as virtualizing it as a port
Are there others cons?

(note: this post is actually spawned from a few comments in the SuperCog thread. With something like this, particularly the 16-register version, it might actually be possible to stick with the simple 1:16 hub timing. Might. Or, at least, it would address many of the same use cases, I think.)

jmg · 2014-05-11 18:31

Seairth wrote: »

With something like this, particularly the 16-register version, it might actually be possible to stick with the simple 1:16 hub timing. Might. Or, at least, it would address many of the same use cases, I think.)

Any alternate COG-COG comms would be nice, but it does not really solve the fundamental HUB timing issues, which apply when larger burst-data tables are involved, and will also apply to tightly coded peripherals & also Hubexec flows.

I think there is a pathway for moderate bandwidth cog-any-cog links in the Smart Pins. Is that enough ?

mark · 2014-05-11 19:26

I had brought up a similar idea in a couple posts, but instead it was a multi-byte cache of 4-port in-hub ram operating at clock speed. Since it's 4-port, any given cog would have access to it every 4 clocks, or two instructions. I'd imagine it would be able to utilize the existing physical bus running between the cog and the hub. I think it would go a long way in speeding up inter-cog communications.

4x5n · 2014-05-11 19:36

I've wanted core to core (cog to cog?) comm since I wrote my first multi-core app!! It was for an project I was working on that used a cog to run a stepper motor. At first I wanted a flag to let the cog running the stepper when to start and stop running the stepper. Soon after that I ran into a need to vary the speed of the motor and the number of steps it was to run in each direction. That was when I wished cog to cog comm of data was available.

However unless that's already in the works I wouldn't want it now at the expense of delaying the release of the P2!! In that case leave it for the P3.

Cluso99 · 2014-05-11 19:41

Seairth,
I have brought this up many times. However, the last time Chip said it is way too expensive/complex because of the buses involved.
Even a simple PortD 32bit bus where all cogs have their outputs "or"ed together requires a 32 wire ring around the chip.

That is why I then suggested parallel comms just between adjacent cogs.

The smart pins will (apparently) permit slower comms (clocked serial) between cogs. This method could be used with cogs driving slower peripherals for "blind" cogs (cogs without hub slots) if slot-sharing is implemented. But it isn't going to solve the issue where fast comms between cogs is necessary (such as USB FS if it is even possible).

jmg · 2014-05-11 19:51

Seairth wrote: »

These 32-bit HUB registers would have to be 2-port memory to allow a SET and GET on the same register at the same time,

One overhead here, is If these have no slot schemes, and are any-any, they are like wide latches and that means every COG has a read-array of 480 lines, fanning in from all other COGs, into a 480b x 4b Address Mux -> 32b value.
Expanding that x16 would be very costly in interconnects.

Another approach could be to allow some Signaling Booleans for Wait/ready semaphores, using the new bitops, and then using the (hopefully there ?) Serial Cog-Cog links ?
32-64 Booleans would reduce the cost to 32-64 interconnects.

Would such a combination of Parallel Flags and Serial data, be good enough ?

jmg · 2014-05-11 19:55

4x5n wrote: »

...At first I wanted a flag to let the cog running the stepper when to start and stop running the stepper..

See #6 - Booleans should be low cost in routing terms. Next step would be Booleans + Serial, which may be as simple as 16-32 cells of simplified Smart Pins, buried ?
The new bitops do lend themselves to expanded boolean (virtual pins) & I think the opcode reach is already there.

4x5n · 2014-05-11 20:02

jmg wrote: »

See #6 - Booleans should be low cost in routing terms. Next step would be Booleans + Serial, which may be as simple as 16-32 cells of simplified Smart Pins, buried ?
The new bitops do lend themselves to expanded boolean (virtual pins) & I think the opcode reach is already there.

Not sure if that's in the plans but it would be nice if it becomes available with the P2 or P3. I started using locks but since it still meant waiting for hub access I simply went with a long as the "flag". A value of zero meant stop the motor and the rest was "encoded" in the long. In the long run it was the right way to go and I've since made a number of "feature adds" using that system and it's managed to hold up well. I would still like to be able to pass between the cogs in parallel.

It's not worth ANY delays though!!

potatohead · 2014-05-11 20:29

Using dummy pins, now that we won't have 96 real ones, could work very nicely.

jmg · 2014-05-11 20:52

potatohead wrote: »

Using dummy pins, now that we won't have 96 real ones, could work very nicely.

Yes, and should be very low-risk, as it is not 'new' as much as a slight 'reach-extension'.

Tubular · 2014-05-11 20:59

Cluso99 wrote: »

Even a simple PortD 32bit bus where all cogs have their outputs "or"ed together requires a 32 wire ring around the chip.<br>

Aren't we looking at fully synthesized now? Perhaps except for memory blocks?

If they're not connected to any external pad, they they'd probably synthesize as another bundle of spaghetti in the middle

Seairth · 2014-05-12 10:11

Following up on several comments:

Bringing Port D back as a single wired-OR 32-bit bus

I'm really not crazy about this approach. First, as a matter of practicality, I think it would have to be limited to only two cogs using it. Yes, I'm sure there are schemes that could allow more, but you would either have to involve the hub or give up some of those 32 bits for handshaking. Because this effectively becomes a globally shared resource, it would be difficult to write OBEX drivers that took advantage of that port. Second, it cuts into the cog's register space. While this does have advantages (treating it it like a register), you lose local registers (even when not using the feature) and it is brittle in terms of future expansion (e.g. what if you want port E, F, etc.).

Hardware considerations

Yeah, I should have thought about this a bit more when I was starting the thread. The primary design objective was to have data flow in one direction only, thereby avoiding the whole arbitration stuff in the hub. If each cog can write to a register, and each of the other cogs can only read that register, then there is never any conflict.

As jmg points out, though, this would be a lot to route. While it might be trivial to route the cog's own write/read lines, I do see how you'd end up with a 480-bit bus ring for the reads for the other cogs (15 pairs of 32-bit lines through a 4-bit mux). Not pretty.

Paired Cogs

Yes, that would certainly reduce the overall circuit complexity. I too had [post=1257770]suggested this approach a while back[/post]. It seems like a decent compromise, as I'd expect most two-cog driver solutions to use adjacent cogs. Yes, it somewhat breaks the global symmetry of the cogs, but from the individual cog perspective they are all still identical. And it still requires all non-paired cogs to use the hub to communicate.

Working with the original instructions I suggested, SETMSG would write to one of the paired registers and GETMSG would read from the other paired register. No addressing required. This would result in a 64-bit bus (two 32-bit read buses) between adjacent cogs, which I hope would be easily routable.

Further, if this sort of routing is simple, you could add additional register. For instance, if each cog had 4 32-bit registers, then you'd only increase the interconnect by 4 bits (two 2-bit read address buses). And 16 registers would increase that by only another 4 bits. Interestingly, once you have this many registers, it opens up some other possibilities. But I'll post that as a separate reply...

Hub Memory Arbitration

No, I don't think this discussion precludes or replaces the various arbitration/slot-sharing discussions that are going on. All I'm saying is that a major use case of the hub is to transfer data between cogs. The advanced arbitration mechanisms we've been discussion (excecpt SuperCog) have the use cases of improving throughput and reducing latency between cogs. This discussion (cog-to-cog comms) has the potential of addressing that requirement only.

Seairth · 2014-05-12 10:15

Expanding on the paired register idea, suppose the following:

Each cog contains 16 32-bit 2-port registers that it can write to at any time.
Each cog has a paired cog from which it can read the paired cog's 16 32-bit registers at any time.
The paired cogs would have a 72-bit internconnect (2 32-bit data buses and 2 4-bit address buses) for reading each other's registers.
SETMSG D, #n writes data in D to register n (0-15).
GETMSG D, #n reads data into D from register n (0-15).

The two-cog driver use case is very obvious, so I won't expand on that one. But are there other use cases?

Running large programs

Suppose you dedicated one cog (call it cog 0) to maintaing the instruction cache and one cog for executing the intructions (call it cog 1), in an XMM-type scenario. The basic process might go like:

   Cog 0                           Cog 1
   ==============================  ========================================
1  poll with "GETMSG addr, #0"     request with "SETMSG addr, #0"
   ------------------------------  ----------------------------------------
2  "SETMSG addr, #0" to indicate   poll with "GETMSG addr, #0" to see when
    done                            cog 0 is done
   ------------------------------  ----------------------------------------
3  write up to 15 instructions     perform 1 to 15 "GETMSG inst, #n" to get
    using "SETMSG inst, #n"         instructions
    (where n = 1 to 15)
   ------------------------------  ----------------------------------------
4  loop to 1                       execute
                                   
   ------------------------------  ----------------------------------------
5                                  loop to 1.

Note: Because the read of the newly-written value is delayed one clock
cycle, it is possible for phases 1-3 to each safely occur in parallel.  For
instances, Cog 1 can be reading the new instructions while Cog 0 is writing
the new instructions, because Cog 1 will enter phase 3 at least one clock
cycle after Cog 0 enters phase 3.

More advanced implementations of Cog 0 could precache branches, . You could also swap out Cog 0 implementations for different scenarios (reading from SDRAM, CF, eeprom, etc.) without having to change how Cog 1 works.

You could also have register 0 be split into two fields: address and number of instructions. This would allow faster retrieval of less-than-15-instruction snippets. And it would still give an address space larger than is possible with the hub.

Obviously, this would not be as fast as native hubexec. And it wouldn't be as fast as LMM using QUAD reads. But I do think you'd get faster performance than is currently possible on P8X32A, if only because you'd have reduced read latency (compared to hub-aligned timing). Further, this would free up Cog 1 to use the hub for data storage exclusively, which would also improve overall speed.

Using one cog as a "local" data space

Cog 0 could provide stack space for Cog 1. In this scenario, Cog 1 could "push" up to 15 registers (plus one "count" register). On average, this would be much faster than maintaining the stack in the hub. The "pop" part would take a little longer, due to the delay for requesting the "pop". Also, Cog 0 could overflow its stack into the hub, if needed, without having to change Cog 1 at all.

Similarly, the same mechanism could be used to store arbitrary values. Such an encoding might look like:

00xxxxxx_xxxxxxxx_xxxxxxxx_0000nnnn : "push" %%nnnn registers
01xxxxxx_xxxxxxxx_xxxxxxxx_0000nnnn : "pop" %%nnnn registers
10xxxxxx_xxxxxxxx_mmmmmmmm_0000nnnn : "write" register %%nnnn to address %%mmmmmmmm
11xxxxxx_xxxxxxxx_mmmmmmmm_0000nnnn : "read" address %%mmmmmmmm into register %%nnnn

Note: %%mmmmmmmm is only 256 registers, as you probably don't want Cog 1 clobbering Cog 0's code.

Using one cog as a video buffer

In essense, Cog 1 treats Cog 0 like the AUX registers from P8X96A for video output. Cog 0 is only responsible for banging out whatever comes over the paired bus, possibly caching it in local registers beforehand.

Basic debug output

Cog 1 writes debug information to its registers. Cog 0 reads them, and does whatever with them. Eventually, Chip could maybe add a mode that performed an implicit "SETMSG PC, #15" on every instruction cycle, this would be almost transparent to Cog 1. Obviously, this would be less helpful for troubleshooting paired-cog code.

Seairth · 2014-05-12 10:16

Along with SETMSG and GETMSG, It might also make sense to have a "WAITMSG D, #n" that works like WAITPNE. In other words, it stalls until register #n contains a value different than the value at D.

brucee · 2014-05-12 11:52

This seems like a complex solution, to a problem that existing hardware solves. Especially when you consider it is an n-squared problem, meaning as the number of COGs grows the circuitry grows as the square of that number.

It's unlikely that all 16 need to talk to all 16, and anything other than that breaks the "simple/orthogonal" nature of the prop.

You can already dedicate hub space to act as the same function, yes you have to wait your turn to get to it, but most of these kind of tasks are not as time critical. If you needed faster, you could dedicate a pin or 2 for COGs to wave at each other.

Also, I'm part of the "ship it" group, that thinks this process has taken way too long, and I still question the decision of why not fix the database of the last 2 SNAFUs, while I agree the time of returning to that is probably now past.

Seairth · 2014-05-12 12:46

brucee wrote: »

This seems like a complex solution, to a problem that existing hardware solves. Especially when you consider it is an n-squared problem, meaning as the number of COGs grows the circuitry grows as the square of that number.

It's unlikely that all 16 need to talk to all 16, and anything other than that breaks the "simple/orthogonal" nature of the prop.

You can already dedicate hub space to act as the same function, yes you have to wait your turn to get to it, but most of these kind of tasks are not as time critical. If you needed faster, you could dedicate a pin or 2 for COGs to wave at each other.

Also, I'm part of the "ship it" group, that thinks this process has taken way too long, and I still question the decision of why not fix the database of the last 2 SNAFUs, while I agree the time of returning to that is probably now past.

I am also in the "ship it" camp (see other threads). The most recent post (expanding on the paired-cog approach) can actually be seen as a way help "ship it". Consider this: right now, both hubexec and RDQUAD/WRQUAD are on the side-burners because of their impact on the architecture. Chip intends to add one or both of them back in, of course. But there is this "ship it" pressure that he also has to deal with. Now, suppose that it turns out to be easier to implement the paired-cog concept than to implement hubexec and quads. This would allow the potential of doing some of the things that hubexec allows, some of the things quads allow (moving large data between two cogs), and even some of the things that advanced hub arbitration allows. And still get the chip out the door without adding hubexec, or advanced bus arbitration.

No, I'm not saying that it's better than or preferable to hubexec or quads (or advanced bus arbitration). I'm just suggesting that it might give Chip an alternative that helps with "ship it".

jmg · 2014-05-12 14:23

Seairth wrote: »

Each cog contains 16 32-bit 2-port registers that it can write to at any time.

Each cog has a paired cog from which it can read the paired cog's 16 32-bit registers at any time.

The paired cogs would have a 72-bit internconnect (2 32-bit data buses and 2 4-bit address buses) for reading each other's registers.

SETMSG D, #n writes data in D to register n (0-15).

GETMSG D, #n reads data into D from register n (0-15).

This is really just Two Port RAM, which is a standard block in FPGA and ASICs

To make COGs more equal, you would likely need 16 of these Dual Port memories, with the COGS in a Ring.
Each COG can then talk two ways : CW and CCW in the ring.

Seairth wrote: »

Bringing Port D back as a single wired-OR 32-bit bus

I'm really not crazy about this approach. First, as a matter of practicality, I think it would have to be limited to only two cogs using it. Yes, I'm sure there are schemes that could allow more, but you would either have to involve the hub or give up some of those 32 bits for handshaking.

I'd agree buried pins are not that flash for Data path flows, but very useful as Boolean flags, which is the real intent surely ?
Many of the new bitops seem to have > 64 built into the opcodes already, but I think the WAITPAE WAITPAN
WAITPBE WAITPBN opcodes may need 3rd copies. (WAITPCE WAITPCN ?)

Another variant would allow COGs to attach Serdes (simple 32b shifters) onto any of these pins.
That will give flexible links, but with 16 Opcode cycle latencies.

If they could be coded as DDR, that may get to 8 Opcode cycle latencies, combined with flags at 1 cycle latency.

Seairth · 2014-05-13 12:00

jmg wrote: »

This is really just Two Port RAM, which is a standard block in FPGA and ASICs

To make COGs more equal, you would likely need 16 of these Dual Port memories, with the COGS in a Ring.
Each COG can then talk two ways : CW and CCW in the ring.

Do you mean "16 more of these Dual Port memories"? In my description above, each pair of cogs has a total of 32 32-bit, 2-port registers (two sets of 16). If you are considering that to be "a memory", then yes, 16 memories would allow each cog to talk to its two "adjacent" cogs. Paired cogs would use only 8 memories, resulting in 256 2-port registers being added to the chip. The adjacent-cogs approach would use 16 memories, adding 512 2-port registers.

jmg wrote: »

I'd agree buried pins are not that flash for Data path flows, but very useful as Boolean flags, which is the real intent surely ?
Many of the new bitops seem to have > 64 built into the opcodes already, but I think the WAITPAE WAITPAN
WAITPBE WAITPBN opcodes may need 3rd copies. (WAITPCE WAITPCN ?)

No, I do not mean for them to be used for bitops/flags, at least not exclusively. If you want to write code where one cog looks at bit patterns from the other cog as a set of flags, that's fine. But this is meant to be more general than that. This simply allows, in the case of the 16-register example, each paired cog to write up to 16 32-bit values that can be read by the other paired cog. How those values are interpreted is entirely up to the developer. Read back over my examples to get an idea of the (non-exhaustive list of) ways they could be used.

jmg wrote: »

Another variant would allow COGs to attach Serdes (simple 32b shifters) onto any of these pins.
That will give flexible links, but with 16 Opcode cycle latencies.

If they could be coded as DDR, that may get to 8 Opcode cycle latencies, combined with flags at 1 cycle latency.

I'm not sure what the ramifications would be to hook this additional functionality it. Personally, I would leave them as simple sets of one-way buffers between every two cogs. They just provide very fast, local, cog-to-cog communication without involving the hub.

jmg · 2014-05-13 16:21

Seairth wrote: »

They just provide very fast, local, cog-to-cog communication without involving the hub.

The new 16 Memory Rotator Chip is suggesting,
http://forums.parallax.com/attachment.php?attachmentid=108689&d=1400011603
may go some way toward improving COG-COG links.
Small bursts I think can now behave a bit like a FIFO

Something like this transfers 5 longs, in a 'tight burst' - how tight, is still tbf, but I think the right values can have no inter-long delay adders

Imagine COG0 does ,from t=T
WRLONG Da,PTRx++
WRLONG Db,PTRx++
WRLONG Dc,PTRx++
WRLONG Dd,PTRx++
WRLONG De,PTRx++

and up to 8 SysClks later, (t<=T+8) COG8 starts, with the same Initial pointer value
RDLONG Da,PTRx++
RDLONG Db,PTRx++
RDLONG Dc,PTRx++
RDLONG Dd,PTRx++
RDLONG De,PTRx++

Cluso99 · 2014-05-14 23:18

I had a different idea (with the new hub method)...

Basically, what we now want (reduces power while waiting) is a simple method to indicate to another cog(s) that info is ready/waiting/required in hub.

It could be just 16 bit bus between all cogs.
Each cog puts its own bit on the bus according to its cogid (anytime, no hub waiting)
All cogs can read the whole 16 bit word (anytime, no hub waiting)

Supporting instructions could be
SET/CLR/XOR 'no operands are required
TEST #othercogid WZ/WC 'checks a bit
GET <reg>,<mask> WZ/WC 'checks bit mask
WAIT <mask/bit> 'waits for mask/bit in low power mode.

This means that each cog can signal other cog(s) to take some action (typically accessing hub to do/check) and provides a simple way to suspend cog(s) saving power. It saves them from continually polling hub waiting for something to do (as we currently do). With bigger hub, power will be higher, so no need to waste it.

jmg · 2014-05-15 00:20

Cluso99 wrote: »

Basically, what we now want (reduces power while waiting) is a simple method to indicate to another cog(s) that info is ready/waiting/required in hub.

It could be just 16 bit bus between all cogs.
Each cog puts its own bit on the bus according to its cogid (anytime, no hub waiting)
All cogs can read the whole 16 bit word (anytime, no hub waiting)

Supporting instructions could be
SET/CLR/XOR 'no operands are required
TEST #othercogid WZ/WC 'checks a bit
GET <reg>,<mask> WZ/WC 'checks bit mask
WAIT <mask/bit> 'waits for mask/bit in low power mode.

This means that each cog can signal other cog(s) to take some action (typically accessing hub to do/check) and provides a simple way to suspend cog(s) saving power. It saves them from continually polling hub waiting for something to do (as we currently do). With bigger hub, power will be higher, so no need to waste it.

Sounds good, so good in fact, I suggested this back in #6

Booleans for Signaling.

Cluso99 · 2014-05-15 01:28

jmg wrote: »

Sounds good, so good in fact, I suggested this back in #6 Booleans for Signaling.

Oh well, cannot hurt to re-raise it again

AntoineDoinel · 2014-05-15 01:35

Hmm, no.

More like "boleans" for signaling... that another "32 booleans for signaling" are ready to be read by our cog

Or have been read by the other cog.

Nothing to do with HUB, that has greatly increased bandwidth, but far too much latency.

More like a "Parallel Master Port", with bidirectional registers and strobe lines.

Seairth · 2014-05-15 03:58

jmg wrote: »

The new 16 Memory Rotator Chip is suggesting,
http://forums.parallax.com/attachment.php?attachmentid=108689&d=1400011603
may go some way toward improving COG-COG links.
Small bursts I think can now behave a bit like a FIFO

Something like this transfers 5 longs, in a 'tight burst' - how tight, is still tbf, but I think the right values can have no inter-long delay adders

Imagine COG0 does ,from t=T
WRLONG Da,PTRx++
WRLONG Db,PTRx++
WRLONG Dc,PTRx++
WRLONG Dd,PTRx++
WRLONG De,PTRx++

and up to 8 SysClks later, (t<=T+8) COG8 starts, with the same Initial pointer value
RDLONG Da,PTRx++
RDLONG Db,PTRx++
RDLONG Dc,PTRx++
RDLONG Dd,PTRx++
RDLONG De,PTRx++

Yeah, that occurred to me too. The most immediate issue I see with this is having to place paired cogs opposite of each other to get the 8 clock cycles between each cog. It seems to me that adjacent cogs will be more useful for pairing, especially if pin groups are still going to be attached to specific cogs.

jmg · 2014-05-15 16:45

Seairth wrote: »

It seems to me that adjacent cogs will be more useful for pairing, especially if pin groups are still going to be attached to specific cogs.

Probably.
There are access-delays (time to get spoke) and pathway delays (time for COGxx to read new data from COGyy)

In adjacent COG cases pathway delays will be Asymmetric - a lower CogDiff latency -> next-in-line, and a 16-CogDiff for data going the other way.
Using the BLOCK opcodes would likely be the best way to code transfers, if you wanted COG location to really not matter at all. (one COG has to wait until first COGS WRBlock is complete, before starting its RDBlock)

Another plug for cog-to-cog comms

Comments