About cogs needing to communicate quickly. What about being able to see other cogs' fast r/w pointer? Then, you could know when a new byte was streamed into memory and you could read it. Wait! That may not work because the other cog's read cache probably already picked up the old byte and is waiting to hand it off via RFBYTE. Never mind.
I'm still working on the faster smart-pin<>cog comms. I realized there was a subtle problem with the current USB scheme: When a smart pin updates the Z register, which then gets repeatedly streamed from the smart pin's 5-bit return message bus (1 sync bit + 4 data bits), the update will be ignored if the current message has not been wholly transmitted yet. This is needed to prevent lock-up on the cog's receiver, where it would otherwise keep syncing to message starts and never get a message. A whole message must come through, at least once before it can be supplanted by a fresher one. Anyway, on the older, slower smart-pin-to-cog comm, it took 8 cycles to send out a 32-bit message. Meanwhile, a 12Mbps USB comm potentially updates on every 6th or 7th clock. Say the status changes back-to-back on USB bit periods... the second update wouldn't get through! That might strand a received byte! I've changed the USB mode so that it updates on every USB bit clock, now. That will get the message through at the next opportunity. Better than that, though, USB mode now only sends words, not longs, as that's all that's needed for the USB mode. In the new scheme, it takes 1 clock per message sync and size, and a clock for each nibble of possible bytes, words, or longs. This means that it will take 5 clocks to send the USB status from the smart pin to the cog, which is under the 6 or 7 clocks that 12Mbps USB fires at, at 80MHz.
Yep, totally can work between say HubExec fetches. There is potential for stalls still if the write FIFO can't get the slots it needs but shouldn't be a problem in most cases as even HubExec going full tilt only uses 50% of the FIFOs potential bandwidth.
Ha, I guess if such a FIFO was added then the existing FIFOs write features come into question. How much real-estate can be traded in from that if it could only do reads? The Streamer would lose it's data input feature though.
It is a bit difficult to think about. I'm realising there should be some logic for keeping order. Probably the write FIFO should always get it's turn on the next Hub rotation, if HubExec is due then it'll just have to wait another rotation. Same for a RDxxxx instruction, the write FIFO must be flushed before the regular read takes place. This should eliminate potential problems with dirty addresses but obviously can introduce even larger read stalls. Any two of those operations can interact in alternating harmony I think, all three together might be messy.
Not acceptable?
EDIT: It would be nice if say a RDLONG and WRLONG could occur in the same Hub rotation but I fear this to be fraught with potential complications.
My thinking was that if it was workable without much added silicon, Chip would have jumped at it. Since he didn't do it, my assumption is that he had very good reasons for not doing it... which might take a lot effort to explain to us and would end up with lots of conversation and the same result.
With my above example, I am using one cog to wrfast and one (the analytic cog) to rdfast. Obviously, the analytic cog can't rdfast and wrfast at the same time... so I store the results in a lut buffer and ping-pong back and forth between reading and writing. The problem with this is that there is almost as much overhead in using a lut buffer as there is in simply writing with a normal wr. The beauty of Cluso's suggestion is that the data is coming in at the fastest rate possible, without using a rdfast... which free's the analytic cog to remain in wrfast mode increasing the potential throughput by several times. Of course the same is true in the reverse.
Even when the data is coming in through the streamer, if analysis is required, you have to read it... and then write something back.
Cluso's addition greatly improves this general use case.
And since we have paired pins... why not have paired cogs?
I also like Cluso's addition.
Perhaps to some point some things can be achieved with the current state of the things. But Cluso's idea will open the performance to a greater user base, the ones that are not so clever and don't like puzzling with ping-pong, interleaving ... and such things.
Writes the 9 bit D/# data to the latch (to COG+1) and sets the DataAvailable latch (depends on WC/WZ setting below).
If WC is specified, then if the DataAvailable latch is already set (ie the next cog has not read the data) then the data will NOT be written and the C flag will be set.
If WZ is specified, then bit8 of the latch will be masked OFF, irrespective of bit8 in D/#.
RDCOG D [WC,WZ]
Reads the 9 bit data from the latch (from COG-1) and clears the DataAvailable latch (depends on WC/WZ setting below)
If WC is specified, C will be set if the DataAvailable latch was previously set (ie data is valid).
If WZ is specified, then bit8 of the latch read will be masked OFF, and Z will be set to the original bit8 value.
My thinking was that if it was workable without much added silicon, Chip would have jumped at it.
The real-estate requirements aren't much at all and as noted there could be savings. The concern is the difficulty of getting it right. It's another brain teaser to be added to the list.
Writes the 9 bit D/# data to the latch (to COG+1) and sets the DataAvailable latch (depends on WC/WZ setting below).
If WC is specified, then if the DataAvailable latch is already set (ie the next cog has not read the data) then the data will NOT be written and the C flag will be set.
If WZ is specified, then bit8 of the latch will be masked OFF, irrespective of bit8 in D/#.
RDCOG D [WC,WZ]
Reads the 9 bit data from the latch (from COG-1) and clears the DataAvailable latch (depends on WC/WZ setting below)
If WC is specified, C will be set if the DataAvailable latch was previously set (ie data is valid).
If WZ is specified, then bit8 of the latch read will be masked OFF, and Z will be set to the original bit8 value.
What is the purpose of bit8?
Bit8 (9th bit) serves as a command bit. So when you want to pass things such as SYN, EOP, SE0, etc you can easily. The prop is excellent for this as it has a 9bit immediate mode.
I often use bit8 for this purpose. When using a rendezvous hub design, it is possible to use the 9th bit for special purpose, while retaining the ability to send a full standard byte too.
I could have asked for 32 bits plus the data available flip flop, and even both directions, but I didn't want to be greedy.
I may be completely wrong here, but I really missed the ability to use portB on the P1 for internal comms between cogs. No need to have pins connected.
Not sure if this is doable on the P2 to have a internal port, even if just digital and not smart. But accessible from every COG.
Cluso's plan to have 'the next cog' accessible would need to be build-in twice, one channel to cog n-1 and one for cog n+1. Then it would be possible to pipeline jobs/states over all of the cogs, just by using the right cogid when starting the next job/state handler.
This would give not just paired cogs, but daisy chained ones, one feeding the next one and back.
So one cog would have a fast channel to both neighbors but not to all 15(13) others. Just paired would be half Smile.
Maybe use the messaging system and allow to send some message to the cog-neighbours?
But basically 2 longs, one shared with each neighbor cog would already do perfect. Even 2 or 4 bits between the two cogs. Great for sync of actions or alike.
I am looking at this from a programmer point of view, not Verilog/Asic/whatever.
There was something like P2 to P2 comms over one pin each in the P2-Hot, usable to easy (and standardized) connect multiple P2 chips. Alike the company who starts with X has as Xlinks(?).
How about cog sends message to smartpin, and smartpin sends (relays) message to other cog, on same chip or second one?
Writes the 9 bit D/# data to the latch (to COG+1) and sets the DataAvailable latch (depends on WC/WZ setting below).
If WC is specified, then if the DataAvailable latch is already set (ie the next cog has not read the data) then the data will NOT be written and the C flag will be set.
If WZ is specified, then bit8 of the latch will be masked OFF, irrespective of bit8 in D/#.
RDCOG D [WC,WZ]
Reads the 9 bit data from the latch (from COG-1) and clears the DataAvailable latch (depends on WC/WZ setting below)
If WC is specified, C will be set if the DataAvailable latch was previously set (ie data is valid).
If WZ is specified, then bit8 of the latch read will be masked OFF, and Z will be set to the original bit8 value.
What is the purpose of bit8?
Bit8 (9th bit) serves as a command bit. So when you want to pass things such as SYN, EOP, SE0, etc you can easily. The prop is excellent for this as it has a 9bit immediate mode.
I often use bit8 for this purpose. When using a rendezvous hub design, it is possible to use the 9th bit for special purpose, while retaining the ability to send a full standard byte too.
I could have asked for 32 bits plus the data available flip flop, and even both directions, but I didn't want to be greedy.
Go 32 bits and use the Z flag instead of bit8! Keep it one-way only, though. There! I got greedy for you!
Bit8 (9th bit) serves as a command bit. So when you want to pass things such as SYN, EOP, SE0, etc you can easily. The prop is excellent for this as it has a 9bit immediate mode.
I often use bit8 for this purpose. When using a rendezvous hub design, it is possible to use the 9th bit for special purpose, while retaining the ability to send a full standard byte too.
I could have asked for 32 bits plus the data available flip flop, and even both directions, but I didn't want to be greedy.
It makes sense to have a flag, and align to the opcode reach.
Both directions is not silly, as this has cost a Memory location, which is valuable, & BUS routing has the wires already there too.
Both directions allows co-processor like operation.
Besides 9 and 32b suggestions, there is also 18b which aligns with opcode reach too, but is only 56% of the reg cost of 32b
Besides 9 and 32b suggestions, there is also 18b which aligns with opcode reach too, but is only 56% of the reg cost of 32b
But we are talking about only one register between each cog, for a total of 16 registers. Sixteen 33b (32b plus Z, as I suggested) registers shouldn't add that much. Anything less would require multiple WR/RD cycles to pass a single 32-bit register, which would somewhat defeat the point of making this feature available primarily for speed.
I am away from home trying to follow the conversation
about this suggestion. Seems to be split between several
threads. I love the dual ported LUT RAM. Solves a lot of issues,
but the one it doesn't seem to address is cog-cog coms in a perfectly
deterministic manner.
I love the dual ported LUT RAM. Solves a lot of issues,
but the one it doesn't seem to address is cog-cog coms in a perfectly
deterministic manner.
I think this is right?
I think Chip is pondering this, but fixing real issues in the mean time.
The impacts depend on how Dual-porting is done.
I thought Chip was meaning Dual-ported as in Left/Right COGs able to RMW, and that bypasses any HUB slots, so would be perfectly deterministic.
That's easy to say, but the LUT bus then has to route to both sides of a COG, which is far from compact and may have MHz issues.
Best to fix the things that need fixing first, and then see if there really is room for Dual-port ?
Thanks. That is pretty much what I understood, but even when I have time to read everything, I sometimes scan right by the information I am looking for. For the last few days my schedule left just enough time to scan.... very quickly.
1. A single 8/32 bit latch and a single bit SR Latch for one direction.
- A second pair for bi-directional comms (between adjacent cogs
- the SR latch is set by a write to the 8/32 bit latch and cleared by a read (from the adjacent cog)
- the SR latch can report the "available data" status via the C bit
- there are 16 sets, permitting each cog to have comms between the lower and the upper adjacent cog
- - this actually permits a cog to have fast coms between two other cogs
WRCOGn D/# [WC]
n: 0 = cog-1, 1 = cog+1
WC: if used, returns the previous status of the SR latch (data available = not yet read by other cog)
If the C-0 (no data available), then the data write will be performed and the SR latch will be set
If the C=1 (data not yet read), then the data write will NOT be performed.
WC: if not used, data will be written regardless of the SR latch, and the latch will be set to available
RDCOGn D [WC],[WZ]
n: 0 = cog-1, 1 = cog+1
WC: if used, returns the previous status of the SR latch (data available)
If the C-0 (no data available), returns the data value in the latch, the SR latch is cleared
If the C=1 (data available), returns the data value in the latch, and the SR latch is cleared.
WC: if not used, returns the data value in the latch, and the SR latch will be cleared.
WZ: if used, sets Z if the data read is zero
Perhaps the SR latch being set could trigger an Interrupt???
2. A small 32-bit dual port RAM
- between each pair of cogs (16 sets)
- Do we have an SR latch for each way????
Of these options... which is the simplest, most bullet-proof, most likely to take Chip less than an hour of thought, and least likely to bother anything that now works?
Comments
So, a FIFO would gather random writes and perform them when the main FIFO wasn't using the hub slots?
Not acceptable?
EDIT: It would be nice if say a RDLONG and WRLONG could occur in the same Hub rotation but I fear this to be fraught with potential complications.
My thinking was that if it was workable without much added silicon, Chip would have jumped at it. Since he didn't do it, my assumption is that he had very good reasons for not doing it... which might take a lot effort to explain to us and would end up with lots of conversation and the same result.
Even when the data is coming in through the streamer, if analysis is required, you have to read it... and then write something back.
Cluso's addition greatly improves this general use case.
And since we have paired pins... why not have paired cogs?
Perhaps to some point some things can be achieved with the current state of the things. But Cluso's idea will open the performance to a greater user base, the ones that are not so clever and don't like puzzling with ping-pong, interleaving ... and such things.
What is the purpose of bit8?
The real-estate requirements aren't much at all and as noted there could be savings. The concern is the difficulty of getting it right. It's another brain teaser to be added to the list.
The fast buffering benefits are cool though.
I often use bit8 for this purpose. When using a rendezvous hub design, it is possible to use the 9th bit for special purpose, while retaining the ability to send a full standard byte too.
I could have asked for 32 bits plus the data available flip flop, and even both directions, but I didn't want to be greedy.
Not sure if this is doable on the P2 to have a internal port, even if just digital and not smart. But accessible from every COG.
Cluso's plan to have 'the next cog' accessible would need to be build-in twice, one channel to cog n-1 and one for cog n+1. Then it would be possible to pipeline jobs/states over all of the cogs, just by using the right cogid when starting the next job/state handler.
This would give not just paired cogs, but daisy chained ones, one feeding the next one and back.
So one cog would have a fast channel to both neighbors but not to all 15(13) others. Just paired would be half Smile.
Maybe use the messaging system and allow to send some message to the cog-neighbours?
But basically 2 longs, one shared with each neighbor cog would already do perfect. Even 2 or 4 bits between the two cogs. Great for sync of actions or alike.
I am looking at this from a programmer point of view, not Verilog/Asic/whatever.
There was something like P2 to P2 comms over one pin each in the P2-Hot, usable to easy (and standardized) connect multiple P2 chips. Alike the company who starts with X has as Xlinks(?).
How about cog sends message to smartpin, and smartpin sends (relays) message to other cog, on same chip or second one?
OK, I stop.
Mike
Go 32 bits and use the Z flag instead of bit8! Keep it one-way only, though. There! I got greedy for you!
It makes sense to have a flag, and align to the opcode reach.
Both directions is not silly, as this has cost a Memory location, which is valuable, & BUS routing has the wires already there too.
Both directions allows co-processor like operation.
Besides 9 and 32b suggestions, there is also 18b which aligns with opcode reach too, but is only 56% of the reg cost of 32b
My proposal used two new simple instructions using the format
XXXX D/# [WC,WZ]
But we are talking about only one register between each cog, for a total of 16 registers. Sixteen 33b (32b plus Z, as I suggested) registers shouldn't add that much. Anything less would require multiple WR/RD cycles to pass a single 32-bit register, which would somewhat defeat the point of making this feature available primarily for speed.
about this suggestion. Seems to be split between several
threads. I love the dual ported LUT RAM. Solves a lot of issues,
but the one it doesn't seem to address is cog-cog coms in a perfectly
deterministic manner.
I think this is right?
where are we with this?
The impacts depend on how Dual-porting is done.
I thought Chip was meaning Dual-ported as in Left/Right COGs able to RMW, and that bypasses any HUB slots, so would be perfectly deterministic.
That's easy to say, but the LUT bus then has to route to both sides of a COG, which is far from compact and may have MHz issues.
Best to fix the things that need fixing first, and then see if there really is room for Dual-port ?
I see two simple solutions...
1. A single 8/32 bit latch and a single bit SR Latch for one direction.
- A second pair for bi-directional comms (between adjacent cogs
- the SR latch is set by a write to the 8/32 bit latch and cleared by a read (from the adjacent cog)
- the SR latch can report the "available data" status via the C bit
- there are 16 sets, permitting each cog to have comms between the lower and the upper adjacent cog
- - this actually permits a cog to have fast coms between two other cogs
WRCOGn D/# [WC]
n: 0 = cog-1, 1 = cog+1
WC: if used, returns the previous status of the SR latch (data available = not yet read by other cog)
If the C-0 (no data available), then the data write will be performed and the SR latch will be set
If the C=1 (data not yet read), then the data write will NOT be performed.
WC: if not used, data will be written regardless of the SR latch, and the latch will be set to available
RDCOGn D [WC],[WZ]
n: 0 = cog-1, 1 = cog+1
WC: if used, returns the previous status of the SR latch (data available)
If the C-0 (no data available), returns the data value in the latch, the SR latch is cleared
If the C=1 (data available), returns the data value in the latch, and the SR latch is cleared.
WC: if not used, returns the data value in the latch, and the SR latch will be cleared.
WZ: if used, sets Z if the data read is zero
Perhaps the SR latch being set could trigger an Interrupt???
2. A small 32-bit dual port RAM
- between each pair of cogs (16 sets)
- Do we have an SR latch for each way????
Of these options... which is the simplest, most bullet-proof, most likely to take Chip less than an hour of thought, and least likely to bother anything that now works?