Is there anything else that has been bothering you about the design?
While I was thinking about other ways to implement cog-to-cog communication, I came across this old thread. Did the change we discussed in that ever get implemented? It's not a big deal if it didn't, as it was more of a "nice to have" than anything else.
Does anyone else see anything? (okay, aside from LUT sharing)
@Chip - I mentioned the internal port recently and you answered, that it would require some resources.
But this was before the message, that there is still some space on the die.
Wouldn't this - fully symetric and COG/PIN agnostic mechanism also help @Cluso99 ??
But of course only if no other Object used it - same as with pins anyhow.
...
Just curious about how cogtocog coms would have to work in the first place. Wouldn't a cog that is to receive data from any other cog have to be able to read the data as it comes in without any method of signal flow? Here is my crazy plan:
39 bit OR bus routed between all cogs
32 bits for data
1 bit for setting data available flag
1 bit for setting data received flag
1 bit for flagging the bus as locked
4 bits for ID
This method allows any cog to load up a Long on the bus, and set the data available flag.
The receiver cog gets the data, does anything it needs to do and then sets the data received flag(which has a reset feature to the data available line). The sender can then load up another long plus set the DA flag again. The receiver sees the data available flag toggled and sees new data, gets it, does it's thing, then sets data received(resets DA).
The problem is that only two cogs can talk at a time, unless there was a lock scheme added to the bus so that it could be locked out for a R/W cycle then used by others when it is not locked. Or better yet add an extra 4 bits for an ID. The ID is included with the data packet, and the receiver looks for a specific ID(if desired) and ignores others. Seems simple enough! For one cog to another, it would be fast. And gets rid of the adjacent cog requirement.
I don't want to mention that x processor but the smart pins got some of their features, as I understand them, but with "dma" (streamer) which is a welcomed step ahead. What did not get implemented are the channels. T Chap: mentions something similar. Channels are buffered and numbered, two cores open one end each. The buffers are FIFOs with a dept of like 16 writes. They also work at core speed. They are super useful. (And one of the few ways of doing things). I used them for video output. I just always felt that using a core to bit-band VGA was a waste...
Such a channel is not really needed if the signaling works. It may be a bit slower because of polling and reading from HUB but it means that the data do not have to be rebuffered. I find it a nice compromise.
I'd like to get my hands on some P2s already... a well, just a little more time . Well done !
A way for a smartpin to output a NCO clock > 40 MHz (for an 80MHz fsys clock). At the moment the max NCO frequency that can be generated is 40 MHz ($8000_0000 added each cycle), and if you try to push it further it folds back - if you ask for 50 MHz it'll output 30 MHz etc. The streamer can go all the way to 80 MHz, so there's a mismatch between 40-80MHz, and we'll probably want this for lcd/hdmi/external dacs
That's a very good point - the streamer at SysCLK (80MHz), needs some way to generate a compatible/matching Clock.
I'm also unclear if the Streamer has Hardware Handshake yet ?
Those signals are needed when you want to bulk-stream to Operating systems or similar hosts that you cannot guarantee will swallow always at full speeds. 99% of the time, they can, but some mechanism is needed for the rare stalls.
Maybe someone can connect FTDI FIFO parts and test this ?
Their data is vague on any FIFO hysteresis on handshakes, and they seem to focus on FPGA connect, where handshakes are immediate and a couple of lines of verilog.
FT600 would be cool, but that has more constraints than FT232H/FT2232H
Addit:
FT2232H gives these specs
* USB to parallel FIFO (async) transfer data rate up to 10Mbyte/sec.
Looks like a Test+R/W+Strobe REP loop works here, so a > 80MHz P2 should manage close to 10M ?
* Single channel synchronous FIFO mode for transfers up to 40 Mbytes/sec.
The Sync mode outputs a 60MHz clk, that the P2 may be able to lock to ?
I'm not sure how else the P2 can deliver data carefully aligned to the 60MHz FT clock ?
Assuming it can lock, the P2 also needs to be able to generate pulses down to 1/60MHz, suggests a 120MHz SysCLK ?
FT600 generates the same 60 or 100MHz clock out, that all data needs to align to.
I've been thinking that what would be handy for cog-to-cog messaging would be for the sender cog to use setq+wrlong, then 'attention', and then the receiver cog doing a setq+rdlong. There you can move an N-long message pretty quickly, without any FIFO uncertainties. You would just have waits to get to the initial egg-beater position.
By the way, this could be optimized at run-time, once you know the relative cog numbers, by picking the starting long offsets of sender and receiver that result in the lowest latency.
That's quite a few hoops you have jumped through there, but the idea is quite interesting.
...
This can also work between more than just 2 COGS, should anyone be that ambitious.
It is unlikely the optimal slot would be the same address for target COGS. Were that ever to happen, the next slot is only 1 SysCLK worse.
...
Spacing of more than one cog the cogs allocation to find out a more switable spot window of the egg-beater will work only if the two cogs are doing internal operations. The address you read/write from/to must be calculated based on how far is the partner cog but you always need to keep the code in sync with the egg-beater window. As soon the cog will need to interact with real word IO events will lose the sync with the hub.
It is argued that such resource sharing between COGs is there to be used if wanted but can otherwise be ignored. Perhaps true, but as soon as code that does use it hits OBEX all code would become polluted with the problem.
I hope everyone understands why sequential LUT sharing is problematic.
I don't want to kill any useful feature, but anything that keeps us from being able to share objects without caveats is a bit like kryptonite to the Prop2.
Usually when you buy something (tv, owen, toys, ....) you read its operating instructions to know how to properly operate with the thing.
The same apply to the obex object. It is enough that the thing is documented.
Regarding the object isolation perhaps I can't understand the meaning. Usually the objects are started at the beginning of the application program. If the object's start method cognew (or coginit) the two cogs, waiting for the needed cog boot time, there couldn't be other cogs starting in the middle because the object's start are executed sequentially by the application program. Where is the isolation loosen here? It is only a matter of how the objects are coded.
Freq doubling with R/C+XOR gate is pretty easy, but the R/C needs to be tuned so the duty ends up around 50%
So you can not change things later in software without replacing the r/c values.
Most MCU goes the other way instead, internally halving the freq for less jitter.
Usually when you buy something (tv, owen, toys, ....) you read its operating instructions to know how to properly operate with the thing.
Ah, actually no. TV's, ovens, toys etc are just expected to work as one would expect. Are you one of those strange people that studies the instructions before assembling IKEA furniture
OK, I never read a manual for my Lego or Meccano as a kid. Things got tougher with my Philips Electronic Engineer kit.
Regarding the object isolation perhaps I can't understand the meaning. Usually the objects are started at the beginning of the application program. If the object's start method cognew (or coginit) the two cogs, waiting for the needed cog boot time, there couldn't be other cogs starting in the middle because the object's start are executed sequentially by the application program. Where is the isolation loosen here? It is only a matter of how the objects are coded.
That is perhaps true of small and simple programs. But heck, even my parallel FFT for the P1 starts COGs dynamically at run time.
We are now entering the world of half a megabyte of code and 16 cores. We had better be sure all cores are equal else we are laying a mine field.
A clock doubler wouldn't do what's required. Imagine the Streamer is updating data on 2 out of 3 system clock cycles (2/3*80MHz~ 54 MHz). In effect you need an edge transition 12.5ns6.25ns after each Streamer update, but you don't want an edge on that 3rd cycle gap (when the streamer hasn't updated, its skipping a cycle)
However what could work is using an external clock generator instead of the P2 PLL, to clock the prop at 80MHz or whatever we end up at, and AND that clock signal with an NCO smartpin signal in sync with the streamer NCO. As long as you got it all nicely synced that should work.
Chip when you say its difficult to get it all the way to the pin I assume it clobbers timing constraints, as surely the system clock is already at the smartpins for NCO updates etc
Sorry, i'm in a remote noisy factory otherwise I'd draw this out neatly
if I use 1 character per nanosec to illustrate, 12 chars per 80 MHz cycle,
The streamer outputs this data
AAAAAAAAAAAABBBBBBBBBBBBbbbbbbbbbbbbCCCCCCCCCCCCDDDDDDDDDDDDdddddddddddd
where lowercase is a skipped NCO step (retains data from previous)
an NCO pin could be programmed to do this, if in sync
~~~~~~~~~~~~~~~~~~~~~~~~____________~~~~~~~~~~~~~~~~~~~~~~~~____________
note there are no transitions between the A and B, nor C and D data updates
but what you really want for latching that data into an external SPI chip is something like this
~~~~~~______~~~~~~__________________~~~~~~______~~~~~~__________________
which is just the NCO signal from above Anded with the system clock (which sounds easy, but probably isn't)
you can 'double' that middle signal, but its not what you need, due to the lack of transitions beyond 40 MHz
However what could work is using an external clock generator instead of the P2 PLL, to clock the prop at 80MHz or whatever we end up at, and AND that clock signal with an NCO smartpin signal in sync with the streamer NCO. As long as you got it all nicely synced that should work.
I think in this solution, the NCO also needs a special AND/NAND mode, as you do not want a long idle in one state, generating many clocks, when the other idle state does not.
"all nicely synced " is also tricky, as there will be significant pin buffer delays, so it is much easier to get signals arriving at 2 pins in appx phase balance, than trying to get phase across a complete In-ClkTree-Out.
That total path could easily exceed half a period.
Chip when you say its difficult to get it all the way to the pin I assume it clobbers timing constraints, as surely the system clock is already at the smartpins for NCO updates etc
I think it relates to testing closure and any signal that is not clock qualified becomes special-case.
Possibly only for that case (maybe 2/3 isn't a good example).
Say you wanted to update 7/8 (70MHz, remember if silicon hits 160 MHz we're quite likely to want to update at 145 or 148 or whatever HDMI is):-
AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCCCCCCDDDDDDDDDDDDEEEEEEEEEEEEEFFFFFFFFFFFFGGGGGGGGGGGGgggggggggggg
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~____________
~~~~~~______~~~~~~______~~~~~~______~~~~~~______~~~~~~______~~~~~~______~~~~~~__________________
note that middle signal is like a 10MHz repeat, doubling it only gives you 20 MHz, not 70 MHz
jmg do you know whether that si5351 can have different outputs at the same frequency but phase shifted?
Ok. If I delay that NCO signal by ~10 ns, wouldn't the doubler produce the right waveform?
Depends what is 'right'.
A problem with Streamer and Clock, is ideally the streamer needs to advance on the passive clock edge and be stable on the active one. With a generated clock, that can work up to pin-sampling speeds, but if you want to stream at SysCLK, you really need to drive the same signal that shifts the streamer to a pin.
You could design the streamer to advance on both edges (DDR), and then externally do edge->pulse in XOR, and that would have correct phase, (with usual RC caveats)
... which is just the NCO signal from above Anded with the system clock
I would avoid using NCO terminology, and use Clock Enable instead.
To me a NCO signal can stretch while in either state, which is not what you want.
You can generate a Clock Enable signal from SysCLK (100%) down, and that Clock enable can be Gated with SysCLK to give a pin-copy of the effective applied Streamer Clock. (even tho the streamer does not actually gate the clock, it just uses a Clock Enable)
Last night I wrote a reply but did not post it. I was so disheartened by the replies against it, it kept me awake most of the night. Today I put it out of my mind, deciding to ignore it.
It seems like we have a fabulous new vehicle design. It has 4 doors with 5 seats, 2 in the front and 3 in the rear. But we want to give every occupant equal access to their seat. There is a lazy Susan (a spinning disc like yo often find in Chinese restaurants) that rotates all seats via one door. So this way, every occupant has equal access via the same single door. Of course the occupants must wait in line for their turn to access their seat. This is the hub.
Another way to access the seats via the four doors, so four occupants have equal access to their seats, but there is no way the occupant of the middle back seat has equal access. The result... Remove the 5th seat. After all who wants a 5th seat anyway.
No one wants to upset the 4 main occupants access, to permit the vehicle to carry 5 occupants! The idea to allow the 5th occupant to enter the vehicle by cooperating with either of the two other rear occupants was considered, but was rejected!
I think we have jumped over something and we are mixing arguments... again, please correct me if I am wrong. The idea that I found very compelling... and is worth bending the rules for... was cog signaling. Suddenly the conversation switched over to LUT sharing, which I think is a different kettle of fish:). The original idea was to be able to send a certain number of bits from one cog to the next one, without going through the hub. I don't think this has been ruled out.
In my mind, it isn't so much a car as it is like being on an elevator going in the wrong direction... it would be nice to jump through a door between elevators rather than have to go to the middle floor get out and then wait for who knows how long for the other elevator to arrive.
To make the car analogy work... imagine being in the driver's seat with the kids telling you to turn right and your wife telling you to go forward... and the only thing you can think is "I need a new car":)
Comments
While I was thinking about other ways to implement cog-to-cog communication, I came across this old thread. Did the change we discussed in that ever get implemented? It's not a big deal if it didn't, as it was more of a "nice to have" than anything else.
yee-haw!
But this was before the message, that there is still some space on the die.
Wouldn't this - fully symetric and COG/PIN agnostic mechanism also help @Cluso99 ??
But of course only if no other Object used it - same as with pins anyhow.
...
39 bit OR bus routed between all cogs
32 bits for data
1 bit for setting data available flag
1 bit for setting data received flag
1 bit for flagging the bus as locked
4 bits for ID
This method allows any cog to load up a Long on the bus, and set the data available flag.
The receiver cog gets the data, does anything it needs to do and then sets the data received flag(which has a reset feature to the data available line). The sender can then load up another long plus set the DA flag again. The receiver sees the data available flag toggled and sees new data, gets it, does it's thing, then sets data received(resets DA).
The problem is that only two cogs can talk at a time, unless there was a lock scheme added to the bus so that it could be locked out for a R/W cycle then used by others when it is not locked. Or better yet add an extra 4 bits for an ID. The ID is included with the data packet, and the receiver looks for a specific ID(if desired) and ignores others. Seems simple enough! For one cog to another, it would be fast. And gets rid of the adjacent cog requirement.
Now I'm nervous. Get to testing people. If there are remaining bugs, we gotta find them.
Such a channel is not really needed if the signaling works. It may be a bit slower because of polling and reading from HUB but it means that the data do not have to be rebuffered. I find it a nice compromise.
I'd like to get my hands on some P2s already... a well, just a little more time . Well done !
That's a very good point - the streamer at SysCLK (80MHz), needs some way to generate a compatible/matching Clock.
I'm also unclear if the Streamer has Hardware Handshake yet ?
Those signals are needed when you want to bulk-stream to Operating systems or similar hosts that you cannot guarantee will swallow always at full speeds. 99% of the time, they can, but some mechanism is needed for the rare stalls.
Maybe someone can connect FTDI FIFO parts and test this ?
Their data is vague on any FIFO hysteresis on handshakes, and they seem to focus on FPGA connect, where handshakes are immediate and a couple of lines of verilog.
FT600 would be cool, but that has more constraints than FT232H/FT2232H
Addit:
FT2232H gives these specs
* USB to parallel FIFO (async) transfer data rate up to 10Mbyte/sec.
Looks like a Test+R/W+Strobe REP loop works here, so a > 80MHz P2 should manage close to 10M ?
* Single channel synchronous FIFO mode for transfers up to 40 Mbytes/sec.
The Sync mode outputs a 60MHz clk, that the P2 may be able to lock to ?
I'm not sure how else the P2 can deliver data carefully aligned to the 60MHz FT clock ?
Assuming it can lock, the P2 also needs to be able to generate pulses down to 1/60MHz, suggests a 120MHz SysCLK ?
FT600 generates the same 60 or 100MHz clock out, that all data needs to align to.
Thinking about this effect, there could be a place for two simple modes of HUB interaction ?
Mode F, is as now, where COG stalls for eggbeater slot, then immediately continues. The Net Delay incurred \varies.
Mode D,(new) take a fixed 16c for HUB, and the actual hub transfer occurs somewhere inside that 16c.
Mode F is used for fastest operations, where jitter is tolerated, and code can phase-lock for more looping speed.
Mode D naturally has no added jitter, so makes other code simpler, at a slight average speed cost.
https://www.maximintegrated.com/en/app-notes/index.mvp/id/3327
That doubles input frequency with just a comparator and an XOR gate.
Maybe that'd give 80 MHz clock from 40 MHz input for up to 80 MHz streaming?
hehe, Maxim sell comparators, so no surprise they used one...
You can frequency double with just an XOR gate and a delay element (Maxim uses their product + 5 other parts.
The 1G57/58/97/98 configurable gates include Schmitt triggers, so a simple RC+XOR can frequency double.
Note besides adding parts, this adds an external delay, and does not scale with PLL choices that well, so is less than ideal.
Spacing of more than one cog the cogs allocation to find out a more switable spot window of the egg-beater will work only if the two cogs are doing internal operations. The address you read/write from/to must be calculated based on how far is the partner cog but you always need to keep the code in sync with the egg-beater window. As soon the cog will need to interact with real word IO events will lose the sync with the hub.
Usually when you buy something (tv, owen, toys, ....) you read its operating instructions to know how to properly operate with the thing.
The same apply to the obex object. It is enough that the thing is documented.
Regarding the object isolation perhaps I can't understand the meaning. Usually the objects are started at the beginning of the application program. If the object's start method cognew (or coginit) the two cogs, waiting for the needed cog boot time, there couldn't be other cogs starting in the middle because the object's start are executed sequentially by the application program. Where is the isolation loosen here? It is only a matter of how the objects are coded.
Some delay might be good for setup time on one edge and shouldn't matter if triggering on other edge, right?
So you can not change things later in software without replacing the r/c values.
Most MCU goes the other way instead, internally halving the freq for less jitter.
OK, I never read a manual for my Lego or Meccano as a kid. Things got tougher with my Philips Electronic Engineer kit. That is perhaps true of small and simple programs. But heck, even my parallel FFT for the P1 starts COGs dynamically at run time.
We are now entering the world of half a megabyte of code and 16 cores. We had better be sure all cores are equal else we are laying a mine field.
For example, rolnib d,s,#2 works fine,
but
mov myvar,#2
rolnib d,s,myvar doesn't work.
The compiler suggests using a '#' and when I put that in front of the variable (of course) it doesn't work.
Yes, it has to be a constant, since it's only a 3-bit field.
However what could work is using an external clock generator instead of the P2 PLL, to clock the prop at 80MHz or whatever we end up at, and AND that clock signal with an NCO smartpin signal in sync with the streamer NCO. As long as you got it all nicely synced that should work.
Chip when you say its difficult to get it all the way to the pin I assume it clobbers timing constraints, as surely the system clock is already at the smartpins for NCO updates etc
"all nicely synced " is also tricky, as there will be significant pin buffer delays, so it is much easier to get signals arriving at 2 pins in appx phase balance, than trying to get phase across a complete In-ClkTree-Out.
That total path could easily exceed half a period.
I think it relates to testing closure and any signal that is not clock qualified becomes special-case.
Say you wanted to update 7/8 (70MHz, remember if silicon hits 160 MHz we're quite likely to want to update at 145 or 148 or whatever HDMI is):-
jmg do you know whether that si5351 can have different outputs at the same frequency but phase shifted?
A problem with Streamer and Clock, is ideally the streamer needs to advance on the passive clock edge and be stable on the active one. With a generated clock, that can work up to pin-sampling speeds, but if you want to stream at SysCLK, you really need to drive the same signal that shifts the streamer to a pin.
You could design the streamer to advance on both edges (DDR), and then externally do edge->pulse in XOR, and that would have correct phase, (with usual RC caveats)
I would avoid using NCO terminology, and use Clock Enable instead.
To me a NCO signal can stretch while in either state, which is not what you want.
You can generate a Clock Enable signal from SysCLK (100%) down, and that Clock enable can be Gated with SysCLK to give a pin-copy of the effective applied Streamer Clock. (even tho the streamer does not actually gate the clock, it just uses a Clock Enable)
Exactly.
It specs 333ps per step on phase, and it looks like +/- 45ns is the legal range. (appx +/- 127?)
Last night I wrote a reply but did not post it. I was so disheartened by the replies against it, it kept me awake most of the night. Today I put it out of my mind, deciding to ignore it.
It seems like we have a fabulous new vehicle design. It has 4 doors with 5 seats, 2 in the front and 3 in the rear. But we want to give every occupant equal access to their seat. There is a lazy Susan (a spinning disc like yo often find in Chinese restaurants) that rotates all seats via one door. So this way, every occupant has equal access via the same single door. Of course the occupants must wait in line for their turn to access their seat. This is the hub.
Another way to access the seats via the four doors, so four occupants have equal access to their seats, but there is no way the occupant of the middle back seat has equal access. The result... Remove the 5th seat. After all who wants a 5th seat anyway.
No one wants to upset the 4 main occupants access, to permit the vehicle to carry 5 occupants! The idea to allow the 5th occupant to enter the vehicle by cooperating with either of the two other rear occupants was considered, but was rejected!
I think we have jumped over something and we are mixing arguments... again, please correct me if I am wrong. The idea that I found very compelling... and is worth bending the rules for... was cog signaling. Suddenly the conversation switched over to LUT sharing, which I think is a different kettle of fish:). The original idea was to be able to send a certain number of bits from one cog to the next one, without going through the hub. I don't think this has been ruled out.
In my mind, it isn't so much a car as it is like being on an elevator going in the wrong direction... it would be nice to jump through a door between elevators rather than have to go to the middle floor get out and then wait for who knows how long for the other elevator to arrive.
To make the car analogy work... imagine being in the driver's seat with the kids telling you to turn right and your wife telling you to go forward... and the only thing you can think is "I need a new car":)