Making the case for expanded ATN
Seairth
Posts: 2,474
(moved to a separate thread)
Just to make sure we are on the same page about the implementation I'm suggesting...
We have already discussed the trivial use case, where the cog doesn't care who is signalling it. In this case, use of WAITATN (for blocking) or GETATN WZ (for polling, contents of D are ignored) is sufficient. But, when a cog needs to know which other cogs are getting its attention, this is where the 16-bit ATN register comes into play.
For instance, suppose a cog wants to be sure that only specific other cogs can get its attention:
Or, if a cog wants to wait until ALL other (permitted) cogs have signalled you:
Or, if one cog is feeding data to the next availble worker cog(s) on demand:
Taking this a bit further, suppose you have the following:
If it's important for Cog 0 to receive all 4 bytes at exactly the same time, then we can use the barrier example above:
But, it might be that Cog 0 is willing to process the cogs that are ready, as they become ready:
There are a number of other scenarios that I could have shown, but hopefully this is enough to show the utility of having the 16-bit ATN register. None of the code is long or complicated, and I don't know how you would be able to do this as efficiently without the ATN register.
@cgracey, were you able to change the attention events to support the 16 inputs, as I suggested earlier? This could be used along with WRLUT* to ACK. If you keep the attention event a single ORed input, then the writer cannot determine who is ACKing.
About knowing who caused the attention (or LUT write): how do we parse the who information? I could see it being rapidly overwritten and lost. I think it's unrealistic to know who did something, but realistic to know that someone did something.
Just to make sure we are on the same page about the implementation I'm suggesting...
ATN Register
- 16-bit register, where each bit respresents an incoming ATN signal from a cog
- Bits are set high by 16 incoming ATN signals (one from each cog)
- Bits stay high until GETATN is called
- Bits are set low when GETATN is called (and no incoming ATN signal is high)
ATN Event
- The existing event
- Set high when any one of the incoming ATN signals is high
- Set low by either WAITATN or GETATN (and no incoming ATN signal is high)
WAITATN
- Blocks the cog until an ATN event occurs
- Clears ATN event
- Does not interact with ATN register
GETATN D [WZ]
- Loads ATN register into D[15:0], optionally set Z if ATN register is zero
- Clears ATN event
- Sets ATN register to %$00_00 (except for incoming ATN signals that are high)
COGATN D
- Signals the ATN event on the masked bits in D
- Sets the ATN bit corresponding to this cog's index on all masked cogs to high
We have already discussed the trivial use case, where the cog doesn't care who is signalling it. In this case, use of WAITATN (for blocking) or GETATN WZ (for polling, contents of D are ignored) is sufficient. But, when a cog needs to know which other cogs are getting its attention, this is where the 16-bit ATN register comes into play.
For instance, suppose a cog wants to be sure that only specific other cogs can get its attention:
loop WAITATN ' wait for an ATN event (clears event) GETATN t0 ' gets ATN register into t0 (clears ATN register) AND t0, mask ' checks to see if ANY of the masked bits are set TJZ t0, #loop ' none set (i.e. ATN event was from a non-permitted cog), loop ' otherwise, continue on...
Or, if a cog wants to wait until ALL other (permitted) cogs have signalled you:
MOV barrier, mask ' initialize barrier with mask loop WAITATN ' wait for an ATN event (clears event) GETATN t0 ' gets ATN register into t0 (clears ATN register) ANDN barrier, t0 ' clear the bits from barrier TJNZ barrier, #loop ' barrier still has outstaning cog bits, loop ' otherwise, continue on...
Or, if one cog is feeding data to the next availble worker cog(s) on demand:
loop WAITATN ' wait for an ATN event (clears event) GETATN t0 ' gets ANT register into t0 (clears ATN register) send BOTONE t1, t0 ' get a cog index CLRB t0, t1 ' clear the bit CALL #send_data ' send data to the cog in t1 (e.g. via HUB, WRLUTX/S, or smart pin) TJNZ t0, #send ' if there were more than one ATN signals, loop JMP #loop ' go back to waiting for an ATN event
Taking this a bit further, suppose you have the following:
- Cog 0 is receiving data from Cogs 1-4
- Cogs 1-4 are sending data to Cog 0 via WRLUTX, using the same address, but each to a different byte (so that they can be safely ORed together)
- Ideally, Cogs 1-4 will be ready to send data to Cog 0 at exactly the same time. But, it is possible that one or more of those cogs could be delayed.
If it's important for Cog 0 to receive all 4 bytes at exactly the same time, then we can use the barrier example above:
loop MOV barrier, mask ' initialize barrier with mask wait WAITATN ' wait for an ATN event (clears event) GETATN ready ' gets ATN register into t0 (clears ATN register) ANDN barrier, ready ' clear the ready bits from barrier TJNZ barrier, #wait ' barrier still has outstaning cog bits, loop COGATN mask ' Tell Cogs 1-4 that they can all write their data (on the same clock cycle) CALL #process_data ' do something interesting JMP #loop ' go back to waiting for Cogs 1-4 to get their next byte of data ready
But, it might be that Cog 0 is willing to process the cogs that are ready, as they become ready:
loop WAITATN ' wait for an ATN event (clears event) GETATN ready ' gets ATN register into t0 (clears ATN register) AND ready, mask ' make sure we are only looking at ready signals from Cogs 1-4 TJZ ready, #loop ' some other cog signalled us; ignore COGATN ready ' Tell the "ready" cogs that they can all write their data (on the same clock cycle) CALL #process_data ' do something interesting (t0 indicates which bytes are written) JMP #loop ' go back to waiting for Cogs 1-4 to get their next byte of data ready
There are a number of other scenarios that I could have shown, but hopefully this is enough to show the utility of having the 16-bit ATN register. None of the code is long or complicated, and I don't know how you would be able to do this as efficiently without the ATN register.
Comments
WRLUTX/S and COGATN are very different beasts. They can clearly compliment each other, but I don't think they can be effectively rolled together.
Except this prevents ATN from being used for anything other shared LUT writes. I fully expect that ATN will be used in scenarios other than LUT manipulation.
Certainly in Microcontroller Land, it is normal to be able to identify the Actual Source of any merged flags, by reading another register if needed.
This should have a relatively low Logic cost to include ?
Another use I can think of for this, is a Watchdog COG (or is that WatchCOG?),where a background COG checks for periodic ATN from running COGs - in this case, it is very important to know the set of ATN generating COGs.
ATN alone is not enough.
Brilliant!
A read-mode LUT-Any could safely use this trigger method, but a Write mode one, has to write something to trigger, and that changes the destination register.
If so, there are few reasons for combining the features.
This is what I'm working on right now. It's the last detail on my list.
I have a question: Does a cog really care WHO pinged him? It seems to me that if he cares about WHO, he also must care about WHAT, because he's not going to do the same thing for everyone, right? There is always some customization to his response. There has to be DATA somewhere to base his response on to each cog pinging him for attention. That data will have to live in the hub (or smart pins, maybe).
This seems like what would likely happen: A cog gets pinged and, in this case, he needs to know what to do about it (most apps won't care about WHO, as that will be a constant). He has to get some data from hub to discover what is wanted. What if he had a string of longs in hub, where each long would be written beforehand by the cog pinging him. Once pinged by anyone, he could do a 'SETQ #15' + 'RDLONG startreg,hubaddr' to pull in 16 longs that he could then check for, say, non-0 values that indicate there's a request from the cog related to that long. This might seem like work, but it's no more work than sifting through 16 bits and dealing with each one - and THEN looking for the data. I just don't know if there's any real value in adding lots more logic and flops into this COGATN feature.
Maybe the feature is already what it needs to be? A cog just knows that someone pinged him, not who pinged him.
BTW Can SETQ be used with RDBYTE ? I presume this will have quite a bit of initial latency given the egg beater interaction with word and byte.
SETQ only works with RDLONG, since it's not using the FIFO, just the egg-beater. It's longs or nothing. The good part is that block r/w's don't interfere with the FIFO, so they work during hub exec.
Yes, if the logic cost is too high, you an use either, or both of Pin Cell, or HUB links (or, there is a chance a speed paranoid design has a paired COG already, in which case it has LUT path & signals too, to ONE companion )
As you say, RxCOG can get ATN, then check an active Pin-Cell for Data, and use no-data, or Data value as varying response indicators.
In the above code snippets, the "mask" is purely an implementation detail in the code, not a baked-in part of the ATN mechanism.
This sounds a lot like an enhanced version of mailboxes. COGATN would be the equivalent of "you've got mail". Having the cog fetch hub data allows for both shared and dedicated use of the cog. For shared use the 16 longs could act as individual mailboxes assigned to the other cogs.
There are a variety of reasons why you would only care about who and never about what. For example:
* @jmg's example of a watchdog cog that reacts to any watched cog that hasn't ATNed within a timeout period.
* synchronization across multiple cogs (the classic "barrier")
* workers in a pool indicating when they are "done".
(Notice that these are all primarily many-to-one communications.) Without the expanded ATN, here's how you would likely have to handle each of those with the existing hardware.
For the watchdog, it looks like the most efficient way to do this would be to use the new smart pin mode. You would use up to 15 pins, where each watched cog would send a one-byte event (the actual value is irrelevant; it's just the least clock cycles). The watchdog cog would use use WAITPAN/WAITPBN, followed by an inspection of INA/INB to see who signaled it. it would then have to call PINACK on each signaled pin to clear it.
For the barrier, you could again use the new smart pin mode. Each cog would "signal" the arbiter cog as with the watchdog, then wait for an ATN. In this case, the arbiter would use WAITPAE/WAITPBE to block until all participants have signaled. Then it would turn right around and ATN all of the participants. Technically, you could also use another smart pin instead of ATN, where the arbiter would call PINACK right after calling PINSETY.
For the worker pool, smart pins would again be the likely approach. In this case, I can certainly see where it might actually be preferred to the expanded ATN mechanism, as you could tunnel data between the worker and scheduler via the pin.
There is a caveat to all of these examples: you are relying on the pin to be a proxy of the signalling cog. If another cog writes to the wrong pin, there is no way to detect it. Yes, this is likely a bug, and would hopefully never happen. Comparatively, this is something that the expanded ATN can trivially detect.
Having said all that, if we are willing to utilize smart pins in this fashion and are willing to live with the extra overhead of working with the smart pins, then the expanded ATN is probably not necessary.
This is all predicated on using ATN for only one-to-one or one-to-many communications. In that case, yes, there is no need for the expanded ATN. But, as soon as you have a many-to-one situation, your example above becomes non-workable, or at least very complicated. For instance, how do you distinguish between a LONG that hasn't changed as opposed to one that has been updated with the same value as before? This also now requires the receiving cog to keep up to 16 longs locally as a copy to compare against. And that SETQ+RDLONG will, on average, take ~24 clock cycles. Of course, you could do bytes instead, which would reduce the average down to ~13 clocks. This also ignores the fact that all of the other cogs (the ones writing to the hub) are incurring the hub latency overhead, which can be significant if all you are trying to do is signal (in a many-to-one situation).
And you keep talking about the effort of sifting through 16 bits. Do any of the code examples above look like an effort? I suspect that a hub-oriented solution would look a lot more complicated in each one of those examples. As I mentioned in the last reply, smart pins may be a better approach for some of these examples, but that also has some overhead.
* Inside each cog, add two 32-bit "sticky" registers, one for each IN port.
* Add an instruction (e.g. SETINA/SETINB D/#, where D is a mask) that allows IN bits to either be driven directly from the pin or from the associated sticky bit in the register (to avoid a one-clock delay when using the sticky bit, IN would be the logical-or of the pin and the current sticky bit value).
* The incoming pin signal always sets the associated sticky bit.
* Smart pins no longer hold IN high. Instead, they pulse the line. If a cog wants the same IN behavior as before, it uses SETIN. However, if the design is such that the cog is always waiting on the pin, then it's not necessary to watch the sticky bit; the pulse will be sufficient.
* PINACK goes away. In its place, you need another instruction (which could still be called PINACK) that clears sticky bit(s) (e.g. PINACKA/PINACKB D/#, where D is a mask). Actually, what the instruction really does is re-capture the current state of the pin. This has the effect of setting the sticky bit back to zero if the pin doesn't happen to be high at the time that PINACK is called. Otherwise, the sticky bit stays stuck.
With these changes, I think you could use the new smart pin mode almost entirely in place of the ATN mechanism. And it covers the expanded version I was suggesting.
Also, in some cases (as noted above), the other smart pin modes can be used without PINACK. Further, because the smart pin "attention" bit is local to each cog, it is now possible for multiple cogs to RDPIN (or whatever its called) and PINACK without clearing the "attention" bit in other cogs!
Also, this enhances the basic pin capabilities, independent of the ATN and other smart pin usage.
(edit: I just realized that the SETIN needs to be two instructions: SETINA and SETINB. Same for the new PINACK. Updated to reflect this.)
(edit edit: since the old PINACK would go away, which was really just a specialized PINSETM, the new pin mode could have its own mode instead of overloading another pin mode.)
yeeessss... but the Smart Pin cell, is a seriously capable piece of silicon, to be using as a simple flag ?
I'd want my smart pin cells doing .. Smart Pin Stuff...
This direction really needs the > 64 Pin Cell structure I suggested earlier.
Consuming 15 Smart Pin cells for watchdog use, is a lot of smart silicon gone...
A simpler link-only subset of Pin cells above 64, uses the same highway and commands, but does not steal Smart Pins any time you want to use it.
The Link-only cell would be a lot smaller, and could be faster, end-end.
Sure, I want my smart pins doing smart pin stuff too! But I think this is another case of using extreme examples to justify features. Most designs are going to have very little inter-cog communication or signalling. For example, a typical design is likely to watchdog only a cog or two, if at all. And most designs are not going to use all of the smart pin cells. For example, I expect that smart pins associated to the boot I/O to be largely unused after boot. So why not make better use of the resources that are already there?
I know, I know. After all that talk about enhancing ATN, I now seem to be arguing exactly the opposite. Here's the thing, though. When I first made that argument, LUT sharing was different than it is now, the new smart pin mode didn't exist, etc. Given the current design, I still think the expanded ATN would be useful, but I think there is another approach that can give the same bang for the buck while using other existing hardware (with a few tweaks). And, we just happen to gain some new capabilities in the process! To me, that's a win-win.
I like the boolean signaling ATN, as that does not consume a pin.
If, as Chip mentions, full signaling is too much logic, then using Pin Cells for that extra info makes sense to me, but I think using a Pin Cell for simple Boolean tasks is a serious waste.
That's a good point, but maybe someone wants to use serious BAUD speed after boot, in which case they will need Smart pins.
Best BAUD I can find today, is an EXAR part at 1,2,4 channels at 15MBd with HS-USB backbone.
I've made a case elsewhere for using that device, on P2 Eval Kits, as it gives the highest information link speeds, over Multiple UARTS, and a great real-test device with wide dynamic range.
Other parts are cheaper, but I'd reserve the cheapest slow parts for a P2-Module/Breakout minimalist board.
Or, they might want USB Boot, again, full smart pins are needed. I'd rate USB boot as highly likely, as the last cab on the rank.
Would it be expensive to add some simple digital 'not so smart' pins to the smart pin interface? Even if not faster as them smart pins this could give a useful tool for signaling states between Cogs.
Same question about the locks. Sure with one lock you can simulate multiple ones, but it cost execution time.
Would it be at least possible to have 32 locks instead of 16?
curious.
Mike
I personally dislike to have USB chips on almost every Propeller Board I can buy. I personally like the PropPlug Idea way better then build in USB. We can not reuse the Pins for other purposes (say LEDs) and there are all those reset issues when USB gets enumerated or serial data send if no USB host connected.
A jumper for reset/programming like on the spinneret could help, but still.
Sure that high speed EXAR part sounds interesting, But should be, like the PropPlug, detachable. Not mandatory.
But I am just a hobbyist, Not able to fabricate PCBs yet. So I might be wrong here.
Mike