The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Sapieha · 2015-07-21 09:25

Hi Chip.

Thanks

Cluso99 · 2015-07-21 10:06

Chip, this sounds really good to me. We can set the priority. And we have a way to signal between other cogs via hub reads and writes. With a WAITINT instruction, we can effectively put a cog to sleep while it waits on another cog.

As for pin edge hold off, I would be happy to not have any hold off, other than to not trigger while in the edge interrupt handler (or not until a return from the edge interrupt).

For a long time I have been after a way to signal between cogs to simplify the current polled method, including extra internal pins. But this makes the simplest way, and we can also minimise power while waiting. Congratulations!

There is a side benefit to this mechanism... It effectively defines a fixed long for each cog which is something we have never been able to agree upon.

potatohead · 2015-07-21 10:28

The fixed priority resolves a lot of confusion. Now we are back to a well defined behavior people can design around, and where the priority isn't optimal, people can always employ more than one COG. Same for some rapid edge capture cases.

Without adding features and a lot of complex cases, this seems optimal overall.

Sometimes I think we center in on one COG, forgetting how they can be used together. Of course, the benefit of that is where we can maximize a COG, the benefits are likely to multiply by up to 16!

As mentioned some features do depend on specific COGS. That's going to have to be something users manage. "driver works in COGS 3,4,5" kind of thing. IMHO, worth it for the supervisory / message features. It will be worth it to think about employing this one in common code, just so we keep objects highly portable.

The breakpoint is a very nice "freebie"

Is the 20 bit address inclusive of the COG memory addresses? Breakpoint in COG and HUB EXEC code? A few of us are going to be really happy about this.

Overall, this doesn't seem like feature creep. It's more like a couple cycles were needed to center in on optimal functionality.

jmg · 2015-07-21 10:58

Here are the modes for SETINTx:
0000 off0001 timer interrupt0010 transfer rollover interrupt0011 transfer block wrap interrupt0100 breakpoint interrupt0101 pin pos-edge interrupt0110 pin neg-edge interrupt0111 pin any-edge interrupt1000 read mem interrupt1001 write mem interrupt
What is the setup & granularity of the mem interrupts ?Could those also be used as Breakpoints ?Can they be used to redirect off-chip access, thru a memory manager/serial XIP handler ?
Look forward to some example code that exercises all of this fully.

potatohead · 2015-07-21 11:26

The mem interrupts are centered on the low HUB addresses only and take advantage of the HUB being 16 memories, addressed by lower nibble. The basis for the egg beater, basically.

One long per COG.

Have each cog get a signal when the correspondingly-numbered hub RAM (of
which there are 16) gets written to at its first address.

This amounts
to a little combinatorial logic that feeds a flop that goes out to each
cog. If some cog writes within $0..$3, cog 0 gets a pulse. If some cog
writes to $4..$7, cog 1 gets a pulse, and so on, up to $3C..$3F causing
cog 15 to get a pulse.

Reads pulse too.

It's just a signal mechanism that takes a few LE's to deliver some nice features. We can code things that the very expensive message passing / masking system in "hot" did.

Seairth · 2015-07-21 11:35

If interrupts are being extended to include hub addresses, why not include an interrupt for LOCK as well?

Seairth · 2015-07-21 11:46

First, though, PJV had made some request that we could know when a certain hub location was being written to by another cog. The setup and address comparison for that would have been way too complex. We talked about this and found that a much simpler goal would get us there: Have each cog get a signal when the correspondingly-numbered hub RAM (of which there are 16) gets written to at its first address. This amounts to a little combinational logic that feeds a flop that goes out to each cog. If some cog writes within $0..$3, cog 0 gets a pulse. If some cog writes to $4..$7, cog 1 gets a pulse, and so on, up to $3C..$3F causing cog 15 to get a pulse. This would take about 100 LE's for a full 16-cog implementation, which is 0.07% of the total digital logic. But we could even make it better by signalling reads, too, not just writes, for a likely increase of 16 LE's. The point of all this being that we could use these RDZ/WRZ signals for interrupts, waiting, and polling. Cogs could use this mechanism to fully handshake asynchronous 32-bit data streams between them (in the background, even).

I'm not sure I understand the value of the "read" interrupt.
If cog_0 and cog_1 are using this mechanism, cog_0 would get an interrupt for a write to $0 (by cog_1). cog_0 would then read $0, which would generate a read interrupt on cog_0. cog_1 is never notified that cog_0 read $0. To do that, cog_0 would have to read $1 (in addition to $0).
Conversely, cog_1 could write a value to $1, then wait for an interrupt that cog_0 read it. However, no write interrupt was sent to cog_0, so it wouldn't know to go read $1.

potatohead · 2015-07-21 11:51

Wouldn't this be used with a different HUB memory address, or block of addresses for the actual data transfer?

Meaning the read / write is just two channels of hand shaking, and or a data address mailbox. A controlling COG is likely to use one or the other of it's own read / write. Controlled COGS may use both.

Seairth · 2015-07-21 12:26

How many bits are available in the LOCK register? Here's my thought:

256 bits, which effectively gives us 16 bits per cog
Hub interrupt when LOCK register changes (any bit)
Add LOCKGET instruction to test whether a bit is set (captured in C)
Add WAITLCK instruction to halt a cog until the LOCK register changes

With these, you could have "soft interrupts". For instance, suppose cog_0 want's to call a routine in cog_1. To do so, cog_0 would set hub memory to whatever parameters are required, then set the associated interrupt LOCK bit. cog_0 could then go on to do other things while waiting for cog_1 to complete. Or it could call WAITLCK to wait it out.

In the meantime, cog_1 would have an ISR on INTLOCK, which would then use LOCKGET to see which of its routines (if any) have been triggered. When the routine is finished, it writes the results to hub memory and clears the associated LOCK bit.
cog_0 gets notified that cog_1 is complete in one of three possible ways:

Pending WAITLCK
INTLOCK ISR
polling with LOCKGET

The reason for 256 bits is that this would provide parity with the number of cogs in the system (16^2), though the actual usage is entirely up to the developer.
Another use case is similar to the HUB RAM read/write that Chip described above. In this case, the routine works as follows:

cog_0 writes to hub RAM (any address, not just the first address).
cog_0 sets bit $10 on LOCK (assuming the developer partitioned $00-$0F for cog_0 and $10-$1F for cog_1.
cog_1's INTLOCK ISR wakes up and uses LOCKGET to test $10. If set, it then reads the associated hub RAM, then clears bit $10.
cog_0, as above, can use one of three different methods to see that cog_1 has "received" the data.

Anyhow, you get the idea. Though it's true that these routines would require multiple hubops, the fact that LOCK updates can be handled by an ISR would make the overall process much more efficient than using only polling. I'd even go so far as to suggest that you don't really need the hub RAM read/write interrupts (or, at least the read interrupt) if this were available.

Seairth · 2015-07-21 12:41

Wouldn't this be used with a different HUB memory address, or block of addresses for the actual data transfer?

Meaning the read / write is just two channels of hand shaking, and or a data address mailbox. A controlling COG is likely to use one or the other of it's own read / write. Controlled COGS may use both.

My point was that the "read" interrupt, if I am understanding it correctly, is not very useful. cog_1 only knows if something has read $4-$7.
* How do any of the other cogs know that they should be reading one of those bytes?* How does cog_1 know which byte was read?* How does cog_1 know if the intended recipient was the one that performed the read?

As you point out, this mechanism is most likely going to be used as a "mailbox", with the real data sitting somewhere else. As a result, one-way message flows (detectable via the "write" interrupt) makes sense. But, if you want any ACK or flow-control capability, you can't use "read" interrupts. You end up having to implement two-way messaging, where each cog is writing to the other cog's address. Note that this approach would only support a cog communicating with up to 4 other cogs (which is probably enough for most use cases).
Regardless, the "read" interrupt doesn't seem to be very useful.

kwinn · 2015-07-21 13:33

Wouldn't this be used with a different HUB memory address, or block of addresses for the actual data transfer?

Meaning the read / write is just two channels of hand shaking, and or a data address mailbox. A controlling COG is likely to use one or the other of it's own read / write. Controlled COGS may use both.

My point was that the "read" interrupt, if I am understanding it correctly, is not very useful. cog_1 only knows if something has read $4-$7.
* How do any of the other cogs know that they should be reading one of those bytes?* How does cog_1 know which byte was read?* How does cog_1 know if the intended recipient was the one that performed the read?

As you point out, this mechanism is most likely going to be used as a "mailbox", with the real data sitting somewhere else. As a result, one-way message flows (detectable via the "write" interrupt) makes sense. But, if you want any ACK or flow-control capability, you can't use "read" interrupts. You end up having to implement two-way messaging, where each cog is writing to the other cog's address. Note that this approach would only support a cog communicating with up to 4 other cogs (which is probably enough for most use cases).
Regardless, the "read" interrupt doesn't seem to be very useful.

This whole discussion is leading to some very interesting developments that will make the P2 more useful. Since the silicon cost of adding the read interrupt is so small I think it should be included. I would be very surprised if someone does not come up with a great use for it. Cog to cog communication has been requested in the past, and the read interrupt could be used for that.

Sapieha · 2015-07-21 14:02

Hi All.

As I can see -- Way Chip build Interrupts on PX open for programing any type of companion chip's --
<type MMU, MPU and others that need signaling between main CPU and companion chip.

And that open way to any type of OS that can be RUN on PX -- so it will be possible to run LINUX, Unix and other advanced Os

ctwardell · 2015-07-21 14:06

I assume the red/write interrupts won't be triggered for a given cog if it reads/writes it's own trigger location.
Is this correct?
C.W.

potatohead · 2015-07-21 16:02

You know, it might make sense for it to trigger. That's like a BRK or a self-triggered interrupt feature, but it might be silly, or that one thing people bump into that didn't get used... too.

cgracey · 2015-07-21 16:25

Chip, this sounds really good to me. We can set the priority. And we have a way to signal between other cogs via hub reads and writes. With a WAITINT instruction, we can effectively put a cog to sleep while it waits on another cog.

As for pin edge hold off, I would be happy to not have any hold off, other than to not trigger while in the edge interrupt handler (or not until a return from the edge interrupt).

For a long time I have been after a way to signal between cogs to simplify the current polled method, including extra internal pins. But this makes the simplest way, and we can also minimise power while waiting. Congratulations!

There is a side benefit to this mechanism... It effectively defines a fixed long for each cog which is something we have never been able to agree upon.

WAITINT would be a nice addition.

cgracey · 2015-07-21 16:28

...The breakpoint is a very nice "freebie"

Is the 20 bit address inclusive of the COG memory addresses? Breakpoint in COG and HUB EXEC code?...

Yes, addresses from $0..$7FC ($0..$1FF in terms of registers) are cog execution addresses. Above that range are hub execution addresses. You could have a breakpoint anywhere.

cgracey · 2015-07-21 16:31

Here are the modes for SETINTx:
0000 off0001 timer interrupt0010 transfer rollover interrupt0011 transfer block wrap interrupt0100 breakpoint interrupt0101 pin pos-edge interrupt0110 pin neg-edge interrupt0111 pin any-edge interrupt1000 read mem interrupt1001 write mem interrupt
What is the setup & granularity of the mem interrupts ?Could those also be used as Breakpoints ?Can they be used to redirect off-chip access, thru a memory manager/serial XIP handler ?
Look forward to some example code that exercises all of this fully.

I suppose if they allowed address ranges, the mem interrupts would be half way to being able to interdict what could become off-chip accesses, but they are not that rich and I think it's too much attempt, at this point. Interesting idea, though.

cgracey · 2015-07-21 16:34

If interrupts are being extended to include hub addresses, why not include an interrupt for LOCK as well?

What are you thinking here?

cgracey · 2015-07-21 16:36

First, though, PJV had made some request that we could know when a certain hub location was being written to by another cog. The setup and address comparison for that would have been way too complex. We talked about this and found that a much simpler goal would get us there: Have each cog get a signal when the correspondingly-numbered hub RAM (of which there are 16) gets written to at its first address. This amounts to a little combinational logic that feeds a flop that goes out to each cog. If some cog writes within $0..$3, cog 0 gets a pulse. If some cog writes to $4..$7, cog 1 gets a pulse, and so on, up to $3C..$3F causing cog 15 to get a pulse. This would take about 100 LE's for a full 16-cog implementation, which is 0.07% of the total digital logic. But we could even make it better by signalling reads, too, not just writes, for a likely increase of 16 LE's. The point of all this being that we could use these RDZ/WRZ signals for interrupts, waiting, and polling. Cogs could use this mechanism to fully handshake asynchronous 32-bit data streams between them (in the background, even).

I'm not sure I understand the value of the "read" interrupt.
If cog_0 and cog_1 are using this mechanism, cog_0 would get an interrupt for a write to $0 (by cog_1). cog_0 would then read $0, which would generate a read interrupt on cog_0. cog_1 is never notified that cog_0 read $0. To do that, cog_0 would have to read $1 (in addition to $0).
Conversely, cog_1 could write a value to $1, then wait for an interrupt that cog_0 read it. However, no write interrupt was sent to cog_0, so it wouldn't know to go read $1.

Oh, I didn't realize that! We will need to select 1-of-16 read signals for a proper read interrupt. Any reason not to mux the incoming write signals, as well?

cgracey · 2015-07-21 16:43

I assume the red/write interrupts won't be triggered for a given cog if it reads/writes it's own trigger location.
Is this correct?
C.W.

It would happen, unless some circuitry masked it away.

cgracey · 2015-07-21 16:46

How many bits are available in the LOCK register? Here's my thought:

256 bits, which effectively gives us 16 bits per cog
Hub interrupt when LOCK register changes (any bit)
Add LOCKGET instruction to test whether a bit is set (captured in C)
Add WAITLCK instruction to halt a cog until the LOCK register changes

With these, you could have "soft interrupts". For instance, suppose cog_0 want's to call a routine in cog_1. To do so, cog_0 would set hub memory to whatever parameters are required, then set the associated interrupt LOCK bit. cog_0 could then go on to do other things while waiting for cog_1 to complete. Or it could call WAITLCK to wait it out.

In the meantime, cog_1 would have an ISR on INTLOCK, which would then use LOCKGET to see which of its routines (if any) have been triggered. When the routine is finished, it writes the results to hub memory and clears the associated LOCK bit.
cog_0 gets notified that cog_1 is complete in one of three possible ways:

Pending WAITLCK
INTLOCK ISR
polling with LOCKGET

The reason for 256 bits is that this would provide parity with the number of cogs in the system (16^2), though the actual usage is entirely up to the developer.
Another use case is similar to the HUB RAM read/write that Chip described above. In this case, the routine works as follows:

cog_0 writes to hub RAM (any address, not just the first address).
cog_0 sets bit $10 on LOCK (assuming the developer partitioned $00-$0F for cog_0 and $10-$1F for cog_1.
cog_1's INTLOCK ISR wakes up and uses LOCKGET to test $10. If set, it then reads the associated hub RAM, then clears bit $10.
cog_0, as above, can use one of three different methods to see that cog_1 has "received" the data.

Anyhow, you get the idea. Though it's true that these routines would require multiple hubops, the fact that LOCK updates can be handled by an ISR would make the overall process much more efficient than using only polling. I'd even go so far as to suggest that you don't really need the hub RAM read/write interrupts (or, at least the read interrupt) if this were available.

I think I see what you are proposing, but I'm not fully grasping it, yet.

cgracey · 2015-07-21 17:19

You know, it might make sense for it to trigger. That's like a BRK or a self-triggered interrupt feature, but it might be silly, or that one thing people bump into that didn't get used... too.

I think the idea, normally, would be to write back 0 to tell the other cog that you got it, but with a read alert, he will already know that you got it, so no need to write any 0 back. Once he sees that you read it, he can just write the next value to you and wait for you to get it.
As Seairth pointed out, we will need to select which read we want to interrupt on, so here's a new instruction:
SETIRDL D/# - set interrupt read-long location (4 bits)

Once that instruction executes, you can access that read-alert through another instruction:
GETRDL - if WC/WZ, writes captured read-alert to flag(s), else waits for read-alert
There will be a write-alert instruction, too, aside from the interrupt option:
GETWRL - if WC/WZ, writes captured write-alert to flag(s), else waits for write-alert
This way, you can have one cog doing this:
WRLONG data,other_cogs_special_location 'write dataGETRDL 'wait for other cog to read it<loop>
While another cog does this:
GETWRL 'wait for other cog to write my cog's special locationRDLONG data,my_cogs_special_location 'read data<loop>

Seairth · 2015-07-21 17:22

Oh, I didn't realize that! We will need to select 1-of-16 read signals for a proper read interrupt. Any reason not to mux the incoming write signals, as well?

That would certainly allow the writer to acknowledge a read by another cog, as well as make the communication a bit more flexible. This looks much more like the 16-slot mailbox that people have talked about.

Since the interrupt granularity is at 32 bits, there will be very limited use to do byte or word reads/writes to these addresses.

Rayman · 2015-07-21 17:36

I was hopefully when Chip chimed back in that he was close to a final design...
Seems we're still in the adding features phase though...

rod1963 · 2015-07-21 17:41

This interrupt scheme is growing more and more complex. It makes the one on the 68K series look simple by comparison.

cgracey · 2015-07-21 18:28

I realized while working these things out that there are seven discrete events that can be interrupted on:
1) timer reload2) pin edge3) transfer rollover4) transfer block wrap5) write to cog's special long6) read from any cog's special long7) execution address hit
All these discrete events are open-loop, requiring no feedback.
Four of them require setup instructions, though:

SETIMER D/# - set 32-bit timer period, 0=off, generates reload eventSETEDGE D/# - set edge and pin to %ee_pppppp (decouples edge from interrupt mode)SETREAD D/# - set which of the first sixteen longs generates a read alertSETEXEC D/# - set execution address for breakpoint alert
Now these seven events can be used for interrupts, with the following modes via SETINTx:
000 off001 timer interrupt010 edge interrupt011 transfer rollover interrupt100 transfer block wrap interrupt101 read mem interrupt110 write mem interrupt111 execution address interrupt
These seven events can also be captured into flops and made available to polling instructions, so that interrupts aren't even needed, if you don't want them:
GETIMERGETEDGEGETROLLGETWRAPGETREADGETWRITGETEXEC
For these polling instructions, if WC only is used, the event's flop state is put into C and the flop is cleared. If WZ is used, the event is waited for and Q is used as a timeout against CNT ( Z=1 if a timeout occurred), if neither WC nor WZ is used, the instruction just waits indefinitely for the event.
Oh, and by having pin-edge events poll-able AND wait-able, we can get rid of WAITPX, WAITPR, and WAITPF, which involved their own 64-to-1 pin mux and were bumping critical-path. Implementation of these ideas may cause a net drop in LE's. Getting those events standing on their own, outside of the interrupt modes, makes a lot of good things possible.

potatohead · 2015-07-21 18:39

Just let it play out guys. It is one new feature that needs a use case think through. And that looks messy, because it is.

The end product will be simple and robust.

Seairth · 2015-07-21 19:00

For these polling instructions, if WC only is used, the event's flop state is put into C and the flop is cleared. If WZ is used, the event is waited for and Q is used as a timeout against CNT ( Z=1 if a timeout occurred), if neither WC nor WZ is used, the instruction just waits indefinitely for the event.

Wouldn't it be more consistent with other instructions to call these WAITxxx and reverse the role of C and Z?

cgracey · 2015-07-21 19:07

For these polling instructions, if WC only is used, the event's flop state is put into C and the flop is cleared. If WZ is used, the event is waited for and Q is used as a timeout against CNT ( Z=1 if a timeout occurred), if neither WC nor WZ is used, the instruction just waits indefinitely for the event.

Wouldn't it be more consistent with other instructions to call these WAITxxx and reverse the role of C and Z?

Well, GETQX and GETQY wait for hub CORDIC results.
Oh, do you mean have the versions that don't use WC renamed to WAITxxx? That might be better, I see.

Seairth · 2015-07-21 19:30

I think I see what you are proposing, but I'm not fully grasping it, yet.

I wrote up another explanation, but haven't posted it because I don't know if it would make what I had said prior any more clear. With the current round of changes you have proposed, I don't know whether to push the lock thing any further. I do think it would provide an overall better approach than "special registers". But I also want the P2 image to get finished enough to release for testing and play.
At the very least, I suggest adding a "LOCKGET D" that simply allows the current value of a lock to be determined. That way, locks can be used as a semaphore or as a set of mutexes and/or events. With LOCKGET, a cog can see if a flag is set (if the recipient of the event) or cleared (if the sender of the event). If I recall, the current implementation has 32 lock bits. With that, one can easily implement up to 16 message-passing "channels" using locks and hub memory. Of course, it will require polling, but at least it will be possible.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments