I see no reason why a single 160mips processor could not be divvied up among 16 in the manner you are suggesting by a method similar to the hub slot assignment/scheduling that was discussed before the "lazy susan" hub came along. Instead of 16 cogs with registers and a cpu a single fast cpu could switch between 16 register blocks using a register block assignment table. Could work exactly like the current cogs do.
I don't quite understand your comment. How would you get the full 160 MIPS for a single thread if you divvied up the processor using 16 register blocks? Are you suggesting that each register block would have a variable time-slot allocation instead of assigning the time-slots equally?
Like I said, the real challenge would be doing this for a 1,600 MIPS machine like the P2.
The P-2 isn't a 1600 MIPS machine unless you're doing marketing speak. Not unless Chip has found a way to run a highly parallelized app over all 16 with no performance hit.
MIPS wise I'll wait until the Coremark comes out on the P-2 cogs.
Kwinn:
What you described is feasible. It's pretty much a RTOS as opposed to the old school Round Robin scheduler that Chip implemented on the P-1 because interrupts weren't possible.
.
You can do what you propose say with a Beaglebone Black, a RTOS like Neutrino and you're in business.
I see no reason why a single 160mips processor could not be divvied up among 16 in the manner you are suggesting by a method similar to the hub slot assignment/scheduling that was discussed before the "lazy susan" hub came along. Instead of 16 cogs with registers and a cpu a single fast cpu could switch between 16 register blocks using a register block assignment table. Could work exactly like the current cogs do.
It can, but the devil is in the details. Pipelines & caches complicate things greatly, as they help push up straight line speed, but at the expense of jumps. Remove pipelines and caches and the MHz pool drops, but the design gets simpler.
I have data here on a 8 bit MCU that used a 3-way time-slice, and it had 3 copies of Registers and 3 copies of local RAM, and IIRC the Local Ram on one, was mapped to the extended RAM of the others.
One flaw it had, was the Mux was always 1:3, as abcabc, you could not ask for abababab or abbabbabbabb, which was a strange oversight given how simple that would have been to allow.
I don't think the parts are manufactured any more..
I don't quite understand your comment. How would you get the full 160 MIPS for a single thread if you divvied up the processor using 16 register blocks? Are you suggesting that each register block would have a variable time-slot allocation instead of assigning the time-slots equally?
Not exactly. The cpu would have a lookup table to tell it which one of the 16 register blocks to execute an instruction from, and steps through the table sequentially. To emulate what the cogs do now that table would contain entries with the 4 bit register block pointer, and it would be some multiple of 16 entries. If every second entry pointed to register block 1 then it would get 80mips.
In essence this would be multiplexing the cpu between the 16 register blocks with the ability to select how much of the cpu time each block gets.
Like I said, the real challenge would be doing this for a 1,600 MIPS machine like the P2.
That would be a challenge for a single cpu. All things considered I still think the current multiple core approach is the way to go.
The P-2 isn't a 1600 MIPS machine unless you're doing marketing speak. Not unless Chip has found a way to run a highly parallelized app over all 16 with no performance hit.
MIPS wise I'll wait until the Coremark comes out on the P-2 cogs.
Kwinn:
What you described is feasible. It's pretty much a RTOS as opposed to the old school Round Robin scheduler that Chip implemented on the P-1 because interrupts weren't possible.
.
You can do what you propose say with a Beaglebone Black, a RTOS like Neutrino and you're in business.
I wouldn't call it an RTOS, although it is similar to how multitasking is done. More like a hardware based variable time slicer without the interrupts.
It can, but the devil is in the details. Pipelines & caches complicate things greatly, as they help push up straight line speed, but at the expense of jumps. Remove pipelines and caches and the MHz pool drops, but the design gets simpler.
I agree. Just saying the idea was possible, not that it was good. I prefer the simpler approach.
I have data here on a 8 bit MCU that used a 3-way time-slice, and it had 3 copies of Registers and 3 copies of local RAM, and IIRC the Local Ram on one, was mapped to the extended RAM of the others.
One flaw it had, was the Mux was always 1:3, as abcabc, you could not ask for abababab or abbabbabbabb, which was a strange oversight given how simple that would have been to allow.
I don't think the parts are manufactured any more..
I would be interested in seeing the data on that mcu if it is possible. I often wondered why no one had come out with a micro that had two or more sets of registers (including the program counter) and hardware to switch between them. The closest that I was aware of was the alternate/shadow registers on the Z80 that made multitasking and interrupt handling a bit easier. Adding variable time slicing to that would have been the icing on the cake.
I often wondered why no one had come out with a micro that had two or more sets of registers (including the program counter) and hardware to switch between them. The closest that I was aware of was the alternate/shadow registers on the Z80 that made multitasking and interrupt handling a bit easier. Adding variable time slicing to that would have been the icing on the cake.
The 8051 has 4 banks of registers, and the Z8 has a register frame pointer, but both interrupt in the usual sense.
The HoChip parts had a hard-time sliced 3 way 'core' design.
Hi
This old Texas processor was interesting in its time...it used context switching which I would think would make interrupts fast to service.
Architecture (from the wiki)
The TMS9900 has three internal 16-bit registers — Program counter (PC), Status register (ST), and Workspace Pointer register (WP).[1] The WP register points to a base address in external RAM where the processor's 16 general purpose user registers (each 16 bits wide) are kept. This architecture allows for quick context switching; e.g. when a subroutine is entered, only the single workspace register needs to be changed instead of requiring registers to be saved individually.
Addresses refer to bytes with big endian ordering convention. The TMS9900 is a classic 16 bit machine with an address space of 216 bytes (65,536 bytes or 32,768 words).
There is no concept of a stack and no stack pointer register. Instead, Branch instructions exist that save the program counter to a register and change the register context. The 16 hardware and the 16 software interrupt vectors each consist of a pair of PC and WP values, so the register context switch is automatically performed by an interrupt as well.
Instead of 16 cogs with registers and a cpu a single fast cpu could switch between 16 register blocks using a register block assignment table. Could work exactly like the current cogs do.
Logically, if your single CPU is fast enough it can be made to do the same work as many slower CPU's.
BUT that assumes you can actually make a single CPU that is 16 times faster than those slower CPU's. Reality is you can not.
The P-2 isn't a 1600 MIPS machine unless you're doing marketing speak. Not unless Chip has found a way to run a highly parallelized app over all 16 with no performance hit.
Sure it is. Sure it may not be possible to realize all that performance for big algorithms. But that is not the point of the device. If you want that get an big honking Intel chip.
I don't think cormark is an appropriate benchmark for Propeller like devices. Would you use coremark to evaluate FPGA's?
It's pretty much a RTOS as opposed to the old school Round Robin scheduler that Chip implemented on the P-1 because interrupts weren't possible.
It's not that interrupts are not possible. Interrupts are purposely not a feature of the Propeller. Interrupts are an ugly hack used to give the illusion that a single CPU machine is actually a multi-processing device. Interrupts come from the time when a CPU was a big expensive thing and you were only likely to ever have one. Interrupts are complicated and break item 3) in my list of requirements for determinism (See earlier post). Interrupts can only meet my requirement 2) to a limited extent, up till the point you have overloaded your CPU when you start blocking out your peer tasks on the machine.
It's not that interrupts are not possible. Interrupts are purposely not a feature of the Propeller. Interrupts are an ugly hack used to give the illusion that a single CPU machine is actually a multi-processing device. Interrupts come from the time when a CPU was a big expensive thing and you were only likely to ever have one. Interrupts are complicated and break item 3) in my list of requirements for determinism (See earlier post). Interrupts can only meet my requirement 2) to a limited extent, up till the point you have overloaded your CPU when you start blocking out your peer tasks on the machine.
Interesting. I was always under the impression that interrupts existed because you need an efficient way for a largely synchronous device to interact with a largely asynchronous outside world. In other words, you need interrupts when you can't know when an event will occur. Once hardware interrupts were added, it was suddenly obvious that this technique could be used to simulate a multi-processing device. But I don't think this later observation would have come along if interrupts hadn't been first added to handle asynchronous events.
This is actually one area of the Propeller "philosophy" that I struggle with. Because there aren't interrupts (in the traditional sense), you have basically two alternative approaches:
Polling
Blocking
Polling provides the opportunity to check more than one input, perform other tasks, etc. But it also requires semi-predictable events. If the event is truly unpredictable, then polling is very expensive, both in terms of power consumption and loss of effective processing time.
In those cases, you can switch to blocking. In this case, the cog just stalls out and waits for an external event. While this doesn't necessarily make good use of precessing resources, it has the benefit that it consumes much less power than polling in a tight loop. Further, you can get more accurate timing, since you aren't performing multi-clock jumps, tests, etc.
But what solution is there that is both power-efficient and processing-efficient? While interrupts may have their own issues, this is one thing that I think they are very good at.
Now, here's the thing: you could actually implement very slow interrupts on the Propeller. How?
In COG A, write code that blocks. When it wakes up, the only thing it does is set a flag in the hub.
In COG B, run a modified SPIN interpreter that checks for the flag at the beginning of the instruction loop, then jump to some other behaviour.
Yes, it's ugly. No, it's not nearly as efficient as hardware interrupts. But that is essentially what real interrupt hardware is doing: it has a very small circuit that is doing what COG A is doing, and the CPU is essentially doing what COG B is doing.
Now, for those who are used to writing ISRs, you know that you should minimize how long you are in the ISR (particularly if you have disabled interrupts). The same rule applies to COG A in the above example (and really, any time you are blocking/polling), where the "interrupts" are essentially "disabled" because the cog is running code other than the actual blocking WAITxxx.
The nice thing about the Propeller approach is that you can now run the ISR-specific code in the cog that's performing the WAITxxx. This is great, in that it does not cause other code to temporarily stop running when an external event occurs (as happens in hardware interrupt CPUs). Unfortunately, it also means that you have to dedicate a cog to every individual interrupt (or possibly a couple highly-coupled interrupts), which is potentially expensive (even if you have 16 cogs).
I've lost track of what the P2 smart pins will be capable of, but I hope they will be able to capture the active-high or active-low state of a pin (i.e. set the bit high when the desired state is encountered), which is then "cleared" when the value is read by a cog. With this single feature, it should be possible to consolidate multiple ISRs in a single COG. The main code would simply WAIT on a set of pins for any one of them to change. Then it would execute the appropriate ISR for those that changed. In the meantime, the smart pins will still capture any other asynchronous events, so that when the cog returns to its WAITxxx, other asynchronous events will not be lost. You can even implement prioritized ISRs, entirely in software. And, of course, you can still dedicate more cogs when you want faster response times, etc. In fact, I would probably expect this to become a common design approach in P2, where you spread out your ISRs across the unused cogs (if even necessary).
Okay, I think I started to ramble a bit. I'm not sure this is coherent, but I'm too lazy to take the time to edit it. So, here's the thing. If P2 smart pins have a capture mode, then I think the whole "propeller is missing interrupts" argument is effectively moot. Further, I suspect the "propeller doesn't need interrupts" argument will finally be true.
The last version of the P2 before it got restarted had something that was almost like interrupts. You could define a COG task that would sit in a polling loop waiting for an event, and then it would act upon it. Since it was in a separate COG task, it had no impact on the other COG tasks, and each COG task contained it own hardware environment. So it was almost like an interrupt, but not as good.
Interrupts and COGS are really not mutually exclusive. There's no reason that both cannot be supported, other than some esoteric philosophical aversion to it. A chip that supported both features would be much more powerful than one that did not. And if a COG already supports tasks it would be trivial to add interrupt support.
The last version of the P2 before it got restarted had something that was almost like interrupts. You could define a COG task that would sit in a polling loop waiting for an event, and then it would act upon it. Since it was in a separate COG task, it had no impact on the other COG tasks, and each COG task contained it own hardware environment. So it was almost like an interrupt, but not as good.
Interrupts and COGS are really not mutually exclusive. There's no reason that both cannot be supported, other than some esoteric philosophical aversion to it. A chip that supported both features would be much more powerful than one that did not. And if a COG already supports tasks it would be trivial to add interrupt support.
P2 also supported more options for starting COGS. With 16 of them, it becomes more realistic to have a supervisor COG able to start, start at address, pause, stop, other COGS in response to events it's evaluating.
As for the 1600 MIPS, just take a look at GPU devices. They can really crank through highly parallel problems. The Propeller has similar attributes. It can toss a ton of sprites around, or render a scene in real time, for example. Baggers, Eric Ball and others demonstrated this, using the COGS for dynamic drawing at high efficiency. It can also do a lot of math if needed, but then again it has to, given the lack of higher order math functions. (multiply being the most obvious) If one wants to push around a lot of serial streams, Props do that nicely too. And they do it easily and dynamically. Ask Coley about that one.
The one negative impact interrupts has is the strong potential to reduce the ease of swapping COG tasks out, and or combining them. The strong COG isolation in the P1 is awesome, and one feature that really differentiates it from most other devices. Whatever gets done, I really want to see that continue to be on the table.
Re: Smart Pins
Yeah, having a latch with a few modes would be pretty great. Seconded.
Your impression about interrupts sounds correct to me.
Traditionally there is a program running around on a single CPU doing whatever it does. Then we have some external event that should be handled "now". Could be data arriving that needs saving in a buffer before it too late, could just be the user hitting a button to say "hey stop running that program". Whatever.
That's how it was done with the hardware available. But what is it we actually wanted to do? Here is my take:
1) We have some software, the main application, that we would like to have running. We need a CPU for that.
2) We have some other software, the interrupt handler, that we would like to have run when an event occurs. We need a CPU for that.
3) So far we need 2 CPU's but CPU's were big and expensive and we can't afford that. What to do? Of course, add a signal that diverts our CPU from running the app code to running the interrupt code. Bingo, we have invented the interrupt. We have made it look like we have 2 CPU's when we only have one.
So far so good. But the illusion breaks when you push the timing. What happens when you add another event to the system? Now you might have 2 events that need handling "now". Can't be done, you only have one CPU. One of them has to fail.
OK, no probs, we can fix that by introducing buffers into our I/O devices. Or what about have the I/O device use DMA to stuff data from fast event streams into memory and then issue an interrupt when it's done.
Layer upon layer of hardware complexity can be added the I/O system to make this single CPU with interrupts thing work.
BUT, wait a minute. Why not replace all that buffering and DMA and interrupt circuitry with, well, a CPU that can do that?
Enter the ideas behind the Propeller, the multi-core XMOS devices and others. Not to mention the Intel 8089 I/O coprocessor for the 8086 back in the day. And the Transputer chips from INMOS.
Your impression about interrupts sounds correct to me.
Traditionally there is a program running around on a single CPU doing whatever it does. Then we have some external event that should be handled "now". Could be data arriving that needs saving in a buffer before it too late, could just be the user hitting a button to say "hey stop running that program". Whatever.
That's how it was done with the hardware available. But what is it we actually wanted to do? Here is my take:
1) We have some software, the main application, that we would like to have running. We need a CPU for that.
2) We have some other software, the interrupt handler, that we would like to have run when an event occurs. We need a CPU for that.
3) So far we need 2 CPU's but CPU's were big and expensive and we can't afford that. What to do? Of course, add a signal that diverts our CPU from running the app code to running the interrupt code. Bingo, we have invented the interrupt. We have made it look like we have 2 CPU's when we only have one.
So far so good. But the illusion breaks when you push the timing. What happens when you add another event to the system? Now you might have 2 events that need handling "now". Can't be done, you only have one CPU. One of them has to fail.
OK, no probs, we can fix that by introducing buffers into our I/O devices. Or what about have the I/O device use DMA to stuff data from fast event streams into memory and then issue an interrupt when it's done.
Layer upon layer of hardware complexity can be added the I/O system to make this single CPU with interrupts thing work.
BUT, wait a minute. Why not replace all that buffering and DMA and interrupt circuitry with, well, a CPU that can do that?
Enter the ideas behind the Propeller, the multi-core XMOS devices and others. Not to mention the Intel 8089 I/O coprocessor for the 8086 back in the day. And the Transputer chips from INMOS.
Of course, even with the Propeller we have some of those things. For example, take a look at FullDuplexSerial. It essentially does DMA to hub memory but it does it in software using a complete general purpose processor that probably uses a lot more power and die area than dedicated hardware would use.
As far as I recall XMOS chips do have interrupts. Although I was never quite sure why.
You see the preferred way to program an XMOS is using the XC language. XC is like C but with extensions for parallelism.
In XC everything is event driven. Your threads will block on some pin change or time out or whatever and then continue when that event occurs. Rather like writing COG code and using WAITxxx. In XC you have many threads, perhaps run on the hardware "round-robin" task switcher within a core, perhaps on different cores. In XC there is no concept of "interrupt".
RE: FullDuplexSerial. Yep, a cog like that can be seen as performing DMA. A COG is a smart DMA controller. Smart enough that let's just call it a processor. Besides we cannot distinguish it from the other 7 DMA engines/processors on the chip.
Certainly a CPU uses more hardware and power than a simple dedicated DMA engine. I don't care, the point is to have the flexibility of a CPU over the dedicated use of some counters, buffers and memory read/write logic.
For example, take a look at FullDuplexSerial. It essentially does DMA to hub memory but it does it in software using a complete general purpose processor that probably uses a lot more power and die area than dedicated hardware would use.
More die area, certainly, but the die area is already there; it's not like we're adding it just to run a UART, and it can be used for other things. As to power consumption, remember that FDS spends most of its time in waitcnts, so its power consumption is negligible compared with other apps.
Although they're much more than that, it's helpful to think of the Prop's cogs as microprogrammed "channel" processors, similar to those used in the mainframes of yore. PASM, with its conditional flag setting and conditional execution of any instruction does border on microcode, after all.
As far as I recall XMOS chips do have interrupts. Although I was never quite sure why.
You see the preferred way to program an XMOS is using the XC language. XC is like C but with extensions for parallelism.
In XC everything is event driven. Your threads will block on some pin change or time out or whatever and then continue when that event occurs. Rather like writing COG code and using WAITxxx. In XC you have many threads, perhaps run on the hardware "round-robin" task switcher within a core, perhaps on different cores. In XC there is no concept of "interrupt".
RE: FullDuplexSerial. Yep, a cog like that can be seen as performing DMA. A COG is a smart DMA controller. Smart enough that let's just call it a processor. Besides we cannot distinguish it from the other 7 DMA engines/processors on the chip.
Certainly a CPU uses more hardware and power than a simple dedicated DMA engine. I don't care, the point is to have the flexibility of a CPU over the dedicated use of some counters, buffers and memory read/write logic.
Yes, I understand all of this and I like the idea of having general purpose processors as well. I just wanted to point out that some of the techniques you mentioned that were added to single processors to deal with latency issues still exist on the Propeller but they are implemented in software.
Surely it can't wait on time because then it would miss edges on the incoming serial. It can't wait on edges on pins because then it would miss the time for it's tx edges.
Basically it thrashes from rx to tx coprocess polling for tx time or rx pin.
Logically, if your single CPU is fast enough it can be made to do the same work as many slower CPU's.
BUT that assumes you can actually make a single CPU that is 16 times faster than those slower CPU's. Reality is you can not.
True, and that is the main impetus behind multiple cores and parallel processing. On the other hand being able to distribute a variable portion of the processing power of one very fast cpu among several tasks with no overhead has some advantages over multiple slower cpu's that cannot share cpu power.
Whether it is worth the complexity or has a market are another matter.
@ Dave Hein
Interrupts and COGS are really not mutually exclusive. There's no reason that both cannot be supported, other than some esoteric philosophical aversion to it. A chip that supported both features would be much more powerful than one that did not. And if a COG already supports tasks it would be trivial to add interrupt support.
I agree 100%. Giving each cog the ability to execute some code when one or more pins change state would be a great feature. Having a cog stopped while waiting for an event that happens randomly but requires a fast response is a waste of resources, as is a polling loop that has to execute at high frequency to guarantee the response time.
@ potatohead
The one negative impact interrupts has is the strong potential to reduce the ease of swapping COG tasks out, and or combining them. The strong COG isolation in the P1 is awesome, and one feature that really differentiates it from most other devices. Whatever gets done, I really want to see that continue to be on the table.
Re: Smart Pins
Yeah, having a latch with a few modes would be pretty great. Seconded.
+1 for latch modes.
With 16 cogs I don't see how allowing each of them to have a single interrupt would have an impact on swapping cog tasks or combining them. If anything, I would think having an interrupt available would make combining tasks easier. If you want to swap cog tasks then don't use interrupts in the cog or tasks being swapped.
As it turns out, you're right: not a waitcnt to be seen. It could have been written with waitcnts, though. You only need to sample the input pin at three times the baud rate for reliability. It doesn't have to be sampled continuously.
Somebody here has quite recently presented a full duplex serial that does use WAITCNT as a timer tick from which both rx and tx are driven. The claim was that it was twice as fast as the old FDS. I can't recall who that was or where it is.
You need new glasses. In the sync part RX is called three times and TX once,
Good grief so I do and so it is. As it happens I recently managed smash my new progressive specs and the old pair I'm using don't work anything as well as they use do.
So, it's all clear, as it were. This is all about giving more time to the rx "thread".
I am still confused why people missing interrupts on the Propeller. Quite easy to do. Take one of them 8 cores and think of it as a interrupt handler. WAITPNE uses a mask. While waiting the COG does not use much power. If you need timed events independent of pin changes use a counter and some unused pin.
Event occurs, cog wakes up from WAITPNE and - well - is your interrupt handler you have to write on any other system also. So what am I supposed to miss here? And I still have 7 other independent cores running.
On a multicore chip a interrupt does not need to interrupt the main program at all. It (the task needed) just needs to be done,
Comments
Like I said, the real challenge would be doing this for a 1,600 MIPS machine like the P2.
MIPS wise I'll wait until the Coremark comes out on the P-2 cogs.
Kwinn:
What you described is feasible. It's pretty much a RTOS as opposed to the old school Round Robin scheduler that Chip implemented on the P-1 because interrupts weren't possible.
.
You can do what you propose say with a Beaglebone Black, a RTOS like Neutrino and you're in business.
It can, but the devil is in the details. Pipelines & caches complicate things greatly, as they help push up straight line speed, but at the expense of jumps. Remove pipelines and caches and the MHz pool drops, but the design gets simpler.
I have data here on a 8 bit MCU that used a 3-way time-slice, and it had 3 copies of Registers and 3 copies of local RAM, and IIRC the Local Ram on one, was mapped to the extended RAM of the others.
One flaw it had, was the Mux was always 1:3, as abcabc, you could not ask for abababab or abbabbabbabb, which was a strange oversight given how simple that would have been to allow.
I don't think the parts are manufactured any more..
Not exactly. The cpu would have a lookup table to tell it which one of the 16 register blocks to execute an instruction from, and steps through the table sequentially. To emulate what the cogs do now that table would contain entries with the 4 bit register block pointer, and it would be some multiple of 16 entries. If every second entry pointed to register block 1 then it would get 80mips.
In essence this would be multiplexing the cpu between the 16 register blocks with the ability to select how much of the cpu time each block gets.
That would be a challenge for a single cpu. All things considered I still think the current multiple core approach is the way to go.
I wouldn't call it an RTOS, although it is similar to how multitasking is done. More like a hardware based variable time slicer without the interrupts.
I agree. Just saying the idea was possible, not that it was good. I prefer the simpler approach.
I would be interested in seeing the data on that mcu if it is possible. I often wondered why no one had come out with a micro that had two or more sets of registers (including the program counter) and hardware to switch between them. The closest that I was aware of was the alternate/shadow registers on the Z80 that made multitasking and interrupt handling a bit easier. Adding variable time slicing to that would have been the icing on the cake.
Google
hochip LS2051
The 8051 has 4 banks of registers, and the Z8 has a register frame pointer, but both interrupt in the usual sense.
The HoChip parts had a hard-time sliced 3 way 'core' design.
This old Texas processor was interesting in its time...it used context switching which I would think would make interrupts fast to service.
Architecture (from the wiki)
The TMS9900 has three internal 16-bit registers — Program counter (PC), Status register (ST), and Workspace Pointer register (WP).[1] The WP register points to a base address in external RAM where the processor's 16 general purpose user registers (each 16 bits wide) are kept. This architecture allows for quick context switching; e.g. when a subroutine is entered, only the single workspace register needs to be changed instead of requiring registers to be saved individually.
Addresses refer to bytes with big endian ordering convention. The TMS9900 is a classic 16 bit machine with an address space of 216 bytes (65,536 bytes or 32,768 words).
There is no concept of a stack and no stack pointer register. Instead, Branch instructions exist that save the program counter to a register and change the register context. The 16 hardware and the 16 software interrupt vectors each consist of a pair of PC and WP values, so the register context switch is automatically performed by an interrupt as well.
Dave
BUT that assumes you can actually make a single CPU that is 16 times faster than those slower CPU's. Reality is you can not. Sure it is. Sure it may not be possible to realize all that performance for big algorithms. But that is not the point of the device. If you want that get an big honking Intel chip.
I don't think cormark is an appropriate benchmark for Propeller like devices. Would you use coremark to evaluate FPGA's? It's not that interrupts are not possible. Interrupts are purposely not a feature of the Propeller. Interrupts are an ugly hack used to give the illusion that a single CPU machine is actually a multi-processing device. Interrupts come from the time when a CPU was a big expensive thing and you were only likely to ever have one. Interrupts are complicated and break item 3) in my list of requirements for determinism (See earlier post). Interrupts can only meet my requirement 2) to a limited extent, up till the point you have overloaded your CPU when you start blocking out your peer tasks on the machine.
Interesting. I was always under the impression that interrupts existed because you need an efficient way for a largely synchronous device to interact with a largely asynchronous outside world. In other words, you need interrupts when you can't know when an event will occur. Once hardware interrupts were added, it was suddenly obvious that this technique could be used to simulate a multi-processing device. But I don't think this later observation would have come along if interrupts hadn't been first added to handle asynchronous events.
This is actually one area of the Propeller "philosophy" that I struggle with. Because there aren't interrupts (in the traditional sense), you have basically two alternative approaches:
Polling provides the opportunity to check more than one input, perform other tasks, etc. But it also requires semi-predictable events. If the event is truly unpredictable, then polling is very expensive, both in terms of power consumption and loss of effective processing time.
In those cases, you can switch to blocking. In this case, the cog just stalls out and waits for an external event. While this doesn't necessarily make good use of precessing resources, it has the benefit that it consumes much less power than polling in a tight loop. Further, you can get more accurate timing, since you aren't performing multi-clock jumps, tests, etc.
But what solution is there that is both power-efficient and processing-efficient? While interrupts may have their own issues, this is one thing that I think they are very good at.
Now, here's the thing: you could actually implement very slow interrupts on the Propeller. How?
Yes, it's ugly. No, it's not nearly as efficient as hardware interrupts. But that is essentially what real interrupt hardware is doing: it has a very small circuit that is doing what COG A is doing, and the CPU is essentially doing what COG B is doing.
Now, for those who are used to writing ISRs, you know that you should minimize how long you are in the ISR (particularly if you have disabled interrupts). The same rule applies to COG A in the above example (and really, any time you are blocking/polling), where the "interrupts" are essentially "disabled" because the cog is running code other than the actual blocking WAITxxx.
The nice thing about the Propeller approach is that you can now run the ISR-specific code in the cog that's performing the WAITxxx. This is great, in that it does not cause other code to temporarily stop running when an external event occurs (as happens in hardware interrupt CPUs). Unfortunately, it also means that you have to dedicate a cog to every individual interrupt (or possibly a couple highly-coupled interrupts), which is potentially expensive (even if you have 16 cogs).
I've lost track of what the P2 smart pins will be capable of, but I hope they will be able to capture the active-high or active-low state of a pin (i.e. set the bit high when the desired state is encountered), which is then "cleared" when the value is read by a cog. With this single feature, it should be possible to consolidate multiple ISRs in a single COG. The main code would simply WAIT on a set of pins for any one of them to change. Then it would execute the appropriate ISR for those that changed. In the meantime, the smart pins will still capture any other asynchronous events, so that when the cog returns to its WAITxxx, other asynchronous events will not be lost. You can even implement prioritized ISRs, entirely in software. And, of course, you can still dedicate more cogs when you want faster response times, etc. In fact, I would probably expect this to become a common design approach in P2, where you spread out your ISRs across the unused cogs (if even necessary).
Okay, I think I started to ramble a bit. I'm not sure this is coherent, but I'm too lazy to take the time to edit it. So, here's the thing. If P2 smart pins have a capture mode, then I think the whole "propeller is missing interrupts" argument is effectively moot. Further, I suspect the "propeller doesn't need interrupts" argument will finally be true.
Interrupts and COGS are really not mutually exclusive. There's no reason that both cannot be supported, other than some esoteric philosophical aversion to it. A chip that supported both features would be much more powerful than one that did not. And if a COG already supports tasks it would be trivial to add interrupt support.
As for the 1600 MIPS, just take a look at GPU devices. They can really crank through highly parallel problems. The Propeller has similar attributes. It can toss a ton of sprites around, or render a scene in real time, for example. Baggers, Eric Ball and others demonstrated this, using the COGS for dynamic drawing at high efficiency. It can also do a lot of math if needed, but then again it has to, given the lack of higher order math functions. (multiply being the most obvious) If one wants to push around a lot of serial streams, Props do that nicely too. And they do it easily and dynamically. Ask Coley about that one.
The one negative impact interrupts has is the strong potential to reduce the ease of swapping COG tasks out, and or combining them. The strong COG isolation in the P1 is awesome, and one feature that really differentiates it from most other devices. Whatever gets done, I really want to see that continue to be on the table.
Re: Smart Pins
Yeah, having a latch with a few modes would be pretty great. Seconded.
Your impression about interrupts sounds correct to me.
Traditionally there is a program running around on a single CPU doing whatever it does. Then we have some external event that should be handled "now". Could be data arriving that needs saving in a buffer before it too late, could just be the user hitting a button to say "hey stop running that program". Whatever.
That's how it was done with the hardware available. But what is it we actually wanted to do? Here is my take:
1) We have some software, the main application, that we would like to have running. We need a CPU for that.
2) We have some other software, the interrupt handler, that we would like to have run when an event occurs. We need a CPU for that.
3) So far we need 2 CPU's but CPU's were big and expensive and we can't afford that. What to do? Of course, add a signal that diverts our CPU from running the app code to running the interrupt code. Bingo, we have invented the interrupt. We have made it look like we have 2 CPU's when we only have one.
So far so good. But the illusion breaks when you push the timing. What happens when you add another event to the system? Now you might have 2 events that need handling "now". Can't be done, you only have one CPU. One of them has to fail.
OK, no probs, we can fix that by introducing buffers into our I/O devices. Or what about have the I/O device use DMA to stuff data from fast event streams into memory and then issue an interrupt when it's done.
Layer upon layer of hardware complexity can be added the I/O system to make this single CPU with interrupts thing work.
BUT, wait a minute. Why not replace all that buffering and DMA and interrupt circuitry with, well, a CPU that can do that?
Enter the ideas behind the Propeller, the multi-core XMOS devices and others. Not to mention the Intel 8089 I/O coprocessor for the 8086 back in the day. And the Transputer chips from INMOS.
You see the preferred way to program an XMOS is using the XC language. XC is like C but with extensions for parallelism.
In XC everything is event driven. Your threads will block on some pin change or time out or whatever and then continue when that event occurs. Rather like writing COG code and using WAITxxx. In XC you have many threads, perhaps run on the hardware "round-robin" task switcher within a core, perhaps on different cores. In XC there is no concept of "interrupt".
RE: FullDuplexSerial. Yep, a cog like that can be seen as performing DMA. A COG is a smart DMA controller. Smart enough that let's just call it a processor. Besides we cannot distinguish it from the other 7 DMA engines/processors on the chip.
Certainly a CPU uses more hardware and power than a simple dedicated DMA engine. I don't care, the point is to have the flexibility of a CPU over the dedicated use of some counters, buffers and memory read/write logic.
More die area, certainly, but the die area is already there; it's not like we're adding it just to run a UART, and it can be used for other things. As to power consumption, remember that FDS spends most of its time in waitcnts, so its power consumption is negligible compared with other apps.
Although they're much more than that, it's helpful to think of the Prop's cogs as microprogrammed "channel" processors, similar to those used in the mainframes of yore. PASM, with its conditional flag setting and conditional execution of any instruction does border on microcode, after all.
-Phil
Surely it can't wait on time because then it would miss edges on the incoming serial. It can't wait on edges on pins because then it would miss the time for it's tx edges.
Basically it thrashes from rx to tx coprocess polling for tx time or rx pin.
Or am I remembering this horribly wrongly?
True, and that is the main impetus behind multiple cores and parallel processing. On the other hand being able to distribute a variable portion of the processing power of one very fast cpu among several tasks with no overhead has some advantages over multiple slower cpu's that cannot share cpu power.
Whether it is worth the complexity or has a market are another matter.
@ Dave Hein
I agree 100%. Giving each cog the ability to execute some code when one or more pins change state would be a great feature. Having a cog stopped while waiting for an event that happens randomly but requires a fast response is a waste of resources, as is a polling loop that has to execute at high frequency to guarantee the response time.
@ potatohead
+1 for latch modes.
With 16 cogs I don't see how allowing each of them to have a single interrupt would have an impact on swapping cog tasks or combining them. If anything, I would think having an interrupt available would make combining tasks easier. If you want to swap cog tasks then don't use interrupts in the cog or tasks being swapped.
-Phil
One thing is a context save of some sort. Could be a stack or register bank, or instruction to write state somewhere...
Need a trigger of some sort and means to configure it.
A vector to handle the target alternative execute path.
Honestly, I much prefer the ability to start cogs at given addresses and without reloading them.
Then the interrupt becomes a software construct And there are no unplanned execution paths.
And that seems to be the differentiator. On a P1 all execute paths are planned, known. The last design discussion on P2 was also that way.
Write a block of code and know what it will do.
The egg beater HUB complicates this some, but I'm not sure just how that will play out yet. Nobody does, until Chip gets the first image sorted.
Somebody here has quite recently presented a full duplex serial that does use WAITCNT as a timer tick from which both rx and tx are driven. The claim was that it was twice as fast as the old FDS. I can't recall who that was or where it is.
Aside: How is that fibonacci analysis going?
-Phil
Ah! it was you.
That is really neat bit of code.
I'm curious why do you do twice in your sync loop?
And the similar for the tx lines.
what a wonderful piece of code again from you. I love it. Thank you!
@Heater,
You need new glasses. In the sync part RX is called three times and TX once,
Enjoy!
Mike
So, it's all clear, as it were. This is all about giving more time to the rx "thread".
That really is a fine piece of code.
I am still confused why people missing interrupts on the Propeller. Quite easy to do. Take one of them 8 cores and think of it as a interrupt handler. WAITPNE uses a mask. While waiting the COG does not use much power. If you need timed events independent of pin changes use a counter and some unused pin.
Event occurs, cog wakes up from WAITPNE and - well - is your interrupt handler you have to write on any other system also. So what am I supposed to miss here? And I still have 7 other independent cores running.
On a multicore chip a interrupt does not need to interrupt the main program at all. It (the task needed) just needs to be done,
Maybe some 'interrupt' object in the OBEX?
Enjoy!
Mike