That would cause me to look for another approach. If existing implementation allows me to get to RAM in 2 cycles (I can supposedly start a COG in every single SLOT and do a RD ... that's 16 RDs in 32 cycles ... using some scheme), I should be able to implement a dedicated indirected table in its own RAM of arbitrary length and take no additional hit as the length is increased. At least four things have to happen at each increment: (1) The counter needs to be decremented. (2) The table needs to be addressed. (3) The table needs to be read. (4) The COG needs to be selected. Those four things don't change as the length of the table is increased.
And for something like audio, you grab a big chunk of data from the HUB, the QUAD, and then your COG loop outputs it apart from the HUB. Buffer the jitter, essentially. All that is needed is the HUB access throughput be higher than the need to stream something.
12.5Mhz access times, and on each one you can get a byte, long, QUAD, word, etc...
That would cause me to look for another approach. If existing implementation allows me to get to RAM in 2 cycles (I can supposedly start a COG in every single SLOT and do a RD ... that's 16 RDs in 32 cycles ... using some scheme), I should be able to implement a dedicated indirected table in its own RAM of arbitrary length and take no additional hit as the length is increased. At least four things have to happen at each increment: (1) The counter needs to be decremented. (2) The table needs to be addressed. (3) The table needs to be read. (4) The COG needs to be selected. Those four things don't change as the length of the table is increased.
I believe both (1) and (2) would slow down as the table got larger. In the first case, every doubling of the table increases the counter by one bit, increasing the carry chain, and therefore the settle time of the overall counter. In the second case, you end up with a larger addressing mux that also has a slightly longer settle time. And, if one feeds the other, the mux must wait for the counter to settle before it can settle, making their effects additive.
The reason you are seeing drastic timing changes with larger registers and muxes is the limitation of the fpga. The fpgas used 4 bit lookups (newer ones from Xilinx use 6, not sure about altera). So the moment you need to go higher, it needs to daisy chain another LUT. This will not be the case in the real P1+/P2. Counters can use forward carry to minimise delays. Chip is full of tricks to do this - remember he designed the original P1 using logic.
One or two supercogs would be great. Some time back I suggested using 2 cogs with one using odd unused slots and the other using even unused slots.
The only problem I see with limiting Cog 0 as being the supercog is that each cog has a specific set of DAC pins. So if you wanted to use Cog 0 DAC you probably don't want this to be the supercog. That is why I suggested the Table#2 idea - to allocate each slot to a secondary cog - if you allocate them all to 0, then Cog 0 becomes the supercog. But you could setup table#2 as 0,1,0,1.... and cogs 0 and 1 would get even and odd unused slots.
As long as the slot table is not too big, it gives maximum flexibility. I thought 32 was nice, because you can selectively downgrade some cogs, while keeping the majority as standard 1:16 (by 2:32).
A supercog (or two) would increase the performance of a cog. Of course it depends on what other cogs are doing. But usually, the main program does not require deterministic operation, just an average of more hub access, to perform faster. Why artificially limit its use in the name of determinism. If the P1+/P2 could end up being not only the peripheral, but also the main processor too, then we have a clearer winner for some jobs = more sales.
FWIW We are all familiar with P1. If cog0 was a supercog, imagine how much faster SPIN would perform if running in cog0. Same for LMM.
Remember, most cogs are only running smart I/O blocks and do not require a lot of hub badwidth.
The one issue I see here, is that Objects (cogs) need another way to signal data is available (rather than by hub) so they can minimise hub accesses when not required. eg FDX waits on the hub for bytes to send. If it could be signalled that the output pointer has been updated via an alternate method (like PortD or similar) then the object would not need to sit reading the hub pointer. This would free more slots for the supercog. We only really need 1 or 2 bits per cog to do this, so a 32 bit "or"ed bus would be ideal. Perhaps we could even put the cog into low power by some wait method.
I believe both (1) and (2) would slow down as the table got larger. In the first case, every doubling of the table increases the counter by one bit, increasing the carry chain, and therefore the settle time of the overall counter. In the second case, you end up with a larger addressing mux that also has a slightly longer settle time. And, if one feeds the other, the mux must wait for the counter to settle before it can settle, making their effects additive.
This really is starting to look like FUD. There are no phasing issues regarding sequencing through the table. Use a 10 bit counter ([potential for a 1024 element table). It will only take one clock to decrement. There is no carry involved. Pipeline as you need to allow a new address to be placed onto the table bus with one clock and the RAM to be read by the next clock. It is absolutely ridiculous to think this all can't go on in parallel with the 2 cycle instruction time the slot is actually using to access the main bus (the exclusive resource), even if you have to get ready for it 10 clocks ahead. Once it's running, you're getting a new COG selected from the table every clock or two, as you require.
The reason you are seeing drastic timing changes with larger registers and muxes is the limitation of the fpga. The fpgas used 4 bit lookups (newer ones from Xilinx use 6, not sure about altera). So the moment you need to go higher, it needs to daisy chain another LUT. This will not be the case in the real P1+/P2. Counters can use forward carry to minimise delays. Chip is full of tricks to do this - remember he designed the original P1 using logic.
Sure, but the FPGA can run > 200MHz on 32b simple counters, so it is best to not impact that speed, as FPGA emulation is an important test point. The delays are not really counter issues, as I have 32b counters > 200MHz.
The speed impacts will also exist in ASICs, as the logic trees grow, but they should be well outside critical paths.
This really is starting to look like FUD. There are no phasing issues regarding sequencing through the table. Use a 10 bit counter ([potential for a 1024 element table). It will only take one clock to decrement. There is no carry involved. Pipeline as you need to allow a new address to be placed onto the table bus with one clock and the RAM to be read by the next clock. It is absolutely ridiculous to think this all can't go on in parallel with the 2 cycle instruction time the slot is actually using to access the main bus (the exclusive resource), even if you have to get ready for it 10 clocks ahead. Once it's running, you're getting a new COG selected from the table every clock or two, as you require.
I think we are talking about two different things here. I'm not talking about the number of clock cycles, rather the duration of a single cycle (which I believe is what jmg was referring to in his timing analysis). A simple 10-bit ripple counter will take approximately twice as long to settle as a simple 5-bit ripple counter (there is a carry between every additional bit). Likewise a 1024:1 mux is going to effectively have something like 5 2:1 mux levels more than a 32:1 mux, which also roughly doubles the settle time. If either or both of these are in the critical path, it will definitely slow the clock speed (and therefore fMax). The trick is to figure out how to minimize the impact on the critical path timing.
And that's what Cluso was alluding to he mentioned the use of carry look-ahead techniques and Chip's work on the P1. And, if you can split up these operations over multiple clock cycles (which I think is what you were alluding to), then that also potentially minimizes impact on the critical path. (This, for example, is why the ALU in the P1+ is active for two clock periods. Due to the lateness of some of its inputs, it cannot settle within a single clock cycle unless you reduce fMax.)
There's also the practical considerations of how you would maintain such a large table. The data bus between cogs and the hub is 128 bits wide. This allows for up to 32 4-bit values to be transferred in a single instruction cycle (and, as this is a hubop, within a single hub slot). To set a 1024 element table, you would have to perform the hubop 32 times. And while you are doing that, the hub is iterating over the very same table, making it impossible to update the entire table as an atomic operation. Sure, you could add additional mechanisms to to prevent this (lock the hub to the current cog while performing the updates would be one example), but this just adds more complexity (both in the circuitry and in the code), as well as potentially causing other unexpected side effects in the other cogs.
One or two supercogs would be great. Some time back I suggested using 2 cogs with one using odd unused slots and the other using even unused slots.
The only problem I see with limiting Cog 0 as being the supercog is that each cog has a specific set of DAC pins. So if you wanted to use Cog 0 DAC you probably don't want this to be the supercog. That is why I suggested the Table#2 idea - to allocate each slot to a secondary cog - if you allocate them all to 0, then Cog 0 becomes the supercog. But you could setup table#2 as 0,1,0,1.... and cogs 0 and 1 would get even and odd unused slots.
Correct, COG 0 alone makes no sense, as it cannot USE all that BW anyway (which is why it is not taken seriously).
Once you accept the need to have a user-choice on supercog(s) - driven by Pin-DAC if nothing else, & 2 COGS clearly makes more sense, voila, you are now back at the Table#2.
I'm confused, we must be talking at cross purposes. We started out weighing up the merits of interrupts vs multiple cores to handle external events and then you start discussing HUB sharing, slot timing, consequent jitter and parallel algorithms.
My simple point was that an interrupt requires processing power to handle it. That processing power can be "borrowed" from the currently running code on a single core machine, a traditional interrupt handler, or you can have a dedicated CPU to handle it.
I see no situation where one would prefer to have ones code disrupted by an interrupt rather than just be able to keep running and have something else deal with it in parallel.
Except of course if it's just to big and or expensive to have those extra cores. As it has been historically.
I do agree about the idea of a Propeller as an I/O processor for some other processor.
I think we are talking about two different things here. I'm not talking about the number of clock cycles, rather the duration of a single cycle (which I believe is what jmg was referring to in his timing analysis). A simple 10-bit ripple counter will take approximately twice as long to settle as a simple 5-bit ripple counter (there is a carry between every additional bit). Likewise a 1024:1 mux is going to effectively have something like 5 2:1 mux levels more than a 32:1 mux, which also roughly doubles the settle time. If either or both of these are in the critical path, it will definitely slow the clock speed (and therefore fMax). The trick is to figure out how to minimize the impact on the critical path timing.
And that's what Cluso was alluding to he mentioned the use of carry look-ahead techniques and Chip's work on the P1. And, if you can split up these operations over multiple clock cycles (which I think is what you were alluding to), then that also potentially minimizes impact on the critical path. (This, for example, is why the ALU in the P1+ is active for two clock periods. Due to the lateness of some of its inputs, it cannot settle within a single clock cycle unless you reduce fMax.)
There's also the practical considerations of how you would maintain such a large table. The data bus between cogs and the hub is 128 bits wide. This allows for up to 32 4-bit values to be transferred in a single instruction cycle (and, as this is a hubop, within a single hub slot). To set a 1024 element table, you would have to perform the hubop 32 times. And while you are doing that, the hub is iterating over the very same table, making it impossible to update the entire table as an atomic operation. Sure, you could add additional mechanisms to to prevent this (lock the hub to the current cog while performing the updates would be one example), but this just adds more complexity (both in the circuitry and in the code), as well as potentially causing other unexpected side effects in the other cogs.
Well, I had expected to put off climbing the FPGA/Verilog learning curve a little longer, but the time has come. My platform is the Terasic Cyclone V GX Starter board. I'll see how well my gut feelings are serving me. After refamiliarizing myself with the pipelined multiplyer example I'll try to implement a ram table of configured length and preset COG sequence indexed by a 10 bit counter with a fixed reset value (the modulus). This will read the table and return the COG number next to access the shared resource ... e.g. the bus. If this is rocket science, I will become a rocket scientist. The model itself is far from rocket science.
Re: Setting up the table: That can be a second step for study. For now, I can assume that the table is populated and the modulus is set during the initial load and startup by COG0 using some magic new SPIN instructions ... kind of like we set _clkmode and _xinfreq now. It will be like the configuration steps I go through with brand-x. Later on I can explore what my issues are in manipulating this modulus and table while the HUB is sequencing. There will be the obvious things to avoid like stepping on oneself as the table is changed. But gut tells me it is possible and simple. Something like COGSLOT(cog,slot). Using this model it shouldn't matter if the COG has been started or not. I'm assuming the SLOT sequencer can detect if the COG in the SLOT has been initiated and idle through it if it has not. It must be doing just that now in P1.
This is probably something someone experienced can accomplish in less than an hour. It could take me weeks.
I'll try to implement a ram table of configured length indexed by a 10 bit counter with a fixed reset value (the modulus). This will read the table and return the COG number next to run. If this is rocket science, I will become a rocket scientist.
First just code the counters, and you will find that 10 bit counter will be slower than a 5 bit counter.
Later on I can explore what my issues are in manipulating this modulus and table while the HUB is sequencing running.
Change of ReLoad is simple and safe, as it applies when Counter = 0000. ie it applies at the next scan.
Change of Slot Alloc will need Dual port memory, but I'm not sure FPGAs can do asymmetric Dual Port, which would give atomic write. Fast run time changes to this will be useful. (write port 128b wide, read port 4 bit wide)
With block_ram I have a more complex Dual-Table working at > 300MHz in Lattice, which is roughly Cyclone IV
I'm confused, we must be talking at cross purposes. We started out weighing up the merits of interrupts vs multiple cores to handle external events and then you start discussing HUB sharing, slot timing, consequent jitter and parallel algorithms.
No we didn't. I addressed your question of what was a large application and why I would favor interrupts over COG sequencing for those.
My simple point was that an interrupt requires processing power to handle it. That processing power can be "borrowed" from the currently running code on a single core machine, a traditional interrupt handler, or you can have a dedicated CPU to handle it.
And my point is that the Propeller is not going to be good at that. I wouldn't consider the Propeller for anything above layer 2.
I see no situation where one would prefer to have ones code disrupted by an interrupt rather than just be able to keep running and have something else deal with it in parallel.
Except of course if it's just to big and or expensive to have those extra cores. As it has been historically.
It's using the right tool for the job. An example strategy might be the FPGA chips now coming with a SoC. Some things are better done there than in the FPGA. They also come with hardware SPI and I2C and even some DSP modules. What the FPGA then brings to the party is a fabric for connecting all that stuff.
I do agree about the idea of a Propeller as an I/O processor for some other processor.
I look at it two ways. One is as you say. The other is kind of as a little operating system taking output from other specialized chips (like the brand-x PIC24FJ... Intelligent Integrated Analog chip) and delivering it to other chips (like SRAM). If the Propeller had SPI or I2C modules, that would be a plus. But supposedly I could load such modules in a COG. I would prefer bullet proof modules in hardware though.
I still don't get it. Can you suggest an example where having code execution stopped in order to do something else is preferable to having execution continue while some other hardware, like another core, handles the event?
Saying you can see such situations without telling us what they might be is not helpful.
I'm not sure why we are introducing network layer terminology here. But as an analogy we can say that it depends on what you are doing. Perhaps a Prop is just providing some interface to some other system and is hence down a "layer 2". Or perhaps the Prop is the entire system, as is often the case, then it occupies all layers of the stack.
I would prefer bullet proof modules in hardware though.
What is not "bullet proof" about creating a device in software on a processor as compared to building it in logic gates? From a purely theoretical point of view any logical function that can be designed with a sea of gates can be written as a program. So give the software solution is fast enough we are good to go.
If you really want dedicated hardware peripherals then the Prop is not the choice for you. There are thousands of other devices out there that provide that already.
Do I care? It's running in parallel. It only has to be as fast as the HUB rotation sequencing (now 2 clocks).
Checking Counters is a good way to check your tool flows, so you know the speed of counters, and Muxes, then you can reality-check what the tools tell you.
HUB scan rate is also not 2 clocks, but perhaps you don't care about that either ?
Checking Counters is a good way to check your tool flows, so you know the speed of counters, and Muxes, then you can reality-check what the tools tell you.
HUB scan rate is also not 2 clocks, but perhaps you don't care about that either ?
Re counters: Will keep that in mind. My FPGA adventure starts today. I'm not a quick study.
Re HUB scan clocks: Came up with the 2 cycles by logic. HUB rotates in 32 clocks. Stops 16 times in a rotation. That's 2 clocks per stop.
With no detailed model (no ... the diagram on the datasheet is not a detailed model), I conceive my own model. And then you guys poke holes in my conceptual model and I refine it. My current model conception is different than my first conception. I presume that's progress. Without knowing how COGs are selected onto the bus to accomplish things like RD and WR, I'm left guessing ... and that means saying "how would I do it?"
But before going into hibernation with FPGA/Verilog, I will leave with this: I remain unconvinced that an indirect time slot/COG table of substantial length (say 1000) and variable modulus is not better and simpler and handles more uses cases (i.e. all of them) than anything else I've seen described here.
Re counters: Will keep that in mind. My FPGA adventure starts today. I'm not a quick study.
Re HUB scan clocks: Came up with the 2 cycles by logic. HUB rotates in 32 clocks. Stops 16 times in a rotation. That's 2 clocks per stop.
With no detailed model (no ... the diagram on the datasheet is not a detailed model), I conceive my own model. And then you guys poke holes in my conceptual model and I refine it. My current model conception is different than my first conception. I presume that's progress. Without knowing how COGs are selected onto the bus to accomplish things like RD and WR, I'm left guessing ... and that means saying "how would I do it?"
But before going into hibernation with FPGA/Verilog, I will leave with this: I remain unconvinced that an indirect time slot/COG table of substantial length (say 1000) and variable modulus is not better and simpler and handles more uses cases (i.e. all of them) than anything else I've seen described here.
I still don't get it. Can you suggest an example where having code execution stopped in order to do something else is preferable to having execution continue while some other hardware, like another core, handles the event?
Stopping is never preferable to continuous, uninterrupted operation. But my dialog started here with the impression that the Propeller would benefit by not having to ever wait more than the number of clocks required for the instruction. Initially I thought doubling the slots (and the hardwired cogs) made matters worse. I still do. I'm not comfortable with the impossibility of doing an atomic RD, update, WR ... especially when I know its possible to have that capability with the propeller using an indirected time slot/cog table for cog sequencing. I'm really uncomfortable with the concept of so-called "super cogs".
Saying you can see such situations without telling us what they might be is not helpful.
No, but going into a long statistical analysis based in conjecture that no one reads and generates more questions than it answers is not helpful either.
I'm not sure why we are introducing network layer terminology here.
It's the nature of the network layer model. Things at layer 1 are down on the hardware and more predictable than any of the upper layers. As you move up, the complications of the application defy predictability.
Perhaps a Prop is just providing some interface to some other system and is hence down a "layer 2". Or perhaps the Prop is the entire system, as is often the case, then it occupies all layers of the stack.
Again, the higher you move up the stack, the less predictable the code becomes. RTOS does what it can to give predictability, but can only do so much.
What is not "bullet proof" about creating a device in software on a processor as compared to building it in logic gates?
Timing. And predictable is not the issue ... predictable in general practice is. As long as software is changable, it's not bullet proof. An SPI in hardware would have no concern about what the software was doing. There's always been a tendency to commit software to hardware as soon as it becomes stable and widely used ... and for obvious reasons. Arguing against myself, I like the concept of the software radio.
From a purely theoretical point of view any logical function that can be designed with a sea of gates can be written as a program. So give the software solution is fast enough we are good to go.
That "given" is a real reach. I'm not ready for a philosophical discussion or religion.
If you really want dedicated hardware peripherals then the Prop is not the choice for you. There are thousands of other devices out there that provide that already.
Bingo!
I'm now off to FPGA/Verilog land. I've made my points. A word to the wise is sufficient. I'll continue looking in occasionally but will diligently try to discipline myself to stay out of the loop.
I remain unconvinced that an indirect time slot/COG table of substantial length (say 1000) and variable modulus is not better and simpler and handles more uses cases (i.e. all of them) than anything else I've seen described here.
Of course a larger table is more granular than a smaller one. That is self evident.
As a larger variant, it cannot be called 'simpler'
However, you do get diminishing returns, and once you go below 3% slot / 50% share levels, the 'gains' are at the Asymptote.
A large table has costs, in Atomic access, and silicon area, and time, and even just the code needed to change/manage it .
Your 500 Bytes is a LOT of useful Dualport RAM, and many would rather see some flipped to use between COGs, for example. ie you can over-solve a problem.
And it makes Heater's solution - which requires no tables and no messing about - look even more awesome!
Ross.
True for applications consisting of one large chunk of code served by 15 tiny slaves. Not so awesome for a few mid sized code blocks. Being able to split the functions an application needs between cogs is one of the big pluses of using the propeller.
True for applications consisting of one large chunk of code served by 15 tiny slaves. Not so awesome for a few mid sized code blocks. Being able to split the functions an application needs between cogs is one of the big pluses of using the propeller.
That particular use case is the only mainstream use case that anyone has been able to clearly identify. It simply doesn't make sense to design a chip to cater for other edge cases just because people "think" they might be handy.
That particular use case is the only mainstream use case that anyone has been able to clearly identify. It simply doesn't make sense to design a chip to cater for other edge cases just because people "think" they might be handy.
Ross.
That may be the case now, however that may have more to do with the limited size of hub memory and having only 8 cogs to work with. With 16 cogs and 512K it will be practical to do more by dividing tasks even further.
An application that processes multiple inputs could then be divided into 3 levels. Hub tasks would consist of the main program and a few tasks that process incoming data, display results, summarize and store data, etc. Each of them would need a cog and hub access. Some of those tasks would also need another cog for the high speed I/O and hub access to pass the data to it's hub task.
That particular use case is the only mainstream use case that anyone has been able to clearly identify. It simply doesn't make sense to design a chip to cater for other edge cases just because people "think" they might be handy.
If Chip had taken that attitude when first considering the Propeller, he probably wouldn't have made it. Its always easy to dismiss "edge cases" when there's no precedent for such things on existing hardware. One of the things that makes the Propeller so great is that it breaks the rules for the "mainstream" mould, allowing us to redefine what's "mainstream" and "edge cases". Don't go pigeonholing the Propeller's applicability until we have actual silicon.
Comments
12.5Mhz access times, and on each one you can get a byte, long, QUAD, word, etc...
I believe both (1) and (2) would slow down as the table got larger. In the first case, every doubling of the table increases the counter by one bit, increasing the carry chain, and therefore the settle time of the overall counter. In the second case, you end up with a larger addressing mux that also has a slightly longer settle time. And, if one feeds the other, the mux must wait for the counter to settle before it can settle, making their effects additive.
The reason you are seeing drastic timing changes with larger registers and muxes is the limitation of the fpga. The fpgas used 4 bit lookups (newer ones from Xilinx use 6, not sure about altera). So the moment you need to go higher, it needs to daisy chain another LUT. This will not be the case in the real P1+/P2. Counters can use forward carry to minimise delays. Chip is full of tricks to do this - remember he designed the original P1 using logic.
The only problem I see with limiting Cog 0 as being the supercog is that each cog has a specific set of DAC pins. So if you wanted to use Cog 0 DAC you probably don't want this to be the supercog. That is why I suggested the Table#2 idea - to allocate each slot to a secondary cog - if you allocate them all to 0, then Cog 0 becomes the supercog. But you could setup table#2 as 0,1,0,1.... and cogs 0 and 1 would get even and odd unused slots.
As long as the slot table is not too big, it gives maximum flexibility. I thought 32 was nice, because you can selectively downgrade some cogs, while keeping the majority as standard 1:16 (by 2:32).
A supercog (or two) would increase the performance of a cog. Of course it depends on what other cogs are doing. But usually, the main program does not require deterministic operation, just an average of more hub access, to perform faster. Why artificially limit its use in the name of determinism. If the P1+/P2 could end up being not only the peripheral, but also the main processor too, then we have a clearer winner for some jobs = more sales.
FWIW We are all familiar with P1. If cog0 was a supercog, imagine how much faster SPIN would perform if running in cog0. Same for LMM.
Remember, most cogs are only running smart I/O blocks and do not require a lot of hub badwidth.
The one issue I see here, is that Objects (cogs) need another way to signal data is available (rather than by hub) so they can minimise hub accesses when not required. eg FDX waits on the hub for bytes to send. If it could be signalled that the output pointer has been updated via an alternate method (like PortD or similar) then the object would not need to sit reading the hub pointer. This would free more slots for the supercog. We only really need 1 or 2 bits per cog to do this, so a 32 bit "or"ed bus would be ideal. Perhaps we could even put the cog into low power by some wait method.
Sure, but the FPGA can run > 200MHz on 32b simple counters, so it is best to not impact that speed, as FPGA emulation is an important test point. The delays are not really counter issues, as I have 32b counters > 200MHz.
The speed impacts will also exist in ASICs, as the logic trees grow, but they should be well outside critical paths.
I think we are talking about two different things here. I'm not talking about the number of clock cycles, rather the duration of a single cycle (which I believe is what jmg was referring to in his timing analysis). A simple 10-bit ripple counter will take approximately twice as long to settle as a simple 5-bit ripple counter (there is a carry between every additional bit). Likewise a 1024:1 mux is going to effectively have something like 5 2:1 mux levels more than a 32:1 mux, which also roughly doubles the settle time. If either or both of these are in the critical path, it will definitely slow the clock speed (and therefore fMax). The trick is to figure out how to minimize the impact on the critical path timing.
And that's what Cluso was alluding to he mentioned the use of carry look-ahead techniques and Chip's work on the P1. And, if you can split up these operations over multiple clock cycles (which I think is what you were alluding to), then that also potentially minimizes impact on the critical path. (This, for example, is why the ALU in the P1+ is active for two clock periods. Due to the lateness of some of its inputs, it cannot settle within a single clock cycle unless you reduce fMax.)
There's also the practical considerations of how you would maintain such a large table. The data bus between cogs and the hub is 128 bits wide. This allows for up to 32 4-bit values to be transferred in a single instruction cycle (and, as this is a hubop, within a single hub slot). To set a 1024 element table, you would have to perform the hubop 32 times. And while you are doing that, the hub is iterating over the very same table, making it impossible to update the entire table as an atomic operation. Sure, you could add additional mechanisms to to prevent this (lock the hub to the current cog while performing the updates would be one example), but this just adds more complexity (both in the circuitry and in the code), as well as potentially causing other unexpected side effects in the other cogs.
Correct, COG 0 alone makes no sense, as it cannot USE all that BW anyway (which is why it is not taken seriously).
Once you accept the need to have a user-choice on supercog(s) - driven by Pin-DAC if nothing else, & 2 COGS clearly makes more sense, voila, you are now back at the Table#2.
I'm confused, we must be talking at cross purposes. We started out weighing up the merits of interrupts vs multiple cores to handle external events and then you start discussing HUB sharing, slot timing, consequent jitter and parallel algorithms.
My simple point was that an interrupt requires processing power to handle it. That processing power can be "borrowed" from the currently running code on a single core machine, a traditional interrupt handler, or you can have a dedicated CPU to handle it.
I see no situation where one would prefer to have ones code disrupted by an interrupt rather than just be able to keep running and have something else deal with it in parallel.
Except of course if it's just to big and or expensive to have those extra cores. As it has been historically.
I do agree about the idea of a Propeller as an I/O processor for some other processor.
Well, I had expected to put off climbing the FPGA/Verilog learning curve a little longer, but the time has come. My platform is the Terasic Cyclone V GX Starter board. I'll see how well my gut feelings are serving me. After refamiliarizing myself with the pipelined multiplyer example I'll try to implement a ram table of configured length and preset COG sequence indexed by a 10 bit counter with a fixed reset value (the modulus). This will read the table and return the COG number next to access the shared resource ... e.g. the bus. If this is rocket science, I will become a rocket scientist. The model itself is far from rocket science.
Re: Setting up the table: That can be a second step for study. For now, I can assume that the table is populated and the modulus is set during the initial load and startup by COG0 using some magic new SPIN instructions ... kind of like we set _clkmode and _xinfreq now. It will be like the configuration steps I go through with brand-x. Later on I can explore what my issues are in manipulating this modulus and table while the HUB is sequencing. There will be the obvious things to avoid like stepping on oneself as the table is changed. But gut tells me it is possible and simple. Something like COGSLOT(cog,slot). Using this model it shouldn't matter if the COG has been started or not. I'm assuming the SLOT sequencer can detect if the COG in the SLOT has been initiated and idle through it if it has not. It must be doing just that now in P1.
This is probably something someone experienced can accomplish in less than an hour. It could take me weeks.
Enough conjecture!
First just code the counters, and you will find that 10 bit counter will be slower than a 5 bit counter.
Change of ReLoad is simple and safe, as it applies when Counter = 0000. ie it applies at the next scan.
Change of Slot Alloc will need Dual port memory, but I'm not sure FPGAs can do asymmetric Dual Port, which would give atomic write. Fast run time changes to this will be useful. (write port 128b wide, read port 4 bit wide)
With block_ram I have a more complex Dual-Table working at > 300MHz in Lattice, which is roughly Cyclone IV
No we didn't. I addressed your question of what was a large application and why I would favor interrupts over COG sequencing for those.
And my point is that the Propeller is not going to be good at that. I wouldn't consider the Propeller for anything above layer 2.
I do.
It's using the right tool for the job. An example strategy might be the FPGA chips now coming with a SoC. Some things are better done there than in the FPGA. They also come with hardware SPI and I2C and even some DSP modules. What the FPGA then brings to the party is a fabric for connecting all that stuff.
I look at it two ways. One is as you say. The other is kind of as a little operating system taking output from other specialized chips (like the brand-x PIC24FJ... Intelligent Integrated Analog chip) and delivering it to other chips (like SRAM). If the Propeller had SPI or I2C modules, that would be a plus. But supposedly I could load such modules in a COG. I would prefer bullet proof modules in hardware though.
I still don't get it. Can you suggest an example where having code execution stopped in order to do something else is preferable to having execution continue while some other hardware, like another core, handles the event?
Saying you can see such situations without telling us what they might be is not helpful.
I'm not sure why we are introducing network layer terminology here. But as an analogy we can say that it depends on what you are doing. Perhaps a Prop is just providing some interface to some other system and is hence down a "layer 2". Or perhaps the Prop is the entire system, as is often the case, then it occupies all layers of the stack. What is not "bullet proof" about creating a device in software on a processor as compared to building it in logic gates? From a purely theoretical point of view any logical function that can be designed with a sea of gates can be written as a program. So give the software solution is fast enough we are good to go.
If you really want dedicated hardware peripherals then the Prop is not the choice for you. There are thousands of other devices out there that provide that already.
Checking Counters is a good way to check your tool flows, so you know the speed of counters, and Muxes, then you can reality-check what the tools tell you.
HUB scan rate is also not 2 clocks, but perhaps you don't care about that either ?
Re HUB scan clocks: Came up with the 2 cycles by logic. HUB rotates in 32 clocks. Stops 16 times in a rotation. That's 2 clocks per stop.
With no detailed model (no ... the diagram on the datasheet is not a detailed model), I conceive my own model. And then you guys poke holes in my conceptual model and I refine it. My current model conception is different than my first conception. I presume that's progress. Without knowing how COGs are selected onto the bus to accomplish things like RD and WR, I'm left guessing ... and that means saying "how would I do it?"
But before going into hibernation with FPGA/Verilog, I will leave with this: I remain unconvinced that an indirect time slot/COG table of substantial length (say 1000) and variable modulus is not better and simpler and handles more uses cases (i.e. all of them) than anything else I've seen described here.
Happy learning!
Stopping is never preferable to continuous, uninterrupted operation. But my dialog started here with the impression that the Propeller would benefit by not having to ever wait more than the number of clocks required for the instruction. Initially I thought doubling the slots (and the hardwired cogs) made matters worse. I still do. I'm not comfortable with the impossibility of doing an atomic RD, update, WR ... especially when I know its possible to have that capability with the propeller using an indirected time slot/cog table for cog sequencing. I'm really uncomfortable with the concept of so-called "super cogs".
No, but going into a long statistical analysis based in conjecture that no one reads and generates more questions than it answers is not helpful either.
It's the nature of the network layer model. Things at layer 1 are down on the hardware and more predictable than any of the upper layers. As you move up, the complications of the application defy predictability.
Obviously you are doing different kinds of things at the different layers. But if you only have a hammer, everything looks like a nail.
Again, the higher you move up the stack, the less predictable the code becomes. RTOS does what it can to give predictability, but can only do so much.
Timing. And predictable is not the issue ... predictable in general practice is. As long as software is changable, it's not bullet proof. An SPI in hardware would have no concern about what the software was doing. There's always been a tendency to commit software to hardware as soon as it becomes stable and widely used ... and for obvious reasons. Arguing against myself, I like the concept of the software radio.
That "given" is a real reach. I'm not ready for a philosophical discussion or religion.
Bingo!
I'm now off to FPGA/Verilog land. I've made my points. A word to the wise is sufficient. I'll continue looking in occasionally but will diligently try to discipline myself to stay out of the loop.
Of course a larger table is more granular than a smaller one. That is self evident.
As a larger variant, it cannot be called 'simpler'
However, you do get diminishing returns, and once you go below 3% slot / 50% share levels, the 'gains' are at the Asymptote.
A large table has costs, in Atomic access, and silicon area, and time, and even just the code needed to change/manage it .
Your 500 Bytes is a LOT of useful Dualport RAM, and many would rather see some flipped to use between COGs, for example. ie you can over-solve a problem.
1000 entry tables? 500 bytes of RAM required?
The whole point of Heater's solution was that it is the simplest possible way to solve a perceieved problem.
Why do some people insist on massively overcomplicating this? Do they really not want any solution at all to emerge?
Ross.
Makes you wonder, doesn't it.
And it makes Heater's solution - which requires no tables and no messing about - look even more awesome!
Ross.
True for applications consisting of one large chunk of code served by 15 tiny slaves. Not so awesome for a few mid sized code blocks. Being able to split the functions an application needs between cogs is one of the big pluses of using the propeller.
That particular use case is the only mainstream use case that anyone has been able to clearly identify. It simply doesn't make sense to design a chip to cater for other edge cases just because people "think" they might be handy.
Ross.
That may be the case now, however that may have more to do with the limited size of hub memory and having only 8 cogs to work with. With 16 cogs and 512K it will be practical to do more by dividing tasks even further.
An application that processes multiple inputs could then be divided into 3 levels. Hub tasks would consist of the main program and a few tasks that process incoming data, display results, summarize and store data, etc. Each of them would need a cog and hub access. Some of those tasks would also need another cog for the high speed I/O and hub access to pass the data to it's hub task.
I want the new chip now!
Ross.
If Chip had taken that attitude when first considering the Propeller, he probably wouldn't have made it. Its always easy to dismiss "edge cases" when there's no precedent for such things on existing hardware. One of the things that makes the Propeller so great is that it breaks the rules for the "mainstream" mould, allowing us to redefine what's "mainstream" and "edge cases". Don't go pigeonholing the Propeller's applicability until we have actual silicon.