Instead <snip>You have a system ClockTick register (CT) and the three user Clock Tick registers CT1, CT2 and CT3 and your instructions are all nice again:
GETCT (get clock ticks)
ADDCT1, ADDCT2, ADDCT3 - add S/# to D (and store result in CT1/2/3)
WAITCT1, WAITCT2, WAITCT3
POLLCT1, POLLCT2, POLLCT3
CNT just ceases to be.
That works too...
ClockTick register (CT) is easy to follow, and the primary opcode action is not the 'squashed' word.
Instead <snip>You have a system ClockTick register (CT) and the three user Clock Tick registers CT1, CT2 and CT3 and your instructions are all nice again:
GETCT (get clock ticks)
ADDCT1, ADDCT2, ADDCT3 - add S/# to D (and store result in CT1/2/3)
WAITCT1, WAITCT2, WAITCT3
POLLCT1, POLLCT2, POLLCT3
CNT just ceases to be.
That works too...
ClockTick register (CT) is easy to follow, and the primary opcode action is not the 'squashed' word.
This smells the best!
I will make this change in the next release. GETCNT will become GETCT.
Does something changed in the interim, or perhaps RDLUT remains the solitary instruction that takes three clock cycles to be executed?
And if this is true, by latching (or extending) destination's D address and the data value grabbed from LUT ram for one more cycle, couldn't you make use of next instruction's GET phase unused WR, to move the data to its final destiny?
Does something changed in the interim, or perhaps RDLUT remains the solitary instruction that takes three clock cycles to be executed?
And if this is true, by latching (or extending) destination's D address and the data value grabbed from LUT ram for one more cycle, couldn't you make use of next instruction's GET phase unused WR, to move the data to its final destiny?
Everything but hub and WAITxxx instructions are two clocks, if I'm not missing anything.
Cool. Weren't all the branch instructions four clocks? I must have missed a change somewhere.
You're right. My mind is a steal sieve.
Both cog RAM access slots are used at the end of RDLUT, but even if there was one free, the data forwarding circuitry still needs to have the result in case the next instruction were to read the location that received the RDLUT value.
Sorry, I forgot about the need for the data forwarding circuit to have it ready, for, possibly, to be used by the next instruction.
Either way, having the ability of linearly spanning from two to four cycles per resolved instruction, makes it useful for timing and syncing purposes (at the cost of using some available scratchpad dummy destination address).
As a side note, it'll also ease the coding for two threads (or COGs) to do some cooperative work on an odd/even short cycle count basis, eg, early/late data sampling.
$1F8 PTRA pointer B to hub RAM
$1F9 PTRB pointer A to hub RAM
Also, I think I'd prefer "SETCNT1" to "ADDCNT1".
"add" seems to me to imply adding something to CNT1 and I didn't see any way to set it's initial value.
But, after I read it, I see that you are setting CNT1 to sum of S and D.
So, you are doing an "ADD", but to me, the "SET" is the important part there...
Question:
Do all cogs toggle output pins at the exact same time still? I remember a long time ago that this was one of the initial features of P2-Hot. Is it still in there?
$1F8 PTRA pointer B to hub RAM
$1F9 PTRB pointer A to hub RAM
Also, I think I'd prefer "SETCNT1" to "ADDCNT1".
"add" seems to me to imply adding something to CNT1 and I didn't see any way to set it's initial value.
But, after I read it, I see that you are setting CNT1 to sum of S and D.
So, you are doing an "ADD", but to me, the "SET" is the important part there...
Question:
Do all cogs toggle output pins at the exact same time still? I remember a long time ago that this was one of the initial features of P2-Hot. Is it still in there?
I'll fix that typo. Thanks.
Because instructions take two cycles (at least), cogs will not always update pins at the same time, unless you coordinated them to do so.
I think I recall that in P1 there are different propagation delays between different cogs and I/O pins...
Think I remember that was fixed for P2 at one point...
On the P1, those are sub-clock differences that come from propagation delays through the OR gates.
One the P2, there are flops after the OR gates to align everything.
I know we are waiting on smart pins but the smart thing to do with combining the outputs would be not to OR but to AND, that is any low will produce a low as many signals are active low so multiple cogs could then leave the pin high and another could take it low. Same for serial transmit too.
What!? Peter, you're asking for something, on a chip wide basis, that is just trading one problem for another. You're not getting anything improved by doing it.
And systems like that use negative logic or default to a high-output, both are counter-intuitive.
You have two cogs calling the same routines at different times to activate a chip select. That chip select is active low but the OR outputs mean that the line needs to idle high but that blocks the other cog. Two or more cogs call a serial transmit, say for debugging, and this is of course "active low" as it needs to idle high but the transmit is blocked because it is OR'ed. There is nothing counter-intuitive about this, if only one cog wants to set an output it works the same since the inputs of the "AND" (or negated input NOR) that aren't being driven would be "float high" just as the current OR is "float low" if you want to think of it that way.
I've had plenty of experience with the Prop and countless hardware designs and this one thing would have made such a difference to how I can have multiple cogs share I/O. It's a very important point and surely anyone who has had much to do with Prop design must know this.
You have two cogs calling the same routines at different times to activate a chip select. That chip select is active low but the OR outputs mean that the line needs to idle high but that blocks the other cog. Two or more cogs call a serial transmit, say for debugging, and this is of course "active low" as it needs to idle high but the transmit is blocked because it is OR'ed.
Yes, other MCUs have pullups and Port modes, and open-Drain designs that give a natural OR.
Chip-Select and default Async erial and i2c are all active-low, in action.
Most INT signals from Peripherals are Active-Low.
CPLDs have one method to support both 'Default H and Default L' mixed in a design, and that is a Pin-Polarity fuse.
I/O inversion would be more suitable as that then allows the internals to hold positive logic. UARTs use inverters, I2C is open-collector, pull-downs also another option, the list goes on ... None of that has anything to do with the Cog combining logic. Messing with the OR structure will just be annoying.
I suspect there is plenty of I/O pin config options there right now in the custom layout of the pin structures but is still waiting on Smartpins to provide access to it's config, and of course the real silicon. The FPGA doesn't have the same arrangement so using/testing of that functionality will be limited at best.
Speaking of which, I wonder how much need there is for having the direction control inside every Cog now? We could have a PortC in those addresses.
The direction control in every cog is necessary for fast mode changes, such as with fast I2C, one-wire interfaces, and capacitive "button" tricks.
However, this got me thinking: why can't it be implied by writing to OUTx or reading from INx? Read-after-write might require a couple extra click cycles to let the input settle before capture, but this is no worse than the time taken to set DIRx. Write-after-read should be two click cycles faster than the current approach.
This would free up two registers, though I don't know if we would be able to get an internal PortC in return.
Edit: oh, wait. That won't work. There'd be no way to indicate which bits in the read or write are significant. Hence the need for DIRx.
Should this chip be called something other than "Propeller 2"? It seems like it is diverging from P1 in many ways and claiming they are the same architecture may be misleading.
Comments
That works too...
ClockTick register (CT) is easy to follow, and the primary opcode action is not the 'squashed' word.
You're right. My mind is a steal sieve.
Put another way:
SETEDG #cfg1
' some time passes and the edge event flag is set
SETEDG #cfg2
POLLEDG
In the above code, is POLLEDG necessary to clear the pending event flag from "cfg1", or does the second SETEDG do that automatically?
Edit: never mind. I just read Chip's documentation, which clearly states that it is cleared by the SETxxx instruction.
This smells the best!
I will make this change in the next release. GETCNT will become GETCT.
Does something changed in the interim, or perhaps RDLUT remains the solitary instruction that takes three clock cycles to be executed?
And if this is true, by latching (or extending) destination's D address and the data value grabbed from LUT ram for one more cycle, couldn't you make use of next instruction's GET phase unused WR, to move the data to its final destiny?
Henrique
Both cog RAM access slots are used at the end of RDLUT, but even if there was one free, the data forwarding circuitry still needs to have the result in case the next instruction were to read the location that received the RDLUT value.
I forgot that RDLUT takes three clocks.
Sorry, I forgot about the need for the data forwarding circuit to have it ready, for, possibly, to be used by the next instruction.
Either way, having the ability of linearly spanning from two to four cycles per resolved instruction, makes it useful for timing and syncing purposes (at the cost of using some available scratchpad dummy destination address).
As a side note, it'll also ease the coding for two threads (or COGs) to do some cooperative work on an odd/even short cycle count basis, eg, early/late data sampling.
LUT exec does run at regular speed though, I assume.
LUT exec runs at full speed, yes.
The reason RDLUT takes three clocks:
1) issue LUT read
2) capture LUT data
3) done, mux data through result mux, fan out to all result captures
The reason WRLUT takes two clocks:
1) issue LUT write
2) done
The hub exec FIFO-level bug was fixed and GETCNT was renamed to GETCT.
This section was a big help!! (ok, it certainly gave me plenty to play with this afternoon....I don't think anybody at work noticed!
EDIT: Only 10 COGS on the 1-2-3! What's this world coming to!!! Time for those A9 boards!!
Saw this typo:
Also, I think I'd prefer "SETCNT1" to "ADDCNT1".
"add" seems to me to imply adding something to CNT1 and I didn't see any way to set it's initial value.
But, after I read it, I see that you are setting CNT1 to sum of S and D.
So, you are doing an "ADD", but to me, the "SET" is the important part there...
Question:
Do all cogs toggle output pins at the exact same time still? I remember a long time ago that this was one of the initial features of P2-Hot. Is it still in there?
I'll fix that typo. Thanks.
Because instructions take two cycles (at least), cogs will not always update pins at the same time, unless you coordinated them to do so.
I think I recall that in P1 there are different propagation delays between different cogs and I/O pins...
Think I remember that was fixed for P2 at one point...
On the P1, those are sub-clock differences that come from propagation delays through the OR gates.
One the P2, there are flops after the OR gates to align everything.
The "Prop123_A7_Prop2_v3.rbf" still has 11 working cogs not 10.
That sounds right. It's the DE2-115 that has 10.
I know we are waiting on smart pins but the smart thing to do with combining the outputs would be not to OR but to AND, that is any low will produce a low as many signals are active low so multiple cogs could then leave the pin high and another could take it low. Same for serial transmit too.
How about having AND instead of OR?
And systems like that use negative logic or default to a high-output, both are counter-intuitive.
I've had plenty of experience with the Prop and countless hardware designs and this one thing would have made such a difference to how I can have multiple cogs share I/O. It's a very important point and surely anyone who has had much to do with Prop design must know this.
Chip-Select and default Async erial and i2c are all active-low, in action.
Most INT signals from Peripherals are Active-Low.
CPLDs have one method to support both 'Default H and Default L' mixed in a design, and that is a Pin-Polarity fuse.
I suspect there is plenty of I/O pin config options there right now in the custom layout of the pin structures but is still waiting on Smartpins to provide access to it's config, and of course the real silicon. The FPGA doesn't have the same arrangement so using/testing of that functionality will be limited at best.
Speaking of which, I wonder how much need there is for having the direction control inside every Cog now? We could have a PortC in those addresses.
The direction control in every cog is necessary for fast mode changes, such as with fast I2C, one-wire interfaces, and capacitive "button" tricks.
However, this got me thinking: why can't it be implied by writing to OUTx or reading from INx? Read-after-write might require a couple extra click cycles to let the input settle before capture, but this is no worse than the time taken to set DIRx. Write-after-read should be two click cycles faster than the current approach.
This would free up two registers, though I don't know if we would be able to get an internal PortC in return.
Edit: oh, wait. That won't work. There'd be no way to indicate which bits in the read or write are significant. Hence the need for DIRx.
Wondering if it would be nice to have aliases:
addct1 = addcnt
waitct1 = waitcnt
Then those who don't use interrupts can keep syntax simple...
I don't see how that is simpler.