Might it be possible (and worth it) to window the smart pins into a small block in the hub ram ???
One of the original ideas was to have pin independence so Cogs aren't competing for what shouldn't be a shared resource.
On the other hand, we are probably now talking more about config than throughput. Compared to the Prop1, we're ending up with a lot of automatic hardware that just needs configured now.
On the other hand, we are probably now talking more about config than throughput. Compared to the Prop1, we're ending up with a lot of automatic hardware that just needs configured now.
Yes, I think Chip is considering mainly Config/setup.
I think there still needs to be a simple, direct path to allow (for example), multiple COGS to SysCLK granular SYNC to an single shared external edge.
IIRC, that was via a MUX to swap-in a flag to the usual simple direct pin IO that is there now.
I found a better way to talk to smartpins: 4-bit pathways that can move 32 bits in 8 clocks.
...
This seems to make the best balance of performance and silicon.
Sounds good.
There is still a Flag to Pin path Mux, for shared polling for example ?
Is there an address field sent too, for 36 or 40b frames to the pins - some means to select which PinCell.reg is used would be needed ? - that will add a few more clocks ?
With a slowish read path, more buffering will be needed at the pins.
eg for the Toggle-at-match PWM variant mentioned above, that is 2 compare registers, and it is also useful to have at least 2 capture registers (usually shared with compare), to allow for _/= and =\_ capture, down to 1 sysclk wide.
I think Microchip has a small FIFO on some capture, of course is nice, but may be too costly.
Are we going to need smart pins to talk to the external memory on the P123 A9?
What kind of benchmarks are we talking about for this?
One smartpin will be needed to output NCO MSB to generate the SDRAM clock, but the rest will be handled by instructions and the streamer.
For the newer clk-echo memories,(ISSI,Micron,Spansion et al) one more pin would be needed for the clk back.
Also, a simple Pin-Mux-reg is needed for DDR@Pins to wider Clk sync'd data paths inside P2
(or, a /2 at the pin, but that halves the data rate)
Looking at the data on those parts, this CLK-echo has other uses besides timing closure.
It will pause when data is not valid, and so neatly manages the latencies memories have on initial memory read, and the few-clock pauses some have on crossing page boundaries when streaming.
Perhaps one of the first questions I've made to myself, when I was introduced to the Propeller, could now be answered.
Is there a fixed relationship between the 4 LSBs of the system counter and the COG that is getting its HUB slot?
I'm asking this because, if there is one (perhaps equality), previewing when a specific COG will get its slot or mentally following the nibbles parade to complete 32 bits, will become very simple and straightforward tasks.
Is there a fixed relationship between the 4 LSBs of the system counter and the COG that is getting its HUB slot?
Close, but not quite.
I think the HUB slot used is governed by the 4 lower bits of the variable R/W address, not the COG number.
This enables fast streaming, but access times are a little less deterministic.
I like the sound of NCO MSB output. Will give me the pixel clock I need to get LCD refresh rate up.
Also will allow output to DVI encoder and other things like that...
I found a better way to talk to smartpins: 4-bit pathways that can move 32 bits in 8 clocks.
These will be OR'd together from all cogs and won't use the hub for coordination.
Because they happen within an instruction, they don't require separate flops to work in the background.
Pins will feed back longs using the 3 LSBs of the system counter to time the nibbles.
How in the world do you keep innovating and developing, Chip? It tires my brain just to think of all that is involved in SmartPins, and yet it sits atop an enormous amount of work you've done already on P2, P2 Hot, and P2 mask-error. It is high time for a big payoff to come your way!
Smart pins will look at an incoming 3-bit code on every clock. If it is non-%000, this means the start of a command. The command length varies, according to this initial 3-bit code, which will be followed by some fixed number of 3-bit payloads:
Each cog outputs 3 bits per pin (3 x 32 = 96 bits). The cogs' sixteen sets of 96 bits get OR'd together to form a composite 96-bit set that gives each pin 3 bits. It's like how OUT and DIR signals are OR'd together, except 3 bits per pin, instead of 1.
I think two 32-bit registers are plenty for data setting, while we have a 6-bit mode and a 13-bit pin configuration. That should provide enough raw input conduit for anything. For example, that dual-output triangle-wave PWM mode could use one 32-bit register as an NCO adder value (frequency) and the other 32-bit register as two 16-bit words that provide the thresholds.
Anyway, I feel like this is snapping to grid now, and this part of the problem is maybe solved.
To input from pins, I was thinking that we could have 4 pins coming out of each smart pin and they could be correlated to the system counter's 3 LSBs, so that to read a smart pin, you would just gather nibbles from one of 64 sets of 4 pins, as the system counter's 3 LSBs ran from %000 to %111. I don't think that smart pins need to return more than 32 bits, but we could make them do so by using bit 3 of the system counter to parse longs.
I don't think that smart pins need to return more than 32 bits, but we could make them do so by using bit 3 of the system counter to parse longs.
There is a Capture case of simple Pulse Width measurement where you could want to read 64b ie 2 x 32b values being time-stamps of Rise and Fall.
Some simple Arm/trigger logic is needed to ensure those stamps are on the same cycle.
If you have Capture and Clear option on one edge, then 2 x 32b captures can give Edge position and Period
Wide dynamic range Frequency counting needs to capture Time and Fi Cycles on an Arm/trigger basis.
That's 2 x 32b captures and 2 counters, one for time, one for Fi Cycles.
Smart pins will look at an incoming 3-bit code on every clock. If it is non-%000, this means the start of a command. The command length varies, according to this initial 3-bit code, which will be followed by some fixed number of 3-bit payloads: ...
Hmmmm... Maybe you can use a similar technique to give variable-byte-length COG instructions so we get better code density! :-)
I think two 32-bit registers are plenty for data setting, while we have a 6-bit mode and a 13-bit pin configuration. That should provide enough raw input conduit for anything. For example, that dual-output triangle-wave PWM mode could use one 32-bit register as an NCO adder value (frequency) and the other 32-bit register as two 16-bit words that provide the thresholds.
That may be a little 'light' ?
PWM you need to set the Period and Thresholds, and 16b is maybe just enough if you have a prescaler too.
(some PWM control schemes keep the threshold fixed and vary the total period )
I think two 32-bit registers are plenty for data setting, while we have a 6-bit mode and a 13-bit pin configuration. That should provide enough raw input conduit for anything. For example, that dual-output triangle-wave PWM mode could use one 32-bit register as an NCO adder value (frequency) and the other 32-bit register as two 16-bit words that provide the thresholds.
That may be a little 'light' ?
PWM you need to set the Period and Thresholds, and 16b is maybe just enough if you have a prescaler too.
(some PWM control schemes keep the threshold fixed and vary the total period )
The period would be a function of the NCO frequency and the two 16-bit thresholds would get compared to NCO[30:15]. Well, before the comparison, the NCO value would be NOT'd if the MSB was set. That would give the triangle waveform. That would be sufficient, wouldn't it?
The period would be a function of the NCO frequency and the two 16-bit thresholds would get compared to NCO[30:15]. Well, before the comparison, the NCO value would be NOT'd if the MSB was set. That would give the triangle waveform. That would be sufficient, wouldn't it?
Hmmm, I'm not sure about how NCO multiple bits meshes with PWM.
That would have jitter, and not make for easy 'live' modulation of the period, and would need >= compares rather than = ?
In PWM designs, usually the setpoints (Periods, compares) buffer and update only on Counter=0, which is also not quite a NCO concept.
The period needs to be fully granular and stable for some modulation schemes. (ie allow 1024, 1003, 1047 or whatever)
The period would be a function of the NCO frequency and the two 16-bit thresholds would get compared to NCO[30:15]. Well, before the comparison, the NCO value would be NOT'd if the MSB was set. That would give the triangle waveform. That would be sufficient, wouldn't it?
Hmmm, I'm not sure about how NCO multiple bits meshes with PWM.
That would have jitter, and not make for easy 'live' modulation of the period, and would need >= compares rather than = ?
In PWM designs, usually the setpoints (Periods, compares) buffer and update only on Counter=0, which is also not quite a NCO concept.
The period needs to be fully granular and stable for some modulation schemes. (ie allow 1024, 1003, 1047 or whatever)
It's true there would be 1 clock period of jitter for most values, but it would average out to be really precise. Adders would make it expensive.
In either case, you would have to update the thresholds synchronous to the period, right? Well, I can see where a single equality event could get around the need for that. Is that what you were implying?
It's true there would be 1 clock period of jitter for most values, but it would average out to be really precise. Adders would make it expensive.
... Well, I can see where a single equality event could get around the need for that. Is that what you were implying?
Where I can see NCO/adder prescalers have issues, is they are ok for small increments, but if you want a shorter period than 16b, the values effectively left-justify eg adding 62.25 gives average period of 1020, but that skips many values as it adds, and so simple equality (which is smaller in logic) will not work.
Most PWMs always change counters by +/-1 and so can use the smaller == test on compares.
(and if they update on period-end, that ensures there is always a match)
I've got the initial modes planned out. They fit neatly into 5 bits. This can be easily expanded to whatever we need, to accommodate USB, for example.
instructions
--------------------------------------------------------------------------------------------------------------------------------------
WSBYTE D/#,S/# 'write D[07:0] to pin S[5:0] data, mode dependent
WSWORD D/#,S/# 'write D[15:0] to pin S[5:0] data, mode dependent
WSLONG D/#,S/# 'write D[31:0] to pin S[5:0] data, mode dependent
WSMODE D/#,S/# 'write D[31:0] to pin S[5:0] mode %MMMMM_FFFFCIOHHHLLL
RSBYTE D,S/# 'read byte from pin S[5:0] into D, mode dependent
RSLONG D,S/# 'read long from pin S[5:0] into D, mode dependent
A = IN from this pad, B = IN from other pad, B OUT = OUT to other pad
pad pad
MMMMM Description DIR OUT Pattern Setup Update
--------------------------------------------------------------------------------------------------------------------------------------
00000 OUT (default) DIR OUT
00001 B OUT DIR B OUT
00010 CLK DIR CLK
00011 * transitions DIR mode update-period-repeat WSBYTE=prescaler WSLONG=transitions
00100 * duty DIR mode update-period-repeat WSBYTE=prescaler WSLONG=adder ~
00101 * nco DIR mode update-period-repeat WSBYTE=prescaler WSLONG=adder ~
00110 * pwm sawtooth 16:16 DIR mode update-period-repeat WSBYTE=prescaler WSLONG=F:T, WSWORD=T ~
00111 * pwm triangle 16:16 DIR mode update-period-repeat WSBYTE=prescaler WSLONG=F:T, WSWORD=T ~
01000 * count highs DIR ** OUT period-update-repeat WSLONG=period (0=cont) RSLONG=count ~
01001 * count lows DIR ** OUT period-update-repeat WSLONG=period (0=cont) RSLONG=count ~
01010 * count all edges DIR ** OUT period-update-repeat WSLONG=period (0=cont) RSLONG=count ~
01011 * count positive edges DIR ** OUT period-update-repeat WSLONG=period (0=cont) RSLONG=count ~
01100 * time highs DIR ** OUT event-update-repeat RSLONG=count ~
01101 * time lows DIR ** OUT event-update-repeat RSLONG=count ~
01110 * time highs/lows DIR ** OUT event-update-repeat RSLONG=count ~ (MSB=state)
01111 * time positive edges DIR ** OUT event-update-repeat RSLONG=count ~
10000 * DAC cog channel DIR OUT event-update-repeat WSLONG=period
10001 * DAC random per period DIR OUT event-update-repeat WSLONG=period
10010 * DAC 16-bit dither DIR OUT event-update-repeat WSLONG=period WSWORD=value ~
10011 * DAC 16-bit pwm LSB DIR OUT event-update-repeat WSLONG=period WSWORD=value ~
10100 * A-high inc, B-high dec DIR ** OUT period-update-repeat WSLONG=period (0=cont) RSLONG=count ~
10101 * A-rise inc, B-rise dec DIR ** OUT period-update-repeat WSLONG=period (0=cont) RSLONG=count ~
10110 * A-B encoder DIR ** OUT period-update-repeat WSLONG=period (0=cont) RSLONG=count ~
10111 * pulse, wait B DIR mode period-update-repeat WSLONG=16:16 H:L period RSLONG=last wait for B ~
11000 * sync tx byte, B clk DIR mode transmit-wait-repeat WSWORD=baud *** WSBYTE=data ~~
11001 * sync tx long, B clk DIR mode transmit-wait-repeat WSWORD=baud *** WSLONG=data ~~
11010 * sync rx byte, B clk DIR ** OUT wait-receive-repeat WSWORD=baud *** RSBYTE=data ~
11011 * sync rx long, B clk DIR ** OUT wait-receive-repeat WSWORD=baud *** RSLONG=data ~
11100 * async tx byte DIR mode transmit-wait-repeat WSWORD=baud WSBYTE=data ~~
11101 * async tx long DIR mode transmit-wait-repeat WSWORD=baud WSLONG=data ~~
11110 * async rx byte DIR ** OUT wait-receive-repeat WSWORD=baud RSBYTE=data ~
11111 * async rx long DIR ** OUT wait-receive-repeat WSWORD=baud RSLONG=data ~
* DIR from cogs: 0=reset, 1=start; IN to cogs: 1=done; !OUT from cogs clears done
** set %HHHLLL to %111111 (float/float) if your intent is to input
*** for tx, update data after B-rise; for rx, sample data before b-rise (delay input data by one clk)
~ data is buffered
~~ data is double buffered
Sounds good.
It would help to include the resource size for each mode.
I'm guessing 16b counters/setpoints for PWM ( & 16b prescale?) and 32b counters for Timers and capture ?
How many captures are there ?
eg can it capture both Period and mid edge, to extract duty cycle as M/P ?
Likewise narrow pulse width capture can capture _/= and =\_ into separate registers, to allow down to 1 SysCLK width capture.
Do the Sync Tx modes include 2w and 4w for Dual/Quad SPI ?
( HW that supports Dual SPI can also do JTAG, and P2 should make quite a good JTAG engine.)
A bit-count that covers 1..32 is the most flexible.
Note SPI is usually duplex, and code often decides to discard Rx, but it is there as part of the process.
Above list seems to be Tx or Rx ?
We could refine the capture modes to support short events, and also midpoint and period.
I don't have any immediate plans for 2w and 4w modes, but they could be added. Right now the smart pins are even/odd-paired for signal sharing. Handling 3 or 5 pins (2w/4w+clk) would need another topology. I will get this working first, and then it should be more obvious how to arrange more pins into a smart pin.
I'm anxious to see how much logic this will all take. There is a lot of sharing or flops, etc.
I don't have any immediate plans for 2w and 4w modes, but they could be added. Right now the smart pins are even/odd-paired for signal sharing. Handling 3 or 5 pins (2w/4w+clk) would need another topology. I will get this working first, and then it should be more obvious how to arrange more pins into a smart pin.
I'm anxious to see how much logic this will all take. There is a lot of sharing or flops, etc.
Yes, the wider modes overlap somewhat with the Streamer, but it will be important to stream 4w and 8w memories.
Logic needed will be interesting, as there are many of these.
Comments
One of the original ideas was to have pin independence so Cogs aren't competing for what shouldn't be a shared resource.
On the other hand, we are probably now talking more about config than throughput. Compared to the Prop1, we're ending up with a lot of automatic hardware that just needs configured now.
I think there still needs to be a simple, direct path to allow (for example), multiple COGS to SysCLK granular SYNC to an single shared external edge.
IIRC, that was via a MUX to swap-in a flag to the usual simple direct pin IO that is there now.
WOW. <remembers scene in front of police station in the movie "Tank"> I think ya got me covered.
What kind of benchmarks are we talking about for this?
One smartpin will be needed to output NCO MSB to generate the SDRAM clock, but the rest will be handled by instructions and the streamer.
These will be OR'd together from all cogs and won't use the hub for coordination.
They are fast enough that no interrupts will be needed to optimally use them.
Because they happen within an instruction, they don't require separate flops to work in the background.
Pins will feed back longs using the 3 LSBs of the system counter to time the nibbles.
This seems to make the best balance of performance and silicon.
Sounds good.
There is still a Flag to Pin path Mux, for shared polling for example ?
Is there an address field sent too, for 36 or 40b frames to the pins - some means to select which PinCell.reg is used would be needed ? - that will add a few more clocks ?
With a slowish read path, more buffering will be needed at the pins.
eg for the Toggle-at-match PWM variant mentioned above, that is 2 compare registers, and it is also useful to have at least 2 capture registers (usually shared with compare), to allow for _/= and =\_ capture, down to 1 sysclk wide.
I think Microchip has a small FIFO on some capture, of course is nice, but may be too costly.
For the newer clk-echo memories,(ISSI,Micron,Spansion et al) one more pin would be needed for the clk back.
Also, a simple Pin-Mux-reg is needed for DDR@Pins to wider Clk sync'd data paths inside P2
(or, a /2 at the pin, but that halves the data rate)
Looking at the data on those parts, this CLK-echo has other uses besides timing closure.
It will pause when data is not valid, and so neatly manages the latencies memories have on initial memory read, and the few-clock pauses some have on crossing page boundaries when streaming.
Good data examples are here ( see RWDS)
http://www.issi.com/US/product-flash.shtml
Perhaps one of the first questions I've made to myself, when I was introduced to the Propeller, could now be answered.
Is there a fixed relationship between the 4 LSBs of the system counter and the COG that is getting its HUB slot?
I'm asking this because, if there is one (perhaps equality), previewing when a specific COG will get its slot or mentally following the nibbles parade to complete 32 bits, will become very simple and straightforward tasks.
Henrique
I think the HUB slot used is governed by the 4 lower bits of the variable R/W address, not the COG number.
This enables fast streaming, but access times are a little less deterministic.
On 0, all cogs can access addresses with nibbles equal to their cogid.
On clock 1, all cogs access address+1 modulo 15
Etc...
Also will allow output to DVI encoder and other things like that...
Will we be able to flip the polarity if needed?
Modulo 15 or 7, as they are being selected by the 3 LSBs of the system counter?
How in the world do you keep innovating and developing, Chip? It tires my brain just to think of all that is involved in SmartPins, and yet it sits atop an enormous amount of work you've done already on P2, P2 Hot, and P2 mask-error. It is high time for a big payoff to come your way!
That's what I understand. Need to go back to find the egg beater thread again. There was a cool chart showing it.
For instructions, it is just a longer cycle, due to their two clock time.
Smart pins will look at an incoming 3-bit code on every clock. If it is non-%000, this means the start of a command. The command length varies, according to this initial 3-bit code, which will be followed by some fixed number of 3-bit payloads:
Each cog outputs 3 bits per pin (3 x 32 = 96 bits). The cogs' sixteen sets of 96 bits get OR'd together to form a composite 96-bit set that gives each pin 3 bits. It's like how OUT and DIR signals are OR'd together, except 3 bits per pin, instead of 1.
I think two 32-bit registers are plenty for data setting, while we have a 6-bit mode and a 13-bit pin configuration. That should provide enough raw input conduit for anything. For example, that dual-output triangle-wave PWM mode could use one 32-bit register as an NCO adder value (frequency) and the other 32-bit register as two 16-bit words that provide the thresholds.
Anyway, I feel like this is snapping to grid now, and this part of the problem is maybe solved.
To input from pins, I was thinking that we could have 4 pins coming out of each smart pin and they could be correlated to the system counter's 3 LSBs, so that to read a smart pin, you would just gather nibbles from one of 64 sets of 4 pins, as the system counter's 3 LSBs ran from %000 to %111. I don't think that smart pins need to return more than 32 bits, but we could make them do so by using bit 3 of the system counter to parse longs.
Some simple Arm/trigger logic is needed to ensure those stamps are on the same cycle.
If you have Capture and Clear option on one edge, then 2 x 32b captures can give Edge position and Period
Wide dynamic range Frequency counting needs to capture Time and Fi Cycles on an Arm/trigger basis.
That's 2 x 32b captures and 2 counters, one for time, one for Fi Cycles.
PWM you need to set the Period and Thresholds, and 16b is maybe just enough if you have a prescaler too.
(some PWM control schemes keep the threshold fixed and vary the total period )
The period would be a function of the NCO frequency and the two 16-bit thresholds would get compared to NCO[30:15]. Well, before the comparison, the NCO value would be NOT'd if the MSB was set. That would give the triangle waveform. That would be sufficient, wouldn't it?
Hmmm, I'm not sure about how NCO multiple bits meshes with PWM.
That would have jitter, and not make for easy 'live' modulation of the period, and would need >= compares rather than = ?
In PWM designs, usually the setpoints (Periods, compares) buffer and update only on Counter=0, which is also not quite a NCO concept.
The period needs to be fully granular and stable for some modulation schemes. (ie allow 1024, 1003, 1047 or whatever)
It's true there would be 1 clock period of jitter for most values, but it would average out to be really precise. Adders would make it expensive.
In either case, you would have to update the thresholds synchronous to the period, right? Well, I can see where a single equality event could get around the need for that. Is that what you were implying?
Where I can see NCO/adder prescalers have issues, is they are ok for small increments, but if you want a shorter period than 16b, the values effectively left-justify eg adding 62.25 gives average period of 1020, but that skips many values as it adds, and so simple equality (which is smaller in logic) will not work.
Most PWMs always change counters by +/-1 and so can use the smaller == test on compares.
(and if they update on period-end, that ensures there is always a match)
It would help to include the resource size for each mode.
I'm guessing 16b counters/setpoints for PWM ( & 16b prescale?) and 32b counters for Timers and capture ?
How many captures are there ?
eg can it capture both Period and mid edge, to extract duty cycle as M/P ?
Likewise narrow pulse width capture can capture _/= and =\_ into separate registers, to allow down to 1 SysCLK width capture.
Do the Sync Tx modes include 2w and 4w for Dual/Quad SPI ?
( HW that supports Dual SPI can also do JTAG, and P2 should make quite a good JTAG engine.)
A bit-count that covers 1..32 is the most flexible.
Note SPI is usually duplex, and code often decides to discard Rx, but it is there as part of the process.
Above list seems to be Tx or Rx ?
I don't have any immediate plans for 2w and 4w modes, but they could be added. Right now the smart pins are even/odd-paired for signal sharing. Handling 3 or 5 pins (2w/4w+clk) would need another topology. I will get this working first, and then it should be more obvious how to arrange more pins into a smart pin.
I'm anxious to see how much logic this will all take. There is a lot of sharing or flops, etc.
Yes, the wider modes overlap somewhat with the Streamer, but it will be important to stream 4w and 8w memories.
Logic needed will be interesting, as there are many of these.
What is the Baud Rate formula for Sync/Async ?