The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

potatohead · 2016-04-17 04:41

I'm in favor of adjacent COG comms, provided it does not start another painful set of tradeoffs. COGS already need to be designated based on near PIN mapping.

(Greets all, I'm helping build a startup. Been reading and writing some PASM. The advanced comms is a topic I feel I don't add a lot of value to. I got a fun project just about done!

The comms look great, and I stand a chance of understanding them.)

@Chip: looks like we are really in a place to wrap it up. Can I put a write only flag for low RAM request in one more time?

The purpose is to prevent errant code from trashing on chip dev tools. A couple state changes done through COG ID should make unsetting that flag unlikely enough to be worth doing. If this causes any grief, or has more than a small amount of time attached to it, ignore this one.

@all, is the improved dual port LUT somehow mutually exclusive to the adjacent COG comms?

jmg · 2016-04-17 04:54

potatohead wrote: »

@all, is the improved dual port LUT somehow mutually exclusive to the adjacent COG comms?

No, not mutually exclusive.
I'm not sure where Chip is on this, but once you have Dual Port, it becomes easier to overlap portions Left/right to get quite large and user selectable buffers between COGS.
There may be some routing impacts, as this does spread the BUS more across the COG, so I'm not sure COG-COG Dual port is outside the critical timing path ?

Of course, I'd prefer the Smart Pins are polished & gaps closed, and handshakes added to Streamer before things like Dual Port are added, and that could even be a compile-time switch, as when the final die P&R is done, some elasticity on resource could be useful to pack things in.

JRetSapDoog · 2016-04-17 05:08

evanh wrote: »

Doh! How come no one corrected me on the not? I was babbling away happily about choosing double the HubRAM.

Regarding the die size, I'd think it's seriously constrained by the QFP100 package size. ...

It's easy to miss a "not" when reading. We tend to read what we expect (and we all expect good things of the next chip). And apparently we missed your misread, else we would have pounced all over you like a pack of wolves.

Thanks for your comments about the potential package size limitations. I briefly wondered about that when posting and should have explicitly asked.

So the package does sound like a big limiting factor. However, I did find a table on pg. 9 of the following Cirrus Logic .pdf that lists a 100-pin LQFP with a max die size of 10.48x10.48 mm in a 14x14 mm package (bond form 843-5036), at least if I interpreted it correctly. And that's right in the ball park of what it would take to double HUB RAM. But that doesn't mean such a package is available to Chip or necessarily a good choice. (I do like the 0.5mm lead pitch, though.)

https://www.cirrus.com/cn/pubs/misc/PackageGuide5.pdf

Thanks again for your comment, evanh.

evanh · 2016-04-17 05:43

potatohead wrote: »

... is the improved dual port LUT somehow mutually exclusive to the adjacent COG comms?

The suggestion was to make the dual-ported LUT optionally (when not used by the streamer) visible/addressable by a neighbour Cog. This would entirely supersede what Cluso has asked for.

evanh · 2016-04-17 06:11

JRetSapDoog wrote: »

... that lists a 100-pin LQFP with a max die size of 10.48x10.48 mm in a 14x14 mm package (bond form 843-5036), ...

Nice document. I do see many options for 100pin that have smaller maximums too. The other thing to remember is there will need to be room around the edge of the die for the GNDs to reach the underside heat spreader.

Ah, 75x75um for standard bonding pad dimensions.

ozpropdev · 2016-04-17 08:08

cgracey wrote: »

I want to see about having the streamer do 4/2/1-bit operations, not just 8/16/32.

That would be very useful.

In the streamer input mode we currently have the following modes available.

%ppp : 000 = pins 31..0 32 pins
       001 = pins 39..8 32 pins
       010 = pins 47..16 32 pins
       011 = pins 55..24 32 pins
       100 = pins 63..32 32 pins
       101 = pins 63..40 24 pins, 8 leading “0” bits
       110 = pins 63..48 16 pins, 16 leading “0” bits
       111 = pins 63..56 8 pins, 24 leading “0” bits

In the 24/16/8 bit mode the group is positioned at the top of PortB.
Currently we have P63,P62 (Rx,Tx) pins and the pins P61 to P58 will be SPI/EEPROM/SD?.
This will limit these modes to nbits - 4 usable bits, 8 bit mode will be affected in particular.
Can this group be moved elsewhere, maybe top of PortA.
The new 4/2/1 bit modes may require the same move too.

jmg · 2016-04-17 22:12

jmg wrote: »

cgracey wrote: »

I don't have any doc written up on the clock modes, but they look like this:

%xxxxxx00 = RC Fast (20MHz)
%xxxxxx01 = RC Slow (20KHz)
%xxxxxx10 = XI input
%xxxxxx11 = XI input + PLL

%xxxx00xx = XI = float, XO = float
%xxxx01xx = XI = input, XO = float
%xxxx10xx = XI = input, XO = !XI, 1Mohm feedback, 15pF on XI/XO
%xxxx11xx = XI = input, XO = !XI, 1Mohm feedback, 30pF on XI/XO

%0000xxxx = PLL off
%0001xxxx = PLL, XI * 2
%0010xxxx = PLL, XI * 3
%0011xxxx = PLL, XI * 4
...
%1111xxxx = PLL, XI * 16

OK.
Is there a CMOS Clock in mode ?
What MHz range does Xtal target ?

You may need lower CL values, if you want 30MHz Xtal support & I'd avoid 15pF min setting.

Also, a mode with
XO = !XI, 1Mohm feedback, 0pF added on XI/XO
allows Clipped Sine TCXO's to be AC coupled, and also covers any user-C cases.

There is no field for Xtal Division ?

With a range of possible sources, and higher MHz being smaller these days, being able to divide the Xtal by a modest amount before PFD is useful. Something like 4-5b Xtal Divide and 6-7b PLL divide gives better PLL control, but still light logic.

I found more info and measurements on CL with Xtals, that may be useful.
Si5351A PLL has CL choices of 6,8,10pF and on the board tested, 8pF is +7.8ppm and 10pF -14.3ppm, which gives appx 11ppm/pF for that Crystal, and indicates it's probably a 9pF CL device.
Lower CL are more common, as that needs less Oscillator current.
Murata compact ones (2.0 x 1.6mm) are 24~48MHz in 11 stocked values @ Digikey, and 6pF CL
In 20ppm spec, 8 values from 24~32MHz are stocked, 6pF CL

intel uses these CL values, looks like they target appx 7ppm per step & a nice 16 steps. (+/- 50ppm range )

intel D2000 MCU, (20~33MHz Xtal spec, nominal 32MHz )
Trim Code corresponding to load capacitance on board.
10 pF is the default load cap and the corresponding default trim code is 0000.
0111b: 5.55 pF
0110b: 6.18 pF
0101b: 6.82 pF
0100b: 7.45 pF
0011b: 8.08 pF
0010b: 8.71 pF
0001b: 9.34 pF
0000b: 10 pF
1111b: 10.61 pF
1110b: 11.24 pF
1101b: 11.88 pF
1100b: 12.51 pF
1011b: 13.14 pF
1010b: 13.77 pF
1001b: 14.4 pF
1000b: 15.03 pF

both SiLabs and intel look to spec the CL that results, which makes for easier matching with crystal specs
The very smallest packages have higher ESR, & a quick scan at Digikey shows the largish 3.2 x 2.5mm package has ESR as low as 40 ohms for 32MHz

cgracey · 2016-04-18 00:53

jmg wrote: »

jmg wrote: »

cgracey wrote: »

I don't have any doc written up on the clock modes, but they look like this:

%xxxxxx00 = RC Fast (20MHz)
%xxxxxx01 = RC Slow (20KHz)
%xxxxxx10 = XI input
%xxxxxx11 = XI input + PLL

%xxxx00xx = XI = float, XO = float
%xxxx01xx = XI = input, XO = float
%xxxx10xx = XI = input, XO = !XI, 1Mohm feedback, 15pF on XI/XO
%xxxx11xx = XI = input, XO = !XI, 1Mohm feedback, 30pF on XI/XO

%0000xxxx = PLL off
%0001xxxx = PLL, XI * 2
%0010xxxx = PLL, XI * 3
%0011xxxx = PLL, XI * 4
...
%1111xxxx = PLL, XI * 16

OK.
Is there a CMOS Clock in mode ?
What MHz range does Xtal target ?

You may need lower CL values, if you want 30MHz Xtal support & I'd avoid 15pF min setting.

Also, a mode with
XO = !XI, 1Mohm feedback, 0pF added on XI/XO
allows Clipped Sine TCXO's to be AC coupled, and also covers any user-C cases.

There is no field for Xtal Division ?

With a range of possible sources, and higher MHz being smaller these days, being able to divide the Xtal by a modest amount before PFD is useful. Something like 4-5b Xtal Divide and 6-7b PLL divide gives better PLL control, but still light logic.

I found more info and measurements on CL with Xtals, that may be useful.
Si5351A PLL has CL choices of 6,8,10pF and on the board tested, 8pF is +7.8ppm and 10pF -14.3ppm, which gives appx 11ppm/pF for that Crystal, and indicates it's probably a 9pF CL device.
Lower CL are more common, as that needs less Oscillator current.
Murata compact ones (2.0 x 1.6mm) are 24~48MHz in 11 stocked values @ Digikey, and 6pF CL
In 20ppm spec, 8 values from 24~32MHz are stocked, 6pF CL

intel uses these CL values, looks like they target appx 7ppm per step & a nice 16 steps. (+/- 50ppm range )

intel D2000 MCU, (20~33MHz Xtal spec, nominal 32MHz )
Trim Code corresponding to load capacitance on board.
10 pF is the default load cap and the corresponding default trim code is 0000.
0111b: 5.55 pF
0110b: 6.18 pF
0101b: 6.82 pF
0100b: 7.45 pF
0011b: 8.08 pF
0010b: 8.71 pF
0001b: 9.34 pF
0000b: 10 pF
1111b: 10.61 pF
1110b: 11.24 pF
1101b: 11.88 pF
1100b: 12.51 pF
1011b: 13.14 pF
1010b: 13.77 pF
1001b: 14.4 pF
1000b: 15.03 pF

both SiLabs and intel look to spec the CL that results, which makes for easier matching with crystal specs
The very smallest packages have higher ESR, & a quick scan at Digikey shows the largish 3.2 x 2.5mm package has ESR as low as 40 ohms for 32MHz

My understanding has been that to get 7.5pF loading capacitance, you need one 15pF on each crystal pin, since they are effectively in series, causing the net capacitance to be half of each cap. With this in mind, we actually have 7.5pF and 15pF options in the current design, which are about right. The main target is a 20MHz crystal.

jmg · 2016-04-18 01:22

cgracey wrote: »

My understanding has been that to get 7.5pF loading capacitance, you need one 15pF on each crystal pin, since they are effectively in series, causing the net capacitance to be half of each cap. With this in mind, we actually have 7.5pF and 15pF options in the current design, which are about right. The main target is a 20MHz crystal.

Broadly, yes, that is correct, but the CL values should be verified in a working circuit, as the feedback capacitance of the linear amplifier gets into the mix as well, as do bonding wires etc.
I would specify the actual effective CL on the data sheets, not some internal value that is only part of the story.

What range of Crystals is P2 designed to support ? Is there an ESR max target value yet >

cgracey · 2016-04-18 16:19

Cluso99 wrote: »

Chip,
How do you propose to access these internal pins to the smart pins? Sounds smart to be able to use an FPGA I/O to connect to the internal smart pin pads!

But I cannot envision how you will do this. Will it just be a die with the external pads wired to at test frame, and internal pads also wired to the test frame?

BTW I presume that since Treehouse have now done the I/O block/ring, that the chip die dimensions are now frozen?

We will bring them out to the package pins as inputs and outputs. Then, there will be two SPECIAL pins that are the actual Prop2 I/O pins. All those other pins make up their inputs and outputs.

cgracey · 2016-04-18 16:21

Cluso99 wrote: »

Chip,
There are two guys who are extremely serious about wanting a P1V.
Perhaps they may be willing to fund the licence section to have a P1V placed inside the frame. What are the chances the frame will work in its basic mode for P1V?

For others, the P1V is...
* Prop 1 plus
* 64 I/O
* More hub ram
* Faster (160+ MHz)

Optional extras...
* LUT RAM for cog execution
* Security fuses (OTP)
* serial input on video logic
* Multiply & Divide
* Relative addressing
* Boot from Flash (instead of EEPROM)
* ??? OTP for boot code

We could work something out to use our I/O pins.

The harder part is getting the Verilog nailed down to realize those other features. That is 10x harder. If the Verilog is proven, it's easy to add the I/O's.

cgracey · 2016-04-18 16:24

evanh wrote: »

JRetSapDoog wrote: »

cgracey wrote: »

There's not enough room to double the hub RAM to 1MB, but...

Curiosity killed the cat, but I'm still curious about the feasibility of increasing the die size to make room for 1MB. I just reread Chip's musing (however brief, and assuming he wasn't kidding) about resorting to a 17x17mm die for P2-Hot to dissipate a possible 5W of heat. Okay, that sounds big and expensive, but what about a 10x10mm die size? ...

Doh! How come no one corrected me on the not? I was babbling away happily about choosing double the HubRAM.

Regarding the die size, I'd think it's seriously constrained by the QFP100 package size. Any talk of larger for the Prop2-Hot was when targeting a package with closer to 200 pins I'd guess. The Hot design did have a full extra 32 I/O.

The bigger the die gets, the lower the yield goes. To do 1MB, we really need a finer process, which costs a lot more.

Rayman · 2016-04-18 16:44

Sounds like P2 is close to having a heartbeat.

pmrobert · 2016-04-18 20:09

I could support the BeMicroCV which has a Cyclone V -A2 device on it, which holds more than the DE0-Nano. I think that board is only $49. It would do one or two cogs and several smart pins, maybe 64KB hub RAM. If you wanted that working, I could start doing compiles for it, along with the others.

I would be extremely pleased to see such a build for the BeMicroCV. Like many of us, I have an awaiting application... Thanks for keeping us in the loop, it is appreciated.

Brian Fairchild · 2016-04-20 06:56

This'll probably get lost in the noise over in topic I posted it in so...

cgracey wrote: »

It seems the main area of improvement could be in the simplification of the development environments.

Given that just about every uC coming onto the market now has internal debugging circuitry, either JTAG or proprietary, and associated debugging 'pods' for around the $30-$50 mark, how will the P2 stack up against them?

cgracey · 2016-04-20 06:58

Brian Fairchild wrote: »

This'll probably get lost in the noise over in topic I posted it in so...

cgracey wrote: »

It seems the main area of improvement could be in the simplification of the development environments.

Given that just about every uC coming onto the market now has internal debugging circuitry, either JTAG or proprietary, and associated debugging 'pods' for around the $30-$50 mark, how will the P2 stack up against them?

Our debugger is just a PropPlug. It can run at 2Mbaud.

Brian Fairchild · 2016-04-20 07:53

So the P2 has internal debug hardware or uses one of the COGs to control, ie single-step, other COGs?

cgracey · 2016-04-20 08:02

Brian Fairchild wrote: »

So the P2 has internal debug hardware or uses one of the COGs to control, ie single-step, other COGs?

Each cog has single-step and address-match breakpoints, as well as asynchronous breakpoint from other cogs.

Brian Fairchild · 2016-04-20 09:13

cgracey wrote: »

Each cog has single-step and address-match breakpoints, as well as asynchronous breakpoint from other cogs.

Which are accessible directly from a couple of pins which the PropPlug is connected to and therefore directly accessible from the debugger running on your development system?

cgracey · 2016-04-20 09:18

Brian Fairchild wrote: »

cgracey wrote: »

Each cog has single-step and address-match breakpoints, as well as asynchronous breakpoint from other cogs.

Which are accessible directly from a couple of pins which the PropPlug is connected to and therefore directly accessible from the debugger running on your development system?

With a little bit of helper software, triggered by the debug interrupt, yes.

Brian Fairchild · 2016-04-20 09:30

cgracey wrote: »

With a little bit of helper software, triggered by the debug interrupt, yes.

So the user has to drop a bit of extra code into their code before they can debug it? Doesn't that break determinism?

cgracey · 2016-04-20 09:33

Brian Fairchild wrote: »

cgracey wrote: »

With a little bit of helper software, triggered by the debug interrupt, yes.

So the user has to drop a bit of extra code into their code before they can debug it? Doesn't that break determinism?

No, upon each cog starting, a programmable vector is branched to which can set the breakpoint routine address, if there is one. It's stealthy.

evanh · 2016-04-20 10:05

The main bonus is the target Cog code packing is not disturbed. If timing is critical then the debugging will have to be incidentally gleaned in real time by sampling what's appearing in HubRAM or physical I/O.

Hand crafted debugging is always the best method anyway.

rjo__ · 2016-04-21 00:34

evanh wrote: »

The main bonus is the target Cog code packing is not disturbed. If timing is critical then the debugging will have to be incidentally gleaned in real time by sampling what's appearing in HubRAM or physical I/O.

Hand crafted debugging is always the best method anyway.

Single stepping through time critical processes...hmmm. So, there is no way to just freeze all of the cogs and look at the states of all the variables?... hmmm.

jmg · 2016-04-21 00:45

rjo__ wrote: »

Single stepping through time critical processes...hmmm. So, there is way to just freeze all of the cogs and look at the states of all the variables?... hmmm.

If it really is time critical, the general approach is to break after that section, rather than single-step.
Also Capture of tracking info into memory, inside the time critical processes (adds a few SysCLKs), so that is common.
or if it is Pin-IO based, another COG can capture.

rjo__ · 2016-04-21 01:09

ok... I got it. I am older and I have been between moments of perfect clarity.

Next time the light looks right, I am going to go back and look at a problem again and figure it out.
But... I wouldn't have to be thinking exactly correctly if I knew exactly what was happening at a particular moment in two different cogs... cycling through two different loops: STOP@ClockPlusX.

Sounds like an issue for the P3:)

jmg · 2016-04-21 02:01

rjo__ wrote: »

I wouldn't have to be thinking exactly correctly if I knew exactly what was happening at a particular moment in two different cogs... cycling through two different loops: STOP@ClockPlusX.

There could be different ways to manage multiple COGS debug.
With each Debug-stub supporting a PinCell UART, you probably could run multiple PC-Side sessions, and then use some absolute-time-stamp reporting to figure out how to align trace-reports.

Sounds like another use for the 4 UART device I suggested

I think you could even pair 2 EV-Boards, (idle one P2), to connect 4 more Debug Channels to the Active P2, and have 2 developers working on 4 Debug Channels each !

rjo__ · 2016-04-21 02:09

If you got it... push it!!!

By the way, I don't believe in competing. I have always found success when I found a place where there was no competition.
I hate to create losers... and find no joy in "beating" anyone. My father taught me... "never stand in line. If there is a line, go somewhere else."

Chip is right "there" with this design, totally unique, hard to compare... and the answer to a lot of questions.

Does your solution exist in other comparable mcu's?

jmg · 2016-04-21 02:12

rjo__ wrote: »

Does your solution exist in other comparable mcu's?

Which 'solution' ?

cgracey · 2016-04-21 05:15

rjo__ wrote: »

If you got it... push it!!!

By the way, I don't believe in competing. I have always found success when I found a place where there was no competition.
I hate to create losers... and find no joy in "beating" anyone. My father taught me... "never stand in line. If there is a line, go somewhere else."

Chip is right "there" with this design, totally unique, hard to compare... and the answer to a lot of questions.

Does your solution exist in other comparable mcu's?

I agree with your dad. I'm going to tell my kids to stay out of lines, too. I have always avoided them, too, out of some survival sensibility.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments