Well, I have my asbestos undies on, so here goes...
Where are we headed?
I really think the P2 needs, or better yet really needed a mission statement helping define the goal.
It really no longer seems like the 'next step' above the P1, it is becoming something far bigger than that and leaves open a large gap between it and the P1.
For a lot of use cases the P1 was just shy of being enough, all that was really needed was more I/O, a larger HUB, and it would be nice to have enough of a SERDES to at least allow full speed SPI. Having this is a somewhat friendly package would have been a great next step above the P1.
What is being created now really worries me. In my mind a very bad case of scope creep. Every time Chip delivers on the latest thing someone has wished for someone else throws in something like "Now that we have a modulated stator, all we need is amulite bearings and we can implement high speed USB!" Of course once we get that it will be obvious that we are just shy of some other neat thing and off we send Chip on another mission to build even more. When will it stop, and what will we have when we get there?
I really think this thing needs a bow on it and call it done.
I know there is already a desire for a P3 that would go even farther than the P2. I think you need to determine what could be created that would fit the space between the P1 and P2 before moving to the P3.
What I personally would like to see is:
- The current P2 COGS
- Either 48 or 64 'user' I/O + the I/O for the EPROM and Serial. The number of I/O determined by package choices.
- 64k HUB, could be larger if it doesn't have negative impact on die size or cost, but I think 64k is enough for this case.
- A simple SERDES to allow easy SPI(I could live with dropping this now that we have multitasking COGS)
- I could also see having 32 of the user I/O using the current P2 I/O and the remaining user I/O being a more simple 'digital only' I/O if that would help with package size and cost.
On a BGA for the P2, can there be a center area used for glue so that an adapter board could hold the BGA, but allow for subsequent reflows. Wouldn't a PCB like shown be hand soldered just as easy as the Prop QFP using pads on the edge of the PCB. Or, attach the BGA and add glue to the edge of the IC to secure it to the PCB adapter board. For low volumes and hobby use, a few bucks more for an adapter board is acceptable. There gets to a point when you have to consider the audience, and a BGA allows for a broader range of use. A company can't grow with a hobby biased packaging a product the P2 that restricts larger volume sales. For the P2, I would maximize the potential at the expense of inconvenience for the hobbiest. The inconvenience being adapter boards.
Sheesh, go away from the forum for a day and the p2 world turns upside down
My $0.02 on DDR2:
- I don't like BGA, raises PCB costs for everyone
- love the idea of a dedicated DDR2 engine that using 16 bit memory bus can approach 4*clkfreq memory read/write (edit: I mean in bytes/second) speed.
- like the idea of empty hub slots being used to transfer to/from hub<-> ddr2
- love the idea of ddr2<->aux without going through the hub
- like cluso's shared cogs*128 byte new hub area (* but should be 512 bytes per page, ideally 2-4 pages per cog, more on this later)
- maybe we could get a bit bigger hub too
- HATE any more delays
Now if I understood Chip correctly, maximum burst size is 8 words (16 bits) - fitting a quad long perfectly; I suspect if the bursts are back to back very little setup is needed between 8 word bursts.
If we can accept that determinism would have to depend on careful programming of all cogs, I'd suggest the following:
RDCXMEM inda_ptr [wc]
WRCXMEM inda_ptr [wc]
RDAXMEM aux_ptr [wc]
WRAXMEM aux_ptr [wc]
RDC/WRC would read/write cog memory, and could be in a REPx block with auto increment pointers
RDA/WRA would read/write aux memory, and could be in a REPx block with auto increment pointers
If WC is specified, instruction will stall until complete
if WC is specified, it will just schedule the memory op and continue
Implementation:
RDC/WRC/RDA/WRA just add a request to a FIFO queue, and a DDR2 engine just sits there, reading requests and fulfilling them in order of request. This would tend to fairly distribute access by the %requests per cog, about as close to determinism as we can reasonably get with xmm.
Other useful extension:
As mentioned above, pseudo-dma to hub - could be implemented using above, or as an engine that does "background" transfers between DDR2 and hub, using spare hub/ddr2 cycles... ie if command FIFO empty, it can block move between hub/ddr2 using spare hub slots and empty DDR2 request slots.
Advantage:
- fits "everything is equal" philosopy
- easy access to DDR2 for all cogs
- background DMA for large blocks
- careful system design can almost saturate DDR2 bus
- easy to use XMM, very good match for LMM
Disadvantages:
- no byte/word/long access to DDR2 (could be supported with extra Verilog, would be extremely nice to have for graphics, lmm etc)
- no hardware enforced determinism, requires planning (not a bad thing)
512 byte pages:
Almost everything in Linux, file systems etc for historical reasons uses 512 byte pages, so if there are cog-specific hub pages, ideally we would want 4+ per cog to support code, data, stack and one extra to reduce trashing to memory.
Now adding a 4-long cache to each cog for XMM would also be beneficial
RDCXMEMC ins,ptra++ ' blocking version
nop ' pipe delayx, may not be needed
nop
ins nop
Instant, pain-free, execution of LMM code from DDR2!
With the command FIFO I described above, you could have all eight cogs run different LMM code out of the DDR2 at the same time.
Note:
An
EXECXMEMC ptra++
Instruction would do the same work as the four instructions above, however it would take the same number of clock cycles, without the benefit of two interesting delay slots that have a lot of potential.
If the cogs can have direct access to the new DDR2 ram can there be a mode added to run their program directly from it?
That would eliminate the tiny cog ram bottleneck!
We have faster cogs + more I/O while needing bigger programs to actually make use of the expanded capabilities and yet we still have to have an interpreter cog running to drip feed programs into the cogs due to the lack of cog memory.
I see this change making the P2 an excellent cross between a Microcontroller and a Microcomputer and, aside from the custom SoC route, that is something not currently available for small low volume projects (like mine will be).
Think 4 cogs running Video, Mouse/Keyboard, Touchscreen, User I/O +++ 4 cogs running deterministic process control all coordinated by the common bus/memory! NOTHING else out there now can do that!
This is a great marketing opportunity for Parallax to fill in (similar to their history with the BS and SX) a niche that will open up a lot of "oh my, look what I can do now...".
That said, with this type of package (BGA) Parallax would have to step up to the plate and offer a very cost effective user module with the P2+Ram installed similar to the PropStick USB or the MicroChip processor modules for their dev board (which run ~ $25.00).
I like Seairth's idea of bagging HUB RAM altogether in one version of the chip. It is a real estate hog and an implicit bottleneck despite all that has been done on P2 to minimize that.
Remember the old Star Trek series? I always found the managment style to be quite interesting. There was a point in almost every episode where Captain Kirk would meet with his offficers and listen to their thoughts and concerns and then make a decision based on his experience, their input and considering what was best for the ship and crew.
The thing about DDR2 is that it inputs/outputs data on BOTH edges of the clock. Those would definitely be special I/O pins. And it might require a PLL to generate the timing, as the memory is clocked at 400MHz, transferring two words per clock. A single x16 DDR2 chip can read/write at the rate of 1600MB/s, whereas the current SDRAM only goes to 320MB/s, which is 1/5 the speed. The problem is going to be getting DDR2 data in and out of hub RAM. Instead of QUADs, we'd need OCTs, and at twice the rate (assuming a 200MHz system clock). It sure would be cool for several cogs to be able to make simultaneous memory requests and have them done faster than they could index into the results.
For this to work efficiently, we might need to expand the number of hub slots to 10, allocating turns like so:
This would give the DDR2 40MHz access to hub RAM with a system clock of 200MHz. If it could transfer 32 bytes (8 longs) per access, that would be 32x40MHz = 1280MB/s transfer. A 1080p/60 screen needs 594MB/s, so we would be able to write and read a whole screen in a screen period.
I like the idea of DDR2.
I do not know how complex is the hub circuitry and how much silicon it uses, but I figure it out as a 8 to 1 bidirectional mux.
Now if I have to add the DDR2, assumed the cog can execute only one access to the hub in its time window, I will do this:
create the necessary memory controller (feeded with its own pll clock if necessary) that basically converts the DDR2 interface to a SRAM one and autonomously handles the refresh.
add a simple 2 way mux between the new memory controller and the hub SRAM
add a second set of hub instructions (eg. RDQUAD renamed to IRQUAD and its counterpart ERQUAD; WRQUAD renamed to IWQUAD and its counterpart EWQUAD ....)
those new instructions will control the mux/swap between the Internal and External memory. (between SRAM and memory controller)
In this way each cog will have the same bandwidth to the internal and external memory and basically the current hub design should remain unchanged. If a higher bandwidth is necessary more than one cog can cooperate (each one synced to its "hub" window) in streaming the (whatever) memory contents to the same output pins, or the opposite (like we do now with P1). Such cooperation can came from one task on each cog, leaving the cycles between hub windows for other jobs. From programming point of view it will be completely transparent to access one memory or the other, you have only to change the instruction.
If I have the right figures thats mean @200MHz / 8 cycles "hub/memory" access * quad(4*16) /8 = 200MB/s read or write from internal or external memory. If the need is to copy from one to the other than it halves (eg. one hub window to read from external, next hub window to write to internal)
IMHO 64 full IO is more than enough and, if it helps, I will better have 32 full IO and 32 digital only, thus perhaps gaining some silicon space for a HW support for a universal SERDES that can make/assist in async/sync serial, usb and if possible 4bit SD/flash (Cluso99 have some nice ideas over it)
... and, for the users not interested in additional external ram, perhaps a way to reutilize the special/dedicated 1.8V IOs can be a simple data exchange (clock synchronization) bus between two P2. This can open markets to some niche sectors in high-availability systems or where personal injury is of highest importance and two units are required to run the same program to verify each other before taking any action (eg. car ABS control units, or some PLCs in utilities industry: eg running gas turbines in power centrals)
This is only the perspective of a person not knowing the internals and rules of silicon chips engineering/design.
Almost anything that distinguishes the P2 from commodity AVR, PIC, and ARM chips is a good thing! A high speed interface to 64MB DDR2 is clearly 'distinguishing.'
The microcontroller market is saturated with choices. It is not easy for a manufacturer to find profitable niches that have gone unnoticed all this time. Given that, there is a lot to be said for pursuing a course like Chip has - building what seems good to him, and what everyone else isn't building.
If we can accept that determinism would have to depend on careful programming of all cogs, I'd suggest the following:
Even with interleaving, you still have refresh and many preamble cycles to add into the mix, so you will not really know where other COGs are in their phases, as that will move about. You can only work on average budgets
I do not think Strict determinism is something DDR2 will deliver.
However, with threads, you can drop the (small) 'Strict determinism' stuff into local memory, and give it a thread-slot or two.
Almost anything that distinguishes the P2 from commodity AVR, PIC, and ARM chips is a good thing! A high speed interface to 64MB DDR2 is clearly 'distinguishing.'
The microcontroller market is saturated with choices. It is not easy for a manufacturer to find profitable niches that have gone unnoticed all this time. Given that, there is a lot to be said for pursuing a course like Chip has - building what seems good to him, and what everyone else isn't building.
64MB of memory to do what? Run through a VM on a COG that at best cannot exceed 200Mhz assuming the currently expected speed?
I really think the ARM has already won the race for having a large memory model to run OS's and such.
It seems like the place the prop can shine is with Chip's very flexible I/O coupled with the power of the COG and it's timers, so adding the ability to make the prop and SPI slave.
It just seems like right now the P2 is trying to be everything to everyone and that is almost always a recipe for failure.
Hi-res video, for one. All the things you mention that the Prop is good for are just enhanced by having more memory for whatever the user needs. It certainly isn't exclusively for LMM OS like you seem to suggest. I love the idea of combining data sampling, processing, real-time control, and display in one chip.
Hi-res video, for one. All the things you mention that the Prop is good for are just enhanced by having more memory for whatever the user needs. It certainly isn't exclusively for LMM OS like you seem to suggest. I love the idea of combining data sampling, processing, real-time control, and display in one chip.
I wonder how many microcontroller applications actually need high-res video? It seems like a lot of the silicon on the P2 is being used to supply graphics capabilities. Isn't it more likely that someone who needs that sort of thing will just go with one of the SoCs with a built-in GPU like the one used in the RaspberryPi? Can the graphics hardware on the P2 compete with those GPUs? Will the P2 be cheaper than those SoCs?
Hi-res video, for one. All the things you mention that the Prop is good for are just enhanced by having more memory for whatever the user needs. It certainly isn't exclusively for LMM OS like you seem to suggest. I love the idea of combining data sampling, processing, real-time control, and display in one chip.
I get that there are cases where it would be nice, but do they occur often enough to justify the increased cost and complexity?
I don't know the answer to that, but I really think Parallax needs to figure it out, and soon.
I maintain that we can't predict all the ways a P2 might be used. SiFi authors are always trying to predict the future based on current trends. Look at an original Star Trek TV episode and you'll see several ways we've already greatly surpassed what was sheer fantasy a generation ago. There are other things that are as silly now as they were then. (Dilithium crystals providing warp propulsion??)
So, rather than predict the future - a dicey thing at best - why not provide a great and unique mix of features and strengths, and let others figure out how it works with their specific and often unique mix of needs?
I wonder how many microcontroller applications actually need high-res video?
The alternative is to go head-to-head with ARM designing commodity chips - a sure-fire way to go bankrupt.
I wonder how many microcontroller applications actually need high-res video? It seems like a lot of the silicon on the P2 is being used to supply graphics capabilities. Isn't it more likely that someone who needs that sort of thing will just go with one of the SoCs with a built-in GPU like the one used in the RaspberryPi? Can the graphics hardware on the P2 compete with those GPUs? Will the P2 be cheaper than those SoCs?
I think there is significant market space between Touch controllers+ [some FAT OS+Embedded Board], and say the FTDI Eve FT800. ($5.80/1)
LCD screens are expanding rapidly, and FT800 has a MAX ceiling of 512 x 512. (but more are coming ?)
SSD1963 ($7/1) is not as smart, but it can go a little higher at 864x480x24bit TFTs
DDR2 means a P2 can swallow a SSD1963, and more, without breaking a sweat, and DDR2 bandwidth helps ease the achilles heel of P2, which is the drop in speed when you go off-cog.
Smart LCD displays are ~ $84/1 ($64/100), so the system prices can stand a P2, especially if the support framework is there.
Then there are ship-loads of instrumentation areas, that a P2 can manage with Counters/Timers/maths plus good Display coupling.
However, one can and absolutely should link features to specifics based on things known to date. In the end, it's still a dice roll, as any risk is.
Without this, we run the very real risk of always chasing that idealized, "good enough" vision, which creeps along as technology does.
Now, one advantage Parallax has is it's education / hobby business. Lots of great data there and lots of current customers on great relationships too. The original P1 would not have been made according to a market research. Chip, on the other hand, knew that a market could be teased out and prosper, and Parallax understands very well how to maximize that happening.
And so here we are.
I don't think the P2 would be made according to some market research either. Again, Chip seems to know a market can be teased out and prosper. I believe this. I believe it because I see the intersection of real time control, robust input / output capability, display, etc... is a compelling one and it's currently done with system on chip designs, Linux, etc...
A P2 will make this lean much as the P1 did simple multi-core processing.
Additionally, Parallax has bootstrapped the P1 into a business and they understand how to do that with P2.
So where is the upper boundary? With P1, that upper boundary got baked in due to the layout being all custom. Decisions made early on contributed to the final design.
With P2, things are more fluid and this means a few things:
1. It means some change cycles can be entertained.
2. It increases the risk of chasing the imaginary or just out of reach kinds of things... scope creep, etc...
3. It means a subsequent iteration can be done in a fraction of the time.
4. Though the time cost can shrink to make a P2 possible, the dollar cost remains out there, funded by the existing business. At some point, that may not be a favorable equation as things tend to improve over time. Remaining static only works for so long, and the "so long" is nicely insulated by the niche Parallax is in, but only insulated partially, not completely.
One comment here mentioned moving the HUB to external memory entirely. This is something I want to see myself, but it's also P3 type discussion, given what we've all hashed out so far.
A well funded P3 effort is going to punch well above it's weight. I'm excited to see that, but it's also going to take the growth in both experience, tools, people, and the niche Parallax owns right now to see success and that growth comes from the P2, which should nicely expand things.
I worry about that balance, particularly if the sweet spot for P2 is somehow missed or time passes it by. And I've seen this happen a time or two. Not in this particular product space, but I don't think the dynamics are that far off.
Ok, so that's all a big poo-poo.
Now, to be completely fair, one way this research can be done and mixed in with development like we see happening here is by incorporation of target users and maximizing things for them. Of the current team working together on this, do we have a sufficient representative sample for this to work properly, or are some additional targets needed? I'm not saying expand it, but I am saying get the data on missing targets and use that as a check and balance on what is happening right now.
Designing the perfect P2 for Baggers, for example, would lean hard on graphics, ease of development, data storage options, math. Designing the perfect P2 for Bill or JMG may well see a focus on "execute in place" from larger memory sets, and arguably many prospective C users would prefer this.
If I'm wrong on those examples, just roll with it for a moment and correct me later in the thread. No worries.
So then, taking just those two, how does that align with prospective education users, as in people who will move from P1 to P2? And how does that contrast with the potential new users that would come along for the ride? With new being people who would resonate with the ones I just mentioned?
Given that rough guess data, feature investment priority and potential returns can be roughly quantified and that all can tell us what makes sense given it's realized over X amount of time, etc...
That's all. It's a discussion nobody wants to have when there is so much fun building a great thing, but it is one everybody must have when the party is looking to be coming to an end all too soon, so then having it from time to time is just taking care of "the party" making sure the fun is connected to making sure it's all somewhat viable and sustainable.
I'm not going to mention it further for a while. Either it's worth thinking about, or not, and if not, I'm completely happy to continue on with no worries other than I really would like to see chips sooner rather than later.
One cannot predict the future. However, "make it awesome and it's a sure thing" plays out poorly fairly regularly and it does so simply due to the fact that "awesome" just isn't resonant with enough others to pay off. And it's possible to understand a lot of that now, and we should understand that now so that the risk is reduced and with that, costs and time to release in like kind.
I was just looking at the Micron 512Mb x16 DDR2 interface. That's a lot of pins! No wonder Chip is talking about BGA. Frankly, the more I see what's required to get DDR2 working, the less enthusiastic I am about the idea.
I think the thing we are all jumping at is the idea of having more memory without giving up a cog to access it. Is there no other way? Can we trade DACs for more HUB RAM? Add per-cog hardware support for serial SRAM (with DMA to/from AUX)? Shrink HUB RAM and increase per-cog AUX RAM?
@ potatohead: Very well reasoned and articulated. I can't take issue with a thing. Clearly, adding a DDR2 interface to P2 would make sense only if it could be done quickly, simply, robustly. Perhaps my selfish secret motivation is that I want to make sure the feature gets "out there" so I can buy it! P3 is still pie in the sky.
This is getting crazy...the p2 is getting too big...a good price, more speed, more hub ram, more pins, some code protection would be great start for the p2. I am afraid the p2 is becoming the definition of the second product syndrome
But you'd still have to have code to get the program loaded into the external RAM in the first place. If the P2 were instead to provide hardware support for SPI (esp. dual/quad SPI) such that each cog could set up DMA between an external flash memory and AUX, you'd be able to do all of that LMM stuff without ever having to touch external ram (or hub ram, even).
Taking this further, if P2 were to increase the inter-cog bus width and exchange some HUB RAM with more per-cog AUX RAM, I think we'd find the cogs quite capable of running large programs with very little need of shared resources!
I'll be honest. I like the P2 as it is. Maybe I'd like to see some of these cool things in the future, but I'd really like a P2 in my hands this summer.
I think major redesigns late in a project is a dangerous thing to do. At some point, you need to freeze then have fun with the next version.
This is getting crazy...the p2 is getting too big...a good price, more speed, more hub ram, more pins, some code protection would be great start for the p2. I am afraid the p2 is becoming the definition of the second product syndrome
My input is purely from customers who buy chips for production uses, and I second this motion and have stated the same to Chip. More changes bring us closer to a near-impossible business model due to increased design time cycles and inability to generate revenue and market from smaller step-wise improvements. Our production customers have begged for the following:
More RAM, more code space
More I/O
Faster speed
Internal A/D
Freedom to run C kernel or Spin interpreter
To get the project finished
There's one thing that I'd like which isn't really a feature at all. I'd like to be able to synthesize the whole design in a single piece of software. I understand this would be a step in the backwards direction and isn't really all that practical, though.
It is dangerous territory to encourage more improvements and additions at this point.
To me this whole P2 thing is a mess. It's been really great to see that we can have input and insight into the P2 but the kitchen is so crowded and noisy and it's way past dinner time. Leave the cook alone and let's get dinner on the table.
dma or not, spi flash / ram is limited to <50MB/sec, less than 1/6th of SDRAM, and 1/16th of DDR2
All,
I think everyone here has noticed I like high tech... high performance.
And I think many will be surprised that I think we should stick with SDRAM and TQFP-128 and no major architectural changes for P2. Let's save that stuff for a P3 - heck we can start playing with that as soon as P2 works
Given that the transistor level changes will take Beau a few months, Chip probably has time for a nice SERDES, CRC instructions and other small tweaks he thinks of.
The latest version of the docs is shows a great deal of power coming... and I want to play with it ASAP.
Frankly, the delay from P1 to P2 will be about 8 years - that is too long a design cycle these days. I am hoping it will only be 1 - 2 years after P2 ... and we will have a P3.
I think P3 will be a great time to look into BGA, DDR2/3/4/5, another die shrink (500MHz+ ... maybe 1GHz), more cogs, more hub, more cog memory etc.
Ok, those of you who have fallen on the floor by my advocating NOT putting in DDR2 can pick yourselves up now.
Well, what we've got at this point is low-risk to fabricate and it will do SDRAM, which is a lot better than nothing. I just need to get some layout changes to Beau and it can all come together soon.
DDR2 would take some more engineering, which may not be that much time on my side, but it would take Beau a few months to do the layout changes. Maybe it would be best to build the chip we've got and follow on soon after with the DDR2 version, which would either have to be in a BGA package or have around 48 I/O's, only.
The DDR2 would be a blast to use because you'd have super big-memory bandwidth, but it would likely be a BGA for both the Prop2 and for the DDR2 memory. Of course, Parallax would make an integrated module at a reasonable price.
The current trajectory has both the Prop2 and the SDRAM in gull-wing packages, and if you don't want the SDRAM, you've got that many more full-function I/O's. We'd make a nice module for this system, too.
There's one thing that I'd like which isn't really a feature at all. I'd like to be able to synthesize the whole design in a single piece of software. I understand this would be a step in the backwards direction and isn't really all that practical, though.
I'm not sure I understand what you mean by this but it sounds intriguing. Can you explain in more detail?
Having been the first to say "Hell yeah!" to the idea of native external RAM, I'd like to dial it back a bit.
I don't see any reason to go messing with the Hub, or trying to arrange symmetrical real time access for Bob help us Cogs. Those things are fully baked and probably shouldn't change much.
What I envisioned was the ability for ONE cog -- since the cogs are all the same you'd have a register to pick which one -- has exclusive access to the DDR. This is really all you need. If you've got the bandwidth maybe make a second channel, but make it something that only comes into play when a Cog accesses a Hub address beyond the real Hub RAM, and then only if it's been given access.
This gets you the ability to run really large code, almost all of which is single threaded and does not need deterministic timing. With a second channel maybe it gives you a big video frame buffer too. As for the rest, what makes the Prop shine is mostly small drivers that need the timing features that made the P1 so unique but which generally fit in cogs and even the current 32K Hub. The ability to run something on the scale of a real operating system in one cog while using the other seven for things like USB, video, and other hardware-in-software drivers would add unbelievable value. You could then build what amounts to a single-chip computer with a mix of features found nowhere else.
P.S. on the subject of BGA, I'd suggest having a BGA for the external memory version, since you'd probably want to market it as a module with the DDR included anyway, and a no-external-RAM version of the same chip in some friendlier package with the memory pins left in the package. Similar philosophy to offering a DIP40 version of the P1 for the do it yourselfers.
Comments
Where are we headed?
I really think the P2 needs, or better yet really needed a mission statement helping define the goal.
It really no longer seems like the 'next step' above the P1, it is becoming something far bigger than that and leaves open a large gap between it and the P1.
For a lot of use cases the P1 was just shy of being enough, all that was really needed was more I/O, a larger HUB, and it would be nice to have enough of a SERDES to at least allow full speed SPI. Having this is a somewhat friendly package would have been a great next step above the P1.
What is being created now really worries me. In my mind a very bad case of scope creep. Every time Chip delivers on the latest thing someone has wished for someone else throws in something like "Now that we have a modulated stator, all we need is amulite bearings and we can implement high speed USB!" Of course once we get that it will be obvious that we are just shy of some other neat thing and off we send Chip on another mission to build even more. When will it stop, and what will we have when we get there?
I really think this thing needs a bow on it and call it done.
I know there is already a desire for a P3 that would go even farther than the P2. I think you need to determine what could be created that would fit the space between the P1 and P2 before moving to the P3.
What I personally would like to see is:
- The current P2 COGS
- Either 48 or 64 'user' I/O + the I/O for the EPROM and Serial. The number of I/O determined by package choices.
- 64k HUB, could be larger if it doesn't have negative impact on die size or cost, but I think 64k is enough for this case.
- A simple SERDES to allow easy SPI(I could live with dropping this now that we have multitasking COGS)
- I could also see having 32 of the user I/O using the current P2 I/O and the remaining user I/O being a more simple 'digital only' I/O if that would help with package size and cost.
Chris Wardell
My $0.02 on DDR2:
- I don't like BGA, raises PCB costs for everyone
- love the idea of a dedicated DDR2 engine that using 16 bit memory bus can approach 4*clkfreq memory read/write (edit: I mean in bytes/second) speed.
- like the idea of empty hub slots being used to transfer to/from hub<-> ddr2
- love the idea of ddr2<->aux without going through the hub
- like cluso's shared cogs*128 byte new hub area (* but should be 512 bytes per page, ideally 2-4 pages per cog, more on this later)
- maybe we could get a bit bigger hub too
- HATE any more delays
Now if I understood Chip correctly, maximum burst size is 8 words (16 bits) - fitting a quad long perfectly; I suspect if the bursts are back to back very little setup is needed between 8 word bursts.
If we can accept that determinism would have to depend on careful programming of all cogs, I'd suggest the following:
RDCXMEM inda_ptr [wc]
WRCXMEM inda_ptr [wc]
RDAXMEM aux_ptr [wc]
WRAXMEM aux_ptr [wc]
RDC/WRC would read/write cog memory, and could be in a REPx block with auto increment pointers
RDA/WRA would read/write aux memory, and could be in a REPx block with auto increment pointers
If WC is specified, instruction will stall until complete
if WC is specified, it will just schedule the memory op and continue
Implementation:
RDC/WRC/RDA/WRA just add a request to a FIFO queue, and a DDR2 engine just sits there, reading requests and fulfilling them in order of request. This would tend to fairly distribute access by the %requests per cog, about as close to determinism as we can reasonably get with xmm.
Other useful extension:
As mentioned above, pseudo-dma to hub - could be implemented using above, or as an engine that does "background" transfers between DDR2 and hub, using spare hub/ddr2 cycles... ie if command FIFO empty, it can block move between hub/ddr2 using spare hub slots and empty DDR2 request slots.
Advantage:
- fits "everything is equal" philosopy
- easy access to DDR2 for all cogs
- background DMA for large blocks
- careful system design can almost saturate DDR2 bus
- easy to use XMM, very good match for LMM
Disadvantages:
- no byte/word/long access to DDR2 (could be supported with extra Verilog, would be extremely nice to have for graphics, lmm etc)
- no hardware enforced determinism, requires planning (not a bad thing)
512 byte pages:
Almost everything in Linux, file systems etc for historical reasons uses 512 byte pages, so if there are cog-specific hub pages, ideally we would want 4+ per cog to support code, data, stack and one extra to reduce trashing to memory.
Instant, pain-free, execution of LMM code from DDR2!
With the command FIFO I described above, you could have all eight cogs run different LMM code out of the DDR2 at the same time.
Note:
An
EXECXMEMC ptra++
Instruction would do the same work as the four instructions above, however it would take the same number of clock cycles, without the benefit of two interesting delay slots that have a lot of potential.
That would eliminate the tiny cog ram bottleneck!
We have faster cogs + more I/O while needing bigger programs to actually make use of the expanded capabilities and yet we still have to have an interpreter cog running to drip feed programs into the cogs due to the lack of cog memory.
I see this change making the P2 an excellent cross between a Microcontroller and a Microcomputer and, aside from the custom SoC route, that is something not currently available for small low volume projects (like mine will be).
Think 4 cogs running Video, Mouse/Keyboard, Touchscreen, User I/O +++ 4 cogs running deterministic process control all coordinated by the common bus/memory! NOTHING else out there now can do that!
This is a great marketing opportunity for Parallax to fill in (similar to their history with the BS and SX) a niche that will open up a lot of "oh my, look what I can do now...".
That said, with this type of package (BGA) Parallax would have to step up to the plate and offer a very cost effective user module with the P2+Ram installed similar to the PropStick USB or the MicroChip processor modules for their dev board (which run ~ $25.00).
Sounds like we're at that point in this episode.
Sandy
Yep. Seconded.
Personally, I would link that goal to some roughly identified revenue potential as well.
I like the idea of DDR2.
I do not know how complex is the hub circuitry and how much silicon it uses, but I figure it out as a 8 to 1 bidirectional mux.
Now if I have to add the DDR2, assumed the cog can execute only one access to the hub in its time window, I will do this:
- create the necessary memory controller (feeded with its own pll clock if necessary) that basically converts the DDR2 interface to a SRAM one and autonomously handles the refresh.
- add a simple 2 way mux between the new memory controller and the hub SRAM
- add a second set of hub instructions (eg. RDQUAD renamed to IRQUAD and its counterpart ERQUAD; WRQUAD renamed to IWQUAD and its counterpart EWQUAD ....)
- those new instructions will control the mux/swap between the Internal and External memory. (between SRAM and memory controller)
cog0 \cog1 |
cog2 |
cog3 \_____/ SRAM (Int)
cog4 / hub \ DDR2 (Ext)
cog5 |
cog6 |
cog7 /
In this way each cog will have the same bandwidth to the internal and external memory and basically the current hub design should remain unchanged. If a higher bandwidth is necessary more than one cog can cooperate (each one synced to its "hub" window) in streaming the (whatever) memory contents to the same output pins, or the opposite (like we do now with P1). Such cooperation can came from one task on each cog, leaving the cycles between hub windows for other jobs. From programming point of view it will be completely transparent to access one memory or the other, you have only to change the instruction.
If I have the right figures thats mean @200MHz / 8 cycles "hub/memory" access * quad(4*16) /8 = 200MB/s read or write from internal or external memory. If the need is to copy from one to the other than it halves (eg. one hub window to read from external, next hub window to write to internal)
IMHO 64 full IO is more than enough and, if it helps, I will better have 32 full IO and 32 digital only, thus perhaps gaining some silicon space for a HW support for a universal SERDES that can make/assist in async/sync serial, usb and if possible 4bit SD/flash (Cluso99 have some nice ideas over it)
... and, for the users not interested in additional external ram, perhaps a way to reutilize the special/dedicated 1.8V IOs can be a simple data exchange (clock synchronization) bus between two P2. This can open markets to some niche sectors in high-availability systems or where personal injury is of highest importance and two units are required to run the same program to verify each other before taking any action (eg. car ABS control units, or some PLCs in utilities industry: eg running gas turbines in power centrals)
This is only the perspective of a person not knowing the internals and rules of silicon chips engineering/design.
The microcontroller market is saturated with choices. It is not easy for a manufacturer to find profitable niches that have gone unnoticed all this time. Given that, there is a lot to be said for pursuing a course like Chip has - building what seems good to him, and what everyone else isn't building.
Certainly some FIFO like pipeline will need to buffer the SDRAM bursts, which are narrow and fast, into opcode sized bites.
Perhaps an opcode could block on FIFO data ready, which would give the flow you are after ?
Execute-in-place is a good thing to aspire to.
Even with interleaving, you still have refresh and many preamble cycles to add into the mix, so you will not really know where other COGs are in their phases, as that will move about. You can only work on average budgets
I do not think Strict determinism is something DDR2 will deliver.
However, with threads, you can drop the (small) 'Strict determinism' stuff into local memory, and give it a thread-slot or two.
64MB of memory to do what? Run through a VM on a COG that at best cannot exceed 200Mhz assuming the currently expected speed?
I really think the ARM has already won the race for having a large memory model to run OS's and such.
It seems like the place the prop can shine is with Chip's very flexible I/O coupled with the power of the COG and it's timers, so adding the ability to make the prop and SPI slave.
It just seems like right now the P2 is trying to be everything to everyone and that is almost always a recipe for failure.
C.W.
Hi-res video, for one. All the things you mention that the Prop is good for are just enhanced by having more memory for whatever the user needs. It certainly isn't exclusively for LMM OS like you seem to suggest. I love the idea of combining data sampling, processing, real-time control, and display in one chip.
I get that there are cases where it would be nice, but do they occur often enough to justify the increased cost and complexity?
I don't know the answer to that, but I really think Parallax needs to figure it out, and soon.
C.W.
So, rather than predict the future - a dicey thing at best - why not provide a great and unique mix of features and strengths, and let others figure out how it works with their specific and often unique mix of needs?
The alternative is to go head-to-head with ARM designing commodity chips - a sure-fire way to go bankrupt.
I think there is significant market space between Touch controllers+ [some FAT OS+Embedded Board], and say the FTDI Eve FT800. ($5.80/1)
LCD screens are expanding rapidly, and FT800 has a MAX ceiling of 512 x 512. (but more are coming ?)
SSD1963 ($7/1) is not as smart, but it can go a little higher at 864x480x24bit TFTs
DDR2 means a P2 can swallow a SSD1963, and more, without breaking a sweat, and DDR2 bandwidth helps ease the achilles heel of P2, which is the drop in speed when you go off-cog.
Smart LCD displays are ~ $84/1 ($64/100), so the system prices can stand a P2, especially if the support framework is there.
Then there are ship-loads of instrumentation areas, that a P2 can manage with Counters/Timers/maths plus good Display coupling.
Then there is this news :
http://ir.atmel.com/releasedetail.cfm?ReleaseID=809666
The Atmel device mentioned there, is not cheap.
ATMXT1664S-CUR 1: $21.83 100: $12.38
However, one can and absolutely should link features to specifics based on things known to date. In the end, it's still a dice roll, as any risk is.
Without this, we run the very real risk of always chasing that idealized, "good enough" vision, which creeps along as technology does.
Now, one advantage Parallax has is it's education / hobby business. Lots of great data there and lots of current customers on great relationships too. The original P1 would not have been made according to a market research. Chip, on the other hand, knew that a market could be teased out and prosper, and Parallax understands very well how to maximize that happening.
And so here we are.
I don't think the P2 would be made according to some market research either. Again, Chip seems to know a market can be teased out and prosper. I believe this. I believe it because I see the intersection of real time control, robust input / output capability, display, etc... is a compelling one and it's currently done with system on chip designs, Linux, etc...
A P2 will make this lean much as the P1 did simple multi-core processing.
Additionally, Parallax has bootstrapped the P1 into a business and they understand how to do that with P2.
So where is the upper boundary? With P1, that upper boundary got baked in due to the layout being all custom. Decisions made early on contributed to the final design.
With P2, things are more fluid and this means a few things:
1. It means some change cycles can be entertained.
2. It increases the risk of chasing the imaginary or just out of reach kinds of things... scope creep, etc...
3. It means a subsequent iteration can be done in a fraction of the time.
4. Though the time cost can shrink to make a P2 possible, the dollar cost remains out there, funded by the existing business. At some point, that may not be a favorable equation as things tend to improve over time. Remaining static only works for so long, and the "so long" is nicely insulated by the niche Parallax is in, but only insulated partially, not completely.
One comment here mentioned moving the HUB to external memory entirely. This is something I want to see myself, but it's also P3 type discussion, given what we've all hashed out so far.
A well funded P3 effort is going to punch well above it's weight. I'm excited to see that, but it's also going to take the growth in both experience, tools, people, and the niche Parallax owns right now to see success and that growth comes from the P2, which should nicely expand things.
I worry about that balance, particularly if the sweet spot for P2 is somehow missed or time passes it by. And I've seen this happen a time or two. Not in this particular product space, but I don't think the dynamics are that far off.
Ok, so that's all a big poo-poo.
Now, to be completely fair, one way this research can be done and mixed in with development like we see happening here is by incorporation of target users and maximizing things for them. Of the current team working together on this, do we have a sufficient representative sample for this to work properly, or are some additional targets needed? I'm not saying expand it, but I am saying get the data on missing targets and use that as a check and balance on what is happening right now.
Designing the perfect P2 for Baggers, for example, would lean hard on graphics, ease of development, data storage options, math. Designing the perfect P2 for Bill or JMG may well see a focus on "execute in place" from larger memory sets, and arguably many prospective C users would prefer this.
If I'm wrong on those examples, just roll with it for a moment and correct me later in the thread. No worries.
So then, taking just those two, how does that align with prospective education users, as in people who will move from P1 to P2? And how does that contrast with the potential new users that would come along for the ride? With new being people who would resonate with the ones I just mentioned?
Given that rough guess data, feature investment priority and potential returns can be roughly quantified and that all can tell us what makes sense given it's realized over X amount of time, etc...
That's all. It's a discussion nobody wants to have when there is so much fun building a great thing, but it is one everybody must have when the party is looking to be coming to an end all too soon, so then having it from time to time is just taking care of "the party" making sure the fun is connected to making sure it's all somewhat viable and sustainable.
I'm not going to mention it further for a while. Either it's worth thinking about, or not, and if not, I'm completely happy to continue on with no worries other than I really would like to see chips sooner rather than later.
One cannot predict the future. However, "make it awesome and it's a sure thing" plays out poorly fairly regularly and it does so simply due to the fact that "awesome" just isn't resonant with enough others to pay off. And it's possible to understand a lot of that now, and we should understand that now so that the risk is reduced and with that, costs and time to release in like kind.
I think the thing we are all jumping at is the idea of having more memory without giving up a cog to access it. Is there no other way? Can we trade DACs for more HUB RAM? Add per-cog hardware support for serial SRAM (with DMA to/from AUX)? Shrink HUB RAM and increase per-cog AUX RAM?
But you'd still have to have code to get the program loaded into the external RAM in the first place. If the P2 were instead to provide hardware support for SPI (esp. dual/quad SPI) such that each cog could set up DMA between an external flash memory and AUX, you'd be able to do all of that LMM stuff without ever having to touch external ram (or hub ram, even).
Taking this further, if P2 were to increase the inter-cog bus width and exchange some HUB RAM with more per-cog AUX RAM, I think we'd find the cogs quite capable of running large programs with very little need of shared resources!
I think major redesigns late in a project is a dangerous thing to do. At some point, you need to freeze then have fun with the next version.
My input is purely from customers who buy chips for production uses, and I second this motion and have stated the same to Chip. More changes bring us closer to a near-impossible business model due to increased design time cycles and inability to generate revenue and market from smaller step-wise improvements. Our production customers have begged for the following:
- More RAM, more code space
- More I/O
- Faster speed
- Internal A/D
- Freedom to run C kernel or Spin interpreter
- To get the project finished
There's one thing that I'd like which isn't really a feature at all. I'd like to be able to synthesize the whole design in a single piece of software. I understand this would be a step in the backwards direction and isn't really all that practical, though.It is dangerous territory to encourage more improvements and additions at this point.
Ken Gracey
GREAT post.
Seairth,
dma or not, spi flash / ram is limited to <50MB/sec, less than 1/6th of SDRAM, and 1/16th of DDR2
All,
I think everyone here has noticed I like high tech... high performance.
And I think many will be surprised that I think we should stick with SDRAM and TQFP-128 and no major architectural changes for P2. Let's save that stuff for a P3 - heck we can start playing with that as soon as P2 works
Given that the transistor level changes will take Beau a few months, Chip probably has time for a nice SERDES, CRC instructions and other small tweaks he thinks of.
The latest version of the docs is shows a great deal of power coming... and I want to play with it ASAP.
Frankly, the delay from P1 to P2 will be about 8 years - that is too long a design cycle these days. I am hoping it will only be 1 - 2 years after P2 ... and we will have a P3.
I think P3 will be a great time to look into BGA, DDR2/3/4/5, another die shrink (500MHz+ ... maybe 1GHz), more cogs, more hub, more cog memory etc.
Ok, those of you who have fallen on the floor by my advocating NOT putting in DDR2 can pick yourselves up now.
DDR2 would take some more engineering, which may not be that much time on my side, but it would take Beau a few months to do the layout changes. Maybe it would be best to build the chip we've got and follow on soon after with the DDR2 version, which would either have to be in a BGA package or have around 48 I/O's, only.
The DDR2 would be a blast to use because you'd have super big-memory bandwidth, but it would likely be a BGA for both the Prop2 and for the DDR2 memory. Of course, Parallax would make an integrated module at a reasonable price.
The current trajectory has both the Prop2 and the SDRAM in gull-wing packages, and if you don't want the SDRAM, you've got that many more full-function I/O's. We'd make a nice module for this system, too.
I don't see any reason to go messing with the Hub, or trying to arrange symmetrical real time access for Bob help us Cogs. Those things are fully baked and probably shouldn't change much.
What I envisioned was the ability for ONE cog -- since the cogs are all the same you'd have a register to pick which one -- has exclusive access to the DDR. This is really all you need. If you've got the bandwidth maybe make a second channel, but make it something that only comes into play when a Cog accesses a Hub address beyond the real Hub RAM, and then only if it's been given access.
This gets you the ability to run really large code, almost all of which is single threaded and does not need deterministic timing. With a second channel maybe it gives you a big video frame buffer too. As for the rest, what makes the Prop shine is mostly small drivers that need the timing features that made the P1 so unique but which generally fit in cogs and even the current 32K Hub. The ability to run something on the scale of a real operating system in one cog while using the other seven for things like USB, video, and other hardware-in-software drivers would add unbelievable value. You could then build what amounts to a single-chip computer with a mix of features found nowhere else.
P.S. on the subject of BGA, I'd suggest having a BGA for the external memory version, since you'd probably want to market it as a module with the DDR included anyway, and a no-external-RAM version of the same chip in some friendlier package with the memory pins left in the package. Similar philosophy to offering a DIP40 version of the P1 for the do it yourselfers.
He means making a license-able chunk of HDL for use in FPGA's.