Propeller II update - BLOG

Bill Henning · 2013-11-29 15:53

I for one would be thrilled with a gull wing p2 and module soon, followed (as closely as possible <grin>) with a bga/ddr2 module ... you had me drooling at all the extra bandwidth!

cgracey wrote: »

Well, what we've got at this point is low-risk to fabricate and it will do SDRAM, which is a lot better than nothing. I just need to get some layout changes to Beau and it can all come together soon.

DDR2 would take some more engineering, which may not be that much time on my side, but it would take Beau a few months to do the layout changes. Maybe it would be best to build the chip we've got and follow on soon after with the DDR2 version, which would either have to be in a BGA package or have around 48 I/O's, only.

The DDR2 would be a blast to use because you'd have super big-memory bandwidth, but it would likely be a BGA for both the Prop2 and for the DDR2 memory. Of course, Parallax would make an integrated module at a reasonable price.

The current trajectory has both the Prop2 and the SDRAM in gull-wing packages, and if you don't want the SDRAM, you've got that many more full-function I/O's. We'd make a nice module for this system, too.

Ken Gracey · 2013-11-29 15:57

cgracey wrote: »

He means making a license-able chunk of HDL for use in FPGA's.

That would be neat and it is something I think we need to do, but what I really meant was the ability to synthesize our I/O frame (the manual layout "pin" with the A/D hardware) in the same software we use to synthesize the core, eliminating the step of combining the two pieces and improving the potential for a successful foundry run. But I understand it's not practical and that we'd actually lose some of our special P2 features if a full-die synthesis was the only way to do the design.

Bill Henning · 2013-11-29 16:05

potatohead wrote: »

Bill, you are right where I am on it.

Ship this one. It is going to take our brains and a lot of development work to make it sing. We will learn a lot from that, all of which goes right back into P3.

Given what is known now, P3 could be 4 years, during which we rock hard on P2, growth happens, everybody gets ready to take it to the next level.

This excites me far more than a stretch tech goal does for P2. It is going to do a ton more. Just remember what we all thought on the last run.

Frankly, the instruction changes and optimizations of things done recently were totally worth it. To me, that just tuned up what got done. It was worth it because the assumption was a new synth / layout cycle would not be significantly impacted.

Let's keep it that way. We all will be better for having done it.

As for the current FPGA work, I say we expose those development tools people want so that the whole thing boils down to a great simulator. We all use our boards to help us fully exploit P2 quickly.

Pick new boards for P3 and carry on right after we freeze P2.

I am willing to bet the insight gained by bootstrapping P2 into business like P1 happened will pay off very nicely on P3, and if it is done well, P3 is nicely funded, meaning it too can be great, and the party continues on with far fewer worries.

First P3 discussion, external RAM as HUB? Looking forward to it.

We are in agreement

There is absolutely no reason why P3 development should not start as soon as shuttle run P2 chips are verified, and P2 production chips ordered.

And the idea of using FPGA tools to implement extreme debugging tools is a great one.

There is much to be said for iterative development.

Think of a P2.5 that is the result of the tweaks we all think of once we start torturing P2.

It should be a much shorter development cycle, because by that time all existing instructions etc will be verified.

Maybe the P3 designation should be reserved for a major architectural change - for example going 64 bit?

Regarding hub vs. external memory...

Internal will always be a lot faster, and is deterministic.

I do wonder about dividing up the address space, allocating say first 2GB to on-chip, and 2GB to external DDR2/3/4/?, with some caching... but that is a future discussion.

cgracey · 2013-11-29 16:05

Ken Gracey wrote: »

That would be neat and it is something I think we need to do, but what I really meant was the ability to synthesize our I/O frame (the manual layout "pin" with the A/D hardware) in the same software we use to synthesize the core, eliminating the step of combining the two pieces and improving the potential for a successful foundry run. But I understand it's not practical and that we'd actually lose some of our special P2 features if a full-die synthesis was the only way to do the design.

We'd have to characterize our I/O pins and memories in such a way that they could be ingested by a chip-level synthesis tool. Then, there could be "closure" across all aspects of the design. Right now, we are integrating synthesized logic into our full-custom frame that contains all the I/O's, I/O busses, and memories. The barrier between the synthesis tool and our custom layout is a place where problems can creep in, because there is only limited full-chip verification possible.

Bill Henning · 2013-11-29 16:05

Maybe the tools will improve enough in time for P3?

Ken Gracey wrote: »

That would be neat and it is something I think we need to do, but what I really meant was the ability to synthesize our I/O frame (the manual layout "pin" with the A/D hardware) in the same software we use to synthesize the core, eliminating the step of combining the two pieces and improving the potential for a successful foundry run. But I understand it's not practical and that we'd actually lose some of our special P2 features if a full-die synthesis was the only way to do the design.

cgracey · 2013-11-29 16:08

Bill Henning wrote: »

Maybe the tools will improve enough in time for P3?

When we get much below 180nm, it may be impractical for us to design our own pins and memories, anymore. The number of design rules increase horrendously as things shrink. For 350nm, I think there were about 300 rules. For 40nm, there may be 4,000 rules.

jmg · 2013-11-29 16:11

potatohead wrote: »

I don't feel good about the RAM change, based on those things. Primary reason: Too big of a jump.

I would manage the DDR support as a Stub/fork, so that the present FPGA P2 is stable, with bug fixes and serial patches already in the pipeline done.

The DDR2 step is not that large a jump - it's really an over-grown Multiplexer & State machine.

However, it does need to talk to another chip, which is where it morphs to a 'balls in the air' problem.

What could be practical, is to do the 64 Full I/O + 1.8V Digital/Memory I/O split that Chip mentioned, which buys more HUB memory, (how much?) and allows some steps to faster Digital IO.
Maybe OnSemi have DDR2 IP, that makes this much more of a Cut/paste task ?

OR

Given the idea of [P2.Chip + Small Cyclone V] is already floated, perhaps that could become [P2.Chip + Small Cyclone V + DDR2]
You can buy [Small Cyclone V + DDR3] right now, & maybe the SW layers are not too different DDR2 vs DDR3 ?
( Altera Spec DDR3 and DDR2 at the same peak MHz on Cyclone V )

A key question then becomes : How fast can P2.FPGA run in a Cyclone V ?
Certainly, Altera have the DDRx is done, and runs at 400MHz, so one testing pathway exists right now.

Cluso99 · 2013-11-29 16:19

please excuse errors - inthe car on a xoom and hopeless to correct typos.

I understand Ken and others concerns so lets put it into perspective.

Chip came upfor air and part ofthat release was to lookaround at some things of interest. He realised HD video requiresmore bandwith and foundcheap ddr2 ram but its only inbga. He wondered whetherit is possible to use in P2?

So we spent the daymusing possibilities. Nothing has beendecided. Chip is relaxing in his own way. If it turnsout there are some real advantages the I am sure Chip andKen will discuss it.

So lets consider the possibilities in themindset thatnothing may come to pass.

How long will ddr2chipsbe availaable?
Canthe ddr state m/calso handle other typesof sdram?

Could we have a second 128 bit bus goingto each cog (in parallel with, and like the hub/cog 128 bitbus)?
By multiplexing these twobuses at each cog end, wenow can retainthenormal hub/cog interface with 1:8 clock access. The second bus would go to the ddr state m/c. This gives the ddr its own hub style bus that it would own. Bya few new instructions, the cog can request r/w block transfers to/from ddr memory. The state m/cwould have asmall rambuffer which could bewindowed into hub.

Now the seconnd bus could transfer an 8 word block (quad long) in 1 clock. The state m/c can acess each cog for 7:8 clocks! Of course this has a rollong effect - due to theexisting hubinterface. But, if there is no cog/transfer takingplace, the state m/c has a full8 cycleaccess requiring a littlemore logic.

With this secondbus wecan put ddr in/out og cog ram or aux memory.

Thiscreates a huge b/w available to be shared adhocbetween cogs and ddr. Now this would only belimited by the design of the state m/candthe cog insttructions.

We would have anawesum video and xmm interface! And Iam not sure that it is that complex. Its a muchmoreversatile interface than wideningthehub bus.

Now, if the ddr

cgracey · 2013-11-29 16:23

Cluso99 wrote: »

please excuse errors - inthe car on a xoom and hopeless to correct typos.

I understand Ken and others concerns so lets put it into perspective.

Chip came upfor air and part ofthat release was to lookaround at some things of interest. He realised HD video requiresmore bandwith and foundcheap ddr2 ram but its only inbga. He wondered whetherit is possible to use in P2?

So we spent the daymusing possibilities. Nothing has beendecided. Chip is relaxing in his own way. If it turnsout there are some real advantages the I am sure Chip andKen will discuss it.

So lets consider the possibilities in themindset thatnothing may come to pass.

How long will ddr2chipsbe availaable?
Canthe ddr state m/calso handle other typesof sdram?

Could we have a second 128 bit bus goingto each cog (in parallel with, and like the hub/cog 128 bitbus)?
By multiplexing these twobuses at each cog end, wenow can retainthenormal hub/cog interface with 1:8 clock access. The second bus would go to the ddr state m/c. This gives the ddr its own hub style bus that it would own. Bya few new instructions, the cog can request r/w block transfers to/from ddr memory. The state m/cwould have asmall rambuffer which could bewindowed into hub.

Now the seconnd bus could transfer an 8 word block (quad long) in 1 clock. The state m/c can acess each cog for 7:8 clocks! Of course this has a rollong effect - due to theexisting hubinterface. But, if there is no cog/transfer takingplace, the state m/c has a full8 cycleaccess requiring a littlemore logic.

With this secondbus wecan put ddr in/out og cog ram or aux memory.

Thiscreates a huge b/w available to be shared adhocbetween cogs and ddr. Now this would only belimited by the design of the state m/candthe cog insttructions.

We would have anawesum video and xmm interface! And Iam not sure that it is that complex. Its a muchmoreversatile interface than wideningthehub bus.

Now, if the ddr

Imagine P3 where we have way bigger cog RAM. It would be great for the DDR2/3 driver to DMA right into it. Maybe there's no need for something like AUX RAM, either. Simplification would be good and can be achieved. We're just in tighter spaces right now.

Bill Henning · 2013-11-29 16:28

I am imagining a P3 with 64 bit cog's.

- easy instruction and address space expansion
- cleaner, more regular instruction set
- a hub would still be useful for inter-cog communications, or perhaps something like transputer channels, but faster

I think AUX would still be valuable, as it could have separate ports for the video read, ram read/write, without having to six-port (shudder) the cog memory

cgracey wrote: »

Imagine P3 where we have way bigger cog RAM. It would be great for the DDR2/3 driver to DMA right into it. Maybe there's no need for something like AUX RAM, either. Simplification would be good and can be achieved. We're just in tighter spaces right now.

Cluso99 · 2013-11-29 16:29

damn - xoom lost the plot

If the hub has a small (say 1KB) of dual port ram where the second port was to the state m/c,we haveanother mechanism for small cost.

Lastly, it maysave somemore silicon ifthe adc were removedfrom the ddr pins. Having 64 GPIO pins would be fine. Might evenbe aceptable to onlyhave dacs on 32 pins - either 0-31 or 32-64.

The ddr state m/c should be able to do both ddr2 & sdram (as was previously going to be used??

jmg · 2013-11-29 16:39

Cluso99 wrote: »

Lastly, it maysave somemore silicon ifthe adc were removedfrom the ddr pins. Having 64 GPIO pins would be fine. Might evenbe aceptable to onlyhave dacs on 32 pins - either 0-31 or 32-64.

Some numbers on how much RAM you can gain, for each removed Analog-ADC/DAC pin feature, would be good here.

128K HUB ram is still rather small.. the latest Microchip part, has 512K of RAM and FLASH

Bill Henning · 2013-11-29 16:44

jmg wrote: »

Some numbers on how much RAM you can gain, for each removed Analog-ADC/DAC pin feature, would be good here.

128K HUB ram is still rather small.. the latest Microchip part, has 512K of RAM and FLASH

I looked at the errata and went OUCH!

ozpropdev · 2013-11-29 17:11

I think the latest tweaks have made a HUGE difference to P2.
Having used the virtual P2 for a while now I think back to the code I wrote at the
start of the experience to what I am writing now, vastly better performance , reduced code size etc.

The small window available to Chip to add a few instructions doesn't seem to be too big a deal for the guru.
If they don't eat up silicon and turn out to be a page of Verilog code then it's worth it.
If all that's left is the SPI/SERDES and it's not too big a job then that should be the LAST addition.
IMHO I think it's time to "lock and load"

DDRx and enhancements are for P2.5/P2+
Architecture changes are P3

When we get stuck into P3 using the FPGA I believe the lessons learnt from our current P2 experience
will rapidly produce a working P3 conceptual model.

As much as I want the P2 to do some more stuff, I could use the current P2 already and will make
the switch to P2.5/P2+ when it appears.

Put simply, as soon as I can buy chips (whatever the final chip is) I will start buying them.

Ramon · 2013-11-29 19:55

cgracey wrote: »

I have a question for you all:

Would it be a good idea to reduce the universal-purpose I/O pin count down to 64, from 92, in order to provide enough fast 1.8V I/Os to talk to a x16 DDR2 SDRAM?

I ask because once I started playing around with big memory (32MB in a 3.3V SDRAM), I could see right away that the Prop2 would be able to do a lot of exceptional things if we had big, fast, cheap memory. The sweet spot for SDRAM seems to be the 64MB DDR2, which costs about $2. It's a little cheaper than 32MB of 3.3V SDRAM that we are on target for, but can be read and written twice as fast, which makes 1080p all-points-addressable graphics a lot more practical. It needs 1.8V I/O's, though.

By doing this, we would free up lots of silicon area which is just routing the DAC busses. There's actually as much area committed to routing the DAC busses as the core takes. We could free up 50% more room for the core by cutting the DAC busses in half.

This may not be practical to do at this point, because of all the manual layout involved, but I'm just thinking about it. What do you guys say about this?

I think that is a good change, if the DAC/ADC is not completely lost. One of the big points of Propeller II is that their pins are both digital and analog (can be used as ADC, DAC). That is something very unique that I think also can have a lot of exceptional applications.

Think also about packaging. P1 have 40 pins and P2 would have 128 pins. That is a big jump. I think that sweet spot for packaging (nowadays) is 44, 48, 52 or 64 tqfp / vqfp. (because its small size, easy to solder, and lower production cost than BGA). The problem is how to predict the packaging that will be available in 3-5 years. (maybe in 3-5 years the packaging companies can easily do a double die SRAM + P2)

Ramon · 2013-11-29 20:13

Kerry S wrote: »

I see this change making the P2 an excellent cross between a Microcontroller and a Microcomputer and, aside from the custom SoC route, that is something not currently available for small low volume projects (like mine will be).

Think 4 cogs running Video, Mouse/Keyboard, Touchscreen, User I/O +++ 4 cogs running deterministic process control all coordinated by the common bus/memory! NOTHING else out there now can do that!

This is a great marketing opportunity for Parallax to fill in (similar to their history with the BS and SX) a niche that will open up a lot of "oh my, look what I can do now...".

I have quite that same feeling.

But do not think only on digital terms. If each pin also can have high resolution and high speed ADC/DAC. That will be the definitive industrial IC.

Ramon · 2013-11-29 20:30

cgracey wrote: »

When we get much below 180nm, it may be impractical for us to design our own pins and memories, anymore. The number of design rules increase horrendously as things shrink. For 350nm, I think there were about 300 rules. For 40nm, there may be 4,000 rules.

Chip, how about a multi-die IC? The core synthetized in any process below 180nm and the I/O part done by your team at 180nm that all of you know pretty well.

If Parallax is considering an open source development for the core, you can keep the I/O part as close source. What do the packaging companies say about this?

Kerry S · 2013-11-30 01:03

While I agree with everyone about wanting a P2 chip yesterday, and avoiding any more delays, I am concerned about this memory issue.

I would rather wait an additional 30-60-90 days and get a chip that can really deliver on its capabilities instead of getting one that much sooner that after it is out we all realize that it just cannot deliver what it should.

Why do I say that?

The P2 is a huge leap forward from the P1 with 3 times the I/O available and much more complex I/O at that. The P1 seems to have started as a cool educational concept that grew into a very capable and flexible controller. What is the purpose of the P2? Why do we have all of that powerful flexible I/O if we cannot drive it with equally impressive programming? Why do we have fantastic video if we cannot use that for our user interface? What will it take to fully utilize the chip's capabilities in the real world for commercial/industrial applications? Sure we have 160MHz and 1 instruction per cycle for most cog PASM instructions but will we ever be able to really get that? From what I have read, and maybe I missed something somewhere, we are still limited to 512 instructions for native PASM in each cog. That was TIGHT for a P1 with 32 I/O. Now we have 96 with each cog trying to push up to 4 threads to drive that power? Will it be possible to do that using PASM drivers that fit in 512 commands? So now we have a processor that in theory is GREAT but when pushed we end up with the following:

We lose one cog right off the bat because we need it to run either a Spin or C interpreter to handle programs too big to fit into a cog. Before we load a program we are down to 7 cogs with 96 I/O to utilize. Add in a great user interface system, that the P2 should be able to deliver, and some impressive commercial/industrial style process control and now we have that ONE cog trying to interpret 7 complex programs and drip feed our 7 worker cogs. Has anyone actually done a study to see what real world performance we will get when pushed to the limit? How much time are those 7 cogs going to spend waiting for the next instruction?

And is that not the point of the Propeller design concept, that you can do amazing things when pushing it to the limit? If we are hamstrung by cog memory limitations what good is all of the fancy I/O going to be?

Right now if you want a great user interface you have to use either a PC or a SoC like a CubieBoard2 or a Raspberry PI (neither of those are suited for a commercial product and getting your own custom SoC made is not a real possibly for most of us). If you want serious I/O you end up with a single core Microchip controller. Need both at the same time? Well you can put expensive I/O boards into your PC or you can try to do it with a parallel or serial connection between the two. Either way it is a complex setup both electrically and software wise. Programming a Pic32 to do real time process control is very difficult as you have to either write your own code or buy a 3rd party RTOS for it and add that level of cost/complexity.

Why is the P2 a revolutionary product? Because you can do it all on a single chip. NO complex OS needed, NO interrupts to worry about stalling, just very simple "here are my 8 cogs independently programmed yet fully coordinated"!

I will say it again NOTHING ELSE OUT THERE CAN TOUCH THAT.

My dream P2? 8 - 32 bit cogs driving 64 flexible I/O with 8MB of DDR2 ram dedicated to each cog that they run their programs directly out of with their cog ram converted into 512 programmable data registers and HUB ram used just for inter cog data sharing.

Ultimately it is up to Chip and Ken to decide... what is the purpose of this chip? Is it an incremental improvement on the P1 or is it a revolution in the way we think about system design?

Personally I think they are on the verge of an amazing revolution...

Seairth · 2013-11-30 01:08

Bill Henning wrote: »

dma or not, spi flash / ram is limited to <50MB/sec, less than 1/6th of SDRAM, and 1/16th of DDR2

Agreed. However, I was suggesting this to meet a functional need, not a performance need. Frankly, we waste quite a lot of our cog memory with essentially static instructions. This is exactly the kind of thing that Harvard architectures avoids. And, no, I'm not suggesting we adopt a Harvard architecture. Instead, I'm suggesting that there's an easy way we could get some of the same benefits, thereby allowing larger code to be run "directly" while also allowing more of the cog memory (and hub memory, for that matter) to be used for data.

Note that the suggestion to use DMA with the AUX registers was to try to keep the approach within the current constraints of the hardware; I'm sure there are technically superior approaches with more extensive hardware changes.

Cluso99 · 2013-11-30 03:35

Chip,
If you only provided DACs on the first 64 I/O pins, is there much work involved to remove these for useful space?
Could it be also worth considering only providing the DACs on 32 I/O pins? I am basing this on the assumption that VGA/HD only requires 3 pins with DACs - is this correct???
Could you add more hub in this space?

Would there be any benefit (space) if the first 64 I/O pins only had ADC capability? Is there much work involved - I presume this could be quite a job as this is the part Beau did?

I would expect that the video circuitry including all the pixel maths blocks take considerable space too. Since this is all in the synthesised logic (presumed), might it be possible to only provide 2 or 3 copies of this block, and a multiplexer to each cog, with a simple mechanism to for a cog to claim a video block (like locks)?
With the current limits of hub ram, and bandwidth of SDRAM, it would be reasonably unlikely that any users will require more than 2-3 video circuits.

These are just a few thoughts which might be totally impractical.

I thought of a fairly simple mechanism (IMHO) to provide cogs with an optional extra hub cycle if available. However, looking at the bandwidth available to the SDRAM (max 200MHz x word = 400MB/s), this is the same as a cogs access to hub (200MHz / 8 * Quad longs = 400MB/s) so there really seems no point. If you think otherwise, I'll explain.

BTW My biggest concern with DDR2 was continued supply. It is rather old technology already so how long will it be available for.

rjo__ · 2013-11-30 05:02

Everybody has thought it. So, I might as well say it.

What if the P2 turns out to be impossible to build?

What if there is an interface issue (between the analog end and the fully digital end), which turns out to be extremely difficult to debug?

How much time and how many shuttle passes would it take to make: "…that's enough, we can't do this anymore" a reality?

Chip's latest musings and the resulting conversations could be the path we are on, whether we like it or not. It is entirely possible that a P3 could be ready before a P2 shuttle run succeeds.

I agree that at this point, the P2 needs to be:

fully defined,
fully debugged,
wrapped up,
and shuttled out,

but the consideration and movement toward a fully digital design might be the fastest way to actually get silicon out the door.

While display capabilities might have driven the initial consideration of the memory architecture, we should consider the potential for marketable applications that these design changes make possible.

My favorite examples: machine vision. It should be cheap, it isn't. With a P3, robust machine vision is possible, without the need for network level arbitration or very expensive parts.

Real-time 3D vision… Unless you cheat or go to 4D, it actually requires 3 cameras (to eliminate refractive artifacts)… with cog level memory support, the P3 could support two sets, without breaking a sweat.

ctwardell · 2013-11-30 05:40

Kerry S wrote: »

While I agree with everyone about wanting a P2 chip yesterday, and avoiding any more delays, I am concerned about this memory issue.

I would rather wait an additional 30-60-90 days and get a chip that can really deliver on its capabilities instead of getting one that much sooner that after it is out we all realize that it just cannot deliver what it should.

Why do I say that?

The P2 is a huge leap forward from the P1 with 3 times the I/O available and much more complex I/O at that. The P1 seems to have started as a cool educational concept that grew into a very capable and flexible controller. What is the purpose of the P2? Why do we have all of that powerful flexible I/O if we cannot drive it with equally impressive programming? Why do we have fantastic video if we cannot use that for our user interface? What will it take to fully utilize the chip's capabilities in the real world for commercial/industrial applications? Sure we have 160MHz and 1 instruction per cycle for most cog PASM instructions but will we ever be able to really get that? From what I have read, and maybe I missed something somewhere, we are still limited to 512 instructions for native PASM in each cog. That was TIGHT for a P1 with 32 I/O. Now we have 96 with each cog trying to push up to 4 threads to drive that power? Will it be possible to do that using PASM drivers that fit in 512 commands? So now we have a processor that in theory is GREAT but when pushed we end up with the following:

We lose one cog right off the bat because we need it to run either a Spin or C interpreter to handle programs too big to fit into a cog. Before we load a program we are down to 7 cogs with 96 I/O to utilize. Add in a great user interface system, that the P2 should be able to deliver, and some impressive commercial/industrial style process control and now we have that ONE cog trying to interpret 7 complex programs and drip feed our 7 worker cogs. Has anyone actually done a study to see what real world performance we will get when pushed to the limit? How much time are those 7 cogs going to spend waiting for the next instruction?

And is that not the point of the Propeller design concept, that you can do amazing things when pushing it to the limit? If we are hamstrung by cog memory limitations what good is all of the fancy I/O going to be?

Right now if you want a great user interface you have to use either a PC or a SoC like a CubieBoard2 or a Raspberry PI (neither of those are suited for a commercial product and getting your own custom SoC made is not a real possibly for most of us). If you want serious I/O you end up with a single core Microchip controller. Need both at the same time? Well you can put expensive I/O boards into your PC or you can try to do it with a parallel or serial connection between the two. Either way it is a complex setup both electrically and software wise. Programming a Pic32 to do real time process control is very difficult as you have to either write your own code or buy a 3rd party RTOS for it and add that level of cost/complexity.

Why is the P2 a revolutionary product? Because you can do it all on a single chip. NO complex OS needed, NO interrupts to worry about stalling, just very simple "here are my 8 cogs independently programmed yet fully coordinated"!

I will say it again NOTHING ELSE OUT THERE CAN TOUCH THAT.

My dream P2? 8 - 32 bit cogs driving 64 flexible I/O with 8MB of DDR2 ram dedicated to each cog that they run their programs directly out of with their cog ram converted into 512 programmable data registers and HUB ram used just for inter cog data sharing.

Ultimately it is up to Chip and Ken to decide... what is the purpose of this chip? Is it an incremental improvement on the P1 or is it a revolution in the way we think about system design?

Personally I think they are on the verge of an amazing revolution...

This is almost all P3 stuff.

Nothing in the current discussion for P2 involves going beyond 512 longs for COG memory. That would require expanding beyond 32 bit instructions to expand the address capability.

Chip already mentioned going to 64 bit instruction size for P3, that's when we can really expand things.

The current P2 design, with SDRAM, provides plenty of RAM for many applications.
Sure there things that would be nice to have in some use cases, but right now we have nothing so it has no application in ANY use case.

What Chip has designed is already amazing and goes way beyond P1.

We have to keep in mind that a series of chips that provide incremental improvement is far better than constantly redefining the 'perfect chip' that is never delivered.

C.W.

Dave Hein · 2013-11-30 06:01

ctwardell wrote: »

Nothing in the current discussion for P2 involves going beyond 512 longs for COG memory. That would require expanding beyond 32 bit instructions to expand the address capability.

Accessing more than 512 longs does not require expanding beyond 32-bit instructions. There have been many suggestions for addressing much more than 512 longs without increasing the instruction width. However, I'm hoping that all the discussion since Chip posted "The Big Update is done!!!" applies to P3 and not P2.

ctwardell · 2013-11-30 06:13

Dave Hein wrote: »

Accessing more than 512 longs does not require expanding beyond 32-bit instructions. There have been many suggestions for addressing much more than 512 longs without increasing the instruction width. However, I'm hoping that all the discussion since Chip posted "The Big Update is done!!!" applies to P3 and not P2.

I don't recall any of them being considered, at least by Chip, for implementation on the P2, maybe I missed something.

C.W.

David Betz · 2013-11-30 06:24

ctwardell wrote: »

I don't recall any of them being considered, at least by Chip, for implementation on the P2, maybe I missed something.

C.W.

I agree. Leave these major changes like DDR for P3. In fact, I'd rather that more thought be put into allowing compiled high-level languages to run in native mode as a priority for P3. LMM is a neat hack but there really needs to be better support for compiled languages in any future Propeller chip. I'm disappointed that it isn't in P2 but I hope it is at the front of the list for P3.

cgracey · 2013-11-30 09:44

No matter what, every pin will have a 9-bit DAC and a delta-sigma ADC. Those are built-in and all fit underneath the power rings.

What is taking 20% of the chip area, or 10 square mm, is the huge DAC bus for DAC signals that come out of cogs and update on every system clock or video clock. If we eliminate that bus, we'll have double the space we have now for the core, which would be a huge deal..

The limitation caused by getting rid of that huge DAC bus would be that certain pins would now be tied to certain cogs for outputting new DAC data on every clock. It wouldn't affect anything else, like static DAC updates or ADC's. It would just mean that for outputting video or analog CTR signals, certain pins would relate to certain cogs. Is that a limitation that we can live with?

Such a change would allow all 1.8V pin logic to be synthesized within the core and drastically reduce timing complexities for I/O. Also, we could put things like parallel DAC updating and DAC dither into the core-side logic for each pin. We could even have ADC tallies computed per pin, or PWM output. This stuff is all very simple to do, actually. I'm also aware that what we currently have for the core logic is bigger than our last fab attempt, and probably won't fit into the old space. We're going to have to eliminate at least part of that huge DAC bus. I'm thinking of getting rid of it, altogether. What do you guys think?

Heater. · 2013-11-30 10:04

I think you are making me very nervous.

Removing that bus and the other rearrangements you have mentioned sound like a huge change and you are saying it has to be done because the core logic has grown with all the other changes going on recently.

Now, I've always loved the idea that on the Prop every pin is equal and every cog is equal and any cog can drive any pin just the same. It is possible that this ideal situation does not scale, which seems to be what you have run into. If the practical up shot of removing that bus is that most of the Props regularity remains but only video is tied to a particular cog then I could live with that. After all are we ever likely to want to have more than one video output on a Prop at a time?

Any chance of more HUB RAM if the bus is removed?

cgracey · 2013-11-30 10:18

Heater. wrote: »

I think you are making me very nervous.

Removing that bus and the other rearrangements you have mentioned sound like a huge change and you are saying it has to be done because the core logic has grown with all the other changes going on recently.

Now, I've always loved the idea that on the Prop every pin is equal and every cog is equal and any cog can drive any pin just the same. It is possible that this ideal situation does not scale, which seems to be what you have run into. If the practical up shot of removing that bus is that most of the Props regularity remains but only video is tied to a particular cog then I could live with that. After all are we ever likely to want to have more than one video output on a Prop at a time?

Any chance of more HUB RAM if the bus is removed?

We really are out of room. We can't grow the die anymore, because it is at the limit of what will go into the biggest-die-pad TQFP-128 package. Even the pad frame is max'd out and can't accommodate any more I/O's or power pins. It just seems like an exorbitant expense to have 20% of the die spent on a giant data bus of 288 signals that circle the chip, that are likely to be only partially used in even a very busy application.

About more hub RAM: the next step would be to double what we've got, and that would eat up the entire 10 square mm. We could double or quadruple the AUX RAMs, though.

Heater. · 2013-11-30 10:32

Well, if it won't fit it won't fit. So there is no way out.
The way you put it that bus does sound like a gross waste of space.

Still makes me nervous though. We are going to get a rash of suggestions for things to fill the freed up space with. More tweaks and features and...

ctwardell · 2013-11-30 10:32

cgracey wrote: »

No matter what, every pin will have a 9-bit DAC and a delta-sigma ADC. Those are built-in and all fit underneath the power rings.

What is taking 20% of the chip area, or 10 square mm, is the huge DAC bus for DAC signals that come out of cogs and update on every system clock or video clock. If we eliminate that bus, we'll have double the space we have now for the core, which would be a huge deal..

The limitation caused by getting rid of that huge DAC bus would be that certain pins would now be tied to certain cogs for outputting new DAC data on every clock. It wouldn't affect anything else, like static DAC updates or ADC's. It would just mean that for outputting video or analog CTR signals, certain pins would relate to certain cogs. Is that a limitation that we can live with?

Such a change would allow all 1.8V pin logic to be synthesized within the core and drastically reduce timing complexities for I/O. Also, we could put things like parallel DAC updating and DAC dither into the core-side logic for each pin. We could even have ADC tallies computed per pin, or PWM output. This stuff is all very simple to do, actually. I'm also aware that what we currently have for the core logic is bigger than our last fab attempt, and probably won't fit into the old space. We're going to have to eliminate at least part of that huge DAC bus. I'm thinking of getting rid of it, altogether. What do you guys think?

It sounds like there is little choice based on die size, and increasing the amount of the chip that can be synthesized is an upshot.

I think any changes that improve the chance for success with minimal additional shuttle runs need to be strongly considered.

C.W.

Propeller II update - BLOG

Comments