Propeller II update - BLOG

KC_Rob · 2013-11-25 09:24

Ken Gracey wrote: »

Hello Devin - it could be a while yet....
In the meantime, design around Propeller 1!

Well, I can now honestly say that I'm doing my part. I just got my first design win for the Propeller 1; the PO came in Friday! I've been trying to get a Prop designed in a real product for a while now (it had nearly become a running joke with some of my clients), and finally, a couple of weeks ago, work came in that was ideally suited to the Prop's capabilities, so much so that it was practically a slam-dunk making the case for it.

While not much, this is a start. In fact, very shortly I'll be meeting with the same client about another project that the P1 could be a good candidate for.

potatohead · 2013-11-25 09:57

Very cool!

If you can shar win details someday, please do.

Seairth · 2013-11-25 12:45

KC_Rob wrote: »

I just got my first design win for the Propeller 1; the PO came in Friday!

Congratulations! I know what that feels like. I've been trying to convince my company similarly for several internal projects. This gives me hope that I'll convince them yet!

KC_Rob · 2013-11-25 14:16

Seairth wrote: »

Congratulations! I know what that feels like. I've been trying to convince my company similarly for several internal projects. This gives me hope that I'll convince them yet!

Thanks! Yes, hang in there. Something will eventually turn up that the Prop is undeniably well-suited for. In the meantime, if my experience is anything to go by, you and your Propeller obsession may become the butt of a few jokes.

KC_Rob · 2013-11-25 14:17

potatohead wrote: »

Very cool!

If you can shar win details someday, please do.

Thank you! I will share some of those details very soon!!

potatohead · 2013-11-25 15:17

...err "share" Thanks. Looking forward to it.

Oldbitcollector (Jeff) · 2013-11-26 11:38

Could we get a "status" update on when the next test run will be?

Thanks
Jeff

cgracey · 2013-11-26 11:50

Oldbitcollector (Jeff) wrote: »

Could we get a "status" update on when the next test run will be?

Thanks
Jeff

We've got to make transistor-size changes to the layout and resynthesize the core. I estimate new test chips in 4 months.

Cluso99 · 2013-11-26 13:54

cgracey wrote: »

We've got to make transistor-size changes to the layout and resynthesize the core. I estimate new test chips in 4 months.

Ouch! Does that mean you have to redo all the sram blocks, etc (all the sections Beau did)?

BTW Chip, DO you ever sleep???

Yanomani · 2013-11-26 14:07

Hi Cluso99

I'm not sure about Chip, but I believe that the lack of sleep is a common feature, and even essential, in our electro-digital fraternity.

Except when I'm sick, four to five hours of sleep are enough for me. And this is from my eleven years of age, ie there long ago.:nerd:

Yanomani

cgracey · 2013-11-26 15:01

Cluso99 wrote: »

Ouch! Does that mean you have to redo all the sram blocks, etc (all the sections Beau did)?

BTW Chip, DO you ever sleep???

Not the SRAM cells, themselves, but the driving logic needs modifications. We need to shrink the width of the PMOS logic transistors by about 1/4th, as this new process requires less P width to match N width drive strength.

Cluso99 · 2013-11-26 15:23

cgracey wrote: »

Not the SRAM cells, themselves, but the driving logic needs modifications. We need to shrink the width of the PMOS logic transistors by about 1/4th, as this new process requires less P width to match N width drive strength.

Thanks Chip.
Is this a time consuming job?
I presume Beau will be dong this.
Does this mean there will be more usable silicon or is it irrelevant in the scale of other parts of the die?

cgracey · 2013-11-26 15:42

Cluso99 wrote: »

Thanks Chip.
Is this a time consuming job?
I presume Beau will be dong this.
Does this mean there will be more usable silicon or is it irrelevant in the scale of other parts of the die?

It doesn't free up any space, just tucks the PMOS transistor finger down a little. If we were to start over, it would save a little space.

There are probably between 100 and 200 instances that will need this change. It will take Beau a little while.

Baggers · 2013-11-28 12:09

cgracey wrote: »

There are probably between 100 and 200 instances that will need this change. It will take Beau a little while.

Good luck Beau!

cgracey · 2013-11-28 12:11

I have a question for you all:

Would it be a good idea to reduce the universal-purpose I/O pin count down to 64, from 92, in order to provide enough fast 1.8V I/Os to talk to a x16 DDR2 SDRAM?

I ask because once I started playing around with big memory (32MB in a 3.3V SDRAM), I could see right away that the Prop2 would be able to do a lot of exceptional things if we had big, fast, cheap memory. The sweet spot for SDRAM seems to be the 64MB DDR2, which costs about $2. It's a little cheaper than 32MB of 3.3V SDRAM that we are on target for, but can be read and written twice as fast, which makes 1080p all-points-addressable graphics a lot more practical. It needs 1.8V I/O's, though.

By doing this, we would free up lots of silicon area which is just routing the DAC busses. There's actually as much area committed to routing the DAC busses as the core takes. We could free up 50% more room for the core by cutting the DAC busses in half.

This may not be practical to do at this point, because of all the manual layout involved, but I'm just thinking about it. What do you guys say about this?

localroger · 2013-11-28 12:23

cgracey wrote: »

Would it be a good idea to reduce the universal-purpose I/O pin count down to 64, from 92, in order to provide enough fast 1.8V I/Os to talk to a x16 DDR2 SDRAM?

I like this idea. The main reason 32 pins has been so restrictive is when we try to use them for memory bus and I/O too. I can hardly think of a situation where 64 wouldn't be enough general purpose I/O, but standardized big RAM would be a huge win.

jmg · 2013-11-28 12:45

cgracey wrote: »

I have a question for you all:

Would it be a good idea to reduce the universal-purpose I/O pin count down to 64, from 92, in order to provide enough fast 1.8V I/Os to talk to a x16 DDR2 SDRAM?
...
This may not be practical to do at this point, because of all the manual layout involved, but I'm just thinking about it. What do you guys say about this?

Memory and memory bandwidth is emerging as a clear bottleneck in P2, (as the core gets more powerful, and faster) and the obvious next step is to improve off-chip memory speed and cost.

Anyone who really wants 96 DACs can buy 2 P2's , they will be a rarer customer than someone wanting more memory.
Everyone wants more memory!

You may be able to use the extra space to increase on chip memory, and/or, give some HW support to burst-code-reading.

Also check DDR3, not sure what the details are wrt DDR2, but I see the $49 BeMicro has DDR3 ?
(In another thread, someone has BeMicro CV + Nios II/f + DDR3 working)

cgracey · 2013-11-28 12:57

jmg wrote: »

Memory and memory bandwidth is emerging as a clear bottleneck in P2, (as the core gets more powerful, and faster) and the obvious next step is to improve off-chip memory speed and cost.

Anyone who really wants 96 DACs can buy 2 P2's , they will be a rarer customer than someone wanting more memory.
Everyone wants more memory!

You may be able to use the extra space to increase on chip memory, and/or, give some HW support to burst-code-reading.

Also check DDR3, not sure what the details are wrt DDR2, but I see the $49 BeMicro has DDR3 ?
(In another thread, someone has BeMicro CV + Nios II/f + DDR3 working)

The complication with DDR3 is that it runs at only 1.5V, which means it works best with a process smaller than 180nm. It is a little more price competitive than DDR2, and will become more so. DDR3 would be better suited to a Prop3.

ozpropdev · 2013-11-28 13:18

cgracey wrote: »

.. which makes 1080p all-points-addressable graphics a lot more practical.

64I/O and improved graphics potential sounds like a good balance.

Cluso99 · 2013-11-28 13:26

cgracey wrote: »

I have a question for you all:

Would it be a good idea to reduce the universal-purpose I/O pin count down to 64, from 92, in order to provide enough fast 1.8V I/Os to talk to a x16 DDR2 SDRAM?

I ask because once I started playing around with big memory (32MB in a 3.3V SDRAM), I could see right away that the Prop2 would be able to do a lot of exceptional things if we had big, fast, cheap memory. The sweet spot for SDRAM seems to be the 64MB DDR2, which costs about $2. It's a little cheaper than 32MB of 3.3V SDRAM that we are on target for, but can be read and written twice as fast, which makes 1080p all-points-addressable graphics a lot more practical. It needs 1.8V I/O's, though.

By doing this, we would free up lots of silicon area which is just routing the DAC busses. There's actually as much area committed to routing the DAC busses as the core takes. We could free up 50% more room for the core by cutting the DAC busses in half.

This may not be practical to do at this point, because of all the manual layout involved, but I'm just thinking about it. What do you guys say about this?

I would say definitely go for it!!!

The restriction all of us have found on P1 is that of memory space - as soon as we take pins for external memory we lose a lot of valuable pins. But still having 64 GPIO would be absolutely fantastic!! Twice the speed is also impressive. I would rather wait another month to get this!

The big problem I see currently with the ARM/etc processors is the complex interrupt mechanism and all the registers. It's (almost) impossible to get anything done in a guaranteed deterministic way. Surely we could find some killer apps for the P2 with this memory.

Would the DDR2 pins be dedicated or usable for something else? (I don't care if that is all they are useful for as 64 GPIO is fine)
I presume you would make the DDR2 access/refresh in logic so the cog driver would be simple.
Does this mean we could get some more hub memory too?

jmg · 2013-11-28 13:55

cgracey wrote: »

Would it be a good idea to reduce the universal-purpose I/O pin count down to 64, from 92, in order to provide enough fast 1.8V I/Os to talk to a x16 DDR2 SDRAM?

You could also investigate what stacked die requires ?

eg Nuvoton are in almost exactly this space : 128 pin parts, with Stacked DDR memory.
http://www.nuvoton.com/NuvotonMOSS/Community/ProductInfo.aspx?tp_GUID=66feb925-4931-4d99-938d-9a6f89fc0ac6

jazzed · 2013-11-28 14:08

Hi Chip.

Are you having Turkey Pizza today? We're having Thai Turkey.

Why does limiting to 64 IO pins buy so much? Would limiting to 70 be useful?
64 IO would be fine, but 70 total IO pins would be much nicer (80 total pins?).

Why 70 total pins? 64 totally free "user IO" + 6 special purpose IO.

Serial debug port and SPI flash should be on dedicated pins so that "user IO" can be free of junk.
You don't need "user IO" on those pins except for maybe programmable pull-up/down resistors.

It blows total symmetry, but requiring total symmetry is kind of silly anyway to me at least.

What package are you considering?

QFP100 14mm 0.5mm pitch, QFP80 14mm 0.65mm pitch, QFP80 12mm 0.5mm pitch ?

Happy Thanksgiving to everyone.
--Steve

cgracey · 2013-11-28 14:19

Cluso99 wrote: »

I would say definitely go for it!!!

The restriction all of us have found on P1 is that of memory space - as soon as we take pins for external memory we lose a lot of valuable pins. But still having 64 GPIO would be absolutely fantastic!! Twice the speed is also impressive. I would rather wait another month to get this!

The big problem I see currently with the ARM/etc processors is the complex interrupt mechanism and all the registers. It's (almost) impossible to get anything done in a guaranteed deterministic way. Surely we could find some killer apps for the P2 with this memory.

Would the DDR2 pins be dedicated or usable for something else? (I don't care if that is all they are useful for as 64 GPIO is fine)
I presume you would make the DDR2 access/refresh in logic so the cog driver would be simple.
Does this mean we could get some more hub memory too?

The thing about DDR2 is that it inputs/outputs data on BOTH edges of the clock. Those would definitely be special I/O pins. And it might require a PLL to generate the timing, as the memory is clocked at 400MHz, transferring two words per clock. A single x16 DDR2 chip can read/write at the rate of 1600MB/s, whereas the current SDRAM only goes to 320MB/s, which is 1/5 the speed. The problem is going to be getting DDR2 data in and out of hub RAM. Instead of QUADs, we'd need OCTs, and at twice the rate (assuming a 200MHz system clock). It sure would be cool for several cogs to be able to make simultaneous memory requests and have them done faster than they could index into the results.

For this to work efficiently, we might need to expand the number of hub slots to 10, allocating turns like so:

0) cog0
1) cog1
2) cog2
3) cog3
4) ddr2
5) cog4
6) cog5
7) cog6
8) cog7
9) ddr2

This would give the DDR2 40MHz access to hub RAM with a system clock of 200MHz. If it could transfer 32 bytes (8 longs) per access, that would be 32x40MHz = 1280MB/s transfer. A 1080p/60 screen needs 594MB/s, so we would be able to write and read a whole screen in a screen period.

Seairth · 2013-11-28 14:19

DDR2 would be good. Would this then become a hub resource? Also, what happens to Port C? Widen the inter-cog bus? Access to a variable-width shift register?

cgracey · 2013-11-28 14:23

jazzed wrote: »

Hi Chip.

Are you having Turkey Pizza today? We're having Thai Turkey.

Why does limiting to 64 IO pins buy so much? Would limiting to 70 be useful?
64 IO would be fine, but 70 total IO pins would be much nicer (80 total pins?).

Why 70 total pins? 64 totally free "user IO" + 6 special purpose IO.

Serial debug port and SPI flash should be on dedicated pins so that "user IO" can be free of junk.
You don't need "user IO" on those pins except for maybe programmable pull-up/down resistors.

It blows total symmetry, but requiring total symmetry is kind of silly anyway to me at least.

What package are you considering?

QFP100 14mm 0.5mm pitch, QFP80 14mm 0.65mm pitch, QFP80 12mm 0.5mm pitch ?

Happy Thanksgiving to everyone.
--Steve

We are having standard turkey. Thai turkey sounds good.

This would all have to fit into the current TQFP-128 w/.4mm pitch. That's not really enough pins, either. We need more. I hate to bring it up, but by going to a BGA (which is all DDR2 comes in), we could shrink the package and get more pins.

Cluso99 · 2013-11-28 15:30

cgracey wrote: »

We are having standard turkey. Thai turkey sounds good.

This would all have to fit into the current TQFP-128 w/.4mm pitch. That's not really enough pins, either. We need more. I hate to bring it up, but by going to a BGA (which is all DDR2 comes in), we could shrink the package and get more pins.

BGA - Don't know about that?? 0.4mm pitch has me worried enough though I have become quite adept at 0.65mm pitch now. I do now have access to an T200C+ oven so just need to learn how BGA pcbs are designed I have read a bit though and am eager to try. Was looking at the Cyclone V FPGA 256 pin 1mm BGA a little while ago.

But, Parallax would presumably building those PCIEx4 connector boards and presuming the price was right with DDR2 (should sell an unpopulated DDR2 version too). I now have one of those PCIEx4 connectors here and they look great. I could see a nice little bus concept going here.

I am not worried about more pins if we have the DDR2 specialised pins. 64 is fine, even if 5 of those had to be dedicated to TX/RX & SPI Flash.

Multiple cogs accessing the DDR2 might cause delays due to the requirement to setup the addresses. Might be worth having a few blocks of 128-bit wide memory as some form of cache that each cog could access with a similar method to rd/wraux without the need to go via hub. Then the hub order would not require changing.

Here's a possibility. Lets say you had a sram for a pseudo DDR2 cache, say 1KB total.
Therefore, 1/8 KB = 128B allocated to each cog
Therefore, each cog has 8 * QuadLongs of DDR2 cache

Each cog has hub access 1 in 8 clocks. What if 4 of the other clocks it had QuadLong access to its' DDR2 cache. Now you can transfer a lot of data.
So, at 200MHz, 50% (4 in 8 clocks), 4*128 bits = 100MHz 512 bits = 512Mbps.

Could you make this work???

JRetSapDoog · 2013-11-28 15:32

Hi, Chip. Your musing seems like the perfect compromise in the best sense of the word (much the same way as a cheetah is a "compromise" in nature), perfectly positioning the P2 between a standard microcontroller and an application processor. Best of both worlds! That would probably really provide an alternative to bloated OS solutions for many apps, too (which is something that seems important to you, and us, even if we don't yet realize it).

So, would the DDR2 circuitry still have to be programmed and occupy a cog, or is there any chance it could be fully built-in (as risky as that might sound)? I mean, could generalized circuitry be designed that could allow cogs to share (divy-up) the DDR2 data? That would seem ideal (if we're going for the ideal compromise). Or, with the extra real estate opened up, could you add a 9th cog dedicated to taking care of this (though 9 isn't a "nice" number)?

I think I agree with Jazzed about having some more pins above the 64 GPIO, if possible. In fact, I can foresee such pins being dedicated to I/O tasks only w/o ADC/DAC and so on, reducing their silicon area. Doing so might push us into a BGA package, but if Parallax (and others) can make it a module with the DDR2 and provide I/O around the periphery (or wherever), we'd be set. And it seems we won't be able to avoid BGA anyway due to the DDR2. But a solderable or socketable module puts us all back in the game. Anyway, hope you can strive to provide 64 fully unencumbered I/O pins (no *footnotes necessary about "after boot-up"); that would be good for marketing, as well.

Lastly, we'd still have our beautiful symmetry for the 64 GPIO's and the best-of-class orthogonal instructions and deterministic timing (and the Prop-64 would be back, though on steroids). What's not to like? Yeah, increased risks and delays. But it would seem that Parallax would be wise to release a chip to dominate as big a (hopefully ever-expanding) niche as possible. It could/would be the "go-to" chip in that area. With apologies to McDonald's, I'm liking it! But I think I should wake from this dream and start my day (someone pinch me and tell me I'm not dreaming).

Edit: Eliminating (reducing) the off-chip memory bottleneck could really open things up, such that certain apps ideas wouldn't have to be cast aside due to bumping up hard against it (in an "almost feasible" way). If off-chip memory is present, it's good to exploit it.

Cluso99 · 2013-11-28 15:47

The thing about DDR2 is that it inputs/outputs data on BOTH edges of the clock. Those would definitely be special I/O pins. And it might require a PLL to generate the timing, as the memory is clocked at 400MHz, transferring two words per clock. A single x16 DDR2 chip can read/write at the rate of 1600MB/s, whereas the current SDRAM only goes to 320MB/s, which is 1/5 the speed.

Totally disregarding setup, DDR2 access at 400MHz * 2 edges * 16 bits === 100MHz @ 128-bits

That is nicely 50% of the P2 clock. So it would work nicely with using 4 of the 8 hub cycles.

Is the AUX RAM (clut) multi-ported? ie Because the cog can access this as well as the video now, and the hub via rd/wrquads.
Could the cog access the DDR2 cache over the port as the hub in 4 of those other cycles?

jmg · 2013-11-28 16:07

cgracey wrote: »

We need more. I hate to bring it up, but by going to a BGA (which is all DDR2 comes in), we could shrink the package and get more pins.

BGA is quite a large morph, and it has ripple effects in higher FAB costs and more PCB layers.

Nuvoton obviously decided the gull wing package was worth a lot, in lower prices and especially lower finished product prices.
( which was why they took the trouble for dual die TQFP design example above )

The alternative is to flip mindsets, and treat P2 like a WiFi or GPS module.
There, you make a compact, high layer count carrier PCB of smallest practical size, with P2+DDR2 mounted.
Changes of DDR2 size are relatively easy on this path.

GPS modules sell for close to $10 1+, so maybe this is practical ?

Cluso99 · 2013-11-28 16:08

What if you restricted the DDR2 access to only 2 cogs (say 6&7).
I cannot see this being a problem although it is a little away from everything being equal.

This would help the wiring of the bus bits from DDR2 and minimise a set of cache. Each of those cogs could have access on 4 of their hub cycles, and ensure those 4 cycles were opposite to those of its brother. Then, the full DDR2 access speed would be possible.
For video, more likely than not, one of those cogs would be displaying, the other updating. Both would have almost (less due to setup) 800MB/s each - WOW!!!

Propeller II update - BLOG

Comments