The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Peter Jakacki · 2015-08-14 12:32

Putting aside variations to what has been put to us I immediately think that given the current P2 cogs now have these interrupt enhancements along with smart pins they are able to "multitask" a lot easier and given that memory is not easy to add externally I vote for the logical choice of 8 cogs and 512K RAM, if this means real silicon of course.

BTW, I also thought, why not 12 cogs then? But eight P2 cogs + smart pins should be more than sufficient.

T Chap · 2015-08-14 12:38

What is the reason for 8 OR 16? I'd rather have 9, 10, or 11 if a few extra COGs would fit. So far the reason for 8 or 16 seems to be purely cosmetic.

Heater. · 2015-08-14 12:44

Any number of COG that is not a power of 2 is not going to fit well with the "egg beater" HUB access mechanism.

Recall that the "egg beater" is not the old Prop 1 style round robin style access to HUB memory. In has been enhanced to make use f free HUB access time slots and allow multiple access at the same time by many COGs. Thus increasing COG to HUB band width.

This whole egg beater mechanism is based on splitting RAM up into many blocks and having the low order bits (4?) of the memory address select the block in question.

This all leads to the fact that non-power of 2 numbers of COGS will not play well. Or at least be a mess.

Heater. · 2015-08-14 13:10

I have tried to refrained from commenting on P II threads for a long time. Mostly because any talk of new features or changes (cough - interrupts) makes me nervous. I only see delays in all that.

But, whilst I' here, if push comes to shove I put my vote in for keeping the RAM and reducing the COGs.

My reasoning is as follows:

1) C/C++ and other compiled languages will be much more viable on the P II with it's bigger RAM space and HUB execution. However PASM instructions are big so the more RAM the better. In fact I see no reason for not having a Spin compiler that generates PASM for HUB exec rather than byte codes. Such a Spin compiler would allow the PASM in it's objects to run as COG code at the same time as the SPIN parts!

2) I am assuming that in COG code can be run in the same COG that is running HUB code. That is to say the new interrupt/event system allows that. (Is this true?)

3) In general it seems the new interrupt/event system will allow more functionality to be put into a COG. Small things can be combined into COG rather than needing a COG each.

@ Chip,

Some questions:

1) Am I right in saying COG code can run, as events/interrupts whilst HUB execution is in progress?

2) Can the interrupt/event mechanism work with HUB exec code?

3) Can I use the interrupt/event mechanism as a purely event driven programming model?

What I mean by the last question is:

a) We have events, pin changes, time outs, signals from other COGS, etc.

b) Each of those events will have some handler code attached to them.

c) When events fire the appropriate event handler is run. That event handler does what it has to do, as fast as possible, and then terminates.

d) When the event handler is done the COG drops back to "do nothing", HALT, low power mode.

In the event driven model, there are no priories or preemption. Nothing ever gets interrupted. It is not an "interrupt" model. There is no "background" loop endlessly running that needs to be interrupted. All processing is done in response to events only.

Event driven programming removes all kind of hassle with sharing data between tasks or interrupt handlers that cold fire of at any time. Only one thing runs at a time and it runs to completion.

4) I would like to see such an event driven programming model also work with HUB exec code. Is that possible?

evanh · 2015-08-14 13:41

Questions 1 and 2 should be a yes. Chip has been pretty careful to keep everything available to HubExec.

Questions 3 and 4 could quite cleanly be covered as a kernel like environment. Use the interrupt hardware to manage the pending events, then have the user routines only run cooperatively with the event system.

Seairth · 2015-08-14 14:12

cgracey wrote: »
I've been running some numbers and I think that we are going to be just fine with 16 cogs and 512KB, after all:
die outline		8mm x 8mm = 64 mm2
pad frame		7.25mm x 0.75mm x 4 = 21.8 mm2
interior		64 - 21.8 = 42.2 mm2

16 of 8192x32 SP RAM	16 x 1.57 mm2 = 25.1 mm2
16 of 512x32 DP RAM	16 x 0.292 mm2 = 4.7 mm2
16 of 256x32 SP RAM	16 x 0.095 mm2 = 1.5 mm2
16384x8 ROM		0.3 mm2
memories		25.1 + 4.7 + 1.5 + 0.3 = 31.6 mm2

logic area		interior 42.2 - memories 31.6 = 10.6 mm2 for logic

gates allowance		120k/mm2 x 0.65 utilization x 10.6 mm2 = 827k gates
We'll only need about 300k gates, so planning for an 8 x 8 mm die with a 0.75 mm thick pad ring will give us all the room we need.

Sorry for the false alarm. Thanks for all your inputs. The consensus seems to be that big RAM is important.

P.S. You can figure the area savings of eliminating 8 cogs by cutting in half the 512x32 and 256x32 RAM areas (that comes to 2.35 and 0.75 mm2) and taking away the area of 150k gates (150k/827k x 10.6 mm2 = 1.92mm). Summing that all up: 2.35 + 0.75 + 1.92 = 5.0 mm2. That only saves about 8% of the die area, which is not much. Better to keep 16 cogs.

Okay. So there's enough space. I still say that you go with 8 cogs:

As you pointed out, there is still the cost issue.
As others pointed out, this fits the existing 1-2-3 FPGA, which means that it can be tested just as it would be in an ASIC.
As others pointed out, the hub access speed would effectively double. This would mean that a cog would stall for no more than 7 clock cycles for random hubops (compared to 15 for both P1 and the 16 cog P2).
Also, I think this would allow you to drop the instruction streamer cache in half, as well as cut the hub data and interrupt lines by half.
Heck, you could just fit it into a 7mm2 die, increasing your overall yield.

As pointed out by several others, the new cogs are much better designed for multitasking. It seems to me that there are two key areas where this will be important, with respect to the number of cogs necessary to perform a job:

User-level code: keyboard, mouse, serial, LEDs, LCDs, etc. This is all running in "user time", which is really slow. As a result, if a single cog can effectively run all of this stuff, then that leaves 7 other cogs free.
I/O drivers: with P1, it was sometimes necessary to use two cogs to support a single I/O device. If the those devices can now be supported with a single cog, then that also "frees up" cogs. (note: in some cases, it might now even be possible to put multiple unrelated I/O drivers in a single cog.)
{/list]

I'm not saying, by the way, that a 16 cog version should never exist. Instead, make that a variant that you release later if the need really does arise (i.e. a commercial company willing to buy large quantities). If your math is correct, then you should still have the space, and bumping the design back up to 16 should be a trivial matter.

Tor · 2015-08-14 14:36

Another + for 8/512KB if that turns out to be an issue after all - with such powerful cogs more RAM is more important than having 16 cogs (I will have problems finding use for them all) for my own not-very-important hobby projects.

Ariba · 2015-08-14 16:03

It's true that the cogs are now more powerful and allow to do more things in software tasks.
But that does not help for faster peripherals, like VGA, USB, Ethernet.
For VGA video you need 3 cogs to stream R, G and B data concurrently. Same for component video.
As I understand it: one cog can only drive one fast DAC (they are direct connected), so there is no way to do a color VGA driver in one cog.

If you have two VGA outputs (the 1-2-3 board has 2 connectors) you will need 6 cogs, makes only 2 for the application and all other drivers, in an 8 cog P2.

Andy

Heater. · 2015-08-14 16:36

Yes, but nobody has VGA monitors any more. The whole video thing is some kind of historical artefact.

Bill Henning · 2015-08-14 17:13

8 (or 12) cogs with 512KB

Edit: After posting I read that it says at 16 cogs / 512KB as there is room after all... better

cgracey wrote: »

I've got the whole chip (minus the smart pins) compiling on the Cyclone V -A9 device now. It's using 60% of the FPGA logic.

A single-cog Prop2 compiles in 4 minutes and has an Fmax of ~120MHz on the Cyclone IV. It's about 105Mhz on the Cyclone V, which always seems to be slower than the Cyclone IV.

Here's the crazy thing, though: When I do a full-chip compile with 16 cogs on the Cyclone V -A9 device, the critical paths become flop-to-flop interconnect delays, with no logic in-between. These paths connect the hub RAMs' inputs and the CORDIC's results. I think on the ASIC, this wiring delay won't be such a problem. These paths lower the full-chip Fmax to ~82MHz on the -A9. We'll probably just run it at 80MHz on the FPGA, then. I'm sure we could go to 100MHz, too, without any problems, given likely workbench temperatures.

I have paved the way for adding the block-r/w instructions and will add them next. The big impediment to implementation is out of the way, so it should be easy.

Here is a big question for you guys:

We are planning on this:

16 cogs w/ 512KB hub RAM

If things turn out overly-big silicon-wise, and we need to reduce the chip size, which of the following would be better?

16 cogs w/256KB hub RAM, lots of cogs and less RAM
-or-
8 cogs w/512KB hub RAM, fewer cogs and faster memory access

Hopefully, we can get it all in there. With 16 of these new cogs, we are using about 60% of the logic that only 8 of the P2-Hot cogs would have required.

mark · 2015-08-14 17:16

@Ariba

I thought it would be able to stream hub data to pins in parallel (but maybe not DACs?). That had been brought up for the purpose of driving things like LCDs with parallel interfaces.

potatohead · 2015-08-14 17:19

I thought one COG can drive 4 fast DACS, and that we went ahead with pin groups linked to specific COGS for that purpose. Streaming longs into 8 bit DACS makes for great video support. Pretty sure we decided on that path very early on in this new design.

@Heater: First, that's not true at all. But it's growing more true everyday. So, let's consider the point valid anyway.

Offering analog signals means being able to drive just about anything out there right now. We've generalized this too, meaning there isn't a dedicated video circuit anymore. It's just streaming signals in and out. Video is an ARTIFACT of the current design. Big change from P1.

Putting HDMI on the chip is expensive, and it requires a license, and it requires DRM compliance features too. Not to mention HDMI itself is moving along, and will be a legacy thing just like every other legacy thing out there in short order. Display Port seems to have the lead in robust high definition displays at the moment...

Anyone wanting HDMI can use a converter chip on their board, or run through a device to do the same. Analog output does anything we want it to. HDMI output won't.

And doing analog is dirt cheap, robust, etc...

All of this means technically we aren't offering a video system anymore! The event system will actually be used to code up what a WAITVID did before. So, no video capability, no arguments over it, right?

It just so happens that making analog signals has been made really easy, and bidirectional.

I personally don't want HDMI anywhere near the internals of the P2. And I'm quite pleased with the general direction of streaming signals in and out. Makes a TON of sense and will be useful in a ton of ways --in fact, all those ways we thought of when we saw the video subsystem for the first time. Chip is getting this right.

mark · 2015-08-14 17:31

koehler wrote: »

Kickstarting has come up before, no way it'll ever fund.

I think we'd be talking at least another 500K, though Ken and Chip would know more.
Probably much more than that to be frank.

And then, I expect the fab/die costs would go up, yields?

Way to much risk to really contemplate.

8/512K would seem to be the best option since we have hubexec, basic interrupts and smartpins.

It probably would be a long shot, but then again, nothing ventured nothing gained. Even if it didn't reach its stretch goals, it would simply front Parallax some funds in the form of pre-orders, so they don't stand to lose anything.

It would be interesting to know what the costs would look like, though I'd be surprised if it was significantly more in terms of % than the planned process. My understanding is that overall, die costs go down somewhat (assuming you use the process shrink mainly to reduce the size of your die) as you're now fitting more die on a platter, reducing the amount of man-hours required to fabricate X amount of chips. Parallax also wouldn't be going on the most cutting edge process, but rather one that is also mature, so I don't think there would be much of a difference in terms of yield.

rod1963 · 2015-08-14 17:45

Instead of using a PII for video, use a Beagleboard or Raspberry and let the PII handle the I/O. They'll do video much better and easier than a Prop can. If you don't want to do that, use a CPLD and a couple of big SRAM's to drive a VGA display.

But like Heater said, VGA is slowly going away. I see a lot more interest among hobbyists in using micros(PIC32's and ARMs) to drive 7" flat screens. They are dirt cheap and plentiful.

potatohead · 2015-08-14 17:49

Given we moved video to software, people will just make their choices as they see fit. No dedicated video subsystem in this one.

Those who want it can code it, and a lot of other useful analog I/O things.

Heater. · 2015-08-14 18:40

potatohead,

I said nothing about using HDMI. On thinking about it for two seconds I would never suggest adding HDMI to the Prop.

I'm glad to hear that such high speed I/O streaming has been generalized rather than being specific to video.

I do question the whole idea of driving video from a Propeller like that. As Rod says if you want a nice display just slap a Pi or something cheap, fast and easy in there to manage it. For slow and simple there are many options as well, as used by the Arduino guys.

But, as you say, the choice is there if anyone wants to do that. Only, should "driving video" be a driving force in the direction of Propeller design?

potatohead · 2015-08-14 18:46

Don't do it then.

I like the generalization for this reason. I have fun stuff planned, and have grown tired of the various video doesn't make sense discussions.

When nice options get done in software, we will see what makes sense and what does not.

mindrobots · 2015-08-14 18:46

Chip likes video.

3D texture mapping on Prop II - from April 2010.

potatohead · 2015-08-14 18:55

If we keep 16 COGS that gets done in software too.

potatohead · 2015-08-14 18:59

Yes Heater, simple VGA support covers everything without having dedicated, single use circuits.

There are lots of kinds of signals people might want and a streamer capable of good VGA is good fo all those cases.

The same goes for a group of pins for an LCD, which we also have in there.

Doing this well, and while allowing the COG to do some processing when possible is one reason for the interrupts. Helps make a video signal as well as a ton of other useful things.

People talked about general purpose signaling non stop. Now we have that. End of discussion.

Heater. · 2015-08-14 19:25

Potatohead,

Like I said, I'm all in favour of general purpose hardware support for speeding up all kinds of things that are never going to be fast enough with plain software bit banging.

End of discussion.

Except... mindrobots has just reminded us of the "3D texture mapping on Prop II..."

So back to my question: should video be a driving force in the Propeller design?

Is that 3D stuff baked and ready to go? Or do we have to see another long delay whilst that gets moved over to the new design? A wait for a feature that almost nobody is ever going to use except for demos?

potatohead · 2015-08-14 19:27

Left behind in hot. Good decision. That subsystem was way over the top. Fun though.

For this one, the spec is VGA signal support only.

All else is software. Composite, etc...

Ariba · 2015-08-14 19:34

potatohead wrote: »

I thought one COG can drive 4 fast DACS, and that we went ahead with pin groups linked to specific COGS for that purpose. Streaming longs into 8 bit DACS makes for great video support. Pretty sure we decided on that path very early on in this new design.
....

I must have missed this. It's easy to loose track of what gets implemented and what not.
So if you can drive 4 fast DACs with one cog, then I have no problem with a 8 cog P2.

Doing video with a Propeller is one of the things that make fun. I don't care if this is historical or not. Needing to connect a Raspberry Pi for video, is the opposite of fun... Not to speak of the additional costs, the supply currrent and software effort.

@mark
Parallel LCDs need quite some pins if you want a lot of colors. For sure it depends a lot on the application what type of display is best.

Andy

potatohead · 2015-08-14 19:45

I agree.

Frankly, I want nothing to do with a Pi and the mess that involves to get a display I have meaningful control of.

On P2, no OS, drop in the features needed, test, then write application, done.

Linux is awesome, but it is way more hassle than it is worth to use as a simple display.

Besides, doing it on P2 means tricks. I love video tricks and I have some component video ones planned.

The over the top system in P2 hot was fun, but excessive. This move to more COGS and general use functions very seriously improves the usefulness of the chip for signal cases, not just video out. Good move, and it will prove to be competitive too.

And there was one other issue, and that is lean, software video allows for improvement over time and some tuning for particular display use cases. The P2 "hot" system was mostly baked. Very useful, but not flexible. IMHO, it was one layer too removed from software to make sense. What we've got now is the minimum needed, mostly like P1 minus the hardware color circuits, and minus WAITVID itself. Really, it is almost one layer too close to software. But I think it's just fine. Adding the interrupt "event" system helps big. Prior to that, doing a signal was going to monopolize a COG and some things might have been difficult, or require multiple COGS. Now, we have very useful COGS, and that means packing a good amount of functionality into one and that means a robust OBEX type model is still on the table and relevant. Game on man. This is the stuff we all really liked about P1.

And remember, the interrupts are local to the COG in just about all cases. Unless people choose to make a mess of it, it will remain very easy to drop an object in and go. That was true on P1 too. I know I made some messes I didn't need to. No different on this one, and the majority use will be to replace the tasker, manage signals in and out, and respond to things. Given the powerful DACS, this is exactly what we want. And with so many? Killer, if you ask me.

People will need video kernels, much like they did in P1, and once those are done, a lot of stuff will be 'drop in' like it is on P1. And that got used a lot too. No reason to believe any different. And we made P1 do all sorts of stuff that it really wasn't designed for. That's the advantage of having software driven video. No reason to believe this one won't be any different in that respect either.

Lots of DACS, math, streamers, fast HUB? That's a playground!

I am going to learn a lot.

mark · 2015-08-14 20:19

Based on the way things are shaping up, the P2 is looking like a proper successor to the P1. Chip is hitting the right buttons in this revision imo, and the end result should be something he should be really proud of and something I believe most of us will be quite pleased with. I'm also excited to see how the smart pins development unfolds.

jmg · 2015-08-14 21:13

Ariba wrote: »

As I understand it: one cog can only drive one fast DAC (they are direct connected), so there is no way to do a color VGA driver in one cog.

There is a 256 CLUT, and that can stream to pins, NCO pcaed, and from Chip's earlier comments, I think that can be to either 3 x DAC, or 24 pins for parallel LCD (or 16 or 8 or 4? for Byte/Nibble wide memories. - and that can capture the other way for cameras.

jmg · 2015-08-14 21:22

Heater. wrote: »

Yes, but nobody has VGA monitors any more. The whole video thing is some kind of historical artefact.

Analog Video maybe, but 'the whole video thing' is certainly not going away.
I see a large market for P2 driving LCDs directly - along the lines of FTDI EVE custom devices.

Most TVs I've seen, right down to 16" ones, have multiple inputs :
Composite/Component/VGA/HDMI all done in a custom TV chip - thus VGA is really just the connector cost.

The other significant Video market for P2, is higher end character insertion / Video Mixing of Security camera feeds.
That might not have the high pixel counts of HDMI, but nor does it need them, and the volumes here are large.

I'm not sure every pin needs a 75 Ohm DAC, but ADCs on every pin is more common in MCUs.

Cluso99 · 2015-08-14 23:00

Chip,
Can the internal oscillator be connected to the PLL to generate the P2 clock?
From what I recall, the PLL can basically be any multiple. So what I am thinking is that it may be possible to calibrate the internal oscillator using an external source (USB or Serial) and then adjust the PLL multiple to get a rounded frequency source, without requiring an external xtal. I understand the RC oscillator may not be the most stable/accurate, but perhaps this might make such use possible in some areas.

PS I hope you haven't forgotten the special instruction(s) for USB. I'd rather wait for the Smart Pins first as that may resolve some of the things required for USB FS.

jmg · 2015-08-14 23:52

Cluso99 wrote: »

Chip,
Can the internal oscillator be connected to the PLL to generate the P2 clock?
From what I recall, the PLL can basically be any multiple. So what I am thinking is that it may be possible to calibrate the internal oscillator using an external source (USB or Serial) and then adjust the PLL multiple to get a rounded frequency source, without requiring an external xtal. I understand the RC oscillator may not be the most stable/accurate, but perhaps this might make such use possible in some areas.

That could be tricky.
The MCUs that use USB frame-locking for Xtal-less PLL, have Trim-able oscillators.
The LSB of those is typically 0.1~0.2% region.

A PLL Divider from 120MHz to a 12MHz sample rate, is only 10, giving quanta steps of 10% which is far too large. Even a 480MHz PLL would be 2.5% steps.

Cluso99 wrote: »

PS I hope you haven't forgotten the special instruction(s) for USB. I'd rather wait for the Smart Pins first as that may resolve some of the things required for USB FS.

A simple 2-FF filter on the Pin-WAIT opcodes, would allow INT trigger on SE0, to give a concrete frame start point.
One other approach would be to interleave the 75 Ohm DACs and a Serial Support Cell.
That halves the number of 75 Ohm DACS, and adds more low level Serial - like bit stuff/unstuff, and Digital edge PLL state engine, and FIFOs on the serial UARTs.

Smart pins may not be always one-per-pin, there may be places where a cluster grouping is sensible. eg a Typical UART needs two pins, if that were placed on every single pin, you can only actually use half of that hardware. Pin groups of Pairs or Quads may reduce the die cost of a Pin cell.

Beau Schwabe · 2015-08-15 01:34

Heater - "Yes, but nobody has VGA monitors any more. The whole video thing is some kind of historical artifact."

I'm concerned that the same fate might be in for the 180nm technology since it was introduced 15 years ago in 2000, many improvements have been made since then. By the end of this year we are expected to have 22nm commercial technologies available.

The analogy is that 180nm is like an old car... if something goes wrong, even something slight, it might be hard to find replacement parts. Where as something designed with 22nm might be more readily available. While there are still Chip foundries right now capable of 180nm, that could change in just a couple of years and you can bet that they will be harder and harder to find.

Although 22nm is more expensive, it is also 67 times more dense than the 180nm technology, and I certainly bet that the more expensive price is NOT 67 times more than the current 180nm technology price.

As far as Smart pins being done, the pins were completed and tested during the November 2011 test die tapeout and targeted a specific process. What happened? well, the target process changed, and it changed several times. Why? Cost? ... I'm not sure exactly. ...As a result these "smart Pins" needed to be completely redesigned and targeted for the current (whatever it might be) process.

Here is an analogy as far as different processes vs. technology

technology = PIE
process = "The flavor of your PIE"

So while the technology has been constant at 180nm , the guy that wants the Cherry PIE keeps getting Lemon Meringue PIE and the merchant claims that there no difference between the two.

-- Good luck

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments