We're looking at 5 Watts in a BGA!

David Betz · 2014-04-05 21:00

Electrodude wrote: »

Then what would happen if two tasks wanted the hub at the same time? A queue or some other mess? Round robin would be probably bad because of how tasks are implemented. Blocking is the simplest way.

I assume round robin access to the hub.

cgracey · 2014-04-05 21:00

Rayman wrote: »

Strange they didn't ask for more pins... That's what I really want most, I think.

I think Ken forgot to mention pins. We've talked about that, Ken and I.

mindrobots · 2014-04-05 21:03

cgracey wrote: »

The Prop2 cogs have some state-machine features that make it able to do some things that no number of Prop1 cogs are going to accomplish:

RGB-based video with color-space conversion
Pin transfer to/from WIDEs - fast external memory I/O
Goertzel algorithm in CTRs with dithered DAC output
CORDIC computer for circular functions
Single-clock MAC with auto-increment-and-wrapping pointers

These cogs are sleek and make the Prop1 cogs look like farm tractors.

If you add HUBEXE to the P1 cogs, do they at least look like nicely styled minivans or SUVs.

I see that as a big WIN for both technical reasons and also marketing reasons. The 2KB ceiling on COG space is hard to swallow if you don't "get it".

David Betz · 2014-04-05 21:07

Bill Henning wrote: »

There is one fly in the ointment.

Please see my benchmark thread.

P1E with simple hubexec (no I & D caches, no quad or wide, no pointer instructions etc) will only provide approximately 25% speed increase over LMM.

Now 25% is not to be sneezed at, but a P2 cog hubexec is roughly 4x-15x faster than P16E32 LMM that uses FCACHE, and hubexec does not need FCACHE support in the compiler.

I was going to say that I expected hub exec to be implemented in the P1 using the same RDLONGC type cache as is used in P2 although probably with only one cache line but then I guess P1 doesn't have a quad wide access to hub memory.

rogloh · 2014-04-05 21:08

rogloh wrote: »

Interesting idea. I also wonder just like the P1 variant if the power budget allowed the P2 hub to be run at 200MHz while the COGs go at 100MHz and just what that might open up....

Chip, do you think such an idea is possible for P2?

cgracey · 2014-04-05 21:25

rogloh wrote: »

Chip, do you think such an idea is possible for P2?

It would burn a lot of power. Might as well make them sync'd so that people can run things faster if they want to.

Bill Henning · 2014-04-05 21:26

VERY interesting idea. Allows a lot of possibilities...

1) 2 cycle hub access for the 4 cogs, reducing hubexec, task and thread latency even more
2) 8 cycle per task hub access (if the four tasks get round-robbin access to the cogs hub cycles)

Other nice side effects:

3) better precision for timers
4) 66Mbps for the UART's
5) 100MHz SPI master
6) 66MHz SPI slave

Probably a lot more I can't think of right now!

Edit:

just saw Chip's reply to your message. Yep, it would burn more power. But how much power is burned is fully controllable by the chip.

That's why although I am fine with a 4 core 100Mhz P2 - and far prefer it to a P16E32 - unless there is a physical die size limit, I'd prefer a 100Mhz 8 cog P2, as managing the power envelope is not the chip manufacturers responsibility, but the board designers. It's easy not to use 4 cogs. It's impossible to stuff an additional 4 cogs into the package built with 4 cogs.

rogloh wrote: »

Interesting idea. I also wonder just like the P1 variant if the power budget allowed the P2 hub to be run at 200MHz while the COGs go at 100MHz and just what that might open up....

rogloh · 2014-04-05 21:26

cgracey wrote: »

It would burn a lot of power. Might as well make them sync'd so that people can run things faster if they want to.

Ok, thanks Chip. Was sort of dreaming of 50MIP COGs hitting hub cycles every single time and what that might open up for hub exec....

Ken Gracey · 2014-04-05 21:30

cgracey wrote: »

I think Ken forgot to mention pins. We've talked about that, Ken and I.

Exactly right - forgot about that for some reason.

- more pins
- more RAM
- code protect
- A/D
- higher speed

That's the complete list.

Ken Gracey

Phil Pilgrim (PhiPi) · 2014-04-05 21:36

Ken Gracey wrote:

... - A/D ...

Per pin (always seemed a bit extravagant)? Or can the analog level simply be gated to one of two cog-resident ADCs, more or less like the counters are allocated?

-Phil

David Betz · 2014-04-05 21:38

Ken Gracey wrote: »

Exactly right - forgot about that for some reason.

- more pins
- more RAM
- code protect
- A/D
- higher speed

That's the complete list.

Ken Gracey

I see efficient execution of C is no longer on your list. I guess that means hub exec isn't needed. Even with just these features an enhanced P1 would still be interesting. I'd like to see a SPI boot device instead of I2C though. That would allow XMM code execution without any additional parts.

cgracey · 2014-04-05 21:40

Phil Pilgrim (PhiPi) wrote: »

Per pin (always seemed a bit extravagant)? Or can the analog level simply be gated to one of two cog-resident ADCs, more or less like the counters are allocated?

-Phil

These are just simple integrating delta-sigma converters that output a 0/1 bit stream that is proportional to the voltage.

Phil Pilgrim (PhiPi) · 2014-04-05 21:44

Ah, so. So no flash or successive-approximation hardware involved. Got it.

-Phil

jmg · 2014-04-05 21:44

rogloh wrote: »

Interesting idea. I also wonder just like the P1 variant if the power budget allowed the P2 hub to be run at 200MHz while the COGs go at 100MHz and just what that might open up....?

I'm not certain you can run the HUB at 200MHz in 180nm - that's very fast for big memory.

The Code chip has for the P1E does use 2 clock COGs, but when I asked about that same design for P2, he said P2 is too far along for that change (ie it is non trivial)
- it isquite a lot of changes to Verilog to change to 2 Cycles, but it would shrink the P2 COG Memory area.

An alternative is to use the existing COG clock gating to limit COGs to 50%, which is almost the same end result.
(Think of it as 2 x 4 COG P2s time-sharing.)

Moving to 4 COGS at 100% as Chip has done, is also a similar 50% Power profile, but saves a lot of die area.

Of course, total COG code space has halved, and Timers have halved too, both of which may need a clean-up pass.

cgracey · 2014-04-05 21:47

Phil Pilgrim (PhiPi) wrote: »

Ah, so. So no flash or successive-approximation hardware involved. Got it.

-Phil

These pins will be used in whatever chip we build. You can use a Prop1 CTR to add up the 1's over a period of time. In Prop2, it does it for you and delivers a result at whatever rate you want.

Phil Pilgrim (PhiPi) · 2014-04-05 21:52

In the now-traditional P1 two-pin sigma-delta ADC, we get to choose a value for the feedback resistor. This enables very high-impedance inputs, such as an antenna. How does this work under the new regime?

-Phil

cgracey · 2014-04-05 21:56

Phil Pilgrim (PhiPi) wrote: »

In the now-traditional P1 two-pin sigma-delta ADC, we get to choose a value for the feedback resistor. This enables very high-impedance inputs, such as an antenna. How does this work under the new regime?

-Phil

It's locally clocked in the pad, with a super fast comparator, so no latency like on Prop1. That circuit is super tight and the feedback resistance is 5 megohms. The C is something like 10pF. You would approve.

And in the Prop2 CTRs there are complete Goertzel integrators for very discriminating spectral analysis. Reasons like that is why I think we should make Prop2 first. It has 8 years of really in-depth development behind it.

jmg · 2014-04-05 21:57

cgracey wrote: »

It's locally clocked in the pad, with a super fast comparator, so no latency like on Prop1. That circuit is super tight and the feedback resistance is 5 megohms. The C is something like 10pF. You would approve.

You could still build a 2 pin version, if you wanted to, tho right ?

Phil Pilgrim (PhiPi) · 2014-04-05 21:59

10 megs at least, or no deal! (Just kidding; sounds nice.)

-Phil

cgracey · 2014-04-05 22:00

jmg wrote: »

You could still build a 2 pin version, if you wanted to, tho right ?

There is an external feedback mode in the pads, where you can add your own R and C. The big performance increase comes from locally clocking the feedback using a fast comparator. It was designed to do this at 200MHz!

Phil Pilgrim (PhiPi) · 2014-04-05 22:03

Chip,

One thing I've found to be extremely useful is an I/Q demodulator, based upon the counters' logic inputs combining a sigma-delta clock output with pair of local oscillators 90 degrees out of phase. How might this be accomplished with the new ADCs?

-Phil

jmg · 2014-04-05 22:03

cgracey wrote: »

There is an external feedback mode in the pads, where you can add your own R and C. The big performance increase comes from locally clocking the feedback using a fast comparator. It was designed to do this at 200MHz!

The fast comparator. is still used in the external feedback mode in the pads, where you can add your own R and C ?
Sounds like that covers all the bases.

Can users get at both IP pins of these fast comparator ? ( for differential work)
What common mode range do they have ?

Ken Gracey · 2014-04-05 22:04

koehler wrote: »

Ken,

We know you're lurking about, how about throwing us a bone and help us help you?

You're probably sitting back, hoping Chip can work his usual magic and make this problem go away for the most part.
However, it looks like we're at a full stop, with probably 4 Cogs.

The problem some of us see, is that going ahead with 4 Cog's means multithreading is now required, where as in the past we had something akin to lightweight, disposable Cogs.

The premise of the Prop has been simplicity, and interruptless, single-threading functionality.

The P2 option now available appears to jettison that simplicity, and require multi-threading. Without useful interrupts at that.

The P1b/P16 may be an option, although less powerful, it seems to be closer to the Prop's raison d'

David Betz · 2014-04-05 22:04

Bill Henning wrote: »

P1E with simple hubexec (no I & D caches, no quad or wide, no pointer instructions etc) will only provide approximately 25% speed increase over LMM.

Why do you say only 25% faster?

cgracey · 2014-04-05 22:09

Phil Pilgrim (PhiPi) wrote: »

Chip,

One thing I've found to be extremely useful is an I/Q demodulator, based upon the counters' logic inputs combining a sigma-delta clock output with pair of local oscillators 90 degrees out of phase. How might this be accomplished with the new ADCs?

-Phil

You just have a Prop2 CTR do a Goertzel operation on an ADC bit stream coming from a pin. It will accumulate I and Q for as long as you want. Then, capture those values and do a complex ARCTAN with the CORDIC solver. You get power and phase back (ro, theta). The cool thing is, you can have the counter output an 18-bit dithered reference sine or cosine out of a DAC while all this is going on, so you can make phase-locked systems. I imagine you'd even be able to resolve time-of-flight with a pulsed LED into a photodiode, down to micrometers.

Ken Gracey · 2014-04-05 22:09

Phil Pilgrim (PhiPi) wrote: »

Chip,

One thing I've found to be extremely useful is an I/Q demodulator, based upon the counters' logic inputs combining a sigma-delta clock output with pair of local oscillators 90 degrees out of phase. How might this be accomplished with the new ADCs?

-Phil

Wow, Phil is here and finally getting into the swing of things! Welcome to the early adopter party, Phil Pilgrim! I know I can poke ya a bit without causing you to leave the thread.

Ken Gracey

jmg · 2014-04-05 22:12

Bill Henning wrote: »

If you go for a 4 cog P2, would it be difficult to add a round-robbin mode to the tasks access to the hub? (ie cog gets 1/4 hub cycles, tasks get at those slots in a round-robbin manner)

The reason that I ask, is if that can be made deterministic, the tasks would be fully deterministic (except CORDIC / MUL / DIV) - just as deterministic as P1 cogs.

Yes, fully deterministic will matter more with fewer COGs

I'm not sure how the cache interacts, but the Power Profile and Hub mapping I suggested can include Task bits (that map is now removed from COG), and now you can allocate Power/Task/Hub across all users, in a simple Table-granular manner.
Would need a single tiny RAM of 60/120 bytes for 3%/1.5% granularity control of Power and Hub BW.

cgracey · 2014-04-05 22:15

Aside from doubling the complexity of a Prop1 cog, hub exec would have the effect of creating a lot more memory conduit to accommodate a 256-bit hub bus, whereas it would only need to be 32 bits wide, sans hub exec.

Phil Pilgrim (PhiPi) · 2014-04-05 22:25

Ken Gracey wrote:

Wow, Phil is here and finally getting into the swing of things! Welcome to the early adopter party, Phil Pilgrim! I know I can poke ya a bit without causing you to leave the thread.

LOL! I do have my weaknesses, and signal processing is one of them.

While you might be loopy from cold meds, I'm a bit grumpy -- in case no one noticed -- from a bad back (dang kayak rack). When I turned 50, it seemed like the new 30; but now 65 seems like the new 80! But it's all good, and I know I'm among friends.

-Phil

jmg · 2014-04-05 22:25

cgracey wrote: »

Aside from doubling the complexity of a Prop1 cog, hub exec would have the effect of creating a lot more memory conduit to accommodate a 256-bit hub bus, whereas it would only need to be 32 bits wide, sans hub exec.

Solutions then sound like a mix of P1E & P1HE COG designs to keep Die size down, and not all COGS need HubExec

If we also consider the P1HE as a P2, and then you mix (any combination that makes sense) of P1E and P2(HE).

examples : A minimal device, that is still above P1, could be
1 x P2(HE) COG + 8 P1E COGS,
and a more maximal, Power Envelope variant could be something like
3 x P2(HE) COG + 16 P1E COGS

Of course,we still do not know P1E power figures, so the P1E side is more open ended.

We're looking at 5 Watts in a BGA!

Comments