We're looking at 5 Watts in a BGA!

1282931333437

Comments

  • David BetzDavid Betz Posts: 13,288
    edited 2014-04-05 - 21:00:29
    Then what would happen if two tasks wanted the hub at the same time? A queue or some other mess? Round robin would be probably bad because of how tasks are implemented. Blocking is the simplest way.
    I assume round robin access to the hub.
  • cgraceycgracey Posts: 11,145
    edited 2014-04-05 - 21:00:53
    Rayman wrote: »
    Strange they didn't ask for more pins... That's what I really want most, I think.

    I think Ken forgot to mention pins. We've talked about that, Ken and I.
  • mindrobotsmindrobots Posts: 6,506
    edited 2014-04-05 - 21:03:02
    cgracey wrote: »
    The Prop2 cogs have some state-machine features that make it able to do some things that no number of Prop1 cogs are going to accomplish:

    RGB-based video with color-space conversion
    Pin transfer to/from WIDEs - fast external memory I/O
    Goertzel algorithm in CTRs with dithered DAC output
    CORDIC computer for circular functions
    Single-clock MAC with auto-increment-and-wrapping pointers

    These cogs are sleek and make the Prop1 cogs look like farm tractors.

    If you add HUBEXE to the P1 cogs, do they at least look like nicely styled minivans or SUVs.

    I see that as a big WIN for both technical reasons and also marketing reasons. The 2KB ceiling on COG space is hard to swallow if you don't "get it".
    MOV OUTA, PEACE <div>Rick </div><div>"I've stopped using programming languages with Garbage Collection, they keep deleting my source code!!"</div>
  • David BetzDavid Betz Posts: 13,288
    edited 2014-04-05 - 21:07:46
    There is one fly in the ointment.

    Please see my benchmark thread.

    P1E with simple hubexec (no I & D caches, no quad or wide, no pointer instructions etc) will only provide approximately 25% speed increase over LMM.

    Now 25% is not to be sneezed at, but a P2 cog hubexec is roughly 4x-15x faster than P16E32 LMM that uses FCACHE, and hubexec does not need FCACHE support in the compiler.
    I was going to say that I expected hub exec to be implemented in the P1 using the same RDLONGC type cache as is used in P2 although probably with only one cache line but then I guess P1 doesn't have a quad wide access to hub memory.
  • roglohrogloh Posts: 1,068
    edited 2014-04-05 - 21:08:29
    rogloh wrote: »
    Interesting idea. I also wonder just like the P1 variant if the power budget allowed the P2 hub to be run at 200MHz while the COGs go at 100MHz and just what that might open up....

    Chip, do you think such an idea is possible for P2?
  • cgraceycgracey Posts: 11,145
    edited 2014-04-05 - 21:25:24
    rogloh wrote: »
    Chip, do you think such an idea is possible for P2?


    It would burn a lot of power. Might as well make them sync'd so that people can run things faster if they want to.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-05 - 21:26:26
    VERY interesting idea. Allows a lot of possibilities...

    1) 2 cycle hub access for the 4 cogs, reducing hubexec, task and thread latency even more
    2) 8 cycle per task hub access (if the four tasks get round-robbin access to the cogs hub cycles)

    Other nice side effects:

    3) better precision for timers
    4) 66Mbps for the UART's
    5) 100MHz SPI master
    6) 66MHz SPI slave

    Probably a lot more I can't think of right now!

    Edit:

    just saw Chip's reply to your message. Yep, it would burn more power. But how much power is burned is fully controllable by the chip.

    That's why although I am fine with a 4 core 100Mhz P2 - and far prefer it to a P16E32 - unless there is a physical die size limit, I'd prefer a 100Mhz 8 cog P2, as managing the power envelope is not the chip manufacturers responsibility, but the board designers. It's easy not to use 4 cogs. It's impossible to stuff an additional 4 cogs into the package built with 4 cogs.
    rogloh wrote: »
    Interesting idea. I also wonder just like the P1 variant if the power budget allowed the P2 hub to be run at 200MHz while the COGs go at 100MHz and just what that might open up....
    www.mikronauts.com / E-mail: mikronauts _at_ gmail _dot_ com / @Mikronauts on Twitter
    RoboPi: The most advanced Robot controller for the Raspberry Pi (Propeller based)
  • roglohrogloh Posts: 1,068
    edited 2014-04-05 - 21:26:39
    cgracey wrote: »
    It would burn a lot of power. Might as well make them sync'd so that people can run things faster if they want to.
    Ok, thanks Chip. Was sort of dreaming of 50MIP COGs hitting hub cycles every single time and what that might open up for hub exec....
  • Ken GraceyKen Gracey Posts: 6,420
    edited 2014-04-05 - 21:30:08
    cgracey wrote: »
    I think Ken forgot to mention pins. We've talked about that, Ken and I.

    Exactly right - forgot about that for some reason.

    - more pins
    - more RAM
    - code protect
    - A/D
    - higher speed

    That's the complete list.

    Ken Gracey
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 22,341
    edited 2014-04-05 - 21:36:36
    Ken Gracey wrote:
    ... - A/D ...
    Per pin (always seemed a bit extravagant)? Or can the analog level simply be gated to one of two cog-resident ADCs, more or less like the counters are allocated?

    -Phil
    “Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -Antoine de Saint-Exupery
  • David BetzDavid Betz Posts: 13,288
    edited 2014-04-05 - 21:38:29
    Ken Gracey wrote: »
    Exactly right - forgot about that for some reason.

    - more pins
    - more RAM
    - code protect
    - A/D
    - higher speed

    That's the complete list.

    Ken Gracey

    I see efficient execution of C is no longer on your list. I guess that means hub exec isn't needed. Even with just these features an enhanced P1 would still be interesting. I'd like to see a SPI boot device instead of I2C though. That would allow XMM code execution without any additional parts.
  • cgraceycgracey Posts: 11,145
    edited 2014-04-05 - 21:40:13
    Per pin (always seemed a bit extravagant)? Or can the analog level simply be gated to one of two cog-resident ADCs, more or less like the counters are allocated?

    -Phil


    These are just simple integrating delta-sigma converters that output a 0/1 bit stream that is proportional to the voltage.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 22,341
    edited 2014-04-05 - 21:44:23
    Ah, so. So no flash or successive-approximation hardware involved. Got it.

    -Phil
    “Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -Antoine de Saint-Exupery
  • jmgjmg Posts: 13,555
    edited 2014-04-05 - 21:44:29
    rogloh wrote: »
    Interesting idea. I also wonder just like the P1 variant if the power budget allowed the P2 hub to be run at 200MHz while the COGs go at 100MHz and just what that might open up....?

    I'm not certain you can run the HUB at 200MHz in 180nm - that's very fast for big memory.

    The Code chip has for the P1E does use 2 clock COGs, but when I asked about that same design for P2, he said P2 is too far along for that change (ie it is non trivial)
    - it isquite a lot of changes to Verilog to change to 2 Cycles, but it would shrink the P2 COG Memory area.

    An alternative is to use the existing COG clock gating to limit COGs to 50%, which is almost the same end result.
    (Think of it as 2 x 4 COG P2s time-sharing.)

    Moving to 4 COGS at 100% as Chip has done, is also a similar 50% Power profile, but saves a lot of die area.

    Of course, total COG code space has halved, and Timers have halved too, both of which may need a clean-up pass.
  • cgraceycgracey Posts: 11,145
    edited 2014-04-05 - 21:47:23
    Ah, so. So no flash or successive-approximation hardware involved. Got it.

    -Phil


    These pins will be used in whatever chip we build. You can use a Prop1 CTR to add up the 1's over a period of time. In Prop2, it does it for you and delivers a result at whatever rate you want.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 22,341
    edited 2014-04-05 - 21:52:28
    In the now-traditional P1 two-pin sigma-delta ADC, we get to choose a value for the feedback resistor. This enables very high-impedance inputs, such as an antenna. How does this work under the new regime?

    -Phil
    “Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -Antoine de Saint-Exupery
  • cgraceycgracey Posts: 11,145
    edited 2014-04-05 - 21:56:03
    In the now-traditional P1 two-pin sigma-delta ADC, we get to choose a value for the feedback resistor. This enables very high-impedance inputs, such as an antenna. How does this work under the new regime?

    -Phil

    It's locally clocked in the pad, with a super fast comparator, so no latency like on Prop1. That circuit is super tight and the feedback resistance is 5 megohms. The C is something like 10pF. You would approve.

    And in the Prop2 CTRs there are complete Goertzel integrators for very discriminating spectral analysis. Reasons like that is why I think we should make Prop2 first. It has 8 years of really in-depth development behind it.
  • jmgjmg Posts: 13,555
    edited 2014-04-05 - 21:57:58
    cgracey wrote: »
    It's locally clocked in the pad, with a super fast comparator, so no latency like on Prop1. That circuit is super tight and the feedback resistance is 5 megohms. The C is something like 10pF. You would approve.

    You could still build a 2 pin version, if you wanted to, tho right ?
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 22,341
    edited 2014-04-05 - 21:59:10
    10 megs at least, or no deal! (Just kidding; sounds nice.)

    -Phil
    “Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -Antoine de Saint-Exupery
  • cgraceycgracey Posts: 11,145
    edited 2014-04-05 - 22:00:12
    jmg wrote: »
    You could still build a 2 pin version, if you wanted to, tho right ?

    There is an external feedback mode in the pads, where you can add your own R and C. The big performance increase comes from locally clocking the feedback using a fast comparator. It was designed to do this at 200MHz!
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 22,341
    edited 2014-04-05 - 22:03:14
    Chip,

    One thing I've found to be extremely useful is an I/Q demodulator, based upon the counters' logic inputs combining a sigma-delta clock output with pair of local oscillators 90 degrees out of phase. How might this be accomplished with the new ADCs?

    -Phil
    “Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -Antoine de Saint-Exupery
  • jmgjmg Posts: 13,555
    edited 2014-04-05 - 22:03:55
    cgracey wrote: »
    There is an external feedback mode in the pads, where you can add your own R and C. The big performance increase comes from locally clocking the feedback using a fast comparator. It was designed to do this at 200MHz!

    The fast comparator. is still used in the external feedback mode in the pads, where you can add your own R and C ?
    Sounds like that covers all the bases.

    Can users get at both IP pins of these fast comparator ? ( for differential work)
    What common mode range do they have ?
  • Ken GraceyKen Gracey Posts: 6,420
    edited 2014-04-05 - 22:04:39
    koehler wrote: »
    Ken,

    We know you're lurking about, how about throwing us a bone and help us help you?

    You're probably sitting back, hoping Chip can work his usual magic and make this problem go away for the most part.
    However, it looks like we're at a full stop, with probably 4 Cogs.

    The problem some of us see, is that going ahead with 4 Cog's means multithreading is now required, where as in the past we had something akin to lightweight, disposable Cogs.

    The premise of the Prop has been simplicity, and interruptless, single-threading functionality.

    The P2 option now available appears to jettison that simplicity, and require multi-threading. Without useful interrupts at that.

    The P1b/P16 may be an option, although less powerful, it seems to be closer to the Prop's raison d'
  • David BetzDavid Betz Posts: 13,288
    edited 2014-04-05 - 22:04:44
    P1E with simple hubexec (no I & D caches, no quad or wide, no pointer instructions etc) will only provide approximately 25% speed increase over LMM.
    Why do you say only 25% faster?
  • cgraceycgracey Posts: 11,145
    edited 2014-04-05 - 22:09:22
    Chip,

    One thing I've found to be extremely useful is an I/Q demodulator, based upon the counters' logic inputs combining a sigma-delta clock output with pair of local oscillators 90 degrees out of phase. How might this be accomplished with the new ADCs?

    -Phil


    You just have a Prop2 CTR do a Goertzel operation on an ADC bit stream coming from a pin. It will accumulate I and Q for as long as you want. Then, capture those values and do a complex ARCTAN with the CORDIC solver. You get power and phase back (ro, theta). The cool thing is, you can have the counter output an 18-bit dithered reference sine or cosine out of a DAC while all this is going on, so you can make phase-locked systems. I imagine you'd even be able to resolve time-of-flight with a pulsed LED into a photodiode, down to micrometers.
  • Ken GraceyKen Gracey Posts: 6,420
    edited 2014-04-05 - 22:09:55
    Chip,

    One thing I've found to be extremely useful is an I/Q demodulator, based upon the counters' logic inputs combining a sigma-delta clock output with pair of local oscillators 90 degrees out of phase. How might this be accomplished with the new ADCs?

    -Phil

    Wow, Phil is here and finally getting into the swing of things! Welcome to the early adopter party, Phil Pilgrim! I know I can poke ya a bit without causing you to leave the thread.

    Ken Gracey
  • jmgjmg Posts: 13,555
    edited 2014-04-05 - 22:12:12
    If you go for a 4 cog P2, would it be difficult to add a round-robbin mode to the tasks access to the hub? (ie cog gets 1/4 hub cycles, tasks get at those slots in a round-robbin manner)

    The reason that I ask, is if that can be made deterministic, the tasks would be fully deterministic (except CORDIC / MUL / DIV) - just as deterministic as P1 cogs.

    Yes, fully deterministic will matter more with fewer COGs

    I'm not sure how the cache interacts, but the Power Profile and Hub mapping I suggested can include Task bits (that map is now removed from COG), and now you can allocate Power/Task/Hub across all users, in a simple Table-granular manner.
    Would need a single tiny RAM of 60/120 bytes for 3%/1.5% granularity control of Power and Hub BW.
  • cgraceycgracey Posts: 11,145
    edited 2014-04-05 - 22:15:02
    Aside from doubling the complexity of a Prop1 cog, hub exec would have the effect of creating a lot more memory conduit to accommodate a 256-bit hub bus, whereas it would only need to be 32 bits wide, sans hub exec.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 22,341
    edited 2014-04-05 - 22:25:08
    Ken Gracey wrote:
    Wow, Phil is here and finally getting into the swing of things! Welcome to the early adopter party, Phil Pilgrim! I know I can poke ya a bit without causing you to leave the thread.
    LOL! I do have my weaknesses, and signal processing is one of them.

    While you might be loopy from cold meds, I'm a bit grumpy -- in case no one noticed -- from a bad back (dang kayak rack). When I turned 50, it seemed like the new 30; but now 65 seems like the new 80! But it's all good, and I know I'm among friends. :)

    -Phil
    “Perfection is achieved not when there is nothing more to add, but when there is nothing left to take away. -Antoine de Saint-Exupery
  • jmgjmg Posts: 13,555
    edited 2014-04-05 - 22:25:17
    cgracey wrote: »
    Aside from doubling the complexity of a Prop1 cog, hub exec would have the effect of creating a lot more memory conduit to accommodate a 256-bit hub bus, whereas it would only need to be 32 bits wide, sans hub exec.

    Solutions then sound like a mix of P1E & P1HE COG designs to keep Die size down, and not all COGS need HubExec

    If we also consider the P1HE as a P2, and then you mix (any combination that makes sense) of P1E and P2(HE).

    examples : A minimal device, that is still above P1, could be
    1 x P2(HE) COG + 8 P1E COGS,
    and a more maximal, Power Envelope variant could be something like
    3 x P2(HE) COG + 16 P1E COGS

    Of course,we still do not know P1E power figures, so the P1E side is more open ended.
This discussion has been closed.