Shop OBEX P1 Docs P2 Docs Learn Events
Consensus on the P16X32B? - Page 6 — Parallax Forums

Consensus on the P16X32B?

1234689

Comments

  • W9GFOW9GFO Posts: 4,010
    edited 2014-04-05 22:51
    I may be totally out of line and presumptuous -- I often am! -- but I think the idea of a P+1 (or P1+ or whatever) is to get some additional revenue flowing into Rocklin so they can do the P2 the way it should be done.

    I think I am seeing a pattern here. I know I am generalizing but it looks like the people that tend to use the Propeller to control robotic type things are in favor of a P1 variant whilst those that are more into fancy, complex* programming are in favor of the P2. I feel like there is a genuine need for a chip that can do what the proposed P16X32B can do while the P2 is more like a dream chip that can do things well beyond what is needed for most robotic applications. Now the dream chip is being hobbled due to cost restraints and the P16X32B may not come to be in order to try to fulfill the now compromised dream. That seems like a huge risk to me, one that even if it works out will not be the P2 that you really want it to be.

    *complex compared to what I am used to
  • RossHRossH Posts: 5,479
    edited 2014-04-05 23:28
    W9GFO wrote: »
    I think I am seeing a pattern here. I know I am generalizing but it looks like the people that tend to use the Propeller to control robotic type things are in favor of a P1 variant whilst those that are more into fancy, complex* programming are in favor of the P2. I feel like there is a genuine need for a chip that can do what the proposed P16X32B can do while the P2 is more like a dream chip that can do things well beyond what is needed for most robotic applications. Now the dream chip is being hobbled due to cost restraints and the P16X32B may not come to be in order to try to fulfill the now compromised dream. That seems like a huge risk to me, one that even if it works out will not be the P2 that you really want it to be.

    *complex compared to what I am used to

    Yes, I think you have summarised it very well. I would also add that those who just want a micro-controller want the P16X32B. Those who want something that can also pinch-hit as micro-processor want the P2.

    Ross.
  • msrobotsmsrobots Posts: 3,709
    edited 2014-04-05 23:53
    +1 for W9GFO.

    But I just noticed that @Hippy had to come out of lurking mode.

    If we heat this up some more maybe even Pullmoll chimes in again?

    Good times are coming!

    Enjoy!

    Mike
  • potatoheadpotatohead Posts: 10,261
    edited 2014-04-06 00:17
    @Ross, I think that's astute.

    Yes, I agree, but with some qualifiers. I think there is a trend to do more.

    Currently, this means either:

    1. Stepping up to more complex micro-controllers

    2. Coupling with a real micro-processor

    3. Using multiple micro-controllers

    4. Adding on dedicated devices

    5. simply not doing some desirable things

    , or

    some combination of the above involving:

    1. An operating system

    2. Complicated or "pro" grade tools, and this I generally refer to as "thick"

    3. Increased dependence on libraries

    , resulting in:


    1. More expensive designs

    2. More complicated designs

    3. More failed designs.

    I've been thinking for a while now, the P2 may well occupy some of that space nicely. It's got some great features that can be employed with out an OS, potentially reduced BOM, lean programming environment (if desired, this is not always a plus), etc...

    Going to 4 COGS limits this, but not a whole lot. I would personally want to do 5+ COGS and incorporate the clock per COG idea JMG put out there to allow for the maximum range of power / performance / usability options. But, if we can't do that reasonably, we can make 4 COGS do a lot!

    It doesn't change the equation much, and if we somehow get more RAM for the trade, or maybe higher speed operation, then it's a nice overall win.

    There isn't anything out there with that kind of mix, and I think there should be. Using a PI, or Android, QNX, whatever SOC device with micro-controllers hanging off of it for doing this and that precision thing is a nice option, but it does require an OS, BOM part count, complexity, etc...

    I believe there is a range of tasks growing in importance that stretch just above what we consider a micro-controller, but just below where a CPU or computer, or micro-processor, SOC type solution is optimal.

    On the other hand, the P1E variant, extends micro-controller some, and this is good. But it really doesn't get us to that space in-between where the P2 really could shine. And I'm really torn on that aspect of things.

    I'm torn, because growing revenue with a P1E may well dip into P1 sales without adding too much we need to expand the scope of potential users, our community here, etc... But it's a nice win for those "in the family" and likely one for education.

    Of course, I think the P2 can do education clocked and powered properly too, so that's a wash IMHO, with a slight nod toward the P1E for the programming tools requiring less change, being more familiar, etc... and with a slight nod in the other direction for the P2, with the sexy pins, killer video sub-system, etc...

    And on that note, Chip really did something nice with the video system. It's insane fast, flexible, and if we are gonna talk things like instrumentation, something that flat out rules! Yes, it's analog, and I consider that an advantage, in that we avoid the IP mess, can support just about anything ever made to display, and we can always use a package for HDMI, where the painful IP / compliance issues are owned by somebody else, not us, or those who build systems, a net savings in a lot of cases. Our analog output is so good that we won't have trouble on a digital conversion. Heck, on lower frequency scan rate devices, we can clock pixels up so fast as to output 10 bit grey scale, if we want to, which gets us into medical level display quality! (Did this on a TV already, and it rocks hard, and if the thing actually will exist, I'll show and tell one day)

    On the education note, the DAC / smart pins, powerful math, video, audio, etc... make for some pretty sexy experiences. Yes, we can do all that on a Pi, but we can teach people HOW TO DO THAT themselves, hands on too. Worth something, I think.

    @Hippy! Hey man. Long time no see. Always meant to apologize for putting an "i" on Hippy way back when. It got stuck in my brain when I was reading on a crappy device, and I know I typed it way too many times. Sorry. :) Good to see you here, and I trust all is well?
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-04-06 00:46
    Another +1 for W9GFO and for this...
    RossH wrote: »
    Yes, I think you have summarised it very well. I would also add that those who just want a micro-controller want the P16X32B. Those who want something that can also pinch-hit as micro-processor want the P2.

    ...and for this...
    ctwardell wrote: »
    This is part of the "Propeller Conundrum". What makes sense for COG's and hub access when using the COG's as general purpose and parallel computing elements is not the same as what makes sense for using the COG's as peripheral drivers.

    I gave up having any interest in the P2, other than a morbid curiosity, about a year ago when feature creep crept in big time.

    David Betz wrote: »
    I've heard a number of people say that had to go to two Propeller chips for a project. I wonder if that was because of running out of COGs? Hub memory? Pins? If it turns out to be pins or hub memory then even a P1E with 512k of hub memory and 64 pins would probably satisfy that need.

    For me, doing what I consider to be fairly 'normal' embedded microcontroller stuff, it's always been too few cogs, followed by memory and I/O in short order. I also had a bad time on the first unit where a P1 was designed in as the sole processor where toolchain limits caused all sorts of grief which meant a new PCB was required with an additional processor (not P1) to do the actual work. This means that whilst I use P1 commercially, it's never running the main code and usually ends up doing character VGA display with keyboard and mouse and SD card.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-06 01:10
    All,

    the reason for my wanting P2 first are:

    - MUCH higher C (hubexec) performance than LMM or "simple" hubexec on P16
    - MUCH higher hub bandwidth
    - capable of 1080p video output
    - MUCH faster VM's for Spin / Z80 / ZOG etc
    - very high speed UARTS
    - hopefully SERDES
    - tasks allowing using 1 cog as 4 baby cogs
    - threads allowing easy porting of C code using select() for tcp/ip, usb endpoints
    - MUL / DIV / MAC
    - AUX as video buffer, fast stack

    I actually started my P1/P16/P32/P2 benchmark thread to see if I could quantify the differences, and I did.

    Most of the above points are simply impossible in the P16/P32 with the stated goal of cogs identical or nearly so to P1.

    I've often ran out of pins, hub, cogs, and wanted better C performance.

    This does not even consider P2 features that I find less important, but others find very important, that I may get good use out of later, such as CORDIC.

    Not having P2 features will limit me a LOT, and it will limit Parallax's customers a lot too.

    While I wanted a 92 pin 200MHz eight cog P2, I can make good use of a 64 pin 100MHz four cog P2. It allows much more than a 200MHz, 100MIPS (for cog code) P16
  • cgraceycgracey Posts: 14,208
    edited 2014-04-06 02:02
    evanh wrote: »
    A question for Chip (and excuse me if it's already been answered in the other thread): How come the enhanced P1 is so easy all of a sudden? Wasn't this in the too messy bin a year or two back?


    I don't know about it being too messy, but the Prop1 cog is almost nothing. It's so cheap to build that you can do things like WAITCNT without having to worry about anything else, especially if you've got lots of cogs.

    I estimate that a two-clock Prop1 cog, with its memory, would take about 0.37 square mm of area in this 180nm process, whereas a current Prop2 cog with its memories needs about 4.2 square mm, based on OnSemi's last numbers. That means 16 of these Prop1 cogs would take the same area as 1.4 Prop2 cogs. OnSemi estimates a Prop2 cog to take about 850mW at 160MHz. If we scale that by 1.4 for 16 Prop1 cogs, we get 1.19W. With our clock a little faster at 200MHz and our ALU settling over two clocks, we can probably take 1/4 off that, giving us ~900mW. Add 100mW for the 256KB hub RAM and we are burning 1W - for 1,600 MIPS. The exposed-pad TQFP package has a Tja of ~20, so we could operate in an ambient temperature environment of 125C and have a die temperature of under 145C, which would support this speed. This would be a viable chip.

    I would probably add a 16x16 multiplier to the cogs to give them a math boost.

    One other thing: By keeping the hub memory data busses at just 32 bits (as opposed to 256 like in Prop2), we avoid tons of mux'ing everywhere, but at the expense of hub exec. To keep power down, we'd want just one hub RAM turning on at once, anyway - not eight!
  • BaggersBaggers Posts: 3,019
    edited 2014-04-06 02:16
    This post, may sound like it's baiting an argument, but it's not really.
    But... Bill, I have to say this, as I've quietly read the banter from you guys, and for someone who has been quoting use technical proofing in arguments, I now feel I have to state a few things. :)

    Yes, whilst I agree the new P2 is an awesome beast, and that parallax do need to get this out to the public, I can't help but think the P16X32B is in my honest opinion the best route!

    Here are my reasons!

    It looks like Bill is wanting a stand alone computer out of his next prop design, and I'll admit, I'd like to do a stand alone computer with the next Prop chip, whichever one comes out too!

    But looking at the facts.
    P2 computer cog usage
    Cog 1, TV/VGA driver + keyboard + mouse + SD driver
    Cog 2, SDRAM driver for max performance
    Cog 3, GPU to assist feeding TV driver, Audio only taking a slight portion of the cog speed, leaving most for GPU
    Cog 4, CPU single thread for max performance

    Now although that sounds a good setup, here's the P16X32B setup

    Cog 1, TV/VGA driver
    Cog 2, Keyboard + mouse driver
    Cog 3, SD driver for max performance
    Cog 4-8, GPU assist for TV/VGA driver, ( baring in mind these are running 5* faster than a P1 cog, and the GPU side of the PropGFX only had 6 cogs running, whereas this will effectively have 20 P1 cogs running.
    Cog 9 SDRAM driver if you want extra RAM
    Cog 10 Audio driver, for a couple of SID chips and Chip's speech driver, or whatever audio driver you want, sample player, mod tracker
    Cog 11 CPU, agreed it doesn't have cordic etc, but as Bill is looking at Z80, that doesn't have any maths either! plus emulating VMs is inherently slow compared to just plain PASM.
    Cog 12-16 is 6 extra COGS running at 100mips, that you can throw at anything!
    Say you wanted your computer to have a camera, one could read the data from the camera and feed it to the GPUs.
    Maybe you could throw 1 ( or a few ) at image recognition / motion sensing / face recognition / voice recognition.
    You could even throw another CPU in, like the old arcade games that use to have 3 Z80's in etc.

    So as you can see, the Cogs in a P16X32B may not be as awesome with all the features of the P2, but it's WOW factor is in the number of 100MIPS cogs, that are available to throw at anything you want to throw them at.
  • Heater.Heater. Posts: 21,230
    edited 2014-04-06 02:43
    Everybody who wants 16 or 32 COGs in their Propellers have a look at Amdahl's Law. http://en.wikipedia.org/wiki/Parallel_computing

    Amdahl's law tells you what increase in performance you can expect as you parallelize your program more and more. If there is any part of your code that cannot be done in parallel then there is an upper limit to the speed up you can achieved by adding processors. In our case that part that cannot be done in parallel is HUB access.

    Look at the graph on the page linked above.

    What we see there is that adding processors gives you less and less increase in performance until eventually adding a CPU has no discernible effect whatsoever!

    Example:

    If 75% of you code is happening within a COG and 25% has to go to HUB we see that 8 processors only gives you a speed up of 3 times over a single processor. Already not impressive.

    Going from 8 to 16 gets you a speed up of about 3.5 times. Those 8 extra COGs are only doing the work of half a COG !!!. That's awfully wasteful.

    Going from 16 to 32 COGs gets you a total speed up of only about 3.75. You are pretty much throwing silicon at the problem for no benefit at all at this point.

    On a P1 the effect of Amdahl's law is some what hidden from view because of the round-robin HUB access. Even if you are only using one COG you are only getting one eighth the HUB access bandwidth. In effect you have been "pre-crippled" by the HUB. So when you fire up those other COGs you think you are getting linearly more performance. It's an illusion.

    Jumping from 8 to 16 or 32 COGs would make you painfully aware of Amdahl as the HUB access cycle has to be stretched beyond 8.

    Now you can argue that your programs are more parallel than that so it's quite not so bad. You can argue that more COGs is not all about speed but about the simplicity of not having to use interrupts. It's about being able to mix lot's of objects without needing an RTOS. And so on.

    All valid points. But Amdahl should still be taken into consideration.

    @Baggers: "but it's WOW factor is in the number of 100MIPS cogs, that are available to throw at anything you want to throw them at." That WOW factor should not be sounding so exciting now. It will soon fade when you realize how little performance you can actually realize from 32 COGS.

    I'm still sitting on the fence here. A sea of simple P1 style COGs does sound very attractive. Especially has it continues the Prop tradition of making it supper easy to add functionality, eg. Spin objects. As Chip says, a P1 COG is so small and simple that throwing one at a simple task that you need is no big deal, like adding just another 7400 to your old logic designs.

    On the other hand performance suffers badly, as Bill point's out. A PII style COG will out run many P1 style COGs from a computer perspective. Amdahl tells me going to 16 or 32 COGs is a waste of space and power. Worse, it "pre-cripples" applications that only need 8 or less COGs.

    And a lot of nice, indeed previously "essential" features go out the window. Like code signing/protection.
  • cgraceycgracey Posts: 14,208
    edited 2014-04-06 02:55
    Heater. wrote: »
    Everybody who wants 16 or 32 COGs in their Propellers have a look at Amdahl's Law. http://en.wikipedia.org/wiki/Parallel_computing


    But, consider, Heater, that with 16 cogs you get a hub cycle every 8 instructions. That's the same as Prop1.

    I think 32 cogs would probably be extreme overkill, but with a hub cycle allocation table, things like keyboard drivers could get 1:64 access, which would be totally adequate. Same for serial ports.
  • jmgjmg Posts: 15,175
    edited 2014-04-06 02:58
    Heater. wrote: »
    Going from 8 to 16 gets you a speed up of about 3.5 times. Those 8 extra COGs are only doing the work of half a COG !!!. That's awfully wasteful.

    Not only that, but the P1E needs an OnSemi power simulation pass, so more accurate figures are available for each fork.
    Needed are W/MHz and estimated MHz abilities with Vcc.

    The Power Envelope has clearly become very important.

    Heater. wrote: »
    On the other hand performance suffers badly, as Bill point's out. A PII style COG will out run many P1 style COGs from a computer perspective. Amdahl tells me going to 16 or 32 COGs is a waste of space and power. Worse, it "pre-cripples" applications that only need 8 or less COGs.

    Another summary that effectively makes a case for a mix of P2 and P1E COGs.

    HUB bandwidth dilution effects can be managed with Hub Slot Mapping, and Power Envelop effects can be managed with GOG and Power mapping.
    Then it is a choice of how many of each COG are needed, and if more than one combination mix is worthwhile.
    Memory Sizes and speeds come into play as well.
  • ColeyColey Posts: 1,110
    edited 2014-04-06 03:10
    How will you all feel if Parallax produces the P2 and it's a bust?
    What effect will that have on Parallax as a whole?

    If Chip thinks P1+ is 'easy' to do then Parallax should do it, get the chips rolling out the door and get some breathing space to finish P2 properly.

    I would be fairly certain that Parallax doesn't make the majority of it's income off the people in this forum, sorry guys, there's just not enough of you!

    Parallax should listen instead to those who contribute most to that income and make a chip for their market.

    One thing business has taught me is to not have all of your eggs in one basket.

    Diversity is important, multiple revenue streams are important, most of all operating profit is paramount.

    P2 is an unknown quantity in the general marketplace, at best it's a bet, sure the people on this forum want it but like I said there simply aren't enough of us!

    I would love to see a P2 out there, I would love for it to be a success.

    Right now it's a huge mess and we have contributed to that with all this bloody feature creep.

    I was excited when I thought P2 was coming, I even remember Ken saying at the Elektor show in Eindhoven a few years back that it was only 'a few months away'.

    All I think of now when someone says P2 is this....

    thehomer.jpg


    Parallax whatever you decide to do, please do it quickly.

    Coley
    494 x 282 - 95K
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-04-06 04:00
    cgracey wrote: »
    ...things like keyboard drivers could get 1:64 access, which would be totally adequate. Same for serial ports.
    And that access would be a full 32-bit word ie 4 full bytes from a serial port.
  • Heater.Heater. Posts: 21,230
    edited 2014-04-06 04:05
    Chip,
    ...consider, Heater, that with 16 cogs you get a hub cycle every 8 instructions. That's the same as Prop1.
    Indeed, one can certainly tweak around with the parameters of Amdahl's law and get more favourable results. It's just that it will bite you eventually as surely any other physical law.


    Reducing the percentage of non-parallel activity is the main parameter here. In this case you are saying that HUB access is twice as fast relative to the P1 so Amdahl kicks in later. Sounds good.


    But what about going above 16? Amgdahl tells me that 32 is not a good idea. Don't forget your COGs are already having HUB access rationed even if you only use one. "Pre-crippled" as I said. The Amdahl effect does not show up in the Prop until you go passed 8 COGs in a P1 like system or 16 in this new approach.


    And I said, Amdahl is not the be all and end all of the argument. As in your example there are many tasks you might want that really don't need to be fast. Like that keyboard. Here you might be more interested in the P1 like simplicity of being able to just throw another object into the mix and not worry about it. Or they may need to be fast but use very little HUB time. Like, well, I don't know, something.
  • jmgjmg Posts: 15,175
    edited 2014-04-06 04:10
    Heater. wrote: »
    Reducing the percentage of non-parallel activity is the main parameter here. In this case you are saying that HUB access is twice as fast relative to the P1 so Amdahl kicks in later. Sounds good.

    The HUB slot bottle-neck can be improved with mapping, (ie just those that NEED it, still 100% deterministic), and further improved with an alternate means for COGS to communicate a few variables to each other. ( ie more parallel activity on your scale)
    That reserves HUB for 'big data'
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-04-06 04:27
    jmg wrote: »
    ...an alternate means for COGS to communicate a few variables to each other. ( ie more parallel activity on your scale)
    Cluso suggested a simple pair of 32-bit registers linking adjacent COGs (and I guess logically wrapping COG 0 to COG 15) so that each COG could read and write to the COG on either side. Such a mechanism could work as both a semaphore and pointer into hub for COGs to exchange data. By linking directly, a COG could prepare the way for retrieving the data from Hub so that once it had hub access it could go and grab the data directly rather than having to first pick up the semaphore and pointer from the hub before it got to the data. When COGs were being used together this would greatly speed up operation and synchronisation.
  • RossHRossH Posts: 5,479
    edited 2014-04-06 04:45
    Cluso suggested a simple pair of 32-bit registers linking adjacent COGs (and I guess logically wrapping COG 0 to COG 15) so that each COG could read and write to the COG on either side.

    This is a possibility, but personally I'd rather not have this solution. It means you have to worry about the geometry of the cogs a lot more than you do now. I'd rather have every cog capable of talking to every other cog, in a shared bus arrangement.

    Ross.
  • Bob Lawrence (VE1RLL)Bob Lawrence (VE1RLL) Posts: 1,720
    edited 2014-04-06 05:10
    Maybe this has been suggested before but I haven't seen it.

    How about a settable hub access option per cog. Two options based on a number cycles. Of course the options for the number of cycles is yest to be determined however, as an example:


    option 1: Set cog to 4 cycle access - Hub access as soon as possible.
    option 2: Set cog to 8 cycle access. Hub access every second rotation.
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-04-06 05:19
    Maybe this has been suggested before but I haven't seen it.

    Don't know if you've seen Bill's suggestion...
    In the other thread I suggested a 64 entry vector, of five bit fields, indexed by a 6 bit counter running from the 200Mhz clock for assigning hub cycles.

    Default would be 0-31,0-31 (cogid) for who gets the slot, but it could be re-programmed for deterministic timing at whatever grain needed.
  • localrogerlocalroger Posts: 3,452
    edited 2014-04-06 06:11
    Why I particularly favor the P1E if we can get it faster and cheaper:

    The main barrier to using external static RAM on P1 is that it takes all the pins. With 64 I/O you have enough pins to use external static RAM and still have pins for I/O. This gives you the possibility of large video buffers and large fast business logic program storage along with the kind of mixed I/O we're used to having now. Sure it's not as fast as P2 would be, but it would be a lot faster than anything we can manage now.

    16 cogs might be too many but Chip has said they are cheap, and we do tend to find uses for stuff. With the speed not being up to P2 we will probably still need things like multi-cog video drivers, service cogs to access that external RAM, and so on. And while it's nice that you can implement a serial port and a fast IIC driver in the same cog (I've done it out of necessity) it's also nice just to be able to do both functions the easy way because there are plenty of cogs. That is an efficiency the Amdahl standard does not measure, and it's very useful.
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-04-06 07:46
    jmg wrote: »
    That 4 COG ~4 watts is for 180MHz, and 100% usage, so a 50% typical Power Envelope control will meet ~2 Watts.
    A perhaps even more typical Power Profile setting of
    180MHz on 1 COG, 180MHz/2 on 1 COG, 180MHz/4 on 1 COG, 180MHz/8 on 1 COG
    I make as ~1.909W

    The same usage profiles at 100MHz gives ~ 0.7277W


    The 100MHz case is comfortably under ~2W (4 COG) at 100% Power envelope, and I make the
    100% & ~2W point appx 120MHz. on all COGs, so parallax could spec this as 120 MOP / COG continuous capable, or maybe ~500 MOP total ?
    jmg, I agree with your comments. So a question for the other people that are promoting a P1+. If a 4-cog P2 is more capable than any P1+ we could envisage why are we even talking about a P1+? Any time wasted on discussing a P1+ is just delaying the completion of the P2.
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-04-06 07:56
    Dave Hein wrote: »
    If a 4-cog P2 is more capable than any P1+ we could envisage why are we even talking about a P1+?
    Because raw numbers tell only part of the story. And if raw numbers are to be the driver then the P2 falls short of many other devices out there.

    For me, the biggest single draw of a P1+ is it's simplicity. The last time I looked the P2 has 496 instructions and it's still not finished. The P1 has 83. Add a few more to round out the P1 and it's got maybe 100 tops.

    The second big draw to a P1+ is COG count. COGs for me are a means to an end. They're the bits that in other processors I don't need to think about. In your P2 I get 4 of them; in the P1+ as envisaged I get 16 of them. So that's 4x better isn't it?
  • jazzedjazzed Posts: 11,803
    edited 2014-04-06 08:42
    Dave Hein wrote: »
    So a question for the other people that are promoting a P1+. If a 4-cog P2 is more capable than any P1+ we could envisage why are we even talking about a P1+? Any time wasted on discussing a P1+ is just delaying the completion of the P2.
    Maybe the "current" P2 is not the P2 they wanted in the first place?

    A bigger P1 with the few things that customers asked for according to Ken's list seems good enough. It is likely that a lot of people may feel that they will never master the P2 instruction set or the threading that it takes to make 1 COG behave like 4.

    The P1 has a certain simplistic beauty that is hard to beat and many proppeople like it for that. It is slim, clean, and does the job it was intended to do instead of getting in the way.

    I like several P2 features, but the full symmetry philosophy with all the advanced features is really uncalled for. I'm sorry but, not every pin needs an ADC/DAC (just the simple things), and not every COG needs CORDIC, video generator, and hubexec.

    Chip has been on another great adventure, and he certainly gained a new perspective on P1 after the trip through the P2 design. Some of the advanced ideas may be leveraged in the end - after all it was hard won and expensive knowledge; it should be selectively applied somewhere at least. But going hog wild again won't get it done.
  • John AbshierJohn Abshier Posts: 1,116
    edited 2014-04-06 09:01
    Yes, I think you have summarised it very well. I would also add that those who just want a micro-controller want the P16X32B. Those who want something that can also pinch-hit as micro-processor want the P2.

    My thoughts exactly. Let me expand on running out of cogs

    Cog 1 - Main program
    Cog 2 - Motor driver
    Cog 3 - Encoder
    Cog 4 - Full duplex serial 4 port
    Cog 5 - Telemetry
    Cog 6 - ADC chip driver for analog sensors

    So far that is 75% of cogs and no cogs dedicated to sensors. Often my sensor cog not only reads the sensors but filters their output. Ultrasonic sensors often need a cog because of the relatively long time waiting for an echo return.

    Cog 7 - floating point (I know, I know, I don't understand the problem, but I was nursed on FORTRAN.

    Reference reading multiple sensors in one cog. I do some of that. Depending on the sensors and required refresh rate, it can make the programming dificult. I have been spoilt by the Propeller. Coming to the Propeller from GCC and AVR was refreshing.

    John Abshier
  • Heater.Heater. Posts: 21,230
    edited 2014-04-06 09:04
    Jazzed,

    Well put.

    Still on the fence here. I love the KISS of the P1 style. I love raw performance of the PII style and some newer features. I'm nervous about the PII style complexity. 500 instructions is nuts. I'd hate to have only 4 COGS.

    What worries me is that we, well I mean Chip, may now have done the research required to determine what can be done within the confines of the process and budget available, that has taken years, now we have to put time into the development of an actual device....
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-06 09:15
    Chip,

    Whatever chip you build, I'll use :)

    You know that.

    I do like the idea of a simpler, cheaper device for a lot of uses. P1E will be great for a lot of applications.

    I will miss 1080p and many other P2 features until P2+ shows up, but I am certain that P1E will be fun.

    At the risk of being flamed (AGAIN) - and as I said, I will use whatever you build - would the following fit?

    - 64 entry hub slot allocation table (128 would give even more control)
    - 32 simple P1 cogs
    - 512MB ram

    My biggest objections to P1 variants were:

    - only 25MB/sec hub bandwidth on the 16 cog version
    - hubexec being only barely faster than LMM
    - no nice hirez video
    - no fast SDRAM

    After reading the post I am replying to carefully, I thought that the features I will miss most can be somewhat eased by the 32 cog / 512MB / hub bandwidth controlled variation above.

    I know, 32 cogs means more reduced hub bandwidth per cog.

    But does it have to?

    One thing we learned from the Prop1 is that the vast majority of Obex drivers (excluding video) only need a tiny fraction of their potential hub bandwidth. Some of you may have noticed that I was pushing hungry/green/recycle/mooch/(name of the day) to recover otherwise unused cycles.

    With the hub slot allocation table above, if we need a high bandwidth cog, we assign it more slot.

    Ok, the implementation details limit how much a cog can use, but we should be able to increase it at least four fold usably for hub hungry slots, by taking slots away from slow drivers via the slot allocation table.

    Without the slot alocation table, more than 16 cogs makes no sense.

    With the table... easy to load P1 objects from hubex, we just constrain their pipe to what bandwidth they really need, to give the unused bandwidth to cogs that need it (video, signal capture/generation etc)

    This way:

    - No need for a serial driver. Simple bit-banger per port, assign it only one hub cycle in 64.
    - Simple PS/2 keyboard driver, no need for combo driver. 1/64 hub
    ...

    Simplifies Obex, reduces resource contention.

    Satisifies the "Cogs must be the same as P1" crowd.
    cgracey wrote: »
    I don't know about it being too messy, but the Prop1 cog is almost nothing. It's so cheap to build that you can do things like WAITCNT without having to worry about anything else, especially if you've got lots of cogs.

    I estimate that a two-clock Prop1 cog, with its memory, would take about 0.37 square mm of area in this 180nm process, whereas a current Prop2 cog with its memories needs about 4.2 square mm, based on OnSemi's last numbers. That means 16 of these Prop1 cogs would take the same area as 1.4 Prop2 cogs. OnSemi estimates a Prop2 cog to take about 850mW at 160MHz. If we scale that by 1.4 for 16 Prop1 cogs, we get 1.19W. With our clock a little faster at 200MHz and our ALU settling over two clocks, we can probably take 1/4 off that, giving us ~900mW. Add 100mW for the 256KB hub RAM and we are burning 1W - for 1,600 MIPS. The exposed-pad TQFP package has a Tja of ~20, so we could operate in an ambient temperature environment of 125C and have a die temperature of under 145C, which would support this speed. This would be a viable chip.

    I would probably add a 16x16 multiplier to the cogs to give them a math boost.

    One other thing: By keeping the hub memory data busses at just 32 bits (as opposed to 256 like in Prop2), we avoid tons of mux'ing everywhere, but at the expense of hub exec. To keep power down, we'd want just one hub RAM turning on at once, anyway - not eight!
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-06 09:20
    Baggers,

    LOL! I know you don't bait.

    ONE of many uses I want are stand-alone HMI nodes, and stand alone dev stations.

    There is nothing like being able to easily patch a customers controller right in the field, without carrying a lot of specialized programming gear.

    Others are high resolution graphics, fast arbitrary signal generation/capture, etc., at performance levels not reachable by P1 variants.

    I listed why I wanted P2 a bit earlier in this thread; mostly it is because it allows me to do far more than any P1 variant. That costs gates and watts.

    That does not mean I won't use P1 variants - that's plain silly - until P2 shows up.
    Baggers wrote: »
    This post, may sound like it's baiting an argument, but it's not really.
    But... Bill, I have to say this, as I've quietly read the banter from you guys, and for someone who has been quoting use technical proofing in arguments, I now feel I have to state a few things. :)

    Yes, whilst I agree the new P2 is an awesome beast, and that parallax do need to get this out to the public, I can't help but think the P16X32B is in my honest opinion the best route!

    Here are my reasons!

    It looks like Bill is wanting a stand alone computer out of his next prop design, and I'll admit, I'd like to do a stand alone computer with the next Prop chip, whichever one comes out too!

    But looking at the facts.
    P2 computer cog usage
    Cog 1, TV/VGA driver + keyboard + mouse + SD driver
    Cog 2, SDRAM driver for max performance
    Cog 3, GPU to assist feeding TV driver, Audio only taking a slight portion of the cog speed, leaving most for GPU
    Cog 4, CPU single thread for max performance

    Now although that sounds a good setup, here's the P16X32B setup

    Cog 1, TV/VGA driver
    Cog 2, Keyboard + mouse driver
    Cog 3, SD driver for max performance
    Cog 4-8, GPU assist for TV/VGA driver, ( baring in mind these are running 5* faster than a P1 cog, and the GPU side of the PropGFX only had 6 cogs running, whereas this will effectively have 20 P1 cogs running.
    Cog 9 SDRAM driver if you want extra RAM
    Cog 10 Audio driver, for a couple of SID chips and Chip's speech driver, or whatever audio driver you want, sample player, mod tracker
    Cog 11 CPU, agreed it doesn't have cordic etc, but as Bill is looking at Z80, that doesn't have any maths either! plus emulating VMs is inherently slow compared to just plain PASM.
    Cog 12-16 is 6 extra COGS running at 100mips, that you can throw at anything!
    Say you wanted your computer to have a camera, one could read the data from the camera and feed it to the GPUs.
    Maybe you could throw 1 ( or a few ) at image recognition / motion sensing / face recognition / voice recognition.
    You could even throw another CPU in, like the old arcade games that use to have 3 Z80's in etc.

    So as you can see, the Cogs in a P16X32B may not be as awesome with all the features of the P2, but it's WOW factor is in the number of 100MIPS cogs, that are available to throw at anything you want to throw them at.
  • BaggersBaggers Posts: 3,019
    edited 2014-04-06 10:07
    It's funny, as much as I do love the beefed up P16X32B idea, the more I think about it, the more it makes me think why not put 8 P2 cogs in the 180nm P2? why not just rate it to a clock speed of 50-80Mhz but let people who know and are willing to use the full potential over clock it to what it is totally capable of!
    Cos in real world apps, even with an 8 Cog app, I can't see it burning up what the Worst case scenario is in watts, but should they do so, they know that've done it by over clocking and will have taken precautionary measures, heat sink etc, whatever means necessary to keep it from melting the board/chip :)

    What I do know though, is that I feel sorry for Chip and Ken, to have to make his decision, of which way to go!

    Especially with most P2 dev on stand-by until a final decision is made.

    I know whatever chip they release we will all enjoy using, as even if it's a 4cog P2, there's nothing stopping us just using an extra chip in our designs! especially with Chip's fast inter prop comms!
  • GordonMcCombGordonMcComb Posts: 3,366
    edited 2014-04-06 10:53
    W9GFO wrote: »
    I think I am seeing a pattern here. I know I am generalizing but it looks like the people that tend to use the Propeller to control robotic type things are in favor of a P1 variant whilst those that are more into fancy, complex* programming are in favor of the P2.

    I'd tend to agree with this general assessment.

    In the end it has to come down to rational pragmatism, not only in the interest of positive cash flow, but in how the chips are marketed. Every product is the result of compromise. But if the P2 is to be whittled down in order to make it usable, there is a very real risk of making it second-best in the eyes of a buying public. Not all customers have been involved in the long discussion of the chip's development.

    As a simple example, a side-by-side comparison will show the newly envisioned P2 as having half the cores of the P1. How much extra work will it take to educate the buyer that the P2 is actually the more powerful processor? We shouldn't underestimate that buyers tend to make snap judgments based on only summary information.

    Assuming the numbers work for Parallax, having another product in the P1 chain will hopefully serve to generate income that will in turn open better alternatives for getting the P2 out the door, with features that truly show its generational growth. This is not an either/or situation. The world can function with both a P1+ and P2.
  • Tracy AllenTracy Allen Posts: 6,664
    edited 2014-04-06 11:07
    I haven't been following the P2 development closely. Chip told me early on that it would not allow micro-power operation. Sorry, if it can't come in below 20 µA static drain with awareness of outside events as in RCslow, it is of use for little use for most of my projects, and for my customers.

    That said, I'd happily contribute $$$ to help fund the intermediate chip, whatever that might be. (I'd pay considerably more for an equity stake in Parallax. I'm counting on them to have a future!) Conversation at the Gracey family picnics must be quite interesting. I'd want Chip to do whatever he finds most exciting and for Ken to have something that advances the bread and butter business and the education program, and for Chuck Sr. to see the happy median. Symmetry, simplicity, please. Under those terms, even with higher current, I'd want to try out the product as give and explore its features. New applications.

    A design I did, now in production, uses 2 P8X32s, one to handle data acquisition/control and one to handle communications. It could possibly be done with a single P1B with twice the number of pins. But just thinking about it, the potential board layout gives me a headache. The DAQ and Comms functions are logically and physically quite separate, and two lines for inter-prop communication works fine. A major requirement is low static current.
Sign In or Register to comment.