Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II update - BLOG - Page 111 — Parallax Forums

Propeller II update - BLOG

1108109111113114223

Comments

  • cgraceycgracey Posts: 14,155
    edited 2013-12-01 15:51
    Cluso99 wrote: »
    Thanks for the link Chip - you brightened up my day!

    It seems that their expectations are wide open out there.
  • cgraceycgracey Posts: 14,155
    edited 2013-12-01 16:04
    Regarding Bill's question about what can fit into 10 square more millimeters of silicon...

    1) Almost another 128KB of single-port hub RAM, but not quite. I need to talk to Beau about this.

    2) Double the logic of the last tapeout... WAIT A MINUTE... what if we added 4 more cogs (with their associated memories). That might comfortably fit!!!

    WHAT ABOUT FOUR MORE COGS ????

    It would mean:
    1:12 hub timing - LMM would work 1:3 with RDQUAD
    200MHz*12 = 2400 MIPS
    Fast DAC updates per 8 pins, instead of 12 pins
    3 watts?
  • Kerry SKerry S Posts: 163
    edited 2013-12-01 16:08
    Would it be possible to have Cog 0 be a special case where it can use any free (unclaimed) hub access so that we can try to push the interpreter performance? How much of a difference would it make?
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-01 16:13
    Oh what a choice!

    I think we are about to have two camps, reminiscent of discussions a couple of years ago...

    A) 12 cog / 126KB hub camp

    B) 8 cog / <256KB hub camp

    I think I am leaning towards (A)

    but could we have 1:16 hub slots?

    12 slots for the 12 cogs

    4 spare slots to be allocated to cogs that need more bandwidth

    This means four cogs could be 1:8, with 8 cogs at 1:16

    Or two cogs at 3:16, with 10 at 1:16...

    Or one cog at 5:16, with 11 at 1:16

    cgracey wrote: »
    Regarding Bill's question about what can fit into 10 square more millimeters of silicon...

    1) Almost another 128KB of single-port hub RAM, but not quite. I need to talk to Beau about this.

    2) Double the logic of the last tapeout... WAIT A MINUTE... what if we added 4 more cogs (with their associated memories). That might comfortably fit!!!

    WHAT ABOUT FOUR MORE COGS ????

    It would mean:
    1:12 hub timing - LMM would work 1:3 with RDQUAD
    200MHz*12 = 2400 MIPS
    Fast DAC updates per 8 pins, instead of 12 pins
    3 watts?
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-01 16:18
    cgracey wrote: »
    Regarding Bill's question about what can fit into 10 square more millimeters of silicon...

    1) Almost another 128KB of single-port hub RAM, but not quite. I need to talk to Beau about this.

    2) Double the logic of the last tapeout... WAIT A MINUTE... what if we added 4 more cogs (with their associated memories). That might comfortably fit!!!

    WHAT ABOUT FOUR MORE COGS ????

    It would mean:
    1:12 hub timing - LMM would work 1:3 with RDQUAD
    200MHz*12 = 2400 MIPS
    Fast DAC updates per 8 pins, instead of 12 pins
    3 watts?

    WHAT ABOUT FOUR MORE COGS ????
    Wait just a bit, have to wipe down my screen - that's what I get for reading this with a mouthful of coffee ;)
    Are you serious???

    256KB Hub or 12 Cogs - What a dilemma? Need time to think.

    I had been thinking about a small block of memory at the center of the die for all cogs to access. I thought 16 * 32+1 bits (now 32+2). Bill would prefer 32*.
    Then I thought each cog has a write block of 16* or 32* 32+2, and all cogs can read, one at a time sequentially (no determinism).

    But your options have me stumped. More coffee required (well actually I am worse than that - I drink real coke)
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-01 16:20
    Glad to see the pace is slowing. The last 4 hours have been chaotic!
  • User NameUser Name Posts: 1,451
    edited 2013-12-01 16:21
    Heater. wrote: »
    If such users are no longer a concern then go ahead make it as complicated and impenetrable as you like.

    I don't think the slot sharing is as impenetrable as you make it out to be. It isn't half as impenetrable to me as is your multithreading idea. And to make it perfectly clear, I don't wish to toss out either of them!

    Heater. wrote: »
    Sadly I could argue that that is also true of the Prop II as it stands.

    I dispute this. But even if it were true, that's all the more reason not to pass up a virtual freebie like slot sharing.

    I will also say that the quickest way to end up with a Homermobile is to attempt to idiotproof a simple and clean idea like Ray/Chip/Bill/CW/et.al. have proposed.
  • ctwardellctwardell Posts: 1,716
    edited 2013-12-01 16:28
    cgracey wrote: »
    Regarding Bill's question about what can fit into 10 square more millimeters of silicon...

    1) Almost another 128KB of single-port hub RAM, but not quite. I need to talk to Beau about this.

    2) Double the logic of the last tapeout... WAIT A MINUTE... what if we added 4 more cogs (with their associated memories). That might comfortably fit!!!

    WHAT ABOUT FOUR MORE COGS ????

    It would mean:
    1:12 hub timing - LMM would work 1:3 with RDQUAD
    200MHz*12 = 2400 MIPS
    Fast DAC updates per 8 pins, instead of 12 pins
    3 watts?

    Is this just to throw Ken off track...

    C.W.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-01 16:31
    Chip,

    what would be the easiest, least time consuming, least risky use of the newly freed up area?

    I think that is the path to be taken, for this first P2.

    Whatever that path is - be it more cogs, more hub, more aux - I am sure we will find uses for it :)
  • jmgjmg Posts: 15,173
    edited 2013-12-01 16:38
    Oh what a choice!

    I think we are about to have two camps, reminiscent of discussions a couple of years ago...

    A) 12 cog / 126KB hub camp

    B) 8 cog / <256KB hub camp

    I think I am leaning towards (A)

    but could we have 1:16 hub slots?

    12 slots for the 12 cogs

    4 spare slots to be allocated to cogs that need more bandwidth

    This means four cogs could be 1:8, with 8 cogs at 1:16

    Or two cogs at 3:16, with 10 at 1:16...

    Or one cog at 5:16, with 11 at 1:16

    I would agree on bandwidth-flexible slots, needed for any # COGS.

    12 COGS ?? - remember we now have Multi-tasking, so 8 is already the new 12 +...

    Need some examples of what can be done with 12.MT, that cannot be done with 8.MT ?

    Examples of what can be done with 256k, that cannot be done with 128k might be easier to find ?

    That's more fonts for a start, and more Display List..., or might now make JavaScript fit, or many more....

    More important than extra COGS would be HW QuadSPI support, so cheap FLASH can better feed the COGS we have.
  • potatoheadpotatohead Posts: 10,261
    edited 2013-12-01 16:41
    That is my question:

    4 more with a fleshed out SERDES, or with the one that is there now?
  • rabaggettrabaggett Posts: 96
    edited 2013-12-01 16:46
    User Name wrote: »
    I will also say that the quickest way to end up with a Homermobile is to attempt to idiotproof a simple and clean idea like Ray/Chip/Bill/CW/et.al. have proposed.

    AMEN!
    It's already good style to use variables or constants to designate I/O in objects so they don't conflict, and it's the user's responsibility to avoid those conflicts. This is widely accepted.

    It would be good style to use variables when setting up cog slot allocations so they don't conflict, and it would be the user's responsibility to avoid those conflicts. Whether this be by judicious selection of the cog in which the object should run, or by using a variable to select the options, or some combination of these.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-01 16:46
    I'd be perfectly happy with bigger hub, and 8 cogs, but since you asked for examples:

    Advantages to cogs::

    - more raw computational power
    - due to new multipliers and dividers, greater potential to enter DSP space
    - more counters
    - more uart / serdes modules
    - more high bandwidth video / signal outputs
    - more precise deterministic timing for I/O's

    Advantages to more hub: (in addition to what you stated)

    - larger LMM code without SDRAM
    - more buffers (for bitmaps, SD sectors, data gathering, etc)
    - more bandwidth per cog (due to fewer cogs)

    Regarding QSPI support:

    Chip seems to suggest we will get 200Mhz (wohoo!)

    With the new nibble instructions, and counters, we can handle 100Mhz QSPI in software (to AUX/cog memory)

    Fastest QSPI flash I've seen is 104Mhz.
    jmg wrote: »
    I would agree on bandwidth-flexible slots, needed for any # COGS.

    12 COGS ?? - remember we now have Multi-tasking, so 8 is already the new 12 +...

    Need some examples of what can be done with 12.MT, that cannot be done with 8.MT ?

    Examples of what can be done with 256k, that cannot be done with 128k might be easier to find ?

    That's more fonts for a start, and more Display List..., or might now make JavaScript fit, or many more....

    More important than extra COGS would be HW QuadSPI support, so cheap FLASH can better feed the COGS we have.
  • cgraceycgracey Posts: 14,155
    edited 2013-12-01 16:47
    ctwardell wrote: »
    Is this just to throw Ken off track...

    C.W.

    I'm serious!

    This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.
  • cgraceycgracey Posts: 14,155
    edited 2013-12-01 16:49
    potatohead wrote: »
    That is my question:

    4 more with a fleshed out SERDES, or with the one that is there now?

    With the new SERDES - it is not going to be many gates, at all.
  • cgraceycgracey Posts: 14,155
    edited 2013-12-01 16:57
    How come you guys want a QSPI SERDES so badly? I know you could access off-chip Flash quickly, but to what end? Would you dump it to hub for other cogs to use? Why all the keen interest in this feature?
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-01 16:58
    Chip,
    We have a 128bit bus to the hub. Only the r/w quad instructions can use this wide bus and that limits us to the quad buffer.
    The read instructions effectively take 2 clocks to setup.

    So for max hub to cog block transfers we would perform (I know we can just do n*4 and one RDLONC but it illustrates the point better)...
    REPS #n,#4 'n=no of quad longs to transfer
    NOP
    RDLONGC INDA++,PTRA++ ' 1st time syncs to hub (3+ clocks); others 3+ clocks
    RDLONGC INDA++,PTRA++ ' 1 clock
    RDLONGC INDA++,PTRA++ ' 1 clock
    RDLONGC INDA++,PTRA++ ' 1 clock
    So this means in the loop we could execute 2 additional 1 clock instructions???

    Or we could do this...
    REPS #n,#5 'n=no of quad longs to transfer
    NOP
    RDQUADC PTRA ' 1+ clocks (first time 1..8 clocks)
    RDLONGC INDA++,PTRA++ ' 1 clock
    RDLONGC INDA++,PTRA++ ' 1 clock
    RDLONGC INDA++,PTRA++ ' 1 clock
    RDLONGC INDA++,PTRA++ ' 1 clock
    So this means in the loop we could execute 3 additional 1 clock instructions???

    How do we read into AUX from HUB and how quickly can that be done?
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-01 16:59
    cgracey wrote: »
    I'm serious!

    This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.

    12 Cogs!
    I think I just had my first (and hopefully last) heart attack.

    I was thinking the extra die space might be useful as extra AUX ram to expand SETRACE capacity.
    I'm scrapping that idea now.

    I like Bill's 16 time slot model too.

    You the man Chip!
    Happy days :)
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-01 17:00
    If this is the low risk path... take it!

    Just think of the advertising advantages:

    "Introducing the Parallax Propeller 2"

    - up to 2,400MIPS of processing power
    - twelve 32 bit cores, each with MAC's, CORDIC engines, each with up to four threads of execution
    - 24x 32bit timer/counters
    - 24x UARTs
    - 24x SPI ports (serdes)
    - 12x vga/component/ntsc/pal video or signal generation ports
    - 48x 9 bit high speed (200Mhz) DAC's
    - 92x 9/18 bit low speed (12.5MHz) DAC's
    - 92x 9/18 bit Sigma-Delta ADC's, up to XXMhz at 10 bits
    - 92x digital I/O

    Obviously not all ports at the same time, but people are used to that.

    All of a sudden, to those who want "hard" peripherals, there is quite a list

    Marketdroids delight...
    cgracey wrote: »
    I'm serious!

    This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-01 17:02
    I think a tight software loop would be able to max out QSPI since you added the nibble instructions.
    cgracey wrote: »
    How come you guys want a QSPI SERDES so badly? I know you could access off-chip Flash quickly, but to what end? Would you dump it to hub for other cogs to use? Why all the keen interest in this feature?
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-01 17:06
    cgracey wrote: »
    How come you guys want a QSPI SERDES so badly? I know you could access off-chip Flash quickly, but to what end? Would you dump it to hub for other cogs to use? Why all the keen interest in this feature?
    I don't know. If I want more memory, SDRAM or SRAM is the WTG. And you (Parallax) will have a module with SDRAM.

    I presume we can use x16 as well as x8 or x32 SDRAM with the hw?
  • jmgjmg Posts: 15,173
    edited 2013-12-01 17:06
    I'd be perfectly happy with bigger hub, and 8 cogs, but since you asked for examples:

    Advantages to cogs::

    - more raw computational power

    Only in total, per-cog average HUB-BW is actually slightly less.
    - due to new multipliers and dividers, greater potential to enter DSP space
    new as in 4 more sets ? This still needs dispersed code, so I'm less sure this is a real market target ?
    - more counters
    - more uart / serdes modules

    Always good to have, but we are still waiting on exactly what is in the new Counters - I think they already doubled ?
    - more high bandwidth video / signal outputs

    Current design can have 32 high bandwidth Video lines (4x8), and even the slower DAC pathway was looking fine for
    ATE usages.
    - more precise deterministic timing for I/O's

    Not following this - fSYS is the same, so ns per clock is the same ?

    I'll add another Advantages to cogs::
    - Can allocate 8 COGS for SW and 4 COGS for intelligent IO and Peripherals

    Regarding QSPI support:
    Chip seems to suggest we will get 200Mhz (wohoo!)
    With the new nibble instructions, and counters, we can handle 100Mhz QSPI in software (to AUX/cog memory)
    Fastest QSPI flash I've seen is 104Mhz.

    QSPI in hardware will allow the expensive COGS to be used for real SW work, not wiggling pins.
    Hopefully, it is in the improved SERDES mentioned.
  • ctwardellctwardell Posts: 1,716
    edited 2013-12-01 17:07
    If we go with 12 COGS and ~126K Hub we get around 10.5K of Hub per COG

    If we go with 8 COGS and ~254K Hub we get almost 32K of Hub per COG

    I think sticking with 8 COGS and increasing hub is a better balance of resources.

    C.W.
  • ctwardellctwardell Posts: 1,716
    edited 2013-12-01 17:12
    - 12x vga/component/ntsc/pal video or signal generation ports
    - 48x 9 bit high speed (200Mhz) DAC's
    - 92x 9/18 bit low speed (12.5MHz) DAC's
    - 92x 9/18 bit Sigma-Delta ADC's, up to XXMhz at 10 bits
    - 92x digital I/O

    Yeah, but the minute you add the SDRAM pretty much half of those items go 'poof' since the HS DAC's are pin locked..

    C.W.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-01 17:16
    cgracey wrote: »
    I'm serious!

    This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.
    Sold!

    I like Bills 1:16 method with 4 free.

    Do we have a little space to put a small block of ram between cogs?
  • jmgjmg Posts: 15,173
    edited 2013-12-01 17:17
    cgracey wrote: »
    How come you guys want a QSPI SERDES so badly? I know you could access off-chip Flash quickly, but to what end? Would you dump it to hub for other cogs to use? Why all the keen interest in this feature?

    The appeal is to get closer to Execute-in-Place, and using a whole COG to wiggle pins, is a huge amount of silicon rather wasted.

    Best to let HW manage the BITs and the SW manage the Bytes/Words.

    A Prop is already relatively memory-starved, and QSPI Flash is already cheap and available, so making most use of that bandwidth, helps mitigate the memory issues in Prop.

    It is not just code-fetch, things like Font & Icon fetch, could be done in-place for many Display uses.

    As a reference, the FT800 has 256K RAM and > 282k of Fonts in ROM
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-01 17:19
    True, but when you enable the external bus on any microcontroller, a lot of pins go poof. Maybe a P4 will have a PGA512 package with pins for everything :)
  • jmgjmg Posts: 15,173
    edited 2013-12-01 17:28
    ctwardell wrote: »
    If we go with 12 COGS and ~126K Hub we get around 10.5K of Hub per COG
    If we go with 8 COGS and ~254K Hub we get almost 32K of Hub per COG
    I think sticking with 8 COGS and increasing hub is a better balance of resources.

    There are also other combinations possible
      8 COG  ~256K    =>  32     kHR/COG
      9 COG  ~223K    =>  24.833 kHR/COG
     10 COG  ~191K    =>  19.1   kHR/COG
     11 COG  ~158.5K  =>  14.40  kHR/COG
     12 COG  ~125K    =>  10.5   kHR/COG
    

    10 COGs with 191k is also sounding like a good combination :)
  • cgraceycgracey Posts: 14,155
    edited 2013-12-01 17:30
    The sad thing is, we don't have enough room to double the hub RAM from 128KB to 256KB. We probably have enough room for ~220KB total, if we don't add anything else. I remember Beau and I exploring this a while ago.

    I need to find a graphic of the die layout. I know there are several on this thread, but where? I'm in the house today, so I've only got a laptop with nothing special on it.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2013-12-01 17:32
    Cluso99 wrote:
    This is the same with the P2 and hub slots. Do you really want to limit what it is capable of ???
    Frankly, I don't even care anymore. I don't care how many hub slots a cog gets, how many cogs there are, or even whether multitasking is in the mix, for that matter. What I do care -- and care deeply about -- is seeing a company and the people who work there who do matter to me -- and that I depend upon to a large extent for my livelihood -- get dragged down by an overly expensive, insanely long development cycle that has no end in sight and that's open to way too much input from people who will have virtually no impact on the chip's ultimate sales. The P2's development has to be more than an expensive, seven-year-and-counting hobby, or before long we won't even have a P1 to talk about. How many more "just two hour" mods will it take, Chip? It's time to end this insanity and get the P2 out the door, while there still is a door.

    -Phil

    P.S. In case you didn't catch it, subtlety and diplomacy are not my strong suits. And I'll probably regret this in the morning. For the time being, though, it felt good to get it off my chest.
Sign In or Register to comment.