Full-chip integration at On Semi

1246715

Comments

  • jmgjmg Posts: 14,493
    Cluso99 wrote: »
    With all the changes done at OnSemi to the pad frame, is/has the pad frame shuttle due about now been a waste?

    Interesting question, in the other thread, Chip has said this
    "The test chip is electrically equivalent to the pads in the final chip. There have just been a lot of metal routing improvements."

    So no, they are no longer truly identical, but hopefully the changes are all layout-improvements, with gains in areas like ESD and Noise floor, and the electrical operation of the test chip is still 100% valid.
  • jmgjmg Posts: 14,493
    Roy Eltham wrote: »
    To clarify my stance.
    First and foremost, if things can work, as is, with 8 cogs, 512KB, and 160Mhz (the current plan/design), then get that done.

    However, my understanding of what Chip has shared so far is that in order to meet 160Mhz the added buffers/etc. cause it to be too big for the package. It's something around double the base logic size when adding the stuff needed to meet 160Mhz. Chip is going to have them see what the size would be if they target 120Mhz, and if the size shrinks enough then they might end up with room to add cogs and/or memory. So, in that case, my vote is more memory (even if it's not all the way to 1MB). I think it's less risk to go more memory than more cogs, because more cogs means changing the timings for the egg beater memory stuff, etc.

    My hope is that this first P2 version does well enough for Parallax that they can do other variants in the near future. Like within a couple years, instead of 12+ years.

    With no valid numbers yet on Package,Price,MHz any shift from 8c,512k, is mere speculation.
  • My vote would be 16 cogs too, ideally. Would make up for the lost hardware tasking of P2-Hot

    But... it'd be a risky late change. And we can get >8 cogs and plenty of pins and 135+MHz from a P1V in a cheap fpga, so 120 MHz feels too slow. If I were Chip I'd just run with what we have, and step in the direction of greatest demand after the first variant is made
  • I think many of you forget that the 16 Cog 1MB version is what was being tested until fairly recently when it failed to fit into the A9 FPGA. IMHO therefore 16C 1MB has been verified, so no risk/work here other than larger die.
  • The previous target was 16 cogs with 512 KB of hub RAM. It was reduced to 8 cogs because 16 cogs would not fit in the silicon. I don't believe the decision was made based on the A9 FPGA.
  • Cluso99,
    There were changes made after it was reduced to 8 cogs that were fixes to bugs caused by the timing differences with the egg beater. So going back to 16 cogs means that whatever was done in those changes will need to be redone/undone and tested. Also, there were a great deal of other changes post reduction to 8 cogs, that are untested at 16 cogs. I think it's safer to stay at 8 cogs.

  • jmgjmg Posts: 14,493
    Roy Eltham wrote: »
    Cluso99,
    There were changes made after it was reduced to 8 cogs that were fixes to bugs caused by the timing differences with the egg beater. So going back to 16 cogs means that whatever was done in those changes will need to be redone/undone and tested. Also, there were a great deal of other changes post reduction to 8 cogs, that are untested at 16 cogs. I think it's safer to stay at 8 cogs.

    This is Chips file log, for FPGA builds - UPDATED 1 January 2018 - Version 31
    			  smart
    	           cogs	  pins	RAM	Freq	CORDIC	Filename
    	         +---------------------------------------------------------------------------
    Prop123-A9       |  16	   8	 1024k	80MHz	Yes	Prop123_A9_Prop2_16cogs_v31.rbf
    Prop123-A9       |  8	  64	  512k	80MHz	Yes	Prop123_A9_Prop2_8cogs_v31.rbf **
    BeMicro-A9       |  16	   8	 1024k	80MHz	Yes	BeMicro_A9_Prop2_16cogs_v31.jic *
    BeMicro-A9       |  8	  64	  512k	80MHz	Yes	BeMicro_A9_Prop2_8cogs_v31.jic */**
    Prop123-A7       |  4	  38	  512k	80MHz	Yes	Prop123_A7_Prop2_v31.rbf
    DE2-115          |  4	  34	  256k	80MHz	Yes	DE2_115_Prop2_v31.pof *
    BeMicro-A2       |  1	   8	  128k	80MHz	No	BeMicro_A2_Prop2_v31.jic *
    DE0-Nano         |  1	   8	   32k	80MHz	No	DE0_Nano_Prop2_v31.jic
    DE0-Nano Bare    |  1	   8	   32k	80MHz	No	DE0_Nano_Bare_Prop2_v31.jic
    
    *  These images always map SD card pins {CSn,CLK,DO,DI} into P[61:58].
    ** These images represent the logic and memory that will be built in silicon.
    

    Looks like there is a 16 Gog build of the latest v31 code, with just 8 smart pins, but I think the testing focus has been on 8 COGs, as that is the Silicon Target.
  • cgraceycgracey Posts: 13,076
    edited 2018-03-04 - 23:16:30
    I can't remember what bug popped up at 8 cogs, but I do remember that it was some sleeper issue whose fix could only help all other cases. It had something to do with COGID, if I recall. So, 16 cogs is not known to be risky in any way.

    Our original goal was 16 cogs and 512KB of hub. That got knocked down to 8 cogs.
  • 16 cogs would be neat because it would allow for lots of concurrent programs. 8 cogs are usually enough, but when you run out, you have to start thinking a lot harder about how to divide things, forcing early optimizations. It would be nice to never hit the limit, and 16 cogs would do it.
  • jmgjmg Posts: 14,493
    cgracey wrote: »
    I know that without timing constraints, the design is amounting to 334k cells.
    To meet the 160MHz requirement, it's looking like 723k cells.
    Those extra cells are mainly buffers/inverters, but they increase the logic area from 11 sq mm to 16 sq mm.
    We need to evaluate what kind of savings can be had at 120MHz....

    Did OnSemi report 120MHz timing at that 334k cells, or is that some other estimated MHz value ?
    Did OnSemi indicate a 'usual range' for 'cells added to meet timing' ?
  • jmg wrote: »
    cgracey wrote: »
    I know that without timing constraints, the design is amounting to 334k cells.
    To meet the 160MHz requirement, it's looking like 723k cells.
    Those extra cells are mainly buffers/inverters, but they increase the logic area from 11 sq mm to 16 sq mm.
    We need to evaluate what kind of savings can be had at 120MHz....

    Did OnSemi report 120MHz timing at that 334k cells, or is that some other estimated MHz value ?
    Did OnSemi indicate a 'usual range' for 'cells added to meet timing' ?

    I don't know what timing goal the 334k cells were good for. It might have been functional, only.
  • Confession: Pretty much dumb when it comes to the P2 development.

    BUT if I understand this thread, isn't this the first possibility of the previously discussed "family" of P2 products?

    One with greater speed but fewer resources and vice-versa?

  • Mickster wrote: »
    Confession: Pretty much dumb when it comes to the P2 development.

    BUT if I understand this thread, isn't this the first possibility of the previously discussed "family" of P2 products?

    One with greater speed but fewer resources and vice-versa?

    exactly, the open question was witch one will be the first or oldest member, 16 cogs, 1 MB hub and 120MHz or 8 cogs 512KB hub and 160MHZ.

    even with much anticipated sales will it take a while until the next version can be produced.

    On the end business restrictions will play out here since the 16 cog version will be more expensive then the 8 cog version. It needs to be seen how much.

    I think 16 cogs and 1 MB are way more sexier then 8 cogs and 512KB, but maybe those customers keeping Parallax alive by buying more chips per order as I will need my whole life have a different opinion.

    @Chip and @Ken should ask their prime customers what they would like to see as a first family member and produce that version.

    it has to be sustainable to go forward and by now this investment needs to bring returns in to do so.

    Mike
  • jmgjmg Posts: 14,493
    cgracey wrote: »
    16 cogs would be neat because it would allow for lots of concurrent programs. 8 cogs are usually enough, but when you run out, you have to start thinking a lot harder about how to divide things, forcing early optimizations. It would be nice to never hit the limit, and 16 cogs would do it.

    Given the area mentioned is only a ratio of 11:16, I wonder is there an intermediate solution point of 12 COGs, & perhaps 786432 Bytes ?
    That's 12 memories of 64kB each tile, and a 16 slot eggbeater, where 4 slots can duplicate to give 4 COGs double the bandwidth.
  • jmg wrote: »
    cgracey wrote: »
    16 cogs would be neat because it would allow for lots of concurrent programs. 8 cogs are usually enough, but when you run out, you have to start thinking a lot harder about how to divide things, forcing early optimizations. It would be nice to never hit the limit, and 16 cogs would do it.

    Given the area mentioned is only a ratio of 11:16, I wonder is there an intermediate solution point of 12 COGs, & perhaps 786432 Bytes ?
    That's 12 memories of 64kB each tile, and a 16 slot eggbeater, where 4 slots can duplicate to give 4 COGs double the bandwidth.

    For the whole thing to work, cogs must be a power-of-two.
  • jmgjmg Posts: 14,493
    cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    16 cogs would be neat because it would allow for lots of concurrent programs. 8 cogs are usually enough, but when you run out, you have to start thinking a lot harder about how to divide things, forcing early optimizations. It would be nice to never hit the limit, and 16 cogs would do it.

    Given the area mentioned is only a ratio of 11:16, I wonder is there an intermediate solution point of 12 COGs, & perhaps 786432 Bytes ?
    That's 12 memories of 64kB each tile, and a 16 slot eggbeater, where 4 slots can duplicate to give 4 COGs double the bandwidth.

    For the whole thing to work, cogs must be a power-of-two.
    Is that strictly true ?
    I can see that slots need to be power-of-two, and the memory mux is LSB 2 or 3 bits, but I think the 16 slots do not have to be 1:1 COG mapped.
    Seems they can be empty (same as a not-started-yet COG), or dual mapped.
    With unlimited space it is natural to have COGs = SLOTS, but with finite die-space ceiling, other solutions may fit better ?

  • It is like with horses. There are those work horses, able to pull a wagon over the plains, plow your field and pull your car out of a ditch. Slower, but powerful. And then there are those other horses, you use to go hunting, checking the live stock and fast transportation. A lighter, more agile horse. Both very valuable animals, one for each job.

    Same goes for this decision now. Slower but more powerful workhorse or faster but less versatile riding horse?

    Mike

  • jmg wrote: »
    cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    16 cogs would be neat because it would allow for lots of concurrent programs. 8 cogs are usually enough, but when you run out, you have to start thinking a lot harder about how to divide things, forcing early optimizations. It would be nice to never hit the limit, and 16 cogs would do it.

    Given the area mentioned is only a ratio of 11:16, I wonder is there an intermediate solution point of 12 COGs, & perhaps 786432 Bytes ?
    That's 12 memories of 64kB each tile, and a 16 slot eggbeater, where 4 slots can duplicate to give 4 COGs double the bandwidth.

    For the whole thing to work, cogs must be a power-of-two.
    Is that strictly true ?
    I can see that slots need to be power-of-two, and the memory mux is LSB 2 or 3 bits, but I think the 16 slots do not have to be 1:1 COG mapped.
    Seems they can be empty (same as a not-started-yet COG), or dual mapped.
    With unlimited space it is natural to have COGs = SLOTS, but with finite die-space ceiling, other solutions may fit better ?

    Yes, that is actually correct. The hub RAM instances need to be a power-of-two, but not the cogs.
  • Hi Chip

    As for the currently designed chip (8 COGs, 64 smart pins, 512kB HUB, 160MHz), what are the die dimensions?
  • jmgjmg Posts: 14,493
    cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    16 cogs would be neat because it would allow for lots of concurrent programs. 8 cogs are usually enough, but when you run out, you have to start thinking a lot harder about how to divide things, forcing early optimizations. It would be nice to never hit the limit, and 16 cogs would do it.

    Given the area mentioned is only a ratio of 11:16, I wonder is there an intermediate solution point of 12 COGs, & perhaps 786432 Bytes ?
    That's 12 memories of 64kB each tile, and a 16 slot eggbeater, where 4 slots can duplicate to give 4 COGs double the bandwidth.

    For the whole thing to work, cogs must be a power-of-two.
    Is that strictly true ?
    I can see that slots need to be power-of-two, and the memory mux is LSB 2 or 3 bits, but I think the 16 slots do not have to be 1:1 COG mapped.
    Seems they can be empty (same as a not-started-yet COG), or dual mapped.
    With unlimited space it is natural to have COGs = SLOTS, but with finite die-space ceiling, other solutions may fit better ?

    Yes, that is actually correct. The hub RAM instances need to be a power-of-two, but not the cogs.

    Yup, and that's the count of hub RAM instances, not their size - means 512kB is made up of 16 x 32k Memory areas.
    Those areas could add one address bit, and be 40k,48k,56k etc each, they do not have to jump from 32k all the way to 64k.
    40k maps to 640k HUB, then we have 768k, 896k etc.
    Adding an address bit does have a small Memory MHz impact, so once that is done, the incentive is to add as much memory as will physically fit.

    Illustrating that MCU memory size does not have to be binary, I note that SiLabs recently released a 40kF part, and Microchip a 48kF AVR.
    There must be some package/die area factor at work here (as in P2) that makes them select a non-binary step.
  • Yanomani wrote: »
    Hi Chip

    As for the currently designed chip (8 COGs, 64 smart pins, 512kB HUB, 160MHz), what are the die dimensions?

    About 8.5 x 8.5 mm.
  • jmg wrote: »
    cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    jmg wrote: »
    cgracey wrote: »
    16 cogs would be neat because it would allow for lots of concurrent programs. 8 cogs are usually enough, but when you run out, you have to start thinking a lot harder about how to divide things, forcing early optimizations. It would be nice to never hit the limit, and 16 cogs would do it.

    Given the area mentioned is only a ratio of 11:16, I wonder is there an intermediate solution point of 12 COGs, & perhaps 786432 Bytes ?
    That's 12 memories of 64kB each tile, and a 16 slot eggbeater, where 4 slots can duplicate to give 4 COGs double the bandwidth.

    For the whole thing to work, cogs must be a power-of-two.
    Is that strictly true ?
    I can see that slots need to be power-of-two, and the memory mux is LSB 2 or 3 bits, but I think the 16 slots do not have to be 1:1 COG mapped.
    Seems they can be empty (same as a not-started-yet COG), or dual mapped.
    With unlimited space it is natural to have COGs = SLOTS, but with finite die-space ceiling, other solutions may fit better ?

    Yes, that is actually correct. The hub RAM instances need to be a power-of-two, but not the cogs.

    Yup, and that's the count of hub RAM instances, not their size - means 512kB is made up of 16 x 32k Memory areas.
    Those areas could add one address bit, and be 40k,48k,56k etc each, they do not have to jump from 32k all the way to 64k.
    40k maps to 640k HUB, then we have 768k, 896k etc.
    Adding an address bit does have a small Memory MHz impact, so once that is done, the incentive is to add as much memory as will physically fit.

    Illustrating that MCU memory size does not have to be binary, I note that SiLabs recently released a 40kF part, and Microchip a 48kF AVR.
    There must be some package/die area factor at work here (as in P2) that makes them select a non-binary step.

    I don't know if On Semi's ONC18 RAM compilers can do non-power-of-two word counts.
  • hippy wrote: »
    Add me to the 16 Cog, 1 MB RAM, 120 MHz preference.

    Doubling cogs, doubling RAM will attract and facilitate more people than a 30% increase in speed will. IMO.
    Hey hippy,
    Nice to see you are still lurking
  • cgracey wrote: »
    Question:

    Would 16 cogs and 1 MB of Hub be worth a speed reduction from 160 MHz to 120 MHz?

    I am for 1MB ram .....

    .... but as also @Cluso99 noted if possible I would sacrifice the additional 8 cogs (area) in benefit of an integrated 512KB flash or fram with (Q)SPI interface. Just 2 dies inside common package tied to the same external pins. With this no need for any other boot option in the internal ROM. A state machine can copy the first 2KB of flash contents to ram and start the cog.
    The flash can be programmed through the cogs but also via the external pins while keeping the prop in reset.
  • Heater.Heater. Posts: 21,233
    edited 2018-03-05 - 13:33:40
    What? Give up GOGs for FLASH? That is a lot of nice functionality and performance to be throwing away just for the sake of saving a tiny FLASH chip.
  • Cluso99Cluso99 Posts: 16,660
    edited 2018-03-05 - 13:14:42
    dMajo wrote: »
    cgracey wrote: »
    Question:

    Would 16 cogs and 1 MB of Hub be worth a speed reduction from 160 MHz to 120 MHz?

    I am for 1MB ram .....

    .... but as also @Cluso99 noted if possible I would sacrifice the additional 8 cogs (area) in benefit of an integrated 512KB flash or fram with (Q)SPI interface. Just 2 dies inside common package tied to the same external pins. With this no need for any other boot option in the internal ROM. A state machine can copy the first 2KB of flash contents to ram and start the cog.
    The flash can be programmed through the cogs but also via the external pins while keeping the prop in reset.
    -1
    Not me.

    I'd like to see an internal SPI Flash but not at the expense of a significant number of cog or hub.

    Hoping to hear more from Chip after some more discussions with OnSemi this week.
  • For me, if it had 64 I/Os and 16 Cogs, I wouldn't care too much about RAM or speed.
    Most things that need 8 cogs could probably be done with the P1.
    But that is just me....

    Most important is to get something...Anything produced. If it is 8 Cogs then so be it...
    We can worry about extending it later (either more cogs or more ram...Maybe both as seperate chips), but let's not quibble about such things when we are this close.

    Bean
  • Like several others, I want to see the lowest risk thing be done, in the most reasonable time.

    If that ends up being 16 cogs, 512 GB of RAM great!

    Interrupts do make the cogs a lot more powerful than they first appear. But feature conflicts do sometimes limit what a cog can do.

    We have a ton of killer io on this chip. And I think that favors the cogs more than the ram.

    I also think using the cogs in an easy, let's call it lazy way, favors having more cogs.

    Having more cogs, is likely to mean being able to dedicate more of them, or just one or two of them, to an external memory without feeling like doing that takes too many COGS. Big win, if true.

    Perhaps that can be done in a standard easy way.

    While more on-chip Ram is obviously desirable, the truth is this chip, like the P1, will make use of RAM, and unlike the P1, will do so with only a small penalty, rather than the larger one we saw on P1.

    The Hub execute functionality really changes things in my opinion.

    And the biggest change will be larger programs. And in that sense no amount of on-chip ram, unless it were very large, will meet that demand.

    I think we will fairly rapidly adopt external memory, given that it's easy, fairly standard and works with few issues. More cogs seems to favor that scenario.
  • I view the discussion about COGs vs RAM as not relevant at this stage. Sorry to be the wet blanket, but that's what I am.

    While I realize the discussion about the variants is spurred by the ease of making simple changes to the Verilog and layout here and there, I don't think it's that easy.

    We need to talk about things we can change, efforts we can influence to make the P2 a success.

    For example, we don't have the resources to do all the software/documentation and examples like we did for the P1. We will need to establish an open location for creating this documentation, whether a Wiki/GitHub or Google doc (I'm not certain what's the best platform). Do we have much interest in collaborating on a crowd-sourced effort in this way?

    Ken Gracey
  • potatohead wrote: »
    If that ends up being 16 cogs, 512 GB of RAM great!

    I'll take 512GB too! :D
Sign In or Register to comment.