Regarding Bill's question about what can fit into 10 square more millimeters of silicon...
1) Almost another 128KB of single-port hub RAM, but not quite. I need to talk to Beau about this.
2) Double the logic of the last tapeout... WAIT A MINUTE... what if we added 4 more cogs (with their associated memories). That might comfortably fit!!!
WHAT ABOUT FOUR MORE COGS ????
It would mean:
1:12 hub timing - LMM would work 1:3 with RDQUAD
200MHz*12 = 2400 MIPS
Fast DAC updates per 8 pins, instead of 12 pins
3 watts?
Would it be possible to have Cog 0 be a special case where it can use any free (unclaimed) hub access so that we can try to push the interpreter performance? How much of a difference would it make?
Regarding Bill's question about what can fit into 10 square more millimeters of silicon...
1) Almost another 128KB of single-port hub RAM, but not quite. I need to talk to Beau about this.
2) Double the logic of the last tapeout... WAIT A MINUTE... what if we added 4 more cogs (with their associated memories). That might comfortably fit!!!
WHAT ABOUT FOUR MORE COGS ????
It would mean:
1:12 hub timing - LMM would work 1:3 with RDQUAD
200MHz*12 = 2400 MIPS
Fast DAC updates per 8 pins, instead of 12 pins
3 watts?
Regarding Bill's question about what can fit into 10 square more millimeters of silicon...
1) Almost another 128KB of single-port hub RAM, but not quite. I need to talk to Beau about this.
2) Double the logic of the last tapeout... WAIT A MINUTE... what if we added 4 more cogs (with their associated memories). That might comfortably fit!!!
WHAT ABOUT FOUR MORE COGS ????
It would mean:
1:12 hub timing - LMM would work 1:3 with RDQUAD
200MHz*12 = 2400 MIPS
Fast DAC updates per 8 pins, instead of 12 pins
3 watts?
WHAT ABOUT FOUR MORE COGS ????
Wait just a bit, have to wipe down my screen - that's what I get for reading this with a mouthful of coffee
Are you serious???
256KB Hub or 12 Cogs - What a dilemma? Need time to think.
I had been thinking about a small block of memory at the center of the die for all cogs to access. I thought 16 * 32+1 bits (now 32+2). Bill would prefer 32*.
Then I thought each cog has a write block of 16* or 32* 32+2, and all cogs can read, one at a time sequentially (no determinism).
But your options have me stumped. More coffee required (well actually I am worse than that - I drink real coke)
If such users are no longer a concern then go ahead make it as complicated and impenetrable as you like.
I don't think the slot sharing is as impenetrable as you make it out to be. It isn't half as impenetrable to me as is your multithreading idea. And to make it perfectly clear, I don't wish to toss out either of them!
Sadly I could argue that that is also true of the Prop II as it stands.
I dispute this. But even if it were true, that's all the more reason not to pass up a virtual freebie like slot sharing.
I will also say that the quickest way to end up with a Homermobile is to attempt to idiotproof a simple and clean idea like Ray/Chip/Bill/CW/et.al. have proposed.
Regarding Bill's question about what can fit into 10 square more millimeters of silicon...
1) Almost another 128KB of single-port hub RAM, but not quite. I need to talk to Beau about this.
2) Double the logic of the last tapeout... WAIT A MINUTE... what if we added 4 more cogs (with their associated memories). That might comfortably fit!!!
WHAT ABOUT FOUR MORE COGS ????
It would mean:
1:12 hub timing - LMM would work 1:3 with RDQUAD
200MHz*12 = 2400 MIPS
Fast DAC updates per 8 pins, instead of 12 pins
3 watts?
I will also say that the quickest way to end up with a Homermobile is to attempt to idiotproof a simple and clean idea like Ray/Chip/Bill/CW/et.al. have proposed.
AMEN!
It's already good style to use variables or constants to designate I/O in objects so they don't conflict, and it's the user's responsibility to avoid those conflicts. This is widely accepted.
It would be good style to use variables when setting up cog slot allocations so they don't conflict, and it would be the user's responsibility to avoid those conflicts. Whether this be by judicious selection of the cog in which the object should run, or by using a variable to select the options, or some combination of these.
I'd be perfectly happy with bigger hub, and 8 cogs, but since you asked for examples:
Advantages to cogs::
- more raw computational power
- due to new multipliers and dividers, greater potential to enter DSP space
- more counters
- more uart / serdes modules
- more high bandwidth video / signal outputs
- more precise deterministic timing for I/O's
Advantages to more hub: (in addition to what you stated)
- larger LMM code without SDRAM
- more buffers (for bitmaps, SD sectors, data gathering, etc)
- more bandwidth per cog (due to fewer cogs)
Regarding QSPI support:
Chip seems to suggest we will get 200Mhz (wohoo!)
With the new nibble instructions, and counters, we can handle 100Mhz QSPI in software (to AUX/cog memory)
This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.
How come you guys want a QSPI SERDES so badly? I know you could access off-chip Flash quickly, but to what end? Would you dump it to hub for other cogs to use? Why all the keen interest in this feature?
Chip,
We have a 128bit bus to the hub. Only the r/w quad instructions can use this wide bus and that limits us to the quad buffer.
The read instructions effectively take 2 clocks to setup.
So for max hub to cog block transfers we would perform (I know we can just do n*4 and one RDLONC but it illustrates the point better)...
REPS #n,#4 'n=no of quad longs to transfer
NOP
RDLONGC INDA++,PTRA++ ' 1st time syncs to hub (3+ clocks); others 3+ clocks
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
So this means in the loop we could execute 2 additional 1 clock instructions???
Or we could do this...
REPS #n,#5 'n=no of quad longs to transfer
NOP
RDQUADC PTRA ' 1+ clocks (first time 1..8 clocks)
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
So this means in the loop we could execute 3 additional 1 clock instructions???
How do we read into AUX from HUB and how quickly can that be done?
This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.
12 Cogs!
I think I just had my first (and hopefully last) heart attack.
I was thinking the extra die space might be useful as extra AUX ram to expand SETRACE capacity.
I'm scrapping that idea now.
- up to 2,400MIPS of processing power
- twelve 32 bit cores, each with MAC's, CORDIC engines, each with up to four threads of execution
- 24x 32bit timer/counters
- 24x UARTs
- 24x SPI ports (serdes)
- 12x vga/component/ntsc/pal video or signal generation ports
- 48x 9 bit high speed (200Mhz) DAC's
- 92x 9/18 bit low speed (12.5MHz) DAC's
- 92x 9/18 bit Sigma-Delta ADC's, up to XXMhz at 10 bits
- 92x digital I/O
Obviously not all ports at the same time, but people are used to that.
All of a sudden, to those who want "hard" peripherals, there is quite a list
This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.
How come you guys want a QSPI SERDES so badly? I know you could access off-chip Flash quickly, but to what end? Would you dump it to hub for other cogs to use? Why all the keen interest in this feature?
How come you guys want a QSPI SERDES so badly? I know you could access off-chip Flash quickly, but to what end? Would you dump it to hub for other cogs to use? Why all the keen interest in this feature?
I don't know. If I want more memory, SDRAM or SRAM is the WTG. And you (Parallax) will have a module with SDRAM.
I presume we can use x16 as well as x8 or x32 SDRAM with the hw?
I'd be perfectly happy with bigger hub, and 8 cogs, but since you asked for examples:
Advantages to cogs::
- more raw computational power
Only in total, per-cog average HUB-BW is actually slightly less.
- due to new multipliers and dividers, greater potential to enter DSP space
new as in 4 more sets ? This still needs dispersed code, so I'm less sure this is a real market target ?
- more counters
- more uart / serdes modules
Always good to have, but we are still waiting on exactly what is in the new Counters - I think they already doubled ?
- more high bandwidth video / signal outputs
Current design can have 32 high bandwidth Video lines (4x8), and even the slower DAC pathway was looking fine for
ATE usages.
- more precise deterministic timing for I/O's
Not following this - fSYS is the same, so ns per clock is the same ?
I'll add another Advantages to cogs::
- Can allocate 8 COGS for SW and 4 COGS for intelligent IO and Peripherals
Regarding QSPI support:
Chip seems to suggest we will get 200Mhz (wohoo!)
With the new nibble instructions, and counters, we can handle 100Mhz QSPI in software (to AUX/cog memory)
Fastest QSPI flash I've seen is 104Mhz.
QSPI in hardware will allow the expensive COGS to be used for real SW work, not wiggling pins.
Hopefully, it is in the improved SERDES mentioned.
- 12x vga/component/ntsc/pal video or signal generation ports
- 48x 9 bit high speed (200Mhz) DAC's
- 92x 9/18 bit low speed (12.5MHz) DAC's
- 92x 9/18 bit Sigma-Delta ADC's, up to XXMhz at 10 bits
- 92x digital I/O
Yeah, but the minute you add the SDRAM pretty much half of those items go 'poof' since the HS DAC's are pin locked..
This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.
Sold!
I like Bills 1:16 method with 4 free.
Do we have a little space to put a small block of ram between cogs?
How come you guys want a QSPI SERDES so badly? I know you could access off-chip Flash quickly, but to what end? Would you dump it to hub for other cogs to use? Why all the keen interest in this feature?
The appeal is to get closer to Execute-in-Place, and using a whole COG to wiggle pins, is a huge amount of silicon rather wasted.
Best to let HW manage the BITs and the SW manage the Bytes/Words.
A Prop is already relatively memory-starved, and QSPI Flash is already cheap and available, so making most use of that bandwidth, helps mitigate the memory issues in Prop.
It is not just code-fetch, things like Font & Icon fetch, could be done in-place for many Display uses.
As a reference, the FT800 has 256K RAM and > 282k of Fonts in ROM
True, but when you enable the external bus on any microcontroller, a lot of pins go poof. Maybe a P4 will have a PGA512 package with pins for everything
If we go with 12 COGS and ~126K Hub we get around 10.5K of Hub per COG
If we go with 8 COGS and ~254K Hub we get almost 32K of Hub per COG
I think sticking with 8 COGS and increasing hub is a better balance of resources.
The sad thing is, we don't have enough room to double the hub RAM from 128KB to 256KB. We probably have enough room for ~220KB total, if we don't add anything else. I remember Beau and I exploring this a while ago.
I need to find a graphic of the die layout. I know there are several on this thread, but where? I'm in the house today, so I've only got a laptop with nothing special on it.
This is the same with the P2 and hub slots. Do you really want to limit what it is capable of ???
Frankly, I don't even care anymore. I don't care how many hub slots a cog gets, how many cogs there are, or even whether multitasking is in the mix, for that matter. What I do care -- and care deeply about -- is seeing a company and the people who work there who do matter to me -- and that I depend upon to a large extent for my livelihood -- get dragged down by an overly expensive, insanely long development cycle that has no end in sight and that's open to way too much input from people who will have virtually no impact on the chip's ultimate sales. The P2's development has to be more than an expensive, seven-year-and-counting hobby, or before long we won't even have a P1 to talk about. How many more "just two hour" mods will it take, Chip? It's time to end this insanity and get the P2 out the door, while there still is a door.
-Phil
P.S. In case you didn't catch it, subtlety and diplomacy are not my strong suits. And I'll probably regret this in the morning. For the time being, though, it felt good to get it off my chest.
Comments
It seems that their expectations are wide open out there.
1) Almost another 128KB of single-port hub RAM, but not quite. I need to talk to Beau about this.
2) Double the logic of the last tapeout... WAIT A MINUTE... what if we added 4 more cogs (with their associated memories). That might comfortably fit!!!
WHAT ABOUT FOUR MORE COGS ????
It would mean:
1:12 hub timing - LMM would work 1:3 with RDQUAD
200MHz*12 = 2400 MIPS
Fast DAC updates per 8 pins, instead of 12 pins
3 watts?
I think we are about to have two camps, reminiscent of discussions a couple of years ago...
A) 12 cog / 126KB hub camp
8 cog / <256KB hub camp
I think I am leaning towards (A)
but could we have 1:16 hub slots?
12 slots for the 12 cogs
4 spare slots to be allocated to cogs that need more bandwidth
This means four cogs could be 1:8, with 8 cogs at 1:16
Or two cogs at 3:16, with 10 at 1:16...
Or one cog at 5:16, with 11 at 1:16
WHAT ABOUT FOUR MORE COGS ????
Wait just a bit, have to wipe down my screen - that's what I get for reading this with a mouthful of coffee
Are you serious???
256KB Hub or 12 Cogs - What a dilemma? Need time to think.
I had been thinking about a small block of memory at the center of the die for all cogs to access. I thought 16 * 32+1 bits (now 32+2). Bill would prefer 32*.
Then I thought each cog has a write block of 16* or 32* 32+2, and all cogs can read, one at a time sequentially (no determinism).
But your options have me stumped. More coffee required (well actually I am worse than that - I drink real coke)
I don't think the slot sharing is as impenetrable as you make it out to be. It isn't half as impenetrable to me as is your multithreading idea. And to make it perfectly clear, I don't wish to toss out either of them!
I dispute this. But even if it were true, that's all the more reason not to pass up a virtual freebie like slot sharing.
I will also say that the quickest way to end up with a Homermobile is to attempt to idiotproof a simple and clean idea like Ray/Chip/Bill/CW/et.al. have proposed.
Is this just to throw Ken off track...
C.W.
what would be the easiest, least time consuming, least risky use of the newly freed up area?
I think that is the path to be taken, for this first P2.
Whatever that path is - be it more cogs, more hub, more aux - I am sure we will find uses for it
I would agree on bandwidth-flexible slots, needed for any # COGS.
12 COGS ?? - remember we now have Multi-tasking, so 8 is already the new 12 +...
Need some examples of what can be done with 12.MT, that cannot be done with 8.MT ?
Examples of what can be done with 256k, that cannot be done with 128k might be easier to find ?
That's more fonts for a start, and more Display List..., or might now make JavaScript fit, or many more....
More important than extra COGS would be HW QuadSPI support, so cheap FLASH can better feed the COGS we have.
4 more with a fleshed out SERDES, or with the one that is there now?
AMEN!
It's already good style to use variables or constants to designate I/O in objects so they don't conflict, and it's the user's responsibility to avoid those conflicts. This is widely accepted.
It would be good style to use variables when setting up cog slot allocations so they don't conflict, and it would be the user's responsibility to avoid those conflicts. Whether this be by judicious selection of the cog in which the object should run, or by using a variable to select the options, or some combination of these.
Advantages to cogs::
- more raw computational power
- due to new multipliers and dividers, greater potential to enter DSP space
- more counters
- more uart / serdes modules
- more high bandwidth video / signal outputs
- more precise deterministic timing for I/O's
Advantages to more hub: (in addition to what you stated)
- larger LMM code without SDRAM
- more buffers (for bitmaps, SD sectors, data gathering, etc)
- more bandwidth per cog (due to fewer cogs)
Regarding QSPI support:
Chip seems to suggest we will get 200Mhz (wohoo!)
With the new nibble instructions, and counters, we can handle 100Mhz QSPI in software (to AUX/cog memory)
Fastest QSPI flash I've seen is 104Mhz.
I'm serious!
This is low-risk and it's the easiest way to make all that space useful. Changing to 12 cogs is simple. The biggest pain is remapping the COGINIT instruction, which is just a lot of busy work.
With the new SERDES - it is not going to be many gates, at all.
We have a 128bit bus to the hub. Only the r/w quad instructions can use this wide bus and that limits us to the quad buffer.
The read instructions effectively take 2 clocks to setup.
So for max hub to cog block transfers we would perform (I know we can just do n*4 and one RDLONC but it illustrates the point better)...
REPS #n,#4 'n=no of quad longs to transfer
NOP
RDLONGC INDA++,PTRA++ ' 1st time syncs to hub (3+ clocks); others 3+ clocks
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
So this means in the loop we could execute 2 additional 1 clock instructions???
Or we could do this...
REPS #n,#5 'n=no of quad longs to transfer
NOP
RDQUADC PTRA ' 1+ clocks (first time 1..8 clocks)
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
RDLONGC INDA++,PTRA++ ' 1 clock
So this means in the loop we could execute 3 additional 1 clock instructions???
How do we read into AUX from HUB and how quickly can that be done?
12 Cogs!
I think I just had my first (and hopefully last) heart attack.
I was thinking the extra die space might be useful as extra AUX ram to expand SETRACE capacity.
I'm scrapping that idea now.
I like Bill's 16 time slot model too.
You the man Chip!
Happy days
Just think of the advertising advantages:
"Introducing the Parallax Propeller 2"
- up to 2,400MIPS of processing power
- twelve 32 bit cores, each with MAC's, CORDIC engines, each with up to four threads of execution
- 24x 32bit timer/counters
- 24x UARTs
- 24x SPI ports (serdes)
- 12x vga/component/ntsc/pal video or signal generation ports
- 48x 9 bit high speed (200Mhz) DAC's
- 92x 9/18 bit low speed (12.5MHz) DAC's
- 92x 9/18 bit Sigma-Delta ADC's, up to XXMhz at 10 bits
- 92x digital I/O
Obviously not all ports at the same time, but people are used to that.
All of a sudden, to those who want "hard" peripherals, there is quite a list
Marketdroids delight...
I presume we can use x16 as well as x8 or x32 SDRAM with the hw?
Only in total, per-cog average HUB-BW is actually slightly less. new as in 4 more sets ? This still needs dispersed code, so I'm less sure this is a real market target ?
Always good to have, but we are still waiting on exactly what is in the new Counters - I think they already doubled ?
Current design can have 32 high bandwidth Video lines (4x8), and even the slower DAC pathway was looking fine for
ATE usages.
Not following this - fSYS is the same, so ns per clock is the same ?
I'll add another Advantages to cogs::
- Can allocate 8 COGS for SW and 4 COGS for intelligent IO and Peripherals
QSPI in hardware will allow the expensive COGS to be used for real SW work, not wiggling pins.
Hopefully, it is in the improved SERDES mentioned.
If we go with 8 COGS and ~254K Hub we get almost 32K of Hub per COG
I think sticking with 8 COGS and increasing hub is a better balance of resources.
C.W.
Yeah, but the minute you add the SDRAM pretty much half of those items go 'poof' since the HS DAC's are pin locked..
C.W.
I like Bills 1:16 method with 4 free.
Do we have a little space to put a small block of ram between cogs?
The appeal is to get closer to Execute-in-Place, and using a whole COG to wiggle pins, is a huge amount of silicon rather wasted.
Best to let HW manage the BITs and the SW manage the Bytes/Words.
A Prop is already relatively memory-starved, and QSPI Flash is already cheap and available, so making most use of that bandwidth, helps mitigate the memory issues in Prop.
It is not just code-fetch, things like Font & Icon fetch, could be done in-place for many Display uses.
As a reference, the FT800 has 256K RAM and > 282k of Fonts in ROM
There are also other combinations possible
10 COGs with 191k is also sounding like a good combination
I need to find a graphic of the die layout. I know there are several on this thread, but where? I'm in the house today, so I've only got a laptop with nothing special on it.
-Phil
P.S. In case you didn't catch it, subtlety and diplomacy are not my strong suits. And I'll probably regret this in the morning. For the time being, though, it felt good to get it off my chest.