I like Bill's suggestion all cogs start with 1:16 slots.
But if we are going to permit allocation of the other 4 slots as 1:8 or 1:6 or 1:4 to some cogs, we would need to change the order for different implementations.
a 1:16 needs to be...
0 1 2 3 4 5 6 7 8 9 10 11 x x x
a 1:8 (2:16) needs to be like...
0 1 2 3 4 5 6 7 0 8 9 10 11 x x x (0 gets 1:8, others unspecified)
a 1:4 (3:16) needs to be like...
0 1 2 3 0 4 5 6 0 7 8 9 0 10 11 x
a 5:16 needs to be like...
0 1 2 0 3 4 0 5 6 0 7 8 0 9 10 11
The gift and yield can still be used.
Free slots can still be used.
But I think the cog order (round-robbin order) will need to be setup before other cogs are started, say by cog 0???
Am I correct that the die size is pretty much set and locked by the I/O circuits that aren't part of what gets synthesized.
So the space freed up by removing the DAC Bus pretty much has to be used or wasted because actually shrinking the die would make Beau do a ton of high risk rework?
If that's the case then it does make sense to figure out the best use of this space.
Chip,
Whats involved in moving the outer blocks (cogs and hub) around to squeeze another 4 cogs in? I presume this is similar to squeezing in another ~100KB of Hub?
Might it be easier not to mod this section, and place another block of ~100KB of hub in the centre of the die and a simpler access mechanism for all cogs to access? Perhaps this is stupid, or cannot be done for other reason, but its an extension of the small shared block I suggested earlier.
Something strange is happening. I am agreeing with @heater again.
Please avoid feature creep. As I said to Ken once this is about sustainability. We need to stop asking for more features. This thing needs to get out and it is already a BIG milestone away from Prop I.
On any given complex problem for the Prop I you will usually run out of memory, pins or cogs.
Prop II now has already more memory and more pins. Just get more cogs and some SERDES. CRC would be nice. Please do not overdo it.
The OBEX (and the implicit working of different objects TOGETHER) is essential for ALL software-designed HW. Better more of them as some faster and some slower ones with trouble.
For things like video and fast sampling we already do this on Prop I with multiple cogs and syncing of them. It's doable, it's working, it's understandable and its FUN. You see a multicore working.
And isn't that multicore thing that was MOST of us drew to the propeller?
I realize it would look weird to have a not power-of-2 Hub RAM but I think I'd rather see 196K Hub than 12 cogs. We talked about more cogs a long time ago and pretty much came to the conclusion that more cogs aren't all that. Plus the Hub gets slower unless you go with some complicated slot sharing scheme. And if we're not going to have the DDR2 big external memory then Hub RAM is the bottleneck.
As far as I can tell, based on what Chip said so far, the options are:
1) more AUX ram for every cog
2) more HUB memory (~220KB total)
3) more COG's (12)
These can be mixed, so 10 COGS is also a valid Choice. It gives a HUB RAM outcome between the two 8/12 choices.
( 9 and 11 also exist, but 10 is more human-natural)
What happens if instead of filling that extra space with more logic, you instead use a smaller die with the existing logic? Would that allow you to lower the price of the P2?
What happens if instead of filling that extra space with more logic, you instead use a smaller die with the existing logic? Would that allow you to lower the price of the P2?
Yes, but I believe the peripherals are hand-laid-out, so it is not easy to simply push a 'shrink' button.
Unfortunately as I understand it, that would add a HUGE delay, with Beau having to do a ton of work - so that's out.
Beau already has to do some work due to the transistor size change, which also required the elimination of the wide DAC bus - which is why there is some die area to be filled.
During this delay, before the next shuttle run, Chip can make some tweaks without further delaying the schedule - ie USB helper instructions, CRC helper, perhaps a SERDES - and use the new empty space for more hub ram / aux ram / cogs (or mix of hub/aux/cogs), in such a way as to not really increase risks. I am strongly advocating taking the least risk, no delay route
What happens if instead of filling that extra space with more logic, you instead use a smaller die with the existing logic? Would that allow you to lower the price of the P2?
whatever is the lowest risk, least amount of work, least amount of elapsed time, is what should be done (to fill the newly emptied die area).
While we may disagree on details... I think all of as can agree...
We want our P2's as soon as possible!!!!!!
Agreed Bill.
More Hub or more COGS or both, I'm happy with whatever we end up with
There's been a lot of talk about feature creep.
To refresh my memory I just had a quick look at the P2 preliminary spec sheet.
To this point (before extra cog/ram discussions) the current spec P2 seems to match
this old spec (updated circa 2011?) pretty well.
The only obvious new feature is multi-tasking, and it's already implemented and running.
External RAM access is mentioned in there, so nothing new there.
So apart from SERDES everything else seems to have been covered in the spec.
Just a reminder of where we are and where we've come from.
I think that is most useful and expands the cases where a system can get by without external memory which keeps lot's of I/O available.
With the multitasking I really think 8 COGS is enough for now and avoids messing around too much with hub timing.
In one of my programs I am generating a 2 color 800x600 VGA image.
The image buffer gobbles up nearly half of the HUB ram (60K)
That's just one example of a need for HUB ram expansion.
I think it makes sense for the C guys and people doing LMM too. When the HUB is more roomy, more can be put on a fast execute path and there is room to manage caching, etc... and maybe fit a GREAT kernel and libraries in there with room to do stuff.
IMHO, this is pretty important given the work they've done. And we are seeing some adoption and a need out there too.
No way. I think that had to be some generalized statement. Maybe Bill has Z80 / 6502 on the brain after this ROMP! And it was a total ROMP too. Sheesh. My head is spinning.
Since there are some improvements about to happen, either by adding more COGs, AUX or HUB memory, are there any power distribution concerns in the near horizon?
Will they demand the use of a different encapsulation, bringing a center pad to concentrate ground connections, thus freeing some more pins, at the periphery, to enable a better 1.8V and 3.3V distribution grid design?
Since there are some improvements about to happen, either by adding more COGs, AUX or HUB memory, are there any power distribution concerns in the near horizon?
Will they demand the use of a different encapsulation, bringing a center pad to concentrate ground connections, thus freeing some more pins, at the periphery, to enable a better 1.8V and 3.3V distribution grid design?
Yanomani
I think we'll stick with the current TQFP-128 package, as we are already set up to use it. If we just add more hub RAM, there shouldn't be much increase in current draw. We'll keep the pinout the same as it's been.
As potatohead said, it was a generalized statement for microcontroller code - that holds true for a lot of Cortex M0's, M3's, PIC32MX's etc.
By the way, there is another problem with using the AUX RAM as a stack. In C (and I think also in Spin) you can take the address of a local variable. How do you express the address of a location in AUX RAM? How would a pointer to AUX RAM work? C kind of assumes a linear address space. We've simulated that in PropGCC by saying that hub memory starts at 0x00000000 and that external memory starts at 0x20000000. Would we have to carve out another address range for AUX RAM? That means that to dereference a pointer we would have to range check the value to see which memory to access. This is likely to negate much of the advantage of the fast AUX RAM. However, I'm happy to be proven wrong. :-)
We have 128KB, and a possibility of another ~100KB
We have 8 cogs, with aux
We create two P2's on the die, each with...
4 cogs
each with 110KB hub (or whatever 1/2 of total ~220KB can be)
small block of 32 long dual-port/fifo ram between the two P2s
Each cog now gets 1:4 slots. Only extra - allow each cog an option to use available slots.
Possibly 2nd 4x cogs do not have video mode at all, if that saves enough space to make both 128KB hubs.
We have 128KB, and a possibility of another ~100KB
We have 8 cogs, with aux
We create two P2's on the die, each with...
4 cogs
each with 110KB hub (or whatever 1/2 of total ~220KB can be)
small block of 32 long dual-port/fifo ram between the two P2s
Each cog now gets 1:4 slots. Only extra - allow each cog an option to use available slots.
Possibly 2nd 4x cogs do not have video mode at all, if that saves enough space to make both 128KB hubs.
This sounds cool. What's to prevent expanding this to an entire two dimensional array of COG groups in a future Px chip? This kind of scheme might scale better than the current single hub for all COGs.
Will they demand the use of a different encapsulation, bringing a center pad to concentrate ground connections, thus freeing some more pins, at the periphery, to enable a better 1.8V and 3.3V distribution grid design?
I think there were die-paddle size caveats on Centre-Pad lead frames.
However, google finds a FD3298F, which has 128 pins 0.4mm and a medium sized Centre PAD- one that would allow vias inside the Gull lead ring. From the outside, it looks good, but the die-space may be wrong.
As I recall, a lot of library functions use local variables, and arguments, without taking their address.
For those functions, it is an easy win. (easy examples: str*, mem* functions)
For applications that need a large stack, and functions where the address of an argument or local variable is taken, the easy fix is to not use the aux stack.
Using the AUX for stack is most useful for "classical" microcontroller applications that need to fit in the hub.
By the way, there is another problem with using the AUX RAM as a stack. In C (and I think also in Spin) you can take the address of a local variable. How do you express the address of a location in AUX RAM? How would a pointer to AUX RAM work? C kind of assumes a linear address space. We've simulated that in PropGCC by saying that hub memory starts at 0x00000000 and that external memory starts at 0x20000000. Would we have to carve out another address range for AUX RAM? That means that to dereference a pointer we would have to range check the value to see which memory to access. This is likely to negate much of the advantage of the fast AUX RAM. However, I'm happy to be proven wrong. :-)
I think we'll stick with the current TQFP-128 package, as we are already set up to use it. If we just add more hub RAM, there shouldn't be much increase in current draw. We'll keep the pinout the same as it's been.
Did you ever find a centre-pad TQFP-128 offering that would fit the die ?
Google finds FD3298F, but does not show the die cavity of course.
As I recall, a lot of library functions use local variables, and arguments, without taking their address.
For those functions, it is an easy win. (easy examples: str*, mem* functions)
For applications that need a large stack, and functions where the address of an argument or local variable is taken, the easy fix is to not use the aux stack.
Using the AUX for stack is most useful for "classical" microcontroller applications that need to fit in the hub.
You still have the same problem even if you stick to just hub memory since you'd have to distinguish between a pointer to a hub location and a pointer to an AUX RAM location. You'd still need to do range checking on each pointer dereference. I wonder how Chip handles this in Spin2?
You're right that you could have a different memory model that uses AUX RAM as a stack but disallows taking the address of locals. It might be more difficult to implement debugging with this memory model though. What might work is to just keep the linkage information in AUX RAM but keep all of the stack variables in hub memory. I'm not sure if that would give you enough of a speed advantage from AUX RAM though.
Comments
I like Bill's suggestion all cogs start with 1:16 slots.
But if we are going to permit allocation of the other 4 slots as 1:8 or 1:6 or 1:4 to some cogs, we would need to change the order for different implementations.
a 1:16 needs to be...
0 1 2 3 4 5 6 7 8 9 10 11 x x x
a 1:8 (2:16) needs to be like...
0 1 2 3 4 5 6 7 0 8 9 10 11 x x x (0 gets 1:8, others unspecified)
a 1:4 (3:16) needs to be like...
0 1 2 3 0 4 5 6 0 7 8 9 0 10 11 x
a 5:16 needs to be like...
0 1 2 0 3 4 0 5 6 0 7 8 0 9 10 11
The gift and yield can still be used.
Free slots can still be used.
But I think the cog order (round-robbin order) will need to be setup before other cogs are started, say by cog 0???
So the space freed up by removing the DAC Bus pretty much has to be used or wasted because actually shrinking the die would make Beau do a ton of high risk rework?
If that's the case then it does make sense to figure out the best use of this space.
C.W.
Whats involved in moving the outer blocks (cogs and hub) around to squeeze another 4 cogs in? I presume this is similar to squeezing in another ~100KB of Hub?
Might it be easier not to mod this section, and place another block of ~100KB of hub in the centre of the die and a simpler access mechanism for all cogs to access? Perhaps this is stupid, or cannot be done for other reason, but its an extension of the small shared block I suggested earlier.
Something strange is happening. I am agreeing with @heater again.
Please avoid feature creep. As I said to Ken once this is about sustainability. We need to stop asking for more features. This thing needs to get out and it is already a BIG milestone away from Prop I.
On any given complex problem for the Prop I you will usually run out of memory, pins or cogs.
Prop II now has already more memory and more pins. Just get more cogs and some SERDES. CRC would be nice. Please do not overdo it.
The OBEX (and the implicit working of different objects TOGETHER) is essential for ALL software-designed HW. Better more of them as some faster and some slower ones with trouble.
For things like video and fast sampling we already do this on Prop I with multiple cogs and syncing of them. It's doable, it's working, it's understandable and its FUN. You see a multicore working.
And isn't that multicore thing that was MOST of us drew to the propeller?
More cogs please.
Enjoy!
Mike
whatever is the lowest risk, least amount of work, least amount of elapsed time, is what should be done (to fill the newly emptied die area).
shrinking the I/O surrounding the synthesized core would be a BIG delay, so that's out.
Like Roger, I don't care about nice power of two hub size - or number of cogs.
I think none of us wants to see untried, new circuitry for the newly free area (10mm2)
As far as I can tell, based on what Chip said so far, the options are:
1) more AUX ram for every cog
2) more HUB memory (~220KB total)
3) more COG's (12)
We have been having fun talking about it, but Chip will decide which is easiest and least risky to implement.
While we may disagree on details... I think all of as can agree...
We want our P2's as soon as possible!!!!!!
These can be mixed, so 10 COGS is also a valid Choice. It gives a HUB RAM outcome between the two 8/12 choices.
( 9 and 11 also exist, but 10 is more human-natural)
You already have more COGS in Prop 2, thanks to Multi-tasking/threading. (A fairly late addition)
( Have you really run out of P2 COGs ? )
Yes, but I believe the peripherals are hand-laid-out, so it is not easy to simply push a 'shrink' button.
Beau already has to do some work due to the transistor size change, which also required the elimination of the wide DAC bus - which is why there is some die area to be filled.
During this delay, before the next shuttle run, Chip can make some tweaks without further delaying the schedule - ie USB helper instructions, CRC helper, perhaps a SERDES - and use the new empty space for more hub ram / aux ram / cogs (or mix of hub/aux/cogs), in such a way as to not really increase risks. I am strongly advocating taking the least risk, no delay route
Agreed Bill.
More Hub or more COGS or both, I'm happy with whatever we end up with
There's been a lot of talk about feature creep.
To refresh my memory I just had a quick look at the P2 preliminary spec sheet.
To this point (before extra cog/ram discussions) the current spec P2 seems to match
this old spec (updated circa 2011?) pretty well.
The only obvious new feature is multi-tasking, and it's already implemented and running.
External RAM access is mentioned in there, so nothing new there.
So apart from SERDES everything else seems to have been covered in the spec.
Just a reminder of where we are and where we've come from.
Ozpropdev
I think that is most useful and expands the cases where a system can get by without external memory which keeps lot's of I/O available.
With the multitasking I really think 8 COGS is enough for now and avoids messing around too much with hub timing.
C.W.
This makes sense for the VIDEO guys out there.
In one of my programs I am generating a 2 color 800x600 VGA image.
The image buffer gobbles up nearly half of the HUB ram (60K)
That's just one example of a need for HUB ram expansion.
But....I love COG's too.
IMHO, this is pretty important given the work they've done. And we are seeing some adoption and a need out there too.
No way. I think that had to be some generalized statement. Maybe Bill has Z80 / 6502 on the brain after this ROMP! And it was a total ROMP too. Sheesh. My head is spinning.
Yeah. Great post. And the truth is SERDES should have been in there. I feel good about it, and this place we are converging on.
Since there are some improvements about to happen, either by adding more COGs, AUX or HUB memory, are there any power distribution concerns in the near horizon?
Will they demand the use of a different encapsulation, bringing a center pad to concentrate ground connections, thus freeing some more pins, at the periphery, to enable a better 1.8V and 3.3V distribution grid design?
Yanomani
I think we'll stick with the current TQFP-128 package, as we are already set up to use it. If we just add more hub RAM, there shouldn't be much increase in current draw. We'll keep the pinout the same as it's been.
We have 128KB, and a possibility of another ~100KB
We have 8 cogs, with aux
We create two P2's on the die, each with...
- 4 cogs
- each with 110KB hub (or whatever 1/2 of total ~220KB can be)
- small block of 32 long dual-port/fifo ram between the two P2s
Each cog now gets 1:4 slots. Only extra - allow each cog an option to use available slots.Possibly 2nd 4x cogs do not have video mode at all, if that saves enough space to make both 128KB hubs.
Name: A DualxQuad-Core 32bit processor P2^2
I think there were die-paddle size caveats on Centre-Pad lead frames.
However, google finds a FD3298F, which has 128 pins 0.4mm and a medium sized Centre PAD- one that would allow vias inside the Gull lead ring. From the outside, it looks good, but the die-space may be wrong.
For those functions, it is an easy win. (easy examples: str*, mem* functions)
For applications that need a large stack, and functions where the address of an argument or local variable is taken, the easy fix is to not use the aux stack.
Using the AUX for stack is most useful for "classical" microcontroller applications that need to fit in the hub.
Did you ever find a centre-pad TQFP-128 offering that would fit the die ?
Google finds FD3298F, but does not show the die cavity of course.
You're right that you could have a different memory model that uses AUX RAM as a stack but disallows taking the address of locals. It might be more difficult to implement debugging with this memory model though. What might work is to just keep the linkage information in AUX RAM but keep all of the stack variables in hub memory. I'm not sure if that would give you enough of a speed advantage from AUX RAM though.