You still have the same problem even if you stick to just hub memory since you'd have to distinguish between a pointer to a hub location and a pointer to an AUX RAM location. You'd still need to do range checking on each pointer dereference. I wonder how Chip handles this in Spin2?
You're right that you could have a different memory model that uses AUX RAM as a stack but disallows taking the address of locals. It might be more difficult to implement debugging with this memory model though. What might work is to just keep the linkage information in AUX RAM but keep all of the stack variables in hub memory. I'm not sure if that would give you enough of a speed advantage from AUX RAM though.
In the Spin2 compiler, I know variables by type, so I know which instructions to execute to read them from hub, cog register, or AUX memory.
In the Spin2 compiler, I know variables by type, so I know which instructions to execute to read them from hub, cog register, or AUX memory.
Really? I thought Spin was pretty much untyped. Can I take the address of a local variable and then pass it as a parameter to another function and have that function understand that it is a pointer to AUX RAM rather than hub?
Actually instead of not allowing taking the address of locals, I was thinking of it as an optimization where the address of locals was not taken.
It makes passing arguments, and using arguments/locals easy when there is no need for a deep stack - and is much faster than hitting the hub for arguments/locals. It also saves a lot of register spilling and re-loading from GCC. Most control code for microcontrollers does not use deep stacks, so there it is a definite win.
You still have the same problem even if you stick to just hub memory since you'd have to distinguish between a pointer to a hub location and a pointer to an AUX RAM location. You'd still need to do range checking on each pointer dereference. I wonder how Chip handles this in Spin2?
You're right that you could have a different memory model that uses AUX RAM as a stack but disallows taking the address of locals. It might be more difficult to implement debugging with this memory model though. What might work is to just keep the linkage information in AUX RAM but keep all of the stack variables in hub memory. I'm not sure if that would give you enough of a speed advantage from AUX RAM though.
What happens if instead of filling that extra space with more logic, you instead use a smaller die with the existing logic? Would that allow you to lower the price of the P2?
Or how about adding some internal debugging pads between the core part and the I/O part to reduce risk of a new shuttle failure? Something that can be used together with the ION machine that was successfully used to test/modify P1 die.
Actually instead of not allowing taking the address of locals, I was thinking of it as an optimization where the address of locals was not taken.
It makes passing arguments, and using arguments/locals easy when there is no need for a deep stack - and is much faster than hitting the hub for arguments/locals. It also saves a lot of register spilling and re-loading from GCC. Most control code for microcontrollers does not use deep stacks, so there it is a definite win.
Are you suggesting two mutually exclusive compilation modes where either the entire program would use the AUX stack model or it would use the hub stack model but not both? It would be difficult to allow both in the same program without having some sort of decoration on the prototype to indicate which convention was being used.
Two stack models would be a good compromise in ease of implementation vs. performance gains.
I'd even go futher and say that AUX-stack model would mostly be useful in Hub-only mode (and cog mode); but in those cases, it should be a significant win in both speed and code size.
Code that needs XMM mode is more likely to need a stack too big for AUX stack, and implementing mix-mode usage within the same executable is not easy, as you say.
Are you suggesting two mutually exclusive compilation modes where either the entire program would use the AUX stack model or it would use the hub stack model but not both? It would be difficult to allow both in the same program without having some sort of decoration on the prototype to indicate which convention was being used.
That Intersil package had a die pad that was 6.9 x 6.9mm. Our die is 7.3 x 7.3mm, which is too large for all the exposed-pad leadframes that our packaging vendor offers. We could get a custom leadframe designed, but it's a few $10k's.
Really? I thought Spin was pretty much untyped. Can I take the address of a local variable and then pass it as a parameter to another function and have that function understand that it is a pointer to AUX RAM rather than hub?
It is untyped, though the compiler knows where variables are. To access the AUX variables by index, you would use 'AUXR[index]' in Spin. Spin would give you the address of a local in AUX by '@localvariable'. You'd just better use that index in an AUXR[index] expression. My nephews saw my mom in the Apple app store (on her iPad) trying to find handbags from Macy's.
I think there were die-paddle size caveats on Centre-Pad lead frames.
However, google finds a FD3298F, which has 128 pins 0.4mm and a medium sized Centre PAD- one that would allow vias inside the Gull lead ring. From the outside, it looks good, but the die-space may be wrong.
Thanks jmg, I've looked at this and found the same difficulties about die-space info.
Chip
I've noticed it too, the 6.9 x 6.9 versus your current die size limitation, What a heck!
I've noticed it too, the 6.9 x 6.9 versus your current die size limitation, What a heck!
The external size may not quite equate to the internal die area ?
It might be possible to order some packaged in a EP, with not quite the right landing shape, to check the impact of an EP bonding.
Even a TQFP144-EP could give useful info ?
I did find this
Infineon :
Package Leadframe Code Ex Ey
PG-LQFP-100-8 C66065-A6837-C021 7.5 x 7.5
PG-LQFP-144-16 C66065-A7014-C005 7.5 x 7.5
PG-LQFP-144-4 C66065-A6730-C021 9 x 9
PG-LQFP-144-13 C66065-A6730-C033 8 x 8
PG-LQFP-144-23 C66065-A7014-C022 7.5 x 7.5
and of course, there is always a BGA option ...
Some companies do one die, and bond into two packages.
The external size may not quite equate to the internal die area ?
It might be possible to order some packaged in a EP, with not quite the right landing shape, to check the impact of an EP bonding.
Even a TQFP144-EP could give useful info ?
I did find this
Infineon :
Package Leadframe Code Ex Ey
PG-LQFP-100-8 C66065-A6837-C021 7.5 x 7.5
PG-LQFP-144-16 C66065-A7014-C005 7.5 x 7.5
PG-LQFP-144-4 C66065-A6730-C021 9 x 9
PG-LQFP-144-13 C66065-A6730-C033 8 x 8
PG-LQFP-144-23 C66065-A7014-C022 7.5 x 7.5
and of course, there is always a BGA option ...
Some companies do one die, and bond into two packages.
jmg
I believe that a uneven thermal gradient should be totaly avoided, as for soldering or under operational conditions.
Also electric field mismatches are to be expected in such a shorter than die landing area, thus must be avoided too.
If we could find a source, for the 7.5 x 7.5 version with 128 pin, then the current die size could surely be evenly placed over the exposed pad internal frame surface, but since it will not exactly fit, I suspect that the tooling setup fee will apply, then $$$.
Yanomani
P.S.
Despite the wasted silicon area, and consequent costs, if the masks were spaced in a 7.5 x 7.5 grid, then an area for samples from Chip and Beau's signatures ( and Ken's compliance seal too) would be relieved.
I haven't exhausted COGs ever since PropellerGCC inline ASM and fcache support was ready. That makes fast serial IO, I2C, SPI, etc... easy to run all in one COG easing the requirement for more COGs. The only reason to use COGs now is for background operations that can't be blocked such as rxReady(), video-out, or a simple speedup. Everything else can be located in HUB RAM.
This place has been getting crazy in the last few days. Too many new choices and risk...
My only suggestion if you want a device shipped anytime soon is that you have to restrict your changes to things that are the lowest risk or would either save Parallax large amounts of money or increases the P2 target application market so significantly that is is enough to warrant the increased risk of failure or delay in shipping. If you are going to do anything scaling SRAM memory size seems a relatively lower risk when compared to adding completely new features at this point, at least to me.
That being said, really where are the areas that limit the P2 today and could potentially restrict its market? IMO it's unlikely to PASM capabilities or COG resources or clock speed but more likely to be device cost, power usage (both of which are a little hard to do very much about at this point so you need to look at other things). The other areas in which I see the P2 may be limited are in streaming fast I/O to high bandwidth serial interfaces (eg QSPI, Ethernet, even USB), performing data transactions quickly between COGs due to the hub bottleneck, and when accessing shared external memory that contains high resolution framebuffers for video or other large amounts of data such as used by image processing in robotics/industrial applications and/or large amounts of high level code simultaneously (ie. all the stuff that won't all fit into the 124kB hub RAM). We've already analyzed some shared memory performance stuff in a previous post and identified a pretty big bottleneck if one uses SDRAM for this. It's not good and we are really starving out all those COG's raw performance. Yes a VM directly attached to dedicated and exclusive SRAM can probably still run pretty fast, it is really only when you need to share it with other COGs or have to execute from SDRAM with its higher access latencies that you start to run into these performance issues some of which relates to the hub and some to the memory technology itself.
Chip, could your freed up die space from your intended DAC bus wiring changes be used to increase the AUX RAM depth but at the same time also make some amount of it (eg half) 8 ported RAM shared by all the COGs? I'm thinking having an 8 port RAM can open up a lot of new possibilities/applications for high speed data transfers between driver COGs working in parallel and can signicantly increase performance over the hub RAM method alone. Having some shared AUX RAM then allows single cycle random access by multiple COGs, and to the programmer it basically acts like a very fast mini hub RAM with no noticeable hub arbiter delay, ie always deterministic which is very useful. I see VMs using it would be able to communicate with other COGs easily either for rapidly passing around code or data (including video lines), albeit requiring some co-ordinated way not to all write to the same address in the same cycle (but note the hub method effectively also requires that too). Software can always sort that part out so for simplicity hardware could just arbitrate any simultaneous writes using fixed priority using COG ID for example.
Could we get at least a 2k x 32 or ideally even higher sized 8 ported AUX RAM space that is shared by all 8 COGs in the extra chip space available? I'm not saying make all the AUX RAM 8 ported, you'd want to keep the exisiting amount of dual ported AUX RAM as is for exclusive COG use but extend its depth so just some of its increased address space becomes 8 ported RAM. Or if there is space au go go, 16 ported RAM with the video generators reading it too may be even better, but that is probably asking for too much.
I'm thinking this could be a reasonable use of the newly available die space resources, and would allow improved performance in some cases for high speed drivers and I/O etc when mutiple COGs are involved. I don't know how much risk it adds to include 8 port RAM over regular AUX RAM in your P2 design. Only you would know that. The other easy option is simply increasing hub RAM size but that doesn't increase performance directly, just indirectly in some cases by increasing the amount of code or data that will fit inside it before any external RAM is required (still good in many cases). Some mix of both these options may be good too, but I'm thinking an octal port SRAM accessible from all COGs via AUX RAM is going to be rather useful for quite a few new things we could dream up over time and would also help alleviate some of the hub bottleneck problem at the same time...increasing hub RAM alone won't change the bottlenecks.
It's the without external memory cases and those using LMM / compiled languages, etc... biasing me toward the HUB.
In my experience we don't need much HUB space at all.
In 24KB I can fit:
- C++ classes (with templating and inheritance)
- SD card, serial, I2C
- JSON
- 7 Cogs
- 9 distinct HW peripheral ICs
I've achieved this through some basic size optimizations:
- C++ inlining
- CMM, -Os
- no fcache
- no standard libraries
- no exceptions
- linker garbage collection
- Careful coding.
So, I don't think that more hub is needed in well crafted code.
Of course, if it was available for free then sure! Or more cogs. Or both. But it's not too difficult to fit complex programs in a fraction of the space.
Edit: If I compile the application with no optimization (LMM, no -Os, no GC, etc.) then the binary size is 106K. If I add in Os and GC (but keep LMM) it gets down to 49K.
If I understand correctly, hub has a single interface, and aux has a 1.5 port interface (r/w hub/cog and r video gen). So aux will take a little more space per KB than hub.
If we have an extra 100KB hub space available, then that may equal 80KB aux (best educated guess)
Currently we have 256 aux longs.
Hub: Slower access, but available to all.
Cog: Private access, but lots faster.
If we increased (doubled) aux to 512 longs (nice 9bit address) that's an extra 256*8*4=8KB, still leaving ~90KB extra for Hub.
If we increased aux to 8KB = 2048 longs, this uses 8*8K - 256*8*4 = 56KB, leaving perhaps 24KB extra for hub
If we increased aux to 12KB, this would use 8*12KB-8KB = 84KB so this is perhaps borderline.
If we had big cogs, what would we gain?
(1) A big video buffer, but not big enough for a whole HD screen.
(2) A big stack for GCC / spin / etc, as well as variable space too.
(3) A Very Fast Overlay/Instruction Cache for LMM/XMM
QUESTIONS:
(1) Could the SPA/SPB be placed into Cog space $1F0-1F1 and be used like INDA/INDB, so instructions like AND/OR/XOR could also operate directly on Aux?
The basics have been proved with INDA/INDB, so is this simple to do or not?
(2) Could the RD/WR QUAD instruction(s) be extended to execute the transfer of a block of quad-longs in the background?
If either/both of these are easily solved, I would vote for the biggest Aux memory as I think it would provide real processing benefits, more than the hub would.
Otherwise I am on the fence leaning towards a mix of say 192KB hub (+64KB) and 3KB*8 Aux (+2KB*8), leaving a little on the die for later expansion.
Man I go away for a bit and you guys go off the deep end!
I think this hub slot sharing business needs to go away or be very limited, I'm not even sure it really truly buys us anything ( I mean in real actual usage cases, not just on paper ). Chip I think this really flies in the face of one of your main goals! The chip should be easy and fun to work with, not have weird gotcha's and problems due to over complication. Every time we talk you gripe about how flaky and annoying your desktop machine it, this kind of thing you are thinking about doing leads you down that slippery slope. If you want COGs to have faster bandwidth to HUB memory then figure out a way that does it uniformly for all of them all the time. Not this weird one COG gives up a timeslot to another COG Smile. Heater is right, it'll lead to all sorts of problems with object sharing, and no amount of guidelines or rules will stop that. Seriously, just say no to this timeslot sharing/giving up/whatever Smile!
As for what to do with extra die space, the obvious answer is more HUB memory, the chip is STARVED for HUB memory. Nothing else will help the chip more than having more HUB memory. AUX memory is exactly the size it needs to be, it doesn't need to be any bigger. Maybe if HUB or COG memory was already sufficiently larger then we could start talking about making AUX bigger, but seriously, until HUB memory is at least an order of magnitude bigger, it's not big enough. I can't believe anyone would want anything else? We have 8 super duper awesome power house COGs that are sharing a measly 126K of HUB RAM. Adding more COGs would be silly.
Hi Roy I totally agree that more hub RAM is useful as well and having more will help plenty of applications fit into the device, it just won't be able to increase data transfer speeds between COGs in any transaction oriented applications such as cache drivers reading from external memory shared by multiple COGs for holding large data/video and VM code, or high speed I/O drivers wanting to use multiple COGs and needing to share data fast between them for even higher raw performance.
I saw this was one area where the P2 can get hit hard today when it always has to negotiate every COG to COG data transfer through the hub memory pathway with its 8 cycle boundaries. A small amount of shared RAM between COGs could help alleviate this issue if the COGs can read/write it quickly like they can with AUX RAM. IMO the rest of the device appears to be pretty well considered albeit with limited internal memory space in general given its die size restrictions and choice of semiconductor process.
Having both some shared (8 port) RAM and lots more hub RAM would be the best of both worlds.
HUB RAM is funtionality. Lots of it. Easily. You can put code in it that does stuff. You can write that code in C or Spin. HUB RAM enables many functionalities that are otherwise impossible.AUX RAM is not. It won't get used by compilers and it won't get used by most programmers. It's special purpose. It is not generally useful. As Roy said, it's big enough already.More COGs is a waste. We can already combine functionalities into a single COG with threads. No point to have more CPUs to run code that you can't have because it won't fit in RAM. That's all out of balance.
It seems strange how the biggest advocates of more HUB RAM are the same that are fanatically against letting COGs that need it have faster access to it...
At some point the extreme orthogonality places an upper bound on what is feasible. As we more forward there are going to be things that would be nice to have, but the burden of making every COG have that feature is going to explode die size.
While it's nice to say every cog is the same it eventually becomes silly that the COG running the main app or a video driver or something like that has to be identical to the COG just blinking a few LED's.
It's like saying because one vehicle in a convoy is a truck, all the others have to be trucks as well even though they are only carrying the family groceries instead of a load of televisions.
This is what I would do with the surplus die area:
1. Onboard 1.8V regulator, so the chip could be powered from 3.3V only.
2. Fatter power distribution buses. This would provide more likely success to less-than-optimal board layouts.
Anyway, that's what I think your OEM customers would appreciate more than enhanced feature-set complexity.
I thought the Propeller concept was to make a very capable general-purpose processor family which uses soft peripherals.
The moment you make it difficult for developers, particularly commercial developers, they will look elsewhere. Why would I choose a processor where I can't freely use a 6-channel 32-bit timer/PWM module alongside a USB device module without considering how my processor will allocate these things called slots' and how it will access system RAM?
The moment you try to make something excellent at everything you are doomed to failure.
2) COMPARISON WITH OTHER PROCESSORS
There is a fundamental difference between the P2 and other processor families. The others are available now; the P2 won't be on any distributor's shelf for at least 12 months.
Reading the 20 or so pages of posts that appeared over the weekend reminded me of an old proverb, "A bird in the hand is worth two in the bush", which www.phrases.org.uk tells me has the following meaning...
"It's better to have a lesser but certain advantage than the possibility of a greater one that may come to nothing."
Am I the only one that is bit worried about removing the big DAC bus and connecting the fast DACs to certain cogs?
How can an object choose a pin-group for it's VGA output for example? It will need to start the VGA driver in a certain cog, but what if this cog is already used? I think without a totally variable mapping of the DAC pingroups, the whole object concept of Spin is broken.
We may recommend, in the documentation of the object, to start the Video driver as one of the first objects an use high pin numbers for the Video connection.
But if you use many objects with fast DACs (Audio, Functiongenerators and so on) then the cog allocation can only work if you start the drivers in the exact right order.
Comments
It sounds cool to me too!
Something like two parenthesis alike shaped groups of four COGs each!
To me, it tastes like an oval shaped pizza! With an extra long rectangular slice in the middle, i.e., the block of dual port fifo ram! Very cool!
Yanomani
In the Spin2 compiler, I know variables by type, so I know which instructions to execute to read them from hub, cog register, or AUX memory.
It makes passing arguments, and using arguments/locals easy when there is no need for a deep stack - and is much faster than hitting the hub for arguments/locals. It also saves a lot of register spilling and re-loading from GCC. Most control code for microcontrollers does not use deep stacks, so there it is a definite win.
Or how about adding some internal debugging pads between the core part and the I/O part to reduce risk of a new shuttle failure? Something that can be used together with the ION machine that was successfully used to test/modify P1 die.
I'd even go futher and say that AUX-stack model would mostly be useful in Hub-only mode (and cog mode); but in those cases, it should be a significant win in both speed and code size.
Code that needs XMM mode is more likely to need a stack too big for AUX stack, and implementing mix-mode usage within the same executable is not easy, as you say.
http://www.fujielectric.com/company/tech/pdf/r53-3/07.pdf
and another one here
https://www.intersil.com/en/support/packaginginfo/details.html?type=PLASTIC&fam=LQFP&pod=Q128.14X14A
That Intersil package had a die pad that was 6.9 x 6.9mm. Our die is 7.3 x 7.3mm, which is too large for all the exposed-pad leadframes that our packaging vendor offers. We could get a custom leadframe designed, but it's a few $10k's.
It is untyped, though the compiler knows where variables are. To access the AUX variables by index, you would use 'AUXR[index]' in Spin. Spin would give you the address of a local in AUX by '@localvariable'. You'd just better use that index in an AUXR[index] expression. My nephews saw my mom in the Apple app store (on her iPad) trying to find handbags from Macy's.
Thanks jmg, I've looked at this and found the same difficulties about die-space info.
Chip
I've noticed it too, the 6.9 x 6.9 versus your current die size limitation, What a heck!
Where are my new Dremel tool set???
Yanomani
Sounds like the simplest least risk and most beneficial option.
We might be then be able to get that JavaScript or Micro Python engine in there.
More RAM helps fast LMM C code a lot.
More RAM means less need to add an external RAM which is complex, big, slow and eats pins.
12 COGs with the tiny existing RAM seems very unbalanced. It lowers HUB access bandwidth. It invokes Amdahl's law.
To echo Phil and Co. Let's freeze this design and get it out the door.
I'm second for that.
C-languge will love that
The external size may not quite equate to the internal die area ?
It might be possible to order some packaged in a EP, with not quite the right landing shape, to check the impact of an EP bonding.
Even a TQFP144-EP could give useful info ?
I did find this
Infineon :
Package Leadframe Code Ex Ey
PG-LQFP-100-8 C66065-A6837-C021 7.5 x 7.5
PG-LQFP-144-16 C66065-A7014-C005 7.5 x 7.5
PG-LQFP-144-4 C66065-A6730-C021 9 x 9
PG-LQFP-144-13 C66065-A6730-C033 8 x 8
PG-LQFP-144-23 C66065-A7014-C022 7.5 x 7.5
and of course, there is always a BGA option ...
Some companies do one die, and bond into two packages.
jmg
I believe that a uneven thermal gradient should be totaly avoided, as for soldering or under operational conditions.
Also electric field mismatches are to be expected in such a shorter than die landing area, thus must be avoided too.
If we could find a source, for the 7.5 x 7.5 version with 128 pin, then the current die size could surely be evenly placed over the exposed pad internal frame surface, but since it will not exactly fit, I suspect that the tooling setup fee will apply, then $$$.
Yanomani
P.S.
Despite the wasted silicon area, and consequent costs, if the masks were spaced in a 7.5 x 7.5 grid, then an area for samples from Chip and Beau's signatures ( and Ken's compliance seal too) would be relieved.
What a bonus for a winner run!
Yanomani
(ie for 8 cogs, its 100kB extra hub ram, vs 50% extra cogs, vs how much additional stack ram?)
More HUB RAM.
I haven't exhausted COGs ever since PropellerGCC inline ASM and fcache support was ready. That makes fast serial IO, I2C, SPI, etc... easy to run all in one COG easing the requirement for more COGs. The only reason to use COGs now is for background operations that can't be blocked such as rxReady(), video-out, or a simple speedup. Everything else can be located in HUB RAM.
Worst case:
COG0 Main program: serial IO, I2C, SPI.
COG1 VGA
COG2 Keyboard
COG3 Mouse
COG4 SDcard
COG5 Optional FPU
COG6 Optional XMM Cache
COG7 Free
I can see where COG count could easily be a problem in SPIN without in-line ASM and fcache.
Say 2K longs for AUX and the remainder added to HUB. (~86k)
How does this fit in with the SPx pointers and AUX instructions?
Edit: Even expand to 512 longs for 9 bit compatibility.
My only suggestion if you want a device shipped anytime soon is that you have to restrict your changes to things that are the lowest risk or would either save Parallax large amounts of money or increases the P2 target application market so significantly that is is enough to warrant the increased risk of failure or delay in shipping. If you are going to do anything scaling SRAM memory size seems a relatively lower risk when compared to adding completely new features at this point, at least to me.
That being said, really where are the areas that limit the P2 today and could potentially restrict its market? IMO it's unlikely to PASM capabilities or COG resources or clock speed but more likely to be device cost, power usage (both of which are a little hard to do very much about at this point so you need to look at other things). The other areas in which I see the P2 may be limited are in streaming fast I/O to high bandwidth serial interfaces (eg QSPI, Ethernet, even USB), performing data transactions quickly between COGs due to the hub bottleneck, and when accessing shared external memory that contains high resolution framebuffers for video or other large amounts of data such as used by image processing in robotics/industrial applications and/or large amounts of high level code simultaneously (ie. all the stuff that won't all fit into the 124kB hub RAM). We've already analyzed some shared memory performance stuff in a previous post and identified a pretty big bottleneck if one uses SDRAM for this. It's not good and we are really starving out all those COG's raw performance. Yes a VM directly attached to dedicated and exclusive SRAM can probably still run pretty fast, it is really only when you need to share it with other COGs or have to execute from SDRAM with its higher access latencies that you start to run into these performance issues some of which relates to the hub and some to the memory technology itself.
Chip, could your freed up die space from your intended DAC bus wiring changes be used to increase the AUX RAM depth but at the same time also make some amount of it (eg half) 8 ported RAM shared by all the COGs? I'm thinking having an 8 port RAM can open up a lot of new possibilities/applications for high speed data transfers between driver COGs working in parallel and can signicantly increase performance over the hub RAM method alone. Having some shared AUX RAM then allows single cycle random access by multiple COGs, and to the programmer it basically acts like a very fast mini hub RAM with no noticeable hub arbiter delay, ie always deterministic which is very useful. I see VMs using it would be able to communicate with other COGs easily either for rapidly passing around code or data (including video lines), albeit requiring some co-ordinated way not to all write to the same address in the same cycle (but note the hub method effectively also requires that too). Software can always sort that part out so for simplicity hardware could just arbitrate any simultaneous writes using fixed priority using COG ID for example.
Could we get at least a 2k x 32 or ideally even higher sized 8 ported AUX RAM space that is shared by all 8 COGs in the extra chip space available? I'm not saying make all the AUX RAM 8 ported, you'd want to keep the exisiting amount of dual ported AUX RAM as is for exclusive COG use but extend its depth so just some of its increased address space becomes 8 ported RAM. Or if there is space au go go, 16 ported RAM with the video generators reading it too may be even better, but that is probably asking for too much.
I'm thinking this could be a reasonable use of the newly available die space resources, and would allow improved performance in some cases for high speed drivers and I/O etc when mutiple COGs are involved. I don't know how much risk it adds to include 8 port RAM over regular AUX RAM in your P2 design. Only you would know that. The other easy option is simply increasing hub RAM size but that doesn't increase performance directly, just indirectly in some cases by increasing the amount of code or data that will fit inside it before any external RAM is required (still good in many cases). Some mix of both these options may be good too, but I'm thinking an octal port SRAM accessible from all COGs via AUX RAM is going to be rather useful for quite a few new things we could dream up over time and would also help alleviate some of the hub bottleneck problem at the same time...increasing hub RAM alone won't change the bottlenecks.
Roger.
In my experience we don't need much HUB space at all.
In 24KB I can fit:
- C++ classes (with templating and inheritance)
- SD card, serial, I2C
- JSON
- 7 Cogs
- 9 distinct HW peripheral ICs
I've achieved this through some basic size optimizations:
- C++ inlining
- CMM, -Os
- no fcache
- no standard libraries
- no exceptions
- linker garbage collection
- Careful coding.
So, I don't think that more hub is needed in well crafted code.
Of course, if it was available for free then sure! Or more cogs. Or both. But it's not too difficult to fit complex programs in a fraction of the space.
Edit: If I compile the application with no optimization (LMM, no -Os, no GC, etc.) then the binary size is 106K. If I add in Os and GC (but keep LMM) it gets down to 49K.
If we have an extra 100KB hub space available, then that may equal 80KB aux (best educated guess)
Currently we have 256 aux longs.
Hub: Slower access, but available to all.
Cog: Private access, but lots faster.
If we increased (doubled) aux to 512 longs (nice 9bit address) that's an extra 256*8*4=8KB, still leaving ~90KB extra for Hub.
If we increased aux to 8KB = 2048 longs, this uses 8*8K - 256*8*4 = 56KB, leaving perhaps 24KB extra for hub
If we increased aux to 12KB, this would use 8*12KB-8KB = 84KB so this is perhaps borderline.
If we had big cogs, what would we gain?
(1) A big video buffer, but not big enough for a whole HD screen.
(2) A big stack for GCC / spin / etc, as well as variable space too.
(3) A Very Fast Overlay/Instruction Cache for LMM/XMM
QUESTIONS:
(1) Could the SPA/SPB be placed into Cog space $1F0-1F1 and be used like INDA/INDB, so instructions like AND/OR/XOR could also operate directly on Aux?
The basics have been proved with INDA/INDB, so is this simple to do or not?
(2) Could the RD/WR QUAD instruction(s) be extended to execute the transfer of a block of quad-longs in the background?
If either/both of these are easily solved, I would vote for the biggest Aux memory as I think it would provide real processing benefits, more than the hub would.
Otherwise I am on the fence leaning towards a mix of say 192KB hub (+64KB) and 3KB*8 Aux (+2KB*8), leaving a little on the die for later expansion.
I think this hub slot sharing business needs to go away or be very limited, I'm not even sure it really truly buys us anything ( I mean in real actual usage cases, not just on paper ). Chip I think this really flies in the face of one of your main goals! The chip should be easy and fun to work with, not have weird gotcha's and problems due to over complication. Every time we talk you gripe about how flaky and annoying your desktop machine it, this kind of thing you are thinking about doing leads you down that slippery slope. If you want COGs to have faster bandwidth to HUB memory then figure out a way that does it uniformly for all of them all the time. Not this weird one COG gives up a timeslot to another COG Smile. Heater is right, it'll lead to all sorts of problems with object sharing, and no amount of guidelines or rules will stop that. Seriously, just say no to this timeslot sharing/giving up/whatever Smile!
As for what to do with extra die space, the obvious answer is more HUB memory, the chip is STARVED for HUB memory. Nothing else will help the chip more than having more HUB memory. AUX memory is exactly the size it needs to be, it doesn't need to be any bigger. Maybe if HUB or COG memory was already sufficiently larger then we could start talking about making AUX bigger, but seriously, until HUB memory is at least an order of magnitude bigger, it's not big enough. I can't believe anyone would want anything else? We have 8 super duper awesome power house COGs that are sharing a measly 126K of HUB RAM. Adding more COGs would be silly.
I saw this was one area where the P2 can get hit hard today when it always has to negotiate every COG to COG data transfer through the hub memory pathway with its 8 cycle boundaries. A small amount of shared RAM between COGs could help alleviate this issue if the COGs can read/write it quickly like they can with AUX RAM. IMO the rest of the device appears to be pretty well considered albeit with limited internal memory space in general given its die size restrictions and choice of semiconductor process.
Having both some shared (8 port) RAM and lots more hub RAM would be the best of both worlds.
At some point the extreme orthogonality places an upper bound on what is feasible. As we more forward there are going to be things that would be nice to have, but the burden of making every COG have that feature is going to explode die size.
While it's nice to say every cog is the same it eventually becomes silly that the COG running the main app or a video driver or something like that has to be identical to the COG just blinking a few LED's.
It's like saying because one vehicle in a convoy is a truck, all the others have to be trucks as well even though they are only carrying the family groceries instead of a load of televisions.
C.W.
2. Fatter power distribution buses. This would provide more likely success to less-than-optimal board layouts.
Anyway, that's what I think your OEM customers would appreciate more than enhanced feature-set complexity.
-Phil
I thought the Propeller concept was to make a very capable general-purpose processor family which uses soft peripherals.
The moment you make it difficult for developers, particularly commercial developers, they will look elsewhere. Why would I choose a processor where I can't freely use a 6-channel 32-bit timer/PWM module alongside a USB device module without considering how my processor will allocate these things called slots' and how it will access system RAM?
The moment you try to make something excellent at everything you are doomed to failure.
2) COMPARISON WITH OTHER PROCESSORS
There is a fundamental difference between the P2 and other processor families. The others are available now; the P2 won't be on any distributor's shelf for at least 12 months.
Reading the 20 or so pages of posts that appeared over the weekend reminded me of an old proverb, "A bird in the hand is worth two in the bush", which www.phrases.org.uk tells me has the following meaning...
"It's better to have a lesser but certain advantage than the possibility of a greater one that may come to nothing."
How can an object choose a pin-group for it's VGA output for example? It will need to start the VGA driver in a certain cog, but what if this cog is already used? I think without a totally variable mapping of the DAC pingroups, the whole object concept of Spin is broken.
We may recommend, in the documentation of the object, to start the Video driver as one of the first objects an use high pin numbers for the Video connection.
But if you use many objects with fast DACs (Audio, Functiongenerators and so on) then the cog allocation can only work if you start the drivers in the exact right order.
Andy