Would 4 x (4 cogs & 128KB) and parallel interface between them work ???
Cluso99
Posts: 18,069
It has been suggested on another thread, and I think some time ago too. Expanding on this a little...
4 sets of...
(a) 4 simple cogs
(b) 128KB hub @ 200MHz (maybe 160MHz)
(c) Only byte/word/long accesses to hub
(d) 1:4 slot access gives 1:2 instructions (note RD/WRxxxx take >= 2 instructions anyway) ~50(40) MIPS per cog in hubexec
(e) hubexec would be simple (and could even be the only method) - no ICache, just run direct from hub
(f) required relative/direct jmp/call/ret for hubexec and perhaps larger internal LIFO (call/ret/push/pop) (say 16+ levels instead of 4)
Now what would be required is a simple HS parallel interface between the 4 sets - maybe another tiny hub, or a small set of fifos between sets.
Might it be possible to configure the hub blocks initially between the 4 sets of 4 cores? Maybe in 32KB blocks???
Seriously, might this work ???
Postedit:
Still keeping the ADC, DAC, Security, GCC that have all been deemed essential.
4 sets of...
(a) 4 simple cogs
(b) 128KB hub @ 200MHz (maybe 160MHz)
(c) Only byte/word/long accesses to hub
(d) 1:4 slot access gives 1:2 instructions (note RD/WRxxxx take >= 2 instructions anyway) ~50(40) MIPS per cog in hubexec
(e) hubexec would be simple (and could even be the only method) - no ICache, just run direct from hub
(f) required relative/direct jmp/call/ret for hubexec and perhaps larger internal LIFO (call/ret/push/pop) (say 16+ levels instead of 4)
Now what would be required is a simple HS parallel interface between the 4 sets - maybe another tiny hub, or a small set of fifos between sets.
Might it be possible to configure the hub blocks initially between the 4 sets of 4 cores? Maybe in 32KB blocks???
Seriously, might this work ???
Postedit:
Still keeping the ADC, DAC, Security, GCC that have all been deemed essential.
Comments
It's not going to be cheaper and it's not going to be smaller, bit it's much more likely to be sooner!
Maybe another improved version follows sooner rather than later though.
There was some space left over in the die, but nowhere near enough for 1MB of hub.
What if the first block had 512KB, and the other 3 blocks each had say 8KB or 16KB each - could/would this work ??
One problem I see is the DACs are on restricted pins. Could we live with that for video (ie where the 512KB hub would be required) ??
If we had hubexec as I explained above, and a largish LIFO, could we live with a smaller cog space of 1KB (256 longs) ??
We're talking separate ideas then.
Any major deviation from a single central hub is pretty far out there. The physical layout becomes a factor - it already is, one could go down the nearest neighbour grid connected route or maybe islands of small hubs with linkage to a hierarchy above that. There will be tons of other variations but I don't see how any alternatives would be easier to complete than the current P16(P1+).
Actually, the grid arrangement, as would any design that has a large RAM block per core, would reconfigure the Cog to only need a small collection general registers. CogRAM as we know it would vanish.
That would make more sense, than an equal split to 128k, and is really a variant on the common 'more COG RAM' idea.
COG RAM, being 4 ported, will be about 2x the area of other ram ( let's call it IDATA )
That does give choices of where to split, but one problem with pulling down from P1, is compatibility is broken.
More local IDATA, combined with a HUB Slot table would boost the MIPS in two ways, as the HUB slots could be used more for Data than CODE, which would tend to be more local.
This does make COGS less equal, and would need more project-planning to select COGs.
I guess this needs some use-examples, showing code that fails to fit into a COG, but can run nicely in a COG with 2-4x the local CODE ability.
I was a little loose in my terminology
The point i was trying to make is the register memory is 4 ported, (but implemented as muxed 2 ports on 2 cycles).
The additional IDATA.Code memory ( your 8KB ~16KB ) does not need as many ports, so it can be more compact.
No, the opcode count has not (yet) been broken.
Reducing code space seems a backward step, but I guess a design could always augment any reduced registers with additional IDATA.Code memory, but you did not actually say that.
It still makes it less-P1 like.
Not quite.
Where this bites, is in the DAC mapping, and the way Chip has stated that is COG-fixed (and thus pin-fixed) - that makes being able to select any COG as more important.
That is one of the driving factors behind the Table mapping. You do not have to pre-select pins early in a design.
If Pins constrain COGs and COGS have varying (non mappable) resource, a user can find an unexpected PCB redesign on the agenda. Not so nice.