Would 4 x (4 cogs & 128KB) and parallel interface between them work ???

Cluso99 · 2014-05-09 22:27

It has been suggested on another thread, and I think some time ago too. Expanding on this a little...

4 sets of...
(a) 4 simple cogs
(b) 128KB hub @ 200MHz (maybe 160MHz)
(c) Only byte/word/long accesses to hub
(d) 1:4 slot access gives 1:2 instructions (note RD/WRxxxx take >= 2 instructions anyway) ~50(40) MIPS per cog in hubexec
(e) hubexec would be simple (and could even be the only method) - no ICache, just run direct from hub
(f) required relative/direct jmp/call/ret for hubexec and perhaps larger internal LIFO (call/ret/push/pop) (say 16+ levels instead of 4)

Now what would be required is a simple HS parallel interface between the 4 sets - maybe another tiny hub, or a small set of fifos between sets.

Might it be possible to configure the hub blocks initially between the 4 sets of 4 cores? Maybe in 32KB blocks???

Seriously, might this work ???

Postedit:
Still keeping the ADC, DAC, Security, GCC that have all been deemed essential.

evanh · 2014-05-09 22:41

As a cheaper/smaller variant of the current P16? I'd say great for an iteration but not now.

Cluso99 · 2014-05-09 22:51

evanh wrote: »

As a cheaper/smaller variant of the current P16? I'd say great for an iteration but not now.

evanh: What's missing or doesn't work for you. "not now" doesn't help at all.

It's not going to be cheaper and it's not going to be smaller, bit it's much more likely to be sooner!
Maybe another improved version follows sooner rather than later though.

Cluso99 · 2014-05-09 23:17

Lets presume Chip/we can solve the comms between the 4 sets (blocks)...

There was some space left over in the die, but nowhere near enough for 1MB of hub.

What if the first block had 512KB, and the other 3 blocks each had say 8KB or 16KB each - could/would this work ??
One problem I see is the DACs are on restricted pins. Could we live with that for video (ie where the 512KB hub would be required) ??
If we had hubexec as I explained above, and a largish LIFO, could we live with a smaller cog space of 1KB (256 longs) ??

evanh · 2014-05-09 23:35

Cluso99 wrote: »

It's not going to be cheaper and it's not going to be smaller, ...

We're talking separate ideas then.

Any major deviation from a single central hub is pretty far out there. The physical layout becomes a factor - it already is, one could go down the nearest neighbour grid connected route or maybe islands of small hubs with linkage to a hierarchy above that. There will be tons of other variations but I don't see how any alternatives would be easier to complete than the current P16(P1+).

Actually, the grid arrangement, as would any design that has a large RAM block per core, would reconfigure the Cog to only need a small collection general registers. CogRAM as we know it would vanish.

jmg · 2014-05-10 00:19

Cluso99 wrote: »

What if the first block had 512KB, and the other 3 blocks each had say 8KB or 16KB each - could/would this work ??

That would make more sense, than an equal split to 128k, and is really a variant on the common 'more COG RAM' idea.

Cluso99 wrote: »

One problem I see is the DACs are on restricted pins. Could we live with that for video (ie where the 512KB hub would be required) ??
If we had hubexec as I explained above, and a largish LIFO, could we live with a smaller cog space of 1KB (256 longs) ??

COG RAM, being 4 ported, will be about 2x the area of other ram ( let's call it IDATA )

That does give choices of where to split, but one problem with pulling down from P1, is compatibility is broken.

More local IDATA, combined with a HUB Slot table would boost the MIPS in two ways, as the HUB slots could be used more for Data than CODE, which would tend to be more local.

This does make COGS less equal, and would need more project-planning to select COGs.

I guess this needs some use-examples, showing code that fails to fit into a COG, but can run nicely in a COG with 2-4x the local CODE ability.

Cluso99 · 2014-05-10 01:32

jmg wrote: »

COG RAM, being 4 ported, will be about 2x the area of other ram ( let's call it IDATA )

What??!! Cog ram is now 2 ported (the new P1+/P2/P16)

That does give choices of where to split, but one problem with pulling down from P1, is compatibility is broken.

P1 compatibility is already broken.

More local IDATA, combined with a HUB Slot table would boost the MIPS in two ways, as the HUB slots could be used more for Data than CODE, which would tend to be more local.

With only 4 cogs sharing a block, not hub slot table is necessary.

This does make COGS less equal, and would need more project-planning to select COGs.

Surely we have come to conclusion that not all cogs are, or need to be equal. Even heater has come to acknowledge this. Let the user plan a little more. The planning required here would be minimal, but not quite as simple as just adding cogs/objects.

I guess this needs some use-examples, showing code that fails to fit into a COG, but can run nicely in a COG with 2-4x the local CODE ability.

I think most of us who have used the P1 in a number of projects would understand this without examples. Obviously the larger hub the better, but remember on these smaller hub cogs there would be no video taking up the hub.

jmg · 2014-05-10 01:56

Cluso99 wrote: »

What??!! Cog ram is now 2 ported (the new P1+/P2/P16)

I was a little loose in my terminology
The point i was trying to make is the register memory is 4 ported, (but implemented as muxed 2 ports on 2 cycles).
The additional IDATA.Code memory ( your 8KB ~16KB ) does not need as many ports, so it can be more compact.

Cluso99 wrote: »

P1 compatibility is already broken.

No, the opcode count has not (yet) been broken.
Reducing code space seems a backward step, but I guess a design could always augment any reduced registers with additional IDATA.Code memory, but you did not actually say that.
It still makes it less-P1 like.

Cluso99 wrote: »

Surely we have come to conclusion that not all cogs are, or need to be equal. Even heater has come to acknowledge this.

Not quite.
Where this bites, is in the DAC mapping, and the way Chip has stated that is COG-fixed (and thus pin-fixed) - that makes being able to select any COG as more important.
That is one of the driving factors behind the Table mapping. You do not have to pre-select pins early in a design.

If Pins constrain COGs and COGS have varying (non mappable) resource, a user can find an unexpected PCB redesign on the agenda. Not so nice.

Would 4 x (4 cogs & 128KB) and parallel interface between them work ???

Comments