Hub Slot Mapping
Bill Henning
Posts: 6,445
I originally proposed 128 slots - but 64 slots would be enough (128 still provides finer control)
9 bits per entry, defined as:
Rccccmmmm
R - reset table index counter (for jmg)
Cccc - cog this hub cycle is assigned to
Mmmm - if cog cccc does not need slot, give it to mooching cog Mmmm (great idea Ray)
Personally I strongly prefer multiple of 16 table size as reset/top cog will break Obex, however it can be useful in bare metal cases.
Table should have at least 64 entries so low and medium speed drivers can donate most of their slots.
Must be writable by all cogs as otherwise all objects will need modification before use.
If objects have info ie needs 4/64 slots loader could make the table automatically.
Leaving this huge performance boost out due to fear is limiting the app space the chip could compete in, therefore dumb.
9 bits per entry, defined as:
Rccccmmmm
R - reset table index counter (for jmg)
Cccc - cog this hub cycle is assigned to
Mmmm - if cog cccc does not need slot, give it to mooching cog Mmmm (great idea Ray)
Personally I strongly prefer multiple of 16 table size as reset/top cog will break Obex, however it can be useful in bare metal cases.
Table should have at least 64 entries so low and medium speed drivers can donate most of their slots.
Must be writable by all cogs as otherwise all objects will need modification before use.
If objects have info ie needs 4/64 slots loader could make the table automatically.
Leaving this huge performance boost out due to fear is limiting the app space the chip could compete in, therefore dumb.
Comments
Works for me, but right now my testcases show mooch has penalties. ie Idea is good, shame about the speed.
About to try use-more-silicon, but that makes larger tables more costly, and larger tables are slower...
An appeal of 32x is a WrQUAD can achieve full atomic set/change.
Simpler mooch is possible
1) only two cogs can mooch, even cog can use spare even cycles, odd can use spare odd
2) 4 cogs can mooch, but only from cogid mod four cog unused
Both should be faster in logic.
Sure, but atomic handling needs to be implemented, and it has area and speed costs.
It it can be made Atomic, and comes in off the critical path, and negligible Silicon cost, then x64 is fine
The speed problem is not in the mapping of COGid (once you have to change 1 bit, there is no speed cost to change all 4)
- the speed issue is in the fetch-time checking and indirection.
See my other post
http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1265817&viewfull=1#post1265817
more Parallel logic did not really help, but another pipeline stage to resolve the F_NeedsHUB and delay the AllocCOG result does manage to give (Mooch_x15 & x32 ) @ ~ 207.684MHz and just a bit slower than x32.
These pipelines (hopefully) tap-into, and run in Parallel with existing pipelines and opcode 'need info now' timings.