A 32-slot Approach (was: An interleaved hub approach)
Seairth
Posts: 2,474
(COPIED/MOVDE FROM THE P1+ THREAD. If this is actually an already-explored idea, my apologies. I've lost track of a few of them.)
That split won't work. Cogs 0-7 alone would take all the hub slots. You can, however, do:
4 @ 1:8
4 @ 1:16
8 @ 1:32
One such pattern (buy, but not means the only one) would be 01234567012389AB012345670123CDEF, or:
I suppose Chip could just hard-code the interleave pattern and the user would either run in the standard 16-cycle mode, or enable the interleaved mode and load cogs accordingly (HUBCYCA is default, HUBCYCB is interleaved).
Or, if you had a "HUBCYCB D, S", where D pointed to a register that contained 4 4-bit fields for the 1:8 slots and 4 4-bit fields for the 1:16 slots, and S pointed to a register that contained 8 4-bit fields for the 1:32 slots (where the 4-bit fields indicated the cog number), you could provide the custom interleave (including doubling some cogs and starving other cogs, particularly those that aren't running).
For instance, you could get 4 cogs @ 1:8, 8 cogs @ 1:16, and 4 cogs that have no access (which is particularly okay if they aren't running).
Actually, with this, you could even leave the first 8 cogs on the same schedule (4 @ 1:8 and 4@1:16) and swap out the last 8 as demand required. And so on!
Oh, what fun!
Default 1:16 but there is a mode that is not discussed in any novice learning books is a mode that will do 1:8 for the cogs 0-7 and 1:32 for cogs 8-15.
For example 0-7 are used for VGA + sprites, cogs 8-15 are for joystick/keyboard/uart etc.
One extra mode is enough to keep it simple, no 1:4 or 1:2 could possible be needed.
That split won't work. Cogs 0-7 alone would take all the hub slots. You can, however, do:
4 @ 1:8
4 @ 1:16
8 @ 1:32
One such pattern (buy, but not means the only one) would be 01234567012389AB012345670123CDEF, or:
111111 11112222 22222233 01234567 89012345 67890123 45678901 ------------------------------------ 0 |H |H |H |H 1 | H | H | H | H 2 | H | H | H | H 3 | H | H | H | H -------------------------------------- 4 | H | | H | 5 | H | | H | 6 | H | | H | 7 | H| | H| -------------------------------------- 8 | | H | | 9 | | H | | A | | H | | B | | H| | C | | | | H D | | | | H E | | | | H F | | | | H
I suppose Chip could just hard-code the interleave pattern and the user would either run in the standard 16-cycle mode, or enable the interleaved mode and load cogs accordingly (HUBCYCA is default, HUBCYCB is interleaved).
Or, if you had a "HUBCYCB D, S", where D pointed to a register that contained 4 4-bit fields for the 1:8 slots and 4 4-bit fields for the 1:16 slots, and S pointed to a register that contained 8 4-bit fields for the 1:32 slots (where the 4-bit fields indicated the cog number), you could provide the custom interleave (including doubling some cogs and starving other cogs, particularly those that aren't running).
For instance, you could get 4 cogs @ 1:8, 8 cogs @ 1:16, and 4 cogs that have no access (which is particularly okay if they aren't running).
HUBCYCB map48, map16 map48 long $01234567 map16 long $89AB89AB
111111 11112222 22222233 01234567 89012345 67890123 45678901 ------------------------------------ 0 |H |H |H |H 1 | H | H | H | H 2 | H | H | H | H 3 | H | H | H | H -------------------------------------- 4 | H | | H | 5 | H | | H | 6 | H | | H | 7 | H| | H| -------------------------------------- 8 | | H | | H 9 | | H | | H A | | H | | H B | | H| | H C |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx D |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx E |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx F |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx
Actually, with this, you could even leave the first 8 cogs on the same schedule (4 @ 1:8 and 4@1:16) and swap out the last 8 as demand required. And so on!
Oh, what fun!
Comments
Close - the mapping idea is sound, but I think this needs a 32 entry variable table, rather than a ROM table, or splintered tables.
A table of 32x can use WRQUAD to change, and allows 50% resource sharing.
The symmetric 32x variable table also solves/avoids the all-COGS-are-not-actually-quite-equal issues some have raised.
Any COGid can be placed in any order. All cogs are interchangeable.
Here are some examples from another thread, done using a 32x Table+Reload, set for no-jitter.
Key point: HUB Slot Scan Rate (SSR) is not just about Bandwitdh is also impacts tight code-loop granularity.
Some of the many possible operational [Table+Reload] choices are
With a COG map, Table+Reload, I can get further on a PAL Video + USB timing Budget
eg I have a MOD 28 scanner, a PAL fSYS at 38x Burst, (1.684775125e8) and I have a 12MHz USB Data Sync DPLL that can lock +/- 0.285 % with a centre point that is ~ 15ppm off.
I have 2 'fast' COGS running @ 20ns SRR & 14 'slow' cogs with @ 140ns SSR, both jitter free.
(the fast and slow refer only to the SSR, in all other aspects COGS are of course identical)
The Fast COG SSR, can co-operate with my DPLL in a way that is impossible with a locked 1:16 rate.
Other SW issues may occur, but the timing budget moves to 'possible' on this simple example.
Then the following (and more) patterns could be accomplished:
And then there are all the asymmetrical combinations!
1:8 for 4 cogs = 100% boost for them, hub access spaced evenly apart in a 3:24 fashion
and then
1:24 for the remaining 12 cogs for a rather small drop of 33%
So the 4 Boost-Cogs share the every other slots (for a total of 12 out of 24)
The other 12 get one each every 24 slots. interleaved with the 4 Boost cogs.
Having just these 2 modes (1:16 and 3:24/1:24) fixed in hardware it will be easy to understand and hubex will state that this "main routine" uses 3:24
and other generic hubex routines should state that they work fine in either 1:16 or/and 1:24 hub access
I think the instruction can use a 16 entry table (two registers), which I feel would give enough combinations. I'd expect that the two registers would actually get remapped an internal 32-element table. So $01234567, $89ABCDEF would actually end up looking like $04182539061A273B041C253D061E273F (based on the other interleaved pattern I just posted). Then the hub just loops over the elements to select the next cog.
(Don't you like how I casually throw the "just" in there like this is no effort at all? Yes, I know this isn't trivial.)
That would work too. I prefer the 2^n groupings (if it wasn't obvious by my other thread). The 8/16/32 gives three general bands to work within, as well as the ability to promote some of those slots to a faster band (at the cost of few cogs active and/or having no hub access). Your approach would obviously allow some of the same "promotion" schemes, too.
SETHUBM #n
This sets the hub to one of three modes:
0 : 16@1:16
1 : 4@1:8, 4@1:16, 8@1:32
2 : 4@1:8, 12@1:24
(any other groupings?)
SETHUBP D, S
This sets the 32-entry hub pattern table, according to the current hub mode:
0: D contains the first 8 1:16 slots, S contains the second 8 1:16 slots
1: D contains the 4 1:8 slots and 4 1:16 slots, S contains the 8 1:32 slots
2: D contains the 4 1:8 slots and the first 4 1:24 slots, S contains the last 8 1:24 slots
The nice thing here is that the hub itself is simply iterating over the 32 entry table at all times. All the SETHUBM is doing is defining the translation used for the 16-entry table passed by SETHUBP (see two posts back for the example of mode #1).
And, of course, the chip would start up in mode 0 (16@1:16), with the internal map set to $01234567_89ABCDEF_01234567_89ABCDEF (the equivalent of SETHUBP $01234567, $89ABCDEF).
SETHUBP D
points to 4 registers that contains the entire 32-entry map. No need for SETHUBM at all. And, obviously, the chip would still start up with $01234567_89ABCDEF_01234567_89ABCDEF.
This seems so simple and is super easy to understand. And, if SETHUBP is never called, then you get exactly what the current P1+ implements.
Okay. This is my vote. No more variations or other ideas from me. Assuming that this could be implemented without too much circuitry and without affecting max_clk, this is enough (IMHO).
Quite a time back I proposed a 32 slot table so that we could give some cogs only 1:32 while others received a lot more. But it was howled down with noise.
By having a configurable table (Table#1) of 32 slots (where the default is set 0...15, 0..15 meaning no change) the user can configure what he desires.
Now if there were an additional table (Table#2) of 32 slots to be used when the cog given the slot in Table#1 does not require it, then we could effectively do a programmable level of slot-pairing where donation is priority or submissive. Also, a simple form of mooching can be achieved also. Just 2 levels would be sufficient. With careful decoding in an earlier clock, timing could be resolved.
IMHO, this would not be very difficult to implement, not a lot of silicon, and extremely flexible. Default maintains the status quo.
but none of them are on my list., which has 14,12,10 MOD scans.
That's why a ROM table is not sufficient.
If we left this out and just stuck with Table #1, I think that would open up so many possibilities on its own. Then, once we've all had some real experience under our belt, we could decide if it makes sense to include Table #2 as part of P2 (or some other future chip). Would you be heartbroken if only Table #1 was implemented?
It also needs a single ReLoad value field (5b), to cover no-jitter scan cases that do not divide into 16. - see my examples above 14,12,10. ( this is also why a rom-choice-table is not enough. )
easy to say... the key is if this type of pipeline signal really is available.
In one of my pair tests I simply assumed it was, but it is not that easy to extract.
You could pipeline this more, with dual-port table, so the next-slot mapping can be read, and that would help remove this
from the Memory address path, where it really does not want to be.
I get what you are saying, but that's hub access jitter, not I/O jitter (yes, I know they can sometimes be dependent on each other). I'm not saying it's not important. I just see it as an extra bit that, if it were not present on P1+, we would still be able to do a lot of neat stuff with just the 32-entry table.
Not quite - without this, the 32 table is locked to 32, which seriously limits the jitter free choices.
The 5 bit register this needs to fix, uses far less silicon than the table it then allows to work properly.
viz The examples I give above, cannot be jitter-free without this ReLoad value.
With fully random COG Alloc, there is no 'top' to extract this from so it needs to be user-supplied.
The reasons for going to all this effort, is to control both Alloc and jitter on Slot Scan Rates.
Tight Software loops will be SSR granular, and jitter-free loops are very commonly needed, so designers need to define Jitter Free SSR.
As for setup time, each hub instruction already takes longer because it has setup time, so my presumption is that this signal will be available early, while the operands are being fetched.
I would be quite happy for Table#1 and Table#2 to both be 16 slots, and be able to be either one combined 32 slot table or 2 16 slot tables. In fact, just a 16 slot table satisfies where I see the USB FS problem - but there are so many uses for having slot tables.
There is a side benefit from soft tables (it has been mentioned, but worth restating) - since some cogs have special DAC pins, by being able to re-order the cogs in the table may benefit co-operating cogs doing video modes.
FWIW presume Table#1 is unchanged (ie each cog has its own slot), setting Table#2 all to "0" would mean that cog 0 would be given every unused slot (mooching). By setting Table#2 to 0,1,0,1,0,1... would give cog 0 every unused even slot and cog 1 every unused odd slot (2 cogs sharing mooching). This just shows the benefits of Table#2.
Another case..
Table#1 0,1,2,3,4,5,6,7,8,1,A,B,C,D,E,F
Table#2 0,9,2,3,4,5,6,7,8,9,A,B,C,D,E,F
gives cog pairing 1 & 9: Cog 1 gets priority use of both slots 1 & 9. If they are unused, then Cog 9 gets the dregs.
note that cogs 0,2-8,A-F do not donate their slots, so no mooching has been given. Their slots just remain unused.
An appeal of that is it still fits in a WRQUAD, and keeps the tables smaller.
I tried this, and got some unexpectedly low speeds...
A 32x table, with Default switch, and ReLoad, reports on my FPGA build as 240.500MHz, which is similar to other P&R values for my other scanners.
If I add the pipeline check of 'Will next opcode use HUB", - ie basically one more level of lookup, and a Top/Bottom half mux using that decision result, plus a x32 or x16/x16 choice, then P&R reports 131.423MHz
The slot mapper needs to run > 200MHz to not impact system speeds.
ie 4.158ns changes to 7.609ns, an adder of 3.451ns for a 4 bit lookup and 2 choice Mux
I had expected an impact, but not quite that much.
The OnSemi process may give different (better?) figures than this Lattice FPGA, but it should still be proportional. - and values have seemed in the right ballpark.
I'm not sure what you are trying to say, but a direct mapper is applying MapF[SC], whilst an indirect mapper has to
use MapF[SC] to extract F_NeedsHUB as (NextOPCisHUB[MapF[SC]]), and that selects Upper/Lower.
So there will be logic costs to this.
I am not arguing against Table2. Or the notion that setting up intervals other than 2^ cannot be precisely done without it. I just think that Table1 alone still gives us a lot of new opportunities.
Unless jmg's definition of Table1 differs from my simple 32-entry table, and I'm not seeing the difference (this has happened once before), then that's not true. There are 32! combinations that could be assigned. There is never an unused slot. At worst case, a slots is assigned to a cog that's not running. I don't see how Table2 will make any difference to the ability to use all slots.
Just so we are all on the same page, here is what the HUBP table (That's what I'll call it, in case jmg's Table1 is actually something different) would look like for various scenarios:
$01234567_89ABCDEF_01234567_89ABCDEF
$00000000_00000000_00000000_00000000
$34343434_34343434_34343434_34343434
$01273457_6FF7FFF7_0FF7FFF7_FFF7FFF7
$C010D020_C030D040_C010D050_C060D070
$01020310_20310203_10203102_0310203F
(Yes, I realize this is the sort of scenario where you want Table2.)
$01023405_67012304_56712304_FFFFFFFF
$01230123_40414243_01230123_50515253
And so on. Again, I'm not saying that the 32-entry table is sufficient for all cases, but I think it provides so much more than we can do with the simple 1:16 round-robin approach, that I think we'll still be figuring out new combinations when the P2 (or P3) gets released.
Yup. Any Cog ID in Any Slot - but I'd add some examples of 28/24/20 scans. to make it crystal clear that is a vital part of the Scanning Choices.
Correct. (but needs to support Reload (eg 20/24/28/32 etc scans )
It does not affect the ability to use all slots - the mapping can be changed, after init, at any time.
Being able to easily modify mapping for a Cluster of COGS will be a common use case.
Mooch is a subtle variant, that allows a pre-defined modify to be Automatic, ie a fetch-time decision.
The single table SW control means one of the set of COGS in that cluster has to modify the mapping
- Optional re-config of the 32x to 16+16 has appeal, but right now my testcases are showing a significant timing hit from the second level of indirection.
I will try other use-more-silicon approaches, to see if the speed can be improved.
Nope. That requires your Table2, I believe. If you want those examples, you add them! I was sticking to strictly one 32-slot table, as in my original proposal/conclusion/epiphany/etc.
:P
Any Cog in any slot is not Table 2 related. Mooch does need either a 2nd, or a split, table.
To make it easier to follow, I use two lines, one for 'slow' and one for 'fast' - in a final Map those are all in one table row.
Here are some examples of No-Jitter, more granular Slor Repetition Rate choices <>32
Some more numbers :
From above :
x32 Mapper ~ 240MHz ( 50% Alloc possible & appx speed of non-pair config choice )
x16/x16 OPCisHUB ~ 131.423MHz (Auto Mooch variant)
Added a Boolean re-mapper, use-more-silicon, but (hopefully) lower delays - seems routing delays kills this.
(and the LUT cost was also high - added hundreds LUT4.)
x16/x16 ReOrder OPCisHUB ~ 105.809MHz and Logic is much higher
( also get 90.066MHz and double the logic, if re-order table is 32x, but Mooch scan is <=16, so can use 16 re-order table )
This shows a move from x16 to x32 tables, costs ~ 16MHz in fMAX
add another pipeline
x16/x16 ReOrder pipeline2 ~ 141.263MHz - better, but still costly.
return to x16/x16 OPCisHUB but add 2nd pipeline
x16/x16 OPCisHUB ~ 207.684MHz This has config for x32 and x16/x16 Auto Mooch variant
The two pipelines* needed to get tolerable speed, impose these assumptions.
a) The Opcode pipeline can provide an 'early boolean', showing the NEXT opcode will need the HUB (ie I have a HUB opcode in pipeline signal)
b) The AllocCOG (4b) is not needed in the first clock of that opcode, but is valid for 2nd and further clocks.
Larger tables do have a speed cost, as well as a resource cost.
* These pipelines (hopefully) tap-into, and run in Parallel with, existing pipelines and opcode 'need info now' timings, so have no fSys penalties.
This ~207.684MHz includes 2 Config control Bits for
i) Force Default 1:16 operation (ignores Table, can be applied at any time)
ii) Operate as either x32 or x16/x16 Mooch Pair : 2nd COG is used IF first does not need slot.
It is more likely that your 'n' modulo counter is affecting the timing. But as you know, the counter can be reversed and run down to zero (the user doesn't see this).
All: There seems to be a misconception of what my Table#2 is for. It is for the case where the cog in Table#1 does not require the allocated slot (this time around). Table#2 allows the unused slot to be offered to another cog. This permits a mooching dedication of each slot, rather than mooching all slots to one cog. But for my co-operating pairs of cogs, I can give priority of one or both slots to the primary cog, and if not required it will be offered to the other cog in the pair.
So this gives ultimate flexibility. I don't believe this should add any significant timing as everything would be decoded in parallel in the previous clock. Hub accesses will be setting up their read/write in the clock before the hub transaction takes place. I am sure Chip will be able to solve this if he decides to implement this.
So you're the culprit ;-)
I thought someone posted the table idea a while back but I was not sure. It's such a simple, elegant, and flexible method of controlling hub access. Can't understand why it didn't get more discussion then.
Not too enthused about adding a second table though. Not because of the extra silicon, but because of the extra complexity and time it will take to access it. Look ahead decoding for that would not be all that easy either. Better IMHO to use the silicon for a 64 entry table instead.
No surprises really, part of this behaves as RAM, and larger RAM is always slower then smaller RAM as it has more decode-trees.
Counters also clock slower, the larger they get, but 4-5 bit counters will be well above fSys
(32b counters are another story.)
Yup, I coded it so when in Pair/Mooch mode, it checks Table1.COG first and it if does not need the slot right then, it flops to Table2.COG, which may, or may not use it. In this mode, the 32x folds into 2 x 16, and Reload <= 16.
In 32x mode, Reload is <= 32
Easy to 'believe' anything, I prefer hard numbers.
The Tools tell me I can get > 200MHz, for a 'mooch' option, but only by using two pipeline levels.
They also give ~240MHz is for the simpler 32x/ReLoad Mapper
I think that is ok, as they should not add to present pipelines & I expect the HUB-Slot-Value is not needed before 2nd clk in the opcode.