A 32-slot Approach (was: An interleaved hub approach)

Seairth · 2014-05-06 18:09

(COPIED/MOVDE FROM THE P1+ THREAD. If this is actually an already-explored idea, my apologies. I've lost track of a few of them.)

tonyp12 wrote: »

Default 1:16 but there is a mode that is not discussed in any novice learning books is a mode that will do 1:8 for the cogs 0-7 and 1:32 for cogs 8-15.
For example 0-7 are used for VGA + sprites, cogs 8-15 are for joystick/keyboard/uart etc.
One extra mode is enough to keep it simple, no 1:4 or 1:2 could possible be needed.

That split won't work. Cogs 0-7 alone would take all the hub slots. You can, however, do:

4 @ 1:8
4 @ 1:16
8 @ 1:32

One such pattern (buy, but not means the only one) would be 01234567012389AB012345670123CDEF, or:

              111111 11112222 22222233
   01234567 89012345 67890123 45678901
  ------------------------------------
0 |H       |H       |H       |H       
1 | H      | H      | H      | H      
2 |  H     |  H     |  H     |  H     
3 |   H    |   H    |   H    |   H    
--------------------------------------
4 |    H   |        |    H   |        
5 |     H  |        |     H  |        
6 |      H |        |      H |        
7 |       H|        |       H|        
--------------------------------------
8 |        |    H   |        |        
9 |        |     H  |        |        
A |        |      H |        |        
B |        |       H|        |        
C |        |        |        |    H   
D |        |        |        |     H  
E |        |        |        |      H 
F |        |        |        |       H

I suppose Chip could just hard-code the interleave pattern and the user would either run in the standard 16-cycle mode, or enable the interleaved mode and load cogs accordingly (HUBCYCA is default, HUBCYCB is interleaved).

Or, if you had a "HUBCYCB D, S", where D pointed to a register that contained 4 4-bit fields for the 1:8 slots and 4 4-bit fields for the 1:16 slots, and S pointed to a register that contained 8 4-bit fields for the 1:32 slots (where the 4-bit fields indicated the cog number), you could provide the custom interleave (including doubling some cogs and starving other cogs, particularly those that aren't running).

For instance, you could get 4 cogs @ 1:8, 8 cogs @ 1:16, and 4 cogs that have no access (which is particularly okay if they aren't running).

        HUBCYCB map48, map16

map48   long $01234567
map16   long $89AB89AB

              111111 11112222 22222233
   01234567 89012345 67890123 45678901
  ------------------------------------
0 |H       |H       |H       |H       
1 | H      | H      | H      | H      
2 |  H     |  H     |  H     |  H     
3 |   H    |   H    |   H    |   H    
--------------------------------------
4 |    H   |        |    H   |        
5 |     H  |        |     H  |        
6 |      H |        |      H |        
7 |       H|        |       H|        
--------------------------------------
8 |        |    H   |        |    H   
9 |        |     H  |        |     H  
A |        |      H |        |      H 
B |        |       H|        |       H
C |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx
D |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx
E |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx
F |xxxxxxxx|xxxxxxxx|xxxxxxxx|xxxxxxxx

Actually, with this, you could even leave the first 8 cogs on the same schedule (4 @ 1:8 and 4@1:16) and swap out the last 8 as demand required. And so on!

Oh, what fun!

jmg · 2014-05-06 18:20

Seairth wrote: »

(
I suppose Chip could just hard-code the interleave pattern and the user would either run in the standard 16-cycle mode, or enable the interleaved mode and load cogs accordingly (HUBCYCA is default, HUBCYCB is interleaved).

Or, if you had a "HUBCYCB D, S", where D pointed to a register that contained 4 4-bit fields for the 1:8 slots and 4 4-bit fields for the 1:16 slots, and S pointed to a register that contained 8 4-bit fields for the 1:32 slots (where the 4-bit fields indicated the cog number), you could provide the custom interleave (including doubling some cogs and starving other cogs, particularly those that aren't running).

Close - the mapping idea is sound, but I think this needs a 32 entry variable table, rather than a ROM table, or splintered tables.
A table of 32x can use WRQUAD to change, and allows 50% resource sharing.

The symmetric 32x variable table also solves/avoids the all-COGS-are-not-actually-quite-equal issues some have raised.
Any COGid can be placed in any order. All cogs are interchangeable.

Here are some examples from another thread, done using a 32x Table+Reload, set for no-jitter.

Key point: HUB Slot Scan Rate (SSR) is not just about Bandwitdh is also impacts tight code-loop granularity.

Some of the many possible operational [Table+Reload] choices are

COGS No, @ Slot Sample Rate, some coverage options, Any-COGid allowed
 2 @ 20ns SRR(fast) & 14 @ 160ns SSR(slow) 
 2 @ 20ns SRR(fast) & 14 @ 140ns SSR(slow) 
 3 @ 30ns SRR(fast) & 12 @ 120ns SSR(slow) 
 4 @ 40ns SRR(fast) & 12 @ 120ns SSR(slow)
 5 @ 50ns SRR(fast) & 10 @ 100ns SSR(slow)
 2 @ 20ns SRR(fast) & 10 @ 100ns SSR(slow)
 reference Default  is 16 @ 80ns SSR (slow)

Unallocated COGs are not in the Hub Scan, but they can still be used for other tasks.

With a COG map, Table+Reload, I can get further on a PAL Video + USB timing Budget

eg I have a MOD 28 scanner, a PAL fSYS at 38x Burst, (1.684775125e8) and I have a 12MHz USB Data Sync DPLL that can lock +/- 0.285 % with a centre point that is ~ 15ppm off.

I have 2 'fast' COGS running @ 20ns SRR & 14 'slow' cogs with @ 140ns SSR, both jitter free.

(the fast and slow refer only to the SSR, in all other aspects COGS are of course identical)

The Fast COG SSR, can co-operate with my DPLL in a way that is impossible with a locked 1:16 rate.
Other SW issues may occur, but the timing budget moves to 'possible' on this simple example.

Seairth · 2014-05-06 18:26

Actually, if this were the interleave pattern:

              111111 11112222 22222233
   01234567 89012345 67890123 45678901
  ------------------------------------
8 |H       |H       |H       |H       
8 |  H     |  H     |  H     |  H     
8 |    H   |    H   |    H   |    H   
8 |      H |      H |      H |      H 
--------------------------------------
16| H      |        | H      |        
16|     H  |        |     H  |        
16|        | H      |        | H      
16|        |     H  |        |     H  
--------------------------------------
32|   H    |        |        |        
32|       H|        |        |        
32|        |   H    |        |        
32|        |       H|        |        
32|        |        |   H    |        
32|        |        |       H|        
32|        |        |        |   H    
32|        |        |        |       H

Then the following (and more) patterns could be accomplished:

4@1:8, 4@1:16, 8@1:32 ($01234567, $89ABCDEF)
4@1:8, 8@1:16 ($01234567, $89AB89AB)
2@1:4, 4@1:16, 8@1:32 ($01012345, $6789ABCD)
2@1:4, 2@1:8, 8@1:32 ($01012323, $456789AB)
2@1:4, 2@1:8, 4@1:16 ($01012323, $45674567)
2@1:4, 4@1:8 ($01012323, $45454545)
4@1:4 ($01012222, $33333333)

And then there are all the asymmetrical combinations!

tonyp12 · 2014-05-06 18:34

On the other thread I stated a new idea:
1:8 for 4 cogs = 100% boost for them, hub access spaced evenly apart in a 3:24 fashion
and then
1:24 for the remaining 12 cogs for a rather small drop of 33%

So the 4 Boost-Cogs share the every other slots (for a total of 12 out of 24)
The other 12 get one each every 24 slots. interleaved with the 4 Boost cogs.

Having just these 2 modes (1:16 and 3:24/1:24) fixed in hardware it will be easy to understand and hubex will state that this "main routine" uses 3:24
and other generic hubex routines should state that they work fine in either 1:16 or/and 1:24 hub access

111111 11112222 
   01234567 89012345 67890123 
  ----------------------------
8 |H       |H       |H       |       
8 |  H     |  H     |  H     |     
8 |    H   |    H   |    H   |    
8 |      H |      H |      H | 
------------------------------
24| H      |        |        |        
24|   H    |        |        |        
24|     H  |        |        |      
24|       H|        |        |    
24|        | H      |        |        
24|        |   H    |        |        
24|        |     H  |        |        
24|        |       H|        |        
24|        |        | H      |        
24|        |        |   H    |        
24|        |        |     H  |      
24|        |        |       H|

Seairth · 2014-05-06 18:48

jmg wrote: »

Close - the mapping idea is sound, but I think this needs a 32 entry variable table, rather than a ROM table, or splintered tables.
A table of 32x can use WRQUAD to change, and allows 50% resource sharing.

I think the instruction can use a 16 entry table (two registers), which I feel would give enough combinations. I'd expect that the two registers would actually get remapped an internal 32-element table. So $01234567, $89ABCDEF would actually end up looking like $04182539061A273B041C253D061E273F (based on the other interleaved pattern I just posted). Then the hub just loops over the elements to select the next cog.

(Don't you like how I casually throw the "just" in there like this is no effort at all? Yes, I know this isn't trivial.)

Seairth · 2014-05-06 19:04

tonyp12 wrote: »
On the other thread I stated a new idea:
1:8 for 4 cogs = 100% boost for them, hub access spaced evenly apart in a 3:24 fashion
and then
1:24 for the remaining 12 cogs for a rather small drop of 33%

So the 4 Boost-Cogs share the every other slots (for a total of 12 out of 24)
The other 12 get one each every 24 slots. interleaved with the 4 Boost cogs.
111111 11112222 
   01234567 89012345 67890123 
  ----------------------------
8 |H       |H       |H       |       
8 |  H     |  H     |  H     |     
8 |    H   |    H   |    H   |    
8 |      H |      H |      H | 
------------------------------
24| H      |        |        |        
24|   H    |        |        |        
24|     H  |        |        |      
24|       H|        |        |    
24|        | H      |        |        
24|        |   H    |        |        
24|        |     H  |        |        
24|        |       H|        |        
24|        |        | H      |        
24|        |        |   H    |        
24|        |        |     H  |      
24|        |        |       H|

That would work too. I prefer the 2^n groupings (if it wasn't obvious by my other thread). The 8/16/32 gives three general bands to work within, as well as the ability to promote some of those slots to a faster band (at the cost of few cogs active and/or having no hub access). Your approach would obviously allow some of the same "promotion" schemes, too.

Seairth · 2014-05-06 19:25

Generalizing this a bit, suppose you had the following instructions:

SETHUBM #n

This sets the hub to one of three modes:
0 : 16@1:16
1 : 4@1:8, 4@1:16, 8@1:32
2 : 4@1:8, 12@1:24

(any other groupings?)

SETHUBP D, S

This sets the 32-entry hub pattern table, according to the current hub mode:

0: D contains the first 8 1:16 slots, S contains the second 8 1:16 slots
1: D contains the 4 1:8 slots and 4 1:16 slots, S contains the 8 1:32 slots
2: D contains the 4 1:8 slots and the first 4 1:24 slots, S contains the last 8 1:24 slots

The nice thing here is that the hub itself is simply iterating over the 32 entry table at all times. All the SETHUBM is doing is defining the translation used for the 16-entry table passed by SETHUBP (see two posts back for the example of mode #1).

And, of course, the chip would start up in mode 0 (16@1:16), with the internal map set to $01234567_89ABCDEF_01234567_89ABCDEF (the equivalent of SETHUBP $01234567, $89ABCDEF).

Seairth · 2014-05-06 19:31

And, then this could get even more general, I suppose (and jmg might have suggested this, but I'm not sure):

SETHUBP D

points to 4 registers that contains the entire 32-entry map. No need for SETHUBM at all. And, obviously, the chip would still start up with $01234567_89ABCDEF_01234567_89ABCDEF.

Seairth · 2014-05-06 19:39

Huh. I think I just went full circle here (i.e. I think this approach was already suggested by someone else long, long ago).

This seems so simple and is super easy to understand. And, if SETHUBP is never called, then you get exactly what the current P1+ implements.

Okay. This is my vote. No more variations or other ideas from me. Assuming that this could be implemented without too much circuitry and without affecting max_clk, this is enough (IMHO).

Cluso99 · 2014-05-06 19:51

Now we are getting somewhere!!
Quite a time back I proposed a 32 slot table so that we could give some cogs only 1:32 while others received a lot more. But it was howled down with noise.

By having a configurable table (Table#1) of 32 slots (where the default is set 0...15, 0..15 meaning no change) the user can configure what he desires.
Now if there were an additional table (Table#2) of 32 slots to be used when the cog given the slot in Table#1 does not require it, then we could effectively do a programmable level of slot-pairing where donation is priority or submissive. Also, a simple form of mooching can be achieved also. Just 2 levels would be sufficient. With careful decoding in an earlier clock, timing could be resolved.

IMHO, this would not be very difficult to implement, not a lot of silicon, and extremely flexible. Default maintains the status quo.

jmg · 2014-05-06 20:06

Seairth wrote: »

I think the instruction can use a 16 entry table (two registers), which I feel would give enough combinations...

but none of them are on my list., which has 14,12,10 MOD scans.
That's why a ROM table is not sufficient.

COGS No, @ Slot Sample Rate, some coverage options, Any-COGid allowed
 2 @ 20ns SRR(fast) & 14 @ 160ns SSR(slow) 
 2 @ 20ns SRR(fast) & 14 @ 140ns SSR(slow) 
 3 @ 30ns SRR(fast) & 12 @ 120ns SSR(slow) 
 4 @ 40ns SRR(fast) & 12 @ 120ns SSR(slow)
 5 @ 50ns SRR(fast) & 10 @ 100ns SSR(slow)
 2 @ 20ns SRR(fast) & 10 @ 100ns SSR(slow)
 reference Default  is 16 @ 80ns SSR (slow)

Seairth · 2014-05-06 20:10

Cluso99 wrote: »

Now if there were an additional table (Table#2) of 32 slots to be used when the cog given the slot in Table#1 does not require it, then we could effectively do a programmable level of slot-pairing where donation is priority or submissive. Also, a simple form of mooching can be achieved also. Just 2 levels would be sufficient. With careful decoding in an earlier clock, timing could be resolved.

If we left this out and just stuck with Table #1, I think that would open up so many possibilities on its own. Then, once we've all had some real experience under our belt, we could decide if it makes sense to include Table #2 as part of P2 (or some other future chip). Would you be heartbroken if only Table #1 was implemented?

jmg · 2014-05-06 20:26

Cluso99 wrote: »

Now we are getting somewhere!!
Quite a time back I proposed a 32 slot table so that we could give some cogs only 1:32 while others received a lot more. But it was howled down with noise.

By having a configurable table (Table#1) of 32 slots (where the default is set 0...15, 0..15 meaning no change) the user can configure what he desires.

It also needs a single ReLoad value field (5b), to cover no-jitter scan cases that do not divide into 16. - see my examples above 14,12,10. ( this is also why a rom-choice-table is not enough. )

Cluso99 wrote: »

... With careful decoding in an earlier clock, timing could be resolved.

easy to say... the key is if this type of pipeline signal really is available.
In one of my pair tests I simply assumed it was, but it is not that easy to extract.

You could pipeline this more, with dual-port table, so the next-slot mapping can be read, and that would help remove this
from the Memory address path, where it really does not want to be.

Seairth · 2014-05-06 20:51

jmg wrote: »

It also needs a single ReLoad value field (5b), to cover no-jitter scan cases that do not divide into 16. - see my examples above 14,12,10. ( this is also why a rom-choice-table is not enough. ).

I get what you are saying, but that's hub access jitter, not I/O jitter (yes, I know they can sometimes be dependent on each other). I'm not saying it's not important. I just see it as an extra bit that, if it were not present on P1+, we would still be able to do a lot of neat stuff with just the 32-entry table.

jmg · 2014-05-06 21:05

Seairth wrote: »

I get what you are saying, but that's hub access jitter, not I/O jitter (yes, I know they can sometimes be dependent on each other). I'm not saying it's not important. I just see it as an extra bit that, if it were not present on P1+, we would still be able to do a lot of neat stuff with just the 32-entry table.

Not quite - without this, the 32 table is locked to 32, which seriously limits the jitter free choices.
The 5 bit register this needs to fix, uses far less silicon than the table it then allows to work properly.

viz The examples I give above, cannot be jitter-free without this ReLoad value.
With fully random COG Alloc, there is no 'top' to extract this from so it needs to be user-supplied.

The reasons for going to all this effort, is to control both Alloc and jitter on Slot Scan Rates.
Tight Software loops will be SSR granular, and jitter-free loops are very commonly needed, so designers need to define Jitter Free SSR.

Cluso99 · 2014-05-06 21:41

Without Table#2 we don't get to use the otherwise unused slots. This method gives the unused slots a second chance of being used - it can be thought of as a single level form of mooching.

As for setup time, each hub instruction already takes longer because it has setup time, so my presumption is that this signal will be available early, while the operands are being fetched.

I would be quite happy for Table#1 and Table#2 to both be 16 slots, and be able to be either one combined 32 slot table or 2 16 slot tables. In fact, just a 16 slot table satisfies where I see the USB FS problem - but there are so many uses for having slot tables.

There is a side benefit from soft tables (it has been mentioned, but worth restating) - since some cogs have special DAC pins, by being able to re-order the cogs in the table may benefit co-operating cogs doing video modes.

FWIW presume Table#1 is unchanged (ie each cog has its own slot), setting Table#2 all to "0" would mean that cog 0 would be given every unused slot (mooching). By setting Table#2 to 0,1,0,1,0,1... would give cog 0 every unused even slot and cog 1 every unused odd slot (2 cogs sharing mooching). This just shows the benefits of Table#2.

Another case..
Table#1 0,1,2,3,4,5,6,7,8,1,A,B,C,D,E,F
Table#2 0,9,2,3,4,5,6,7,8,9,A,B,C,D,E,F
gives cog pairing 1 & 9: Cog 1 gets priority use of both slots 1 & 9. If they are unused, then Cog 9 gets the dregs.
note that cogs 0,2-8,A-F do not donate their slots, so no mooching has been given. Their slots just remain unused.

jmg · 2014-05-06 23:53

Cluso99 wrote: »

I would be quite happy for Table#1 and Table#2 to both be 16 slots, and be able to be either one combined 32 slot table or 2 16 slot tables. In fact, just a 16 slot table satisfies where I see the USB FS problem - but there are so many uses for having slot tables.

An appeal of that is it still fits in a WRQUAD, and keeps the tables smaller.

I tried this, and got some unexpectedly low speeds...

A 32x table, with Default switch, and ReLoad, reports on my FPGA build as 240.500MHz, which is similar to other P&R values for my other scanners.

If I add the pipeline check of 'Will next opcode use HUB", - ie basically one more level of lookup, and a Top/Bottom half mux using that decision result, plus a x32 or x16/x16 choice, then P&R reports 131.423MHz
The slot mapper needs to run > 200MHz to not impact system speeds.

ie 4.158ns changes to 7.609ns, an adder of 3.451ns for a 4 bit lookup and 2 choice Mux

I had expected an impact, but not quite that much.

The OnSemi process may give different (better?) figures than this Lattice FPGA, but it should still be proportional. - and values have seemed in the right ballpark.

Cluso99 · 2014-05-07 00:06

jmg wrote: »

An appeal of that is it still fits in a WRQUAD, and keeps the tables smaller.

I tried this, and got some unexpectedly low speeds...

A 32x table, with Default switch, and ReLoad, reports on my FPGA build as 240.500MHz, which is similar to other P&R values for my other scanners.

If I add the pipeline check of 'Will next opcode use HUB", - ie basically one more level of lookup, and a Top/Bottom half mux using that decision result, plus a x32 or x16/x16 choice, then P&R reports 131.423MHz
The slot mapper needs to run > 200MHz to not impact system speeds.

ie 4.158ns changes to 7.609ns, an adder of 3.451ns for a 4 bit lookup and 2 choice Mux

I had expected an impact, but not quite that much.

The OnSemi process may give different (better?) figures than this Lattice FPGA, but it should still be proportional. - and values have seemed in the right ballpark.

The code has done a sequential rather than parallel circuit. Both maps should be done in parallel, and the result muxed.

jmg · 2014-05-07 00:12

Cluso99 wrote: »

The code has done a sequential rather than parallel circuit. Both maps should be done in parallel, and the result muxed.

I'm not sure what you are trying to say, but a direct mapper is applying MapF[SC], whilst an indirect mapper has to
use MapF[SC] to extract F_NeedsHUB as (NextOPCisHUB[MapF[SC]]), and that selects Upper/Lower.
So there will be logic costs to this.

Cluso99 · 2014-05-07 01:28

jmg wrote: »

I'm not sure what you are trying to say, but a direct mapper is applying MapF[SC], whilst an indirect mapper has to
use MapF[SC] to extract F_NeedsHUB as (NextOPCisHUB[MapF[SC]]), and that selects Upper/Lower.
So there will be logic costs to this.

I could build it in logic, but not Verilog. Anyway, I am sure Chip will be able to do it comfortably, as this is simple compared to the amazing things he has done.

Seairth · 2014-05-07 05:36

jmg wrote: »

Not quite - without this, the 32 table is locked to 32, which seriously limits the jitter free choices.
The 5 bit register this needs to fix, uses far less silicon than the table it then allows to work properly.

I am not arguing against Table2. Or the notion that setting up intervals other than 2^ cannot be precisely done without it. I just think that Table1 alone still gives us a lot of new opportunities.

Cluso99 wrote: »

Without Table#2 we don't get to use the otherwise unused slots. This method gives the unused slots a second chance of being used - it can be thought of as a single level form of mooching.

Unless jmg's definition of Table1 differs from my simple 32-entry table, and I'm not seeing the difference (this has happened once before), then that's not true. There are 32! combinations that could be assigned. There is never an unused slot. At worst case, a slots is assigned to a cog that's not running. I don't see how Table2 will make any difference to the ability to use all slots.

Just so we are all on the same page, here is what the HUBP table (That's what I'll call it, in case jmg's Table1 is actually something different) would look like for various scenarios:

$01234567_89ABCDEF_01234567_89ABCDEF

default, 1:16 for all cogs

$00000000_00000000_00000000_00000000

one cog to rule them all

$34343434_34343434_34343434_34343434

two cogs (selected for their associated pins) get 1:2 (effectively no hub latency)

$01273457_6FF7FFF7_0FF7FFF7_FFF7FFF7

cog 0 gets 1:16
cogs 1-6 get 1:32
cog 7 gets 1:16
cogs 8-E don't get anything (may not be running)
cog F isn't running

$C010D020_C030D040_C010D050_C060D070

cog 0 gets 1:2
cogs 1 gets 1:16
cogs 2-7 get 1:32
cogs 8-B get nothing (may not be running)
cogs C-D get 1:8 (1:4 relative to each other)
cogs E-F get nothing (may not be running)

$01020310_20310203_10203102_0310203F

cog 0 gets irregular timing of 2:2:3:2:3:2:3:2:3:2:3:2:3
cogs 1-3 get irregular timing of 5:5:5:5:5:5:2
cogs 4-E get nothing (may not be running)
cog F gets 1:32

(Yes, I realize this is the sort of scenario where you want Table2.)

$01023405_67012304_56712304_FFFFFFFF

cog 0 has fitted hubops at strategic spots (instead of tuning the code around the slots)
cogs 1-7 have relatively fast hub access and do not contain timing-critical sections (jitter is okay)
cogs 8-E get nothing (may not be running)
cog F gets one lump block (likely not running, as half the slots can never be accessed with two-clock intruction timing)

$01230123_40414243_01230123_50515253

cog 0 gets 4:5:7:4:5:7
cog 1 gets 4:6:6:4:6:6
cog 2 gets 4:7:5:4:7:5
cog 3 gets 4:8:4:4:8:4
cog 4 gets a 2:2:2:2 burst every 24 clocks
cog 5 gets a 2:2:2:2 burst every 24 clocks (16 clocks out of phase with cog 4)
cogs 6-F get nothing (may not be running)

And so on. Again, I'm not saying that the 32-entry table is sufficient for all cases, but I think it provides so much more than we can do with the simple 1:16 round-robin approach, that I think we'll still be figuring out new combinations when the P2 (or P3) gets released.

jmg · 2014-05-07 12:24

Seairth wrote: »

Just so we are all on the same page, here is what the HUBP table (That's what I'll call it, in case jmg's Table1 is actually something different) would look like for various scenarios:

Yup. Any Cog ID in Any Slot - but I'd add some examples of 28/24/20 scans. to make it crystal clear that is a vital part of the Scanning Choices.

jmg · 2014-05-07 12:32

Seairth wrote: »

I just think that Table1 alone still gives us a lot of new opportunities.

Correct. (but needs to support Reload (eg 20/24/28/32 etc scans )

Seairth wrote: »

I don't see how Table2 will make any difference to the ability to use all slots.

It does not affect the ability to use all slots - the mapping can be changed, after init, at any time.
Being able to easily modify mapping for a Cluster of COGS will be a common use case.

Mooch is a subtle variant, that allows a pre-defined modify to be Automatic, ie a fetch-time decision.
The single table SW control means one of the set of COGS in that cluster has to modify the mapping

- Optional re-config of the 32x to 16+16 has appeal, but right now my testcases are showing a significant timing hit from the second level of indirection.
I will try other use-more-silicon approaches, to see if the speed can be improved.

Seairth · 2014-05-07 12:45

jmg wrote: »

Yup. Any Cog ID in Any Slot - but I'd add some examples of 28/24/20 scans. to make it crystal clear that is a vital part of the Scanning Choices.

Nope. That requires your Table2, I believe. If you want those examples, you add them! I was sticking to strictly one 32-slot table, as in my original proposal/conclusion/epiphany/etc.

:P

jmg · 2014-05-07 13:57

Seairth wrote: »

Nope. That requires your Table2, I believe.

What requires table 2 ?
Any Cog in any slot is not Table 2 related. Mooch does need either a 2nd, or a split, table.
To make it easier to follow, I use two lines, one for 'slow' and one for 'fast' - in a final Map those are all in one table row.

Seairth wrote: »

If you want those examples, you add them! I was sticking to strictly one 32-slot table, as in my original proposal/conclusion/epiphany/etc.

Here are some examples of No-Jitter, more granular Slor Repetition Rate choices <>32

Table: 2 @ 20ns SRR(fast) & 14 @ 140ns SSR(slow) 
|<------ Array ------------>|<------ Array ------------>| M28
00000000001111111111222222220000000000111111111122222222
01234567890123456789012345670123456789012345678901234567
 0 1 2 3 4 5 6 8 9 A B C D E 0 1 2 3 4 5 6 8 9 A B C D E         << Low BW 14 @ 7.142MHz, no jitter
F 7 F 7 F 7 F 7 F 7 F 7 F 7 F 7 F 7 F 7 F 7 F 7 F 7 F 7          << HiBW 2 @ 50MHz, no jitter, 20ns SRR 
                                                                 << Also 1 @ 100MHz, 10ns SRR
Table:  3 @ 30ns SRR(fast) & 12 @ 120ns SSR(slow) 
|<------ Array -------->|<------ Array -------->|  M24
000000000011111111112222000000000011111111112222
012345678901234567890123012345678901234567890123
 1 2 3 4 6 8 9 B C D E F 1 2 3 4 6 8 9 B C D E F        << Low BW 12 @ 8.333MHz, no jitter
0 5 A 0 5 A 0 5 A 0 5 A 0 5 A 0 5 A 0 5 A 0 5 A         << HiBW 3 @ 33.33MHz, no jitter 30ns SRR


Table:  4 @ 40ns SRR(fast) & 12 @ 120ns SSR(slow)
|<------ Array -------->|<------ Array -------->| M24
000000000011111111112222000000000011111111112222
012345678901234567890123012345678901234567890123
 1 2 3 5 6 7 9 A B D E F 1 2 3 5 6 7 9 A B D E F        << Low BW 12 @ 8.333MHz, no jitter
0 4 8 C 0 4 8 C 0 4 8 C 0 4 8 C 0 4 8 C 0 4 8 C         << HiBW 4 @ 25MHz, no jitter 40ns SRR

Table: 5 @ 50ns SRR(fast) & 10 @ 100ns SSR(slow)
|<------ Array ---->|<------ Array ---->| M20
0000000000111111111100000000001111111111
0123456789012345678901234567890123456789
 1 2 4 5 6 8 9 B C E 1 2 4 5 6 8 9 B C E         << Low BW 10 @ 10 MHz, no jitter 100ns SRR
0 3 7 A D 0 3 7 A D 0 3 7 A D 0 3 7 A D          << HiBW 5 @ 20MHz, no jitter 50ns SRR
 
Table:  2 @ 20ns SRR(fast) & 10 @ 100ns SSR(slow)
|<------ Array ---->|<------ Array ---->| M20
0000000000111111111100000000001111111111
0123456789012345678901234567890123456789
 1 2 4 5 6 8 9 B C E 1 2 4 5 6 8 9 B C E         << Low BW 10 @ 10 MHz, no jitter 100ns SRR
D F D F D F D F D F D F D F D F D F D F          << HiBW 2 @ 50MHz, no jitter 20ns SRR

jmg · 2014-05-07 16:41

jmg wrote: »

I tried this, and got some unexpectedly low speeds...

A 32x table, with Default switch, and ReLoad, reports on my FPGA build as 240.500MHz, which is similar to other P&R values for my other scanners.

If I add the pipeline check of 'Will next opcode use HUB", - ie basically one more level of lookup, and a Top/Bottom half mux using that decision result, plus a x32 or x16/x16 choice, then P&R reports 131.423MHz

Some more numbers :

From above :
x32 Mapper ~ 240MHz ( 50% Alloc possible & appx speed of non-pair config choice )
x16/x16 OPCisHUB ~ 131.423MHz (Auto Mooch variant)

Added a Boolean re-mapper, use-more-silicon, but (hopefully) lower delays - seems routing delays kills this.
(and the LUT cost was also high - added hundreds LUT4.)

x16/x16 ReOrder OPCisHUB ~ 105.809MHz and Logic is much higher
( also get 90.066MHz and double the logic, if re-order table is 32x, but Mooch scan is <=16, so can use 16 re-order table )
This shows a move from x16 to x32 tables, costs ~ 16MHz in fMAX

add another pipeline
x16/x16 ReOrder pipeline2 ~ 141.263MHz - better, but still costly.

return to x16/x16 OPCisHUB but add 2nd pipeline

x16/x16 OPCisHUB ~ 207.684MHz This has config for x32 and x16/x16 Auto Mooch variant

The two pipelines* needed to get tolerable speed, impose these assumptions.

a) The Opcode pipeline can provide an 'early boolean', showing the NEXT opcode will need the HUB (ie I have a HUB opcode in pipeline signal)

b) The AllocCOG (4b) is not needed in the first clock of that opcode, but is valid for 2nd and further clocks.

Larger tables do have a speed cost, as well as a resource cost.

* These pipelines (hopefully) tap-into, and run in Parallel with, existing pipelines and opcode 'need info now' timings, so have no fSys penalties.

This ~207.684MHz includes 2 Config control Bits for
i) Force Default 1:16 operation (ignores Table, can be applied at any time)
ii) Operate as either x32 or x16/x16 Mooch Pair : 2nd COG is used IF first does not need slot.

Cluso99 · 2014-05-07 17:39

jmg: Something is severely wrong with your timing if changing from 16 to 32 slots changes the timings at all. Surely this is purely where a counter overflows back to zero.
It is more likely that your 'n' modulo counter is affecting the timing. But as you know, the counter can be reversed and run down to zero (the user doesn't see this).

All: There seems to be a misconception of what my Table#2 is for. It is for the case where the cog in Table#1 does not require the allocated slot (this time around). Table#2 allows the unused slot to be offered to another cog. This permits a mooching dedication of each slot, rather than mooching all slots to one cog. But for my co-operating pairs of cogs, I can give priority of one or both slots to the primary cog, and if not required it will be offered to the other cog in the pair.
So this gives ultimate flexibility. I don't believe this should add any significant timing as everything would be decoded in parallel in the previous clock. Hub accesses will be setting up their read/write in the clock before the hub transaction takes place. I am sure Chip will be able to solve this if he decides to implement this.

kwinn · 2014-05-07 17:47

Cluso99 wrote: »

Now we are getting somewhere!!
Quite a time back I proposed a 32 slot table so that we could give some cogs only 1:32 while others received a lot more. But it was howled down with noise.

By having a configurable table (Table#1) of 32 slots (where the default is set 0...15, 0..15 meaning no change) the user can configure what he desires.
Now if there were an additional table (Table#2) of 32 slots to be used when the cog given the slot in Table#1 does not require it, then we could effectively do a programmable level of slot-pairing where donation is priority or submissive. Also, a simple form of mooching can be achieved also. Just 2 levels would be sufficient. With careful decoding in an earlier clock, timing could be resolved.

IMHO, this would not be very difficult to implement, not a lot of silicon, and extremely flexible. Default maintains the status quo.

So you're the culprit ;-)
I thought someone posted the table idea a while back but I was not sure. It's such a simple, elegant, and flexible method of controlling hub access. Can't understand why it didn't get more discussion then.

Not too enthused about adding a second table though. Not because of the extra silicon, but because of the extra complexity and time it will take to access it. Look ahead decoding for that would not be all that easy either. Better IMHO to use the silicon for a 64 entry table instead.

kwinn · 2014-05-07 17:50

+1 for sure. Keep it simple to implement, use, and understand.

Seairth wrote: »

If we left this out and just stuck with Table #1, I think that would open up so many possibilities on its own. Then, once we've all had some real experience under our belt, we could decide if it makes sense to include Table #2 as part of P2 (or some other future chip). Would you be heartbroken if only Table #1 was implemented?

kwinn · 2014-05-07 17:54

Absolutely! Such a simple tiny addition to the table makes it at least twice as good.

jmg wrote: »

It also needs a single ReLoad value field (5b), to cover no-jitter scan cases that do not divide into 16. - see my examples above 14,12,10. ( this is also why a rom-choice-table is not enough. )

jmg · 2014-05-07 17:56

Cluso99 wrote: »

jmg: Something is severely wrong with your timing if changing from 16 to 32 slots changes the timings at all. Surely this is purely where a counter overflows back to zero.
It is more likely that your 'n' modulo counter is affecting the timing. But as you know, the counter can be reversed and run down to zero (the user doesn't see this).

No surprises really, part of this behaves as RAM, and larger RAM is always slower then smaller RAM as it has more decode-trees.
Counters also clock slower, the larger they get, but 4-5 bit counters will be well above fSys
(32b counters are another story.)

Cluso99 wrote: »

All: There seems to be a misconception of what my Table#2 is for. It is for the case where the cog in Table#1 does not require the allocated slot (this time around). Table#2 allows the unused slot to be offered to another cog. This permits a mooching dedication of each slot, rather than mooching all slots to one cog. But for my co-operating pairs of cogs, I can give priority of one or both slots to the primary cog, and if not required it will be offered to the other cog in the pair.

Yup, I coded it so when in Pair/Mooch mode, it checks Table1.COG first and it if does not need the slot right then, it flops to Table2.COG, which may, or may not use it. In this mode, the 32x folds into 2 x 16, and Reload <= 16.

In 32x mode, Reload is <= 32

Cluso99 wrote: »

So this gives ultimate flexibility. I don't believe this should add any significant timing as everything would be decoded in parallel in the previous clock. Hub accesses will be setting up their read/write in the clock before the hub transaction takes place. I am sure Chip will be able to solve this if he decides to implement this.

Easy to 'believe' anything, I prefer hard numbers.
The Tools tell me I can get > 200MHz, for a 'mooch' option, but only by using two pipeline levels.
They also give ~240MHz is for the simpler 32x/ReLoad Mapper

I think that is ok, as they should not add to present pipelines & I expect the HUB-Slot-Value is not needed before 2nd clk in the opcode.

A 32-slot Approach (was: An interleaved hub approach)

Comments