A 32-slot Approach (was: An interleaved hub approach)

kwinn · 2014-05-07 18:20

By George, I think you've got it. And jmg's addition of a reload counter is the icing on the cake. Instead of being limited to looping through all 32 slots you can specify how many slots to loop through. That means you can set it up so a cog can do evenly timed hub reads/writes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or 16 times for each loop through the 32 entry table. About as flexible as you can get.

Seairth wrote: »

I am not arguing against Table2. Or the notion that setting up intervals other than 2^ cannot be precisely done without it. I just think that Table1 alone still gives us a lot of new opportunities.

Unless jmg's definition of Table1 differs from my simple 32-entry table, and I'm not seeing the difference (this has happened once before), then that's not true. There are 32! combinations that could be assigned. There is never an unused slot. At worst case, a slots is assigned to a cog that's not running. I don't see how Table2 will make any difference to the ability to use all slots.

Just so we are all on the same page, here is what the HUBP table (That's what I'll call it, in case jmg's Table1 is actually something different) would look like for various scenarios:

$01234567_89ABCDEF_01234567_89ABCDEF
default, 1:16 for all cogs

$00000000_00000000_00000000_00000000
one cog to rule them all

$34343434_34343434_34343434_34343434
two cogs (selected for their associated pins) get 1:2 (effectively no hub latency)

$01273457_6FF7FFF7_0FF7FFF7_FFF7FFF7
cog 0 gets 1:16

cogs 1-6 get 1:32

cog 7 gets 1:16

cogs 8-E don't get anything (may not be running)

cog F isn't running

$C010D020_C030D040_C010D050_C060D070
cog 0 gets 1:2

cogs 1 gets 1:16

cogs 2-7 get 1:32

cogs 8-B get nothing (may not be running)

cogs C-D get 1:8 (1:4 relative to each other)

cogs E-F get nothing (may not be running)

$01020310_20310203_10203102_0310203F
cog 0 gets irregular timing of 2:2:3:2:3:2:3:2:3:2:3:2:3

cogs 1-3 get irregular timing of 5:5:5:5:5:5:2

cogs 4-E get nothing (may not be running)

cog F gets 1:32

(Yes, I realize this is the sort of scenario where you want Table2.)

$01023405_67012304_56712304_FFFFFFFF
cog 0 has fitted hubops at strategic spots (instead of tuning the code around the slots)

cogs 1-7 have relatively fast hub access and do not contain timing-critical sections (jitter is okay)

cogs 8-E get nothing (may not be running)

cog F gets one lump block (likely not running, as half the slots can never be accessed with two-clock intruction timing)

$01230123_40414243_01230123_50515253
cog 0 gets 4:5:7:4:5:7

cog 1 gets 4:6:6:4:6:6

cog 2 gets 4:7:5:4:7:5

cog 3 gets 4:8:4:4:8:4

cog 4 gets a 2:2:2:2 burst every 24 clocks

cog 5 gets a 2:2:2:2 burst every 24 clocks (16 clocks out of phase with cog 4)

cogs 6-F get nothing (may not be running)

And so on. Again, I'm not saying that the 32-entry table is sufficient for all cases, but I think it provides so much more than we can do with the simple 1:16 round-robin approach, that I think we'll still be figuring out new combinations when the P2 (or P3) gets released.

jmg · 2014-05-07 19:18

kwinn wrote: »

Better IMHO to use the silicon for a 64 entry table instead.

I tried a 64x table, just to see how the P&R reported (Target set to 206MHz ) - interesting numbers result
a) 32x, Reload Scan, Default Config Boolean -> 240.500MHz
b) 64x, Reload Scan, Default Config Boolean -> 168.407MHz
c) 64x, Reload Scan, remove Default Config -> 165.508MHz
d) 48x, Reload Scan, Default Config Boolean ->192.530MHz

Usually c) would be expected faster than b) as there is less logic, but there is also P&R 'noise' in speed results.
The penalty seems a little magnified, maybe because 32x is above the target, and the others are below. I've seen P&R 'give up' if it cannot meet timing, and so there may be a figure where it attains higher, but not likely > 206MHz

A Default Config Boolean allows the table to be designed as a SRAM ASIC macro, as the contents are ignored in Default.

Whilst these may not reflect OnSemi ASIC too closely, they are FPGA P&R reports, so should track what Altera does.
ie be useful indicators on the FPGA file speed impacts of various design choices.
It is easy to drop well under the critical-path of ~ 200+MHz

The intermediate 48x has intermediate speed.
Counter is still 6b, but the memory-like muxes are smaller and the routing spread is also smaller.

kwinn wrote: »

Not too enthused about adding a second table though. Not because of the extra silicon, but because of the extra complexity and time it will take to access it. Look ahead decoding for that would not be all that easy either.

Yes, the tools reflect there is more logic, but crafted with two pipeline stages, I did manage to get a split 16x/16x choice option working @ 207.684MHz. ( Interesting that is slower than 32x, but faster than 64x )

Edit: added 48x test case above.(Default Config Boolean) included

Seairth · 2014-05-07 19:44

jmg wrote: »

Here are some examples of No-Jitter, more granular Slor Repetition Rate choices <>32

Ok! I get it now! "Pictures" make a big difference!

So, you could have

SETHUBP D, #n

where D points to a QUAD for the 32-entry table and #n indicates the your "ReLoad" value (how many entries to iterate over before starting at the beginning of the table again). And the default would be $01234567_89ABCDEF, 16 (no need to iterate over all 32).

Seairth · 2014-05-07 19:48

kwinn wrote: »

Better IMHO to use the silicon for a 64 entry table instead.

I'd stick to 32 entries. With 16 cogs, this makes the table 128 bits wide, which just happens to be the width of the cog-hub datapath. Given the "SETHUBP D, #n" instruction, though, I don't know how you'd pass the 5-bit ReLoad value.

jmg · 2014-05-07 19:52

Seairth wrote: »
Ok! I get it now! "Pictures" make a big difference!

So, you could have
SETHUBP D, #n
where D points to a QUAD for the 32-entry table and #n indicates the your "ReLoad" value (how many entries to iterate over before starting at the beginning of the table again). And the default would be $01234567_89ABCDEF, 16 (no need to iterate over all 32).

Yup, correct.
I also coded it with a System config bit, that forces the default in Logic, so gives more choices in how the table is created, such as Asymmetric Dual Port SRAM ASIC 'cell', without needing RST init logic. (and a second config bit to swap Table from 32x to 2 x 16 for the Mooch cases)

Seairth · 2014-05-07 20:02

A few technical issues would need to be worked out:

What would happen if the ReLoad value was zero?
What would happen if the table was entirely set to a cog that wasn't running?
Should this affect the Hub Math access?
Should this affect other hubops like COGINIT, COGSTOP and SETHUBP?
Should this affect locks?

jmg · 2014-05-07 20:12

Seairth wrote: »

[*] What would happen if the ReLoad value was zero?

The CogID in Table[0] is always presented.

Seairth wrote: »

[*] What would happen if the table was entirely set to a cog that wasn't running?

That CogID would always be presented.
A Hub wait-implicit handshake would need a confirmation that Hub access was granted, before wait terminates.
As the ID matches no running COGS, & if they all use HUB access, eventually they would all be in hub_wait

kwinn · 2014-05-07 21:04

Agreed, and the cog hub datapath width was the reason I suggested it initially. The 32 entry table really is the minimum size needed to be able to share hub bandwidth and still give a little to all 16 cogs.
I thought the instructions had 9 bit s and d fields, which is more than enough. Has this changed? If the "SETHUBP D, #n" instruction cannot be squeezed into a 32 bit instruction then perhaps D could come from the following memory location, or perhaps another instruction would be needed for that.

Seairth wrote: »

I'd stick to 32 entries. With 16 cogs, this makes the table 128 bits wide, which just happens to be the width of the cog-hub datapath. Given the "SETHUBP D, #n" instruction, though, I don't know how you'd pass the 5-bit ReLoad value.

kwinn · 2014-05-07 21:16

An alternative would be to copy the first or last 5 bits of the data quad that is written to the slot table. That would not be much of an inconvenience.

jmg · 2014-05-07 21:29

kwinn wrote: »

An alternative would be to copy the first or last 5 bits of the data quad that is written to the slot table. That would not be much of an inconvenience.

I'm not quite following ?

The suggestion was to use an opcode of form
SETHUBP D, #n
Where D is the base address of the Quad, and #n sets ReLoad.
#n here can be extracted from the opcode bits, and passed to a latch, and the Hub-path Quad can connect to the write-side of dual port memory 128x1 Write Side, 32 x 4 Read Side.

kwinn · 2014-05-07 23:03

jmg wrote: »

I'm not quite following ?

The suggestion was to use an opcode of form
SETHUBP D, #n
Where D is the base address of the Quad, and #n sets ReLoad.
#n here can be extracted from the opcode bits, and passed to a latch, and the Hub-path Quad can connect to the write-side of dual port memory 128x1 Write Side, 32 x 4 Read Side.

That was in response to the quote below. What I meant was that if there were not enough bits in the "SETHUBP D, #n" instruction for the 5 bits needed to specify the number of entries in the hub slot table we could use 5 bits from the quad that was written to the table to load the Reload register as well. A bit more complicated than having it as part of the SETHUBP instruction but manageable if it was needed. After all, the cogs can be specified in any order we want.

For instance we could use the 5 lsb's of the quad to specify 30 entries by having the last two cogs of the table be 15 and 13 (hex F and D ). By loading the 5 lsb's from those ( bin 1111 and 1101 ) we would load 11101 into the reload register.

Seairth wrote: »

I'd stick to 32 entries. With 16 cogs, this makes the table 128 bits wide, which just happens to be the width of the cog-hub datapath. Given the "SETHUBP D, #n" instruction, though, I don't know how you'd pass the 5-bit ReLoad value.

jmg · 2014-05-07 23:12

kwinn wrote: »

What I meant was that if there were not enough bits in the "SETHUBP D, #n" instruction for the 5 bits needed to specify the number of entries in the hub slot table we could use 5 bits from the quad that was written to the table to load the Reload register as well. ....

? The general Opcode format has 2 x 9 bit S & D fields, so one specifies the Quad base, and the other can be an immediate #9b
- there seem to be spare spaces in the Opcode table, and #9b could also support the 2 control bits, if they were not mapped elsewhere.

Cluso99 · 2014-05-07 23:15

kwinn wrote: »

That was in response to the quote below. What I meant was that if there were not enough bits in the "SETHUBP D, #n" instruction for the 5 bits needed to specify the number of entries in the hub slot table we could use 5 bits from the quad that was written to the table to load the Reload register as well. A bit more complicated than having it as part of the SETHUBP instruction but manageable if it was needed. After all, the cogs can be specified in any order we want.

For instance we could use the 5 lsb's of the quad to specify 30 entries by having the last two cogs of the table be 15 and 13 (hex F and D ). By loading the 5 lsb's from those ( bin 1111 and 1101 ) we would load 11101 into the reload register.

Too complex. Just use a second instruction to limit the no of table entries. Much easier to explain.

Brian Fairchild · 2014-05-07 23:57

Have I got this right?

The "32 entry table plus reload" is

1) A 128-bit register, accessible by a WRLONG (and hence atomic), divided into 32 4-bit fields.
2) Plus a 5-bit counter reload value register
3) Each 4-bit field contains the number (0-15) of the core that will gain access on the next memory access cycle
4) A 5-bit counter, with its top value set by the reload value, scan the fields at the memory access clock rate
5) Loaded 0-15,0-15 on chip reset with a reload of 32
6) Any core can write to the registers to change the access pattern

jmg · 2014-05-08 01:59

Brian Fairchild wrote: »

Have I got this right?

The "32 entry table plus reload" is

1) A 128-bit register, accessible by a WRLONG (and hence atomic), divided into 32 4-bit fields.
2) Plus a 5-bit counter reload value register
3) Each 4-bit field contains the number (0-15) of the core that will gain access on the next memory access cycle
4) A 5-bit counter, with its top value set by the reload value, scan the fields at the memory access clock rate
5) Loaded 0-15,0-15 on chip reset with a reload of 32
6) Any core can write to the registers to change the access pattern

Yup, pretty much nailed it, from the operational viewpoint.
As described above, it can 'drop in' to replace a present locked at only 1:16 Hub scanner.

There is also a variant option that splits the 32 into 2x16, and that gives a pair for each scan, First COGid in the queue of 2 get first rights to the Slot, if it does not use it, the second COGid can.
That is rather tighter in timing, and presumes two levels of pipeline linkage with the total system. So it's a 'maybe' option.

Brian Fairchild · 2014-05-08 02:29

jmg wrote: »

Yup, pretty much nailed it, from the operational viewpoint.

Excellent, I've had 8 days with no real internet access recently and have got very behind with my forum reading.

I have to say I like this proposal as a nice simple clean way to both implement it and explain it.

One idea I've just had, and I'm happy to be shot down,...

1) Add a 32-bit central register, cleared to 0 on reset.
2) Each bit corresponds to one of the 32 access slots.
3) If a bit is set (by the programmer) it means "donate this slot to core 0 if the assigned core doesn't need it this time around".

This essentially give a combination of operating modes...

1) The reset state is pure round-robin access.
2) Changing the table allows finer-grained control of access to suit what cores need to do.
3) Changing the register allows for non-deterministic access for the "void main (void)" core, affecting certain cores which can tolerate it.

mark · 2014-05-08 10:51

Unless I'm misunderstanding, then apparently this form of hub window allocation is also more or less what I had in mind (I guess some of us were speaking a different languages), but I certainly had no concept for low-level implementation. I would have been happy with only a handful of "hard coded" configurations to choose from, so it being flexible is certainly welcome! This just seems the most logical method for those that want hub bandwidth flexibility while still maintaining guaranteed bandwidth for all cogs. I don't see any drawback other than the programmer erroneously giving insufficient hub bandwidth to a cog object from the OBEX (though I maintain that the compiler could still possibly flag the issue).

jmg · 2014-05-08 12:29

[QUOTE=mark

jmg · 2014-05-08 12:38

Brian Fairchild wrote: »

Excellent, I've had 8 days with no real internet access recently and have got very behind with my forum reading.

I have to say I like this proposal as a nice simple clean way to both implement it and explain it.

One idea I've just had, and I'm happy to be shot down,...

1) Add a 32-bit central register, cleared to 0 on reset.
2) Each bit corresponds to one of the 32 access slots.
3) If a bit is set (by the programmer) it means "donate this slot to core 0 if the assigned core doesn't need it this time around".

This essentially give a combination of operating modes...

1) The reset state is pure round-robin access.
2) Changing the table allows finer-grained control of access to suit what cores need to do.
3) Changing the register allows for non-deterministic access for the "void main (void)" core, affecting certain cores which can tolerate it.

That is the pretty much in dual-choice (if-based) alternative in #26 ( and that wholly encompasses the 'Super Cog 0' mirth thread option)

Being smarter, it can choose any COG,not just Cog0.- once you have to change ID, it does not matter if you change it to 0000, or N, there is no speed difference, so there is no techicial reason to be asymmetric or limiting.

Note that ANY conditional (if-based) COG selection does have a Speed Impact issue, (Yes even the 'Super Cog 0' mirth idea)

On my tests, the added impact of conditional (if-based) COG only just gets above 200MHz on FPGA, even with double pipelines. (P&R @ 207.684MHz)

So if-based needs an OnSemi process reality check, assuming Chip wants to try it.

A simpler 32x slot mapper easily meets speed, and it can be run-time modified to get close to an if-based auto-selecting choice.

A fixed mapper is more deterministic than Auto-Select, which is best kept out of tight control areas, but is fine for HMI areas, where humans do not care if repaints vary in time.

That is why my testcase did both choices, config time selectable.
Sometimes deterministic trumps average.

kwinn · 2014-05-08 22:04

What I suggested was a last ditch choice if there was no other way to load the "Reload" register. I would greatly prefer that the "SETHUBP D, #n" instruction could use the "D" field to specify the Quad to write to the hub slot table, and the "#n" immediate value gets loaded into the "Reload" register. That way everything for hub access gets set up with one single atomic instruction.

jmg wrote: »

? The general Opcode format has 2 x 9 bit S & D fields, so one specifies the Quad base, and the other can be an immediate #9b
- there seem to be spare spaces in the Opcode table, and #9b could also support the 2 control bits, if they were not mapped elsewhere.

jmg · 2014-05-08 22:15

kwinn wrote: »

What I suggested was a last ditch choice if there was no other way to load the "Reload" register. I would greatly prefer that the "SETHUBP D, #n" instruction could use the "D" field to specify the Quad to write to the hub slot table, and the "#n" immediate value gets loaded into the "Reload" register. That way everything for hub access gets set up with one single atomic instruction.

Ahh, I misread that message then.

kwinn · 2014-05-08 22:16

I agree. My first choice would be the single "SETHUBP D, #n" instruction that sets up both the table and the "Reload" counter. Second choice would be another instruction for the "Reload" register. The kludge in post 42 would be a distant third, and only if there was no other way to go about it.

Cluso99 wrote: »

Too complex. Just use a second instruction to limit the no of table entries. Much easier to explain.

kwinn · 2014-05-08 22:22

Brian Fairchild wrote: »

Excellent, I've had 8 days with no real internet access recently and have got very behind with my forum reading.

I have to say I like this proposal as a nice simple clean way to both implement it and explain it.

One idea I've just had, and I'm happy to be shot down,...

1) Add a 32-bit central register, cleared to 0 on reset.
2) Each bit corresponds to one of the 32 access slots.
3) If a bit is set (by the programmer) it means "donate this slot to core 0 if the assigned core doesn't need it this time around".

This essentially give a combination of operating modes...

1) The reset state is pure round-robin access.
2) Changing the table allows finer-grained control of access to suit what cores need to do.
3) Changing the register allows for non-deterministic access for the "void main (void)" core, affecting certain cores which can tolerate it.

That's an excellent idea. Best of both worlds.

jmg · 2014-05-08 22:27

jmg wrote: »

On my tests, the added impact of conditional (if-based) COG only just gets above 200MHz on FPGA, even with double pipelines. (P&R @ 207.684MHz)

As another testcase, I moved this to FPGA block_ram, with a MUX on OP and Address.
That will be a little more like an ASIC, perhaps even optimistic now in values.

With double pipelines, and [Default] and [Pair] control booleans, it now reports P&R @ 278.319MHz

That looks comfortable enough to be off any critical path ( especially with the new, lower MHz targets )

Addit : Reading reports lead to some small tweaks/edits, and now it bumps to include Block-ram and distributed RAM: 6 (12 LUT4s) and MHZ has gone over 300MHz.

Certainly makes [32x & 16x/16x] Table Allocator look very practical. Full controll and fast.

Heater. · 2014-05-08 22:29

jmg,

When are you releasing the HDL for your version of the PII ?

jmg · 2014-05-08 22:35

kwinn wrote: »

That's an excellent idea. Best of both worlds.

Not quite, as you cannot define the CogID of the benefiting COG(s) - a single, locked target is all you have.

A split table allows any COG(s) to be boosted, and designs may commonly want to split this to two as COG0 cannot use all 5ns slots anyway.

With my block_ram tests (and Chip's lower MHz), there is looking to be enough margin to do 32x and 16x/16x choices.
Same table, just one bit splits it. (assumes the pipelines needed for any if-based design can mesh with the COG code )

jmg · 2014-05-08 22:38

Heater. wrote: »

jmg,

When are you releasing the HDL for your version of the PII ?

hehe, based on my current rate of progress, on the fractional test cases of Smart pin counter and Hub Alloc, I'd guess 2019, maybe 2021 ?

jmg · 2014-05-09 15:57

jmg wrote: »

FPGA block_ram, with a MUX on OP and Address.

Design A : block_ram on 
   Number of distributed RAM:   6 (12 LUT4s)  
   Number of ripple logic:      0 (0 LUT4s)
   Total number of LUT4s:      46
   Number of block RAMs:  1 out of 72 (1%)

Remove the block-ram switch, otherwise same code,

   Number of SLICEs:            43 out of 16632 (0%)
      SLICEs(logic/ROM):        31 out of 13428 (0%)
      SLICEs(logic/ROM/RAM):    12 out of  3204 (0%)
          As RAM:           12 out of  3204 (0%)
          As Logic/ROM:      0 out of  3204 (0%)
   Number of logic LUT4s:      45
   Number of distributed RAM:  12 (24 LUT4s)
   Number of ripple logic:      3 (6 LUT4s)
   Total number of LUT4s:      75
   Number of block RAMs:  0 out of 72 (0%)

Reported MHz is still > 350MHz, Dual port RAM based table

Even with block_ram off, design has changed to compact, and faster than the original sea-of-LUT code.

Seems the 'coding style' needed to apply block_ram, is also better managed by the tools, even when no directives are given.
The rep mentions using both distributed RAM & ripple logic are now being used, hence the higher MHz.

Addit : Now this has ample margin in FPGA distributed RAM, here is
A 64x : 32x/32x :F16 Table

  MAP  Report:  357.526MHz is the maximum frequency for this preference.  ( was 420.698MHz )
   P&R  Report:  351.124MHz is the maximum frequency for this preference.  ( was 399.680MHz )
   
   Number of SLICEs:            61 out of 16632 (0%)
      SLICEs(logic/ROM):        37 out of 13428 (0%)
      SLICEs(logic/ROM/RAM):    24 out of  3204 (1%)
          As RAM:           24 out of  3204 (1%)
          As Logic/ROM:      0 out of  3204 (0%)
   Number of logic LUT4s:      54
   Number of distributed RAM:  24 (48 LUT4s)   was 12 (24 LUT4s)
   Number of ripple logic:      3 (6 LUT4s)    was 3 (6 LUT4s)
   Total number of LUT4s:     108              was  75
   Number of block RAMs:  0 out of 72 (0%)

64x is slower than 32x, but still is >> 200MHz

Note that FPGA Dual Port memory, cannot wide-write, so a different means of loading will be required.

mark · 2014-05-10 07:25

I've been wondering if there even needs to be a table at all..

Instead, in the hub you could have 16 (or whatever the number of cogs is) 6 or 7-bit counters, each which feeds its output to one bank of a magnitude comparator (with the other bank being tied to a programmable register loaded with a value specifying how many counts must elapse before a cog gets hub access) of similar bit width, with the output of each comparator acting as a sort of "chip select" line for its respective cog, with the additional duty of being a counter reset-to-zero line.

The way it works is like this: initially, each counter is seeded with a value from 0 through #ofCogs, and the loadable registers for all of the comparators are loaded with #ofCogs. With this value, we have the standard round-robin access scheme. However, we can change that by loading different values instead. We could load the value 8 into half of the comparators, and 32 in the others, which would affect hub access accordingly... Or the value 4 in two of the registers, and 28 in the remainder. My math skills are fairly weak, so I can only come up with a few combinations without wanting to bash my head against the table, but I think you get the idea.

Initially I was thinking that the all the counters could be seeded with a fixed value, but it seems that in order to avoid some difficulties (or in fact, it might even be mandatory), it would be desirable to be able to seed a variety of values, so there would need to be a user-loadable register for that too.

Judging by the schematics I looked at, it doesn't seem like such small counters and magnitude comparators require all that much logic. To me, it seems like the logic to seed the counters would be the most tricky, but who knows?

kwinn · 2014-05-11 14:18

@mark

A 32-slot Approach (was: An interleaved hub approach)

Comments