New Hub Scheme For Next Chip

cgracey · 2014-05-13 13:07

Roy Eltham came by and we were talking about all this concern over hub sharing. He had a great idea of having the individual memories act as hubs, instead of just one hub controlling all the memories and serving just one cog at a time.

We happen to have 16 instances of hub RAM, anyway, that make up the total 512KB. By having them each be 8192 locations x 32 bits, and distributing contiguous long addresses among all of them, we could make it so that every cog could read or write a subsequent long on every clock - which is 4x faster than the RDQUAD/WRQUAD scheme. This also allows all memories to be 32 bits wide, instead of 128, which cuts memory power by 65%. Initial latency varies now by the long address, but once locked, longs can be read or written on every clock.

Here is a diagram I made:

Attachment not found.

tonyp12 · 2014-05-13 13:15

> great idea of having the individual memories act as hubs, instead of just one hub controlling all the memories
Sounds great, many ideas have been brought up by flipping the whole idea on who serves what and then.
Free thinkers who can see what the problem really is and when come up with a solution is just super.

cgracey · 2014-05-13 13:16

I just realized that we could do 16-long block reads/writes with NO latency by just starting on whatever long was currently available and wrapping to do the rest as they came.

jazzed · 2014-05-13 13:30

This is a wee bit confusing.

We need any cog to have random access to any hub address. Do the block accesses limit this in any way?

Almost Maker Faire time? Guess I'll miss it again this year ... too much to do.

ctwardell · 2014-05-13 13:33

I can see where this would work well for contiguous transfers, but it really isn't very good for random access.

The biggest drawback is the latency depends on where you are in the cycle when the request is made AND the low order nibble of the address.

Chris Wardell

tonyp12 · 2014-05-13 13:34

> We need any cog to have random access to any hub address
I think It's now more targeted for 16longs burst, and is that not what you should use HUB for anyway?
I guess you can still get a single long, but that may take as long as 16 of them.

cgracey · 2014-05-13 13:38

jazzed wrote: »

This is a wee bit confusing.

We need any cog to have random access to any hub address. Do the block accesses limit this in any way?

Almost Maker Faire time? Guess I'll miss it again this year ... too much to do.

In this scheme, your initial wait is now address-dependent, but no worse than before, on average. What this does is make it possible to quickly transfer longs to and from hub memory, once your latency is over.

This would make LMM really efficient, since instructions can be read into the cog at twice the rate they can be executed. Hub exec is a lot cleaner to generate code for, though, since you don't have to be considerate of boundaries. Hub exec does complicate the cog quite a bit, though. It takes a whole slew of instructions to support. This scheme makes LMM really efficient, but doesn't do a whole lot, directly, for hub exec.

jmg · 2014-05-13 13:44

cgracey wrote: »

. Initial latency varies now by the long address, but once locked, longs can be read or written on every clock.

Very Interesting - but to meet longs can be read or written on every clock. implies a REPS opcode and AutoINC
Does that mean new support in other opcodes will be there to allow this burst speed ?

How does that speed change, with COG-Cycles-per-Address ?
I can see that a very fast COG, can 'spin in sync' with the Memory allocator, but as soon as it needs even 2 cycles to prepare, the speed falls to 16 SysClocks ?

Idea: Given the implied AutoINC, what about a choice on the Index++ ?

If the COG knows it can manage 6 cycles per write, it can add 6 on every WR/RD.
Now, the next HUB access matches the key, and the COG has BW.
Of course, the memory is now sparse, but both sides know how sparse, and they know the base address.

Another idea: Looking some more at this Sized index INC, if it can be numerically a little smarter then multiple add-loops can manage the sparse aspect.

Example of the LSB nibble of a hypothetical 3-spaced INC

= 0
= 3
= 6
= 9
= 12
= 15  6
= 2
= 5
= 8
= 11
= 14  5
= 1
= 4
= 7
= 10
= 13  5
= 0
= 3
= 6
= 9
= 12
= 15    <= 21 writes

Those +3 INCs do three circuits, and they have covered all LSBs, and then can advance.
Memory fills in a unusual order, but the end result is 21 writes into a span range of 31 bytes

jmg · 2014-05-13 13:46

cgracey wrote: »

I just realized that we could do 16-long block reads/writes with NO latency by just starting on whatever long was currently available and wrapping to do the rest as they came.

No added latency I think you meant ? It still takes 16 cycles and I think assumes an AutoINC somewhere in there too.

John Abshier · 2014-05-13 13:47

Chip, I don't understand. What if cog 0 wants to read $XX0 and cog 1 wants to read $YYY0 where XXX not equal to YYY? Do all cogs have to read from the same $XXXn? Who sets $XXX?

John Abshier

tonyp12 · 2014-05-13 13:56

>Do all cogs have to read from the same $XXXn?
The opposite no two cogs can read $xxxF at the same, they all have to take turns reading the lower $0-F of an address
It's like banks in sram, but it's bit0-to-bit3 that is the bank switching and not the msb that you normaly think of when it comes to banks
You could say memory access is interleaved,so a cog can read/write to its hear't content becuase no other cog will be at this "bank" at this moment

jmg · 2014-05-13 13:56

John Abshier wrote: »

Chip, I don't understand. What if cog 0 wants to read $XX0 and cog 1 wants to read $YYY0 where XXX not equal to YYY? Do all cogs have to read from the same $XXXn? Who sets $XXX?

In the example above, COG0 gets immediate access, whilst COG1 has to wait until the allocator spins to have xxx0 pointing to it.
ALL COGS can do 'Cycle simultaneous' HUB writes, but to what are actually physically different memories, so each COG actually has turns to each Physical memory block.

I think if COG knows how fast it can think, and can match the LSB with what is about to arrive, it can get above 1/16 rates.
If it fails to match for any reason, it has to wait for the next spin-around.

cgracey · 2014-05-13 13:57

John Abshier wrote: »

Chip, I don't understand. What if cog 0 wants to read $XX0 and cog 1 wants to read $YYY0 where XXX not equal to YYY? Do all cogs have to read from the same $XXXn? Who sets $XXX?

John Abshier

One of the hub memories would have all $xxxx0 long addresses, while the next would have all $xxxx1 long addresses, and so on. Any cog that wanted to read $xxxxD, for example, would have to wait for that memory to come to his window. All longs at long addresses $xxxxD would be in the same physical memory block. Does that make sense?

jazzed · 2014-05-13 14:06

Ok,

So if a cog is fetching a long from hub, and the long happens to be on (address mod 16*4) == cogid, then there is zero latency? Is a burst read possible?

I guess one concern is in getting enough instructions that don't need another HUB access before the next cycle.

Another concern is in calculating the offset for starting the new instruction assuming a burst. If we have to fetch a block, calculate the address, and jump to that address to execute an instruction that sucks some of the mips out of the execution rate.

I'd like to see Eric or David or other's opinions on how easily a code generator could take advantage of this though. If the COG instruction address to execute is automatically set in the cog by some special instruction it might not make any difference to the code generator.

New instruction maybe? "Atomic hub fetch and execute" ?

John Abshier · 2014-05-13 14:10

I think i grok it. Brought back a memory of programming on mainframe where it was vitally important to get row or column order access correct to avoid page faults.

John Abshier

jmg · 2014-05-13 14:12

jazzed wrote: »

Another concern is in calculating the offset for starting the new instruction assuming a burst. If we have to fetch a block, calculate the address, and jump to that address to execute an instruction that sucks some of the mips out of the execution rate.

From #3, I think if you can tolerate waiting 16 fSys, you can have a 16 opcode block, with known alignment.
So the operation would be 16 CLKs to load, then some time to 'work on' that block, until a new one is needed.

That would imply a RdBlock opcode, & that always takes 16 fSys, with no initial alignment cares.

Tubular · 2014-05-13 14:23

Neat idea, Roy. Great to see some "out of the box" thinking.

But with the 2 clock instruction cycle, wouldn't a cog see only every second memory address? Ie from a cog point of view, first instruction can access xxx0, second instruction can access xxx2, what about getting access to xxx1?

cgracey · 2014-05-13 14:39

jmg wrote: »

From #3, I think if you can tolerate waiting 16 fSys, you can have a 16 opcode block, with known alignment.
So the operation would be 16 CLKs to load, then some time to 'work on' that block, until a new one is needed.

That would imply a RdBlock opcode, & that always takes 16 fSys, with no initial alignment cares.

That's exactly right.

jazzed · 2014-05-13 14:43

jmg wrote: »

From #3, I think if you can tolerate waiting 16 fSys, you can have a 16 opcode block, with known alignment.
So the operation would be 16 CLKs to load, then some time to 'work on' that block, until a new one is needed.

That would imply a RdBlock opcode, & that always takes 16 fSys, with no initial alignment cares.

Doh! I guess that answers my block read question.

Whether it is usable or not for optimum performance is TBD.

cgracey · 2014-05-13 14:50

Tubular wrote: »

Neat idea, Roy. Great to see some "out of the box" thinking.

But with the 2 clock instruction cycle, wouldn't a cog see only every second memory address? Ie from a cog point of view, first instruction can access xxx0, second instruction can access xxx2, what about getting access to xxx1?

There would be special instructions RDBLOC/WRBLOC to handle the transfers of 16 longs. Regardless of the initially-available window, it would always take 16 clocks (+1 for the memory read delay).

Say you wanted to load cog addresses $1E0..$1EF from hub long addresses $1000..$100F and when the RDBLOC instruction started, window $xxx6 was available. No problem:

clock	hub	cog
-------------------
0	1006	1E6
1	1007	1E7
2	1008	1E8
3	1009	1E9
4	100A	1EA
5	100B	1EB
6	100C	1EC
7	100D	1ED
8	100E	1EE
9	100F	1EF
10	1000	1E0
11	1001	1E1
12	1002	1E2
13	1003	1E3
14	1004	1E4
15	1005	1E5

Hub bandwidth would be:

16 RAMs * 32 bits * 200MHz = 12.8GB/s

Cog bandwidth would be:

32 bits * 200MHz = 800MB/s

jmg · 2014-05-13 15:05

cgracey wrote: »

Cog bandwidth would be:

32 bits * 200MHz = 800MB/s

That is for RDBLOC/WRBLOC only ? (which can Rd every SysClk)

What about the detail Tubular asked ? - it does seem the 100MOP/200MHz interactions get 'interesting' here..

["But with the 2 clock instruction cycle, wouldn't a cog see only every second memory address? Ie from a cog point of view, first instruction can access xxx0, second instruction can access xxx2, what about getting access to xxx1?"]

Is there AutoINC somewhere in the mix, and how tightly can multiple RDxx/WRxx opcodes execute ?

tonyp12 · 2014-05-13 15:07

>$1E0..$1EF from hub long addresses $1000..$100F
I guess both sides will be foced to 1*16 boundry? if so the opcode only need 5bits for cog dest and 14bits for HUB to cover up to 1Mb (16384*16*4bytes)

cgracey · 2014-05-13 15:17

jmg wrote: »

That is for RDBLOC/WRBLOC only ? (which can Rd every SysClk)

What about the detail Tubular asked ? - it does seem the 100MOP/200MHz interactions get 'interesting' here..

["But with the 2 clock instruction cycle, wouldn't a cog see only every second memory address? Ie from a cog point of view, first instruction can access xxx0, second instruction can access xxx2, what about getting access to xxx1?"]

Is there AutoINC somewhere in the mix, and how tightly can multiple RDxx/WRxx opcodes execute ?

I meant to convey that the RDBLOC/WRBLOC instructions process a long on every clock, taking 16 contiguous clocks to do so, giving them different timing than normal two-clock instructions.

For discrete contiguous RDLONG instructions, it might be good to order the memories differently, so that if RDLONG D,PTRx++ needs two clocks, you could hit most ascending addresses on time:

0, 8, 1, 9, 2, A, 3, B, 4, C, 5, D, 6, E, 7, F...

Phil Pilgrim (PhiPi) · 2014-05-13 15:20

cgracey wrote:

0, 8, 1, 9, 2, A, 3, B, 4, C, 5, D, 6, E, 7, F

That might affect determinism, since there's a double skip between 7 and 8. The only way around this is for the amount of the skip and the number of hub slots to be relatively prime. (Think about tightening lug nuts.)

-Phil

cgracey · 2014-05-13 15:27

tonyp12 wrote: »

>$1E0..$1EF from hub long addresses $1000..$100F
I guess both sides will be foced to 1*16 boundry? if so the opcode only need 5bits for cog dest and 14bits for HUB to cover up to 1Mb (16384*16*4bytes)

For the hub memory, that is so, but not for the cog. A 4-bit adder is no big deal.

cgracey · 2014-05-13 15:30

Phil Pilgrim (PhiPi) wrote: »

That might affect determinism, since there's a double skip between 7 and 8. The only way around this is for the amount of the skip and the number of hub slots to be relatively prime. (Think about tightening lug nuts.)

-Phil

In editing my post, I had extended the pattern so that you'd always get ascending addresses on time for the RDLONG, but it created periodic circumstances where the same bottom nibble appeared twice in groups of 16. Any group of 16 needs to have all unique bottom nibble addresses for RD/WRBLOC to work properly.

Electrodude · 2014-05-13 15:32

This looks really nice!

How do non-memory hubops work with this new scheme? How does timing for cognew's, locks, clockset, mathops, etc. work? Do I have to wait for slot 5 to mess with cog or lock 5 or set the clock to a setting ending in 5 or divide by something ending in 5 or does the main hub still work the same way it worked on the P1?

Also, I have two suggestions that could possibly be useful but might just get in the way if implemented.

Let's say that I want to do a rd/wrlong, and I have a 15 instruction window available to do it in, with no other hubops in that area. None of those 15 instructions would care about the value of the rd/wrlong. Can you add a way to ask for the rd/wrlong to automatically happen at the best place in that sequence?

first column is current memory bank
2: rdlongd data, $xxx9   ' rdlongd = delayed rdlong -- do it when memory 9 comes around
3: instruction 1
4: instruction 2
5: instruction 3
6: instruction 4
7: instruction 5
8: instruction 6
9: delayed rdlong from above happens here
A: instruction 7
B: instruction 8
C: instruction 9
D: instruction 10
E: instruction 11
F: instruction 12
0: instruction 13
1: instruction 14
2: instruction 15

The only problem with this is if one of those 15 instructions happens to be a hubop. Would this clear the delayed one or would it just happen and then the delayed one happen later? What if the one that shouldn't be there is at the same place the delayed rdlong happens? Which one loses? Or does the cog just wait for memory 9 to come around twice?

Also, can you make the order go 0,8,1,9,2,A,3,B,4,C,5,D,6,E,7,F? Interleaving the accesses like that would allow you to have a hubop every other instruction, which might be better as it's probably more realistic for a program to have data every 2 clocks, so it has time to figure out what to read or write. Something that wants consecutive access every clock can just use rd/wrblock, which would still behave the same way, although hubexec might be slower (unless it loads 16 instructions ahead). Chip just said this... (ninja'd maybe). And I forgot that instructions take 2 clocks...

electrodude

jmg · 2014-05-13 15:35

cgracey wrote: »

For discrete contiguous RDLONG instructions, it might be good to order the memories differently, so that if RDLONG D,PTRx++ needs two clocks, you could hit most ascending addresses on time:

0, 8, 1, 9, 2, A, 3, B, 4, C, 5, D, 6, E, 7, F...

Can you expand here - the list seems to have a pitch of +9 ,but you mention RDLONG D,PTRx++ needs two clocks ? I can see a PTRx++ here, is that adding 9 ? (with a 2 cycle opcode and a +9 pitch, I see a wait of 7 cycles every time ?)

eg suppose one codes
RDLONG Da,PTRx++ ' this one may wait
RDLONG Db,PTRx++ ' ideally, these ones do not wait
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++

What INC is needed for the fastest possible fall thru here ?

cgracey · 2014-05-13 15:41

jmg wrote: »

Can you expand here - the list seems to have a pitch of 9 ,but you mention RDLONG D,PTRx++ needs two clocks ? I can see a PTRx++ here, is that adding 9 ?

eg suppose one codes
RDLONG Da,PTRx++ ' this one may wait
RDLONG Db,PTRx++ ' ideally, these ones do not wait
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++

What INC is needed for the fastest possible fall thru here ?

It's adding 4 (1 long), but it won't wind up reading the wrong location, as it verifies every time that the bottom nibble of its address is in agreement with the current window.

jmg · 2014-05-13 15:46

cgracey wrote: »

It's adding 4 (1 long), but it won't wind up reading the wrong location, as it verifies every time that the bottom nibble of its address is in agreement with the current window.

I can see the wait detail, but cannot follow the delay with your values.

Can you enter the Address (optimal address?) and Cycles, for each line of

RDLONG Da,PTRx++ ' this one may wait
RDLONG Db,PTRx++ ' ideally, these ones do not wait
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++

mark · 2014-05-13 15:53

The 16 long burst transfer scheme is brilliant! Still trying to wrap my head around the implications of breaking up hub ram like this though. Doesn't this make it a bit more difficult to compute addresses of an array stored in, say, one hub ram "slice"?

New Hub Scheme For Next Chip

Comments