New Hub Scheme For Next Chip
cgracey
Posts: 14,209
Roy Eltham came by and we were talking about all this concern over hub sharing. He had a great idea of having the individual memories act as hubs, instead of just one hub controlling all the memories and serving just one cog at a time.
We happen to have 16 instances of hub RAM, anyway, that make up the total 512KB. By having them each be 8192 locations x 32 bits, and distributing contiguous long addresses among all of them, we could make it so that every cog could read or write a subsequent long on every clock - which is 4x faster than the RDQUAD/WRQUAD scheme. This also allows all memories to be 32 bits wide, instead of 128, which cuts memory power by 65%. Initial latency varies now by the long address, but once locked, longs can be read or written on every clock.
Here is a diagram I made:
Attachment not found.
We happen to have 16 instances of hub RAM, anyway, that make up the total 512KB. By having them each be 8192 locations x 32 bits, and distributing contiguous long addresses among all of them, we could make it so that every cog could read or write a subsequent long on every clock - which is 4x faster than the RDQUAD/WRQUAD scheme. This also allows all memories to be 32 bits wide, instead of 128, which cuts memory power by 65%. Initial latency varies now by the long address, but once locked, longs can be read or written on every clock.
Here is a diagram I made:
Attachment not found.
Comments
Sounds great, many ideas have been brought up by flipping the whole idea on who serves what and then.
Free thinkers who can see what the problem really is and when come up with a solution is just super.
We need any cog to have random access to any hub address. Do the block accesses limit this in any way?
Almost Maker Faire time? Guess I'll miss it again this year ... too much to do.
The biggest drawback is the latency depends on where you are in the cycle when the request is made AND the low order nibble of the address.
Chris Wardell
I think It's now more targeted for 16longs burst, and is that not what you should use HUB for anyway?
I guess you can still get a single long, but that may take as long as 16 of them.
In this scheme, your initial wait is now address-dependent, but no worse than before, on average. What this does is make it possible to quickly transfer longs to and from hub memory, once your latency is over.
This would make LMM really efficient, since instructions can be read into the cog at twice the rate they can be executed. Hub exec is a lot cleaner to generate code for, though, since you don't have to be considerate of boundaries. Hub exec does complicate the cog quite a bit, though. It takes a whole slew of instructions to support. This scheme makes LMM really efficient, but doesn't do a whole lot, directly, for hub exec.
Does that mean new support in other opcodes will be there to allow this burst speed ?
How does that speed change, with COG-Cycles-per-Address ?
I can see that a very fast COG, can 'spin in sync' with the Memory allocator, but as soon as it needs even 2 cycles to prepare, the speed falls to 16 SysClocks ?
Idea: Given the implied AutoINC, what about a choice on the Index++ ?
If the COG knows it can manage 6 cycles per write, it can add 6 on every WR/RD.
Now, the next HUB access matches the key, and the COG has BW.
Of course, the memory is now sparse, but both sides know how sparse, and they know the base address.
Another idea: Looking some more at this Sized index INC, if it can be numerically a little smarter then multiple add-loops can manage the sparse aspect.
Example of the LSB nibble of a hypothetical 3-spaced INC Those +3 INCs do three circuits, and they have covered all LSBs, and then can advance.
Memory fills in a unusual order, but the end result is 21 writes into a span range of 31 bytes
No added latency I think you meant ? It still takes 16 cycles and I think assumes an AutoINC somewhere in there too.
John Abshier
The opposite no two cogs can read $xxxF at the same, they all have to take turns reading the lower $0-F of an address
It's like banks in sram, but it's bit0-to-bit3 that is the bank switching and not the msb that you normaly think of when it comes to banks
You could say memory access is interleaved,so a cog can read/write to its hear't content becuase no other cog will be at this "bank" at this moment
In the example above, COG0 gets immediate access, whilst COG1 has to wait until the allocator spins to have xxx0 pointing to it.
ALL COGS can do 'Cycle simultaneous' HUB writes, but to what are actually physically different memories, so each COG actually has turns to each Physical memory block.
I think if COG knows how fast it can think, and can match the LSB with what is about to arrive, it can get above 1/16 rates.
If it fails to match for any reason, it has to wait for the next spin-around.
One of the hub memories would have all $xxxx0 long addresses, while the next would have all $xxxx1 long addresses, and so on. Any cog that wanted to read $xxxxD, for example, would have to wait for that memory to come to his window. All longs at long addresses $xxxxD would be in the same physical memory block. Does that make sense?
So if a cog is fetching a long from hub, and the long happens to be on (address mod 16*4) == cogid, then there is zero latency? Is a burst read possible?
I guess one concern is in getting enough instructions that don't need another HUB access before the next cycle.
Another concern is in calculating the offset for starting the new instruction assuming a burst. If we have to fetch a block, calculate the address, and jump to that address to execute an instruction that sucks some of the mips out of the execution rate.
I'd like to see Eric or David or other's opinions on how easily a code generator could take advantage of this though. If the COG instruction address to execute is automatically set in the cog by some special instruction it might not make any difference to the code generator.
New instruction maybe? "Atomic hub fetch and execute" ?
John Abshier
From #3, I think if you can tolerate waiting 16 fSys, you can have a 16 opcode block, with known alignment.
So the operation would be 16 CLKs to load, then some time to 'work on' that block, until a new one is needed.
That would imply a RdBlock opcode, & that always takes 16 fSys, with no initial alignment cares.
But with the 2 clock instruction cycle, wouldn't a cog see only every second memory address? Ie from a cog point of view, first instruction can access xxx0, second instruction can access xxx2, what about getting access to xxx1?
That's exactly right.
Doh! I guess that answers my block read question.
Whether it is usable or not for optimum performance is TBD.
There would be special instructions RDBLOC/WRBLOC to handle the transfers of 16 longs. Regardless of the initially-available window, it would always take 16 clocks (+1 for the memory read delay).
Say you wanted to load cog addresses $1E0..$1EF from hub long addresses $1000..$100F and when the RDBLOC instruction started, window $xxx6 was available. No problem:
Hub bandwidth would be:
16 RAMs * 32 bits * 200MHz = 12.8GB/s
Cog bandwidth would be:
32 bits * 200MHz = 800MB/s
That is for RDBLOC/WRBLOC only ? (which can Rd every SysClk)
What about the detail Tubular asked ? - it does seem the 100MOP/200MHz interactions get 'interesting' here..
["But with the 2 clock instruction cycle, wouldn't a cog see only every second memory address? Ie from a cog point of view, first instruction can access xxx0, second instruction can access xxx2, what about getting access to xxx1?"]
Is there AutoINC somewhere in the mix, and how tightly can multiple RDxx/WRxx opcodes execute ?
I guess both sides will be foced to 1*16 boundry? if so the opcode only need 5bits for cog dest and 14bits for HUB to cover up to 1Mb (16384*16*4bytes)
I meant to convey that the RDBLOC/WRBLOC instructions process a long on every clock, taking 16 contiguous clocks to do so, giving them different timing than normal two-clock instructions.
For discrete contiguous RDLONG instructions, it might be good to order the memories differently, so that if RDLONG D,PTRx++ needs two clocks, you could hit most ascending addresses on time:
0, 8, 1, 9, 2, A, 3, B, 4, C, 5, D, 6, E, 7, F...
-Phil
For the hub memory, that is so, but not for the cog. A 4-bit adder is no big deal.
In editing my post, I had extended the pattern so that you'd always get ascending addresses on time for the RDLONG, but it created periodic circumstances where the same bottom nibble appeared twice in groups of 16. Any group of 16 needs to have all unique bottom nibble addresses for RD/WRBLOC to work properly.
How do non-memory hubops work with this new scheme? How does timing for cognew's, locks, clockset, mathops, etc. work? Do I have to wait for slot 5 to mess with cog or lock 5 or set the clock to a setting ending in 5 or divide by something ending in 5 or does the main hub still work the same way it worked on the P1?
Also, I have two suggestions that could possibly be useful but might just get in the way if implemented.
Let's say that I want to do a rd/wrlong, and I have a 15 instruction window available to do it in, with no other hubops in that area. None of those 15 instructions would care about the value of the rd/wrlong. Can you add a way to ask for the rd/wrlong to automatically happen at the best place in that sequence? The only problem with this is if one of those 15 instructions happens to be a hubop. Would this clear the delayed one or would it just happen and then the delayed one happen later? What if the one that shouldn't be there is at the same place the delayed rdlong happens? Which one loses? Or does the cog just wait for memory 9 to come around twice?
Also, can you make the order go 0,8,1,9,2,A,3,B,4,C,5,D,6,E,7,F? Interleaving the accesses like that would allow you to have a hubop every other instruction, which might be better as it's probably more realistic for a program to have data every 2 clocks, so it has time to figure out what to read or write. Something that wants consecutive access every clock can just use rd/wrblock, which would still behave the same way, although hubexec might be slower (unless it loads 16 instructions ahead). Chip just said this... (ninja'd maybe). And I forgot that instructions take 2 clocks...
electrodude
Can you expand here - the list seems to have a pitch of +9 ,but you mention RDLONG D,PTRx++ needs two clocks ? I can see a PTRx++ here, is that adding 9 ? (with a 2 cycle opcode and a +9 pitch, I see a wait of 7 cycles every time ?)
eg suppose one codes
RDLONG Da,PTRx++ ' this one may wait
RDLONG Db,PTRx++ ' ideally, these ones do not wait
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++
What INC is needed for the fastest possible fall thru here ?
It's adding 4 (1 long), but it won't wind up reading the wrong location, as it verifies every time that the bottom nibble of its address is in agreement with the current window.
Can you enter the Address (optimal address?) and Cycles, for each line of
RDLONG Da,PTRx++ ' this one may wait
RDLONG Db,PTRx++ ' ideally, these ones do not wait
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++
RDLONG Dc,PTRx++