New Hub Scheme For Next Chip

Roy Eltham · 2014-05-14 17:52

And what about doing multiple wrxxxx in a row? Now you have to do all the ugly flipflop thing many times so they can all avoid stalling? Seems bad to me...

jmg · 2014-05-14 17:54

cgracey wrote: »

I suppose so, but it would have the UART-like problems that you mentioned. It would be an ugly exception to how the cog wants to work.

Is there opcode room for a SNAPCNT

eg Maybe 9 immediate bits fits
upper 2 bits can 'attach' to each of R,W Hub access, which saves needing an extra code line
Lower 7 gives 1..128 choice in delays
>= 16 change the snap rate, and reset the snap timer. * 'impractical' values < 16 give more control choices, like Enable/reset/RunToSnap

tonyp12 · 2014-05-14 18:00

Is hub ram now treated the same way as cog, a "byte" is 32bits and +1 points to the next long?
And then of cource hub-address-robin is long address based?
Pros is that opcode for hub access only needs 18bit for 1MB range

jmg · 2014-05-14 18:00

Roy Eltham wrote: »

And what about doing multiple wrxxxx in a row? Now you have to do all the ugly flipflop thing many times so they can all avoid stalling? Seems bad to me...

Yes, I think buffering is like the UART I mentioned, as you say, at some stage you DO have to stall.
Hence the alternative of SNAPCNT - no extra buffers needed, and optionally 'attaches' to R/W
Effect is to slow them down.
That solves cycle determinism with random address, but is a little blunt in that it stalls - I think the practical repeat delays here will be ~32 (alllows room for Code+Access, like P1 does now)
That is 2x the P1, but if the SysCLK is 2x P1, maybe that is ok.

mark · 2014-05-14 18:02

Roy Eltham wrote: »

And what about doing multiple wrxxxx in a row? Now you have to do all the ugly flipflop thing many times so they can all avoid stalling? Seems bad to me...

I would think it should stall execution if you try performing a RD/WRx instruction again before the previous one had a chance to commit/read data from hub ram. Non-stalling mode would effectively have a single byte/word/long buffer.

jmg · 2014-05-14 18:04

tonyp12 wrote: »

Is hub ram now treated the same way as cog, a "byte" is 32bits and +1 points to the next long?
And then of cource hub-address-robin is long address based?
Pros is that opcode for hub access only needs 18bit for 1MB range

I think slots/R/W are always 32b, but any BYTE opcode could invisibly select the 32b,and extract/shift the BYTE from within that - so incrementing BYTE opcodes, would fetch the same 32b 4 times (16 fSys repeat) and then +1 to next 32b,for a 17 fSys repeat.
Write byte would need to RMW - did someone say Write opcodes were slower ?

tonyp12 · 2014-05-14 18:10

>but any BYTE opcode could invisibly select the 32b,and extract/shift the BYTE
If 18bits for address is pushing it for a opcode, how can there be room for 2more bits for lower address range?
So reading 4 adjacent bytes will take 64+ cycles, though the first rdbyte actually got 32bits and had all of them (if long aligned)
I say, let it be a 32bit only mcu.

jmg · 2014-05-14 18:18

tonyp12 wrote: »

>but any BYTE opcode could invisibly select the 32b,and extract/shift the BYTE
If 18bits for address is pushing it for a opcode, how can there be room for 2more bits for lower address range?
I say, let it be a 32bit only mcu.

Present opcode info says
RDBYTE D,PTRA/PTRB
so I'd guess a 32b PTRA gives your answer ?

jmg · 2014-05-14 18:22

[QUOTE=mark

mark · 2014-05-14 18:45

jmg wrote: »

That's more logic for the buffers, but I can see how that could help write interleave with code.

Yeah, it does require more logic. The other option is to overwrite the buffer, which obviously is a terrible idea. But yeah, the point is to not stall execution just because there's no good way to time random writes. Snapcnt gets close, but wastes cycles. Either way, I think it's important that we have one of these options at our disposal.

jmg wrote: »

However, Read I cannot see, because it cannot do anything until the value is valid, unless there is some flag checked for Read-Data-Register-Now-Valid, but that's looking clumsy.

non-stalled reads are only useful if you have a sort of pipelined loop going where you can do something like perform a RDx, do some work on some data that you requested earlier, then after at least 16 cycles, you get the more recent data you requested and on and on... Otherwise the option is to work with arrays limited to one page of ram, and sync to that the old fashioned way.

kwinn · 2014-05-14 18:49

Seems to me that a sort of token ring is virtually built in to this scheme. Since the hub address is [15 bit location][4 bit block] if you set up 15 sequential registers in the cog you could read/write a long from/to every other cog once for every rotation using the block read/write. All you really need is for the cogs to agree on a starting location in the ram block. The 4 bit block address would take care of reading/writing it in the correct ram block for the intended cog. So a cog to cog communication method is included with this plan.

dMajo wrote: »

Correct me if I am wrong.

Any cog sequentially or randomly accessing at addresses Adr*16+CogID(0..15) will be totally deterministic and will have 1/16 of hub bandwidth ... If all doing so nothing changes compared to traditional scheme (except hub isolates and reduce to 32K, so at least one cog should perform differently to allow data exchange).

Any cog sequentially or randomly accessing at addresses Adr*8+0..7 will also be totally deterministic with 1/8 of the hub bandwidth

To take advantage of the fast sequential bank/page/hubs reading perhaps a sort of token-ring communication between the hubs

Perhaps some inc/dec instructions/pointers can be provided that allows this kind of math

jmg · 2014-05-14 19:05

[QUOTE=mark

kwinn · 2014-05-14 19:26

Agreed. This scheme has tremendous potential to speed things up and maintain determinism.....provided the gating between cogs and ram can be done.

jmg wrote: »

Nope, determinism is not 'lost' as you seem to believe, - it depends on what your code is doing, & determinism has always depended on user code design, and what your code is doing.

Now, the factors are just a little different.

I can see many ways to code and know exactly how many cycles will be needed, and ways to gain much better than16N, when moving Data to or from the HUB.
(and I'm fairly sure that includes HUB to/from Pins, when Chip replies to my earlier questions )

There are also some simple improvements to Pointer Handling, that can work very well with this scheme, to remove additional code complexity, and give higher bandwidths to FIFO/Buffers.

Cluso99 · 2014-05-14 19:43

Chip,
Thanks for posting the Verilog code

I am amazed that Cyclone V is slower, but the LEs are different so guess that's a problem in the mix.

How does OnSemi's RAM compare to the one you and Beau did (silicon space) ?
I ask because I realised that more private cog memory could be added to each cog that would execute at full speed using a simple hubexec style
(see http://forums.parallax.com/showthread.php/155687-A-solution-to-increasing-cog-ram-works-with-any-hub-slot-method )
and wondered if there might be some more space on the die.
Alternately, it might be better to have larger cogs and smaller hub - at first I didn't like it, but it has grown on me and I am not sure of the optimal balance - maybe 256KB hub and 16*16KB cogs ???

jmg · 2014-05-14 19:44

kwinn wrote: »

Agreed. This scheme has tremendous potential to speed things up and maintain determinism.....provided the gating between cogs and ram can be done.

See Chip's update
http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1267735&viewfull=1#post1267735

kwinn · 2014-05-14 19:45

Would the gating for an equivalent barrel shifter have less heft? That is after all what it amounts to.

cgracey wrote: »

I got this new memory scheme implemented in Verilog.

It takes ~9,000 LE's, which is equivalent to ~2 cogs. It's hefty, for sure, but buys a lot of performance.

It's closing timing at ~140MHz, which is pretty decent, and on par with a cog's timing, so far. That means we can likely go 160MHz on the FPGA, but not 200MHz. The silicon will be able to reach 200MHz, though, and maybe higher.

I keep wondering if there is some way to collapse the complexity of this. It all comes down to lots of mux's, which are hard to speed up.

Here is the Verilog code that makes the memory system:

// hub mem

`include "hub_ram.v"

module hub_mem
(
input		clk,
input		ena,

output		[63:0]	slot,	// 16 sets of 4-bit slot code outputs

input		[15:0]	e,	// 16 enable inputs
input		[63:0]	w,	// 16 sets of 4 byte-write-enable inputs
input		[207:0]	a,	// 16 sets of 13 address inputs
input		[511:0]	d,	// 16 sets of 32 data inputs
output		[511:0]	q	// 16 sets of 32 data outputs
);


// slot rotator

reg [15:0][3:0] s;

always @(posedge clk or negedge ena)
if (!ena)
	s <= 64'hFEDCBA9876543210;
else
	s <= {s[0], s[15:1]};

wire [15:0][3:0] sx	= {s[14:0], s[15]};

assign slot		= s;


// memories

wire [15:0]		mem_e = e;
wire [15:0]  [3:0]	mem_w = w;
wire [15:0] [12:0]	mem_a = a;
wire [15:0] [31:0]	mem_d = d;

wire [15:0]		mem_ex;
wire [15:0]  [3:0]	mem_wx;
wire [15:0] [12:0]	mem_ax;
wire [15:0] [31:0]	mem_dx;

wire [15:0] [31:0]	mem_qx;


reg [15:0]		mem_c;
reg [15:0] [31:0]	mem_q;

reg [15:0]		mem_rd;
reg [15:0]  [3:0]	mem_qm;


genvar i;
generate
	for (i=0; i<16; i++)
	begin : hub_mem_gen

		assign mem_ex[i]	= mem_e[s[i]];		// ram perspective
		assign mem_wx[i]	= mem_w[s[i]];
		assign mem_ax[i]	= mem_a[s[i]];
		assign mem_dx[i]	= mem_d[s[i]];

		hub_ram hub_ram_(	.clk(clk),
					.e(mem_ex[i]),
					.w(mem_wx[i]),
					.a(mem_ax[i]),
					.d(mem_dx[i]),
					.q(mem_qx[i])	);

		always @(posedge clk or negedge ena)
		if (!ena)
			mem_c[i] <= 1'b0;
		else
			mem_c[i] <= mem_ex[i] && ~|mem_wx[i];

		always @(posedge clk)
		if (mem_c[i])
			mem_q[i] <= mem_qx[i];


		always @(posedge clk or negedge ena)		// cog perspective
		if (!ena)
			mem_rd[i] <= 1'b0;
		else
			mem_rd[i] <= mem_e[i] && ~|mem_w[i];

		always @(posedge clk)
		if (mem_rd[i])
			mem_qm[i] <= sx[i];

		assign q[i*32+31:i*32] = mem_q[mem_qm[i]];

	end
endgenerate

endmodule

Here is the Verilog code that makes an instance of an 8192x32 RAM w/byte-write:

// hub ram

module hub_ram
(
input			clk,
input			e,
input		  [3:0]	w,
input		 [12:0]	a,
input		 [31:0]	d,
output reg	 [31:0]	q
);


reg [7:0] ram3 [8191:0];
reg [7:0] ram2 [8191:0];
reg [7:0] ram1 [8191:0];
reg [7:0] ram0 [8191:0];


// 8192 x 32 ram with byte-write

always @(posedge clk)
if (e)
begin
	if (w[3]) ram3[a] <= d[31:24];
	if (w[2]) ram2[a] <= d[23:16];
	if (w[1]) ram1[a] <= d[15:8];
	if (w[0]) ram0[a] <= d[7:0];
	q <= {ram3[a], ram2[a], ram1[a], ram0[a]};
end

endmodule

mark · 2014-05-14 19:45

jmg wrote: »

Yes, Not pretty.

It's looking like a split opcode & more choices (I think someone else mentioned split opcodes ? ) wold be best
RDREQ - issues Read Address - (resets RDGET flags ?)
...
RDGET - Fetches RdReq result, stalls if not ready yet (must be paired with RDREQ)

While splitting the instructions like this might be "safer", a single instruction is more efficient for someone who understands the limitations and requirements of it. It hasn't been uncommon for people to try to reduce the number of instructions used in their code as much as possible in order to get it to fit in the limited cog ram, so that's why I'd hate for a read to require two instructions.

tonyp12 · 2014-05-14 19:49

>RDREQ - issues Read Address - (resets RDGET flags ?)
>RDGET - Fetches RdReq result, stalls if not ready yet (must be paired with RDREQ)

RDREQ only need the source address, its opcode could use a 18bit intermediate address
A 18bit movs type instruction is needed to self-modify it as to not having use: movs,shrl,movd
it still allows auto-inc/dec on its own address

RDGET gets 9bit for dest, but with no source there is room for other options like ROL, Mask etc.(replacing word and byte opcodes, as the options can emulate that)

I like it.

jmg · 2014-05-14 20:04

[QUOTE=mark

kwinn · 2014-05-14 20:05

You mean a RD/WR instruction that works similar to how the waitvid does? That is the RD/WR is held/buffered until access to the memory block comes around, but the cog can continue to execute the following instructions unless it is another R/W. A second R/W before the previous one is completed will then halt execution. That would be a very good thing to have.

[QUOTE=mark

kwinn · 2014-05-14 20:13

Chip does say it is possible, but I think it is better to have this be done by the R/W instruction rather than an added wait of some kind. Perhaps a "stall until completed" bit, or perhaps just always allow execution to continue unless a second R/W is encountered before the first one completes. Once the first one completes execution can continue until.........etc.

jmg wrote: »

Exactly, hence my suggestion of a SNAPCNT - I think it needs to be adjustable, for granularity, and 16 does not quite guarantee completion.

You also want a slight variant of WAITCNT, as the idea is to SNAP to some known rate, which is not quite what std WAITCNT does.

It is a little hard to encode into R/W as something needs to set & start this rate timer.
I think SNAPCNT can have 9 immediate bits for opcode support, so perhaps upper 2 bits can 'attach' to each of R,W, giving 1..128, and values < 16 give more control choices.

mark · 2014-05-14 20:17

kwinn wrote: »

You mean a RD/WR instruction that works similar to how the waitvid does? That is the RD/WR is held/buffered until access to the memory block comes around, but the cog can continue to execute the following instructions unless it is another R/W. A second R/W before the previous one is completed will then halt execution. That would be a very good thing to have.

Exactly.

kwinn · 2014-05-14 20:23

It would make contiguous R/W to a single memory block fast and simple. The first R/W is buffered, the cog continues execution and if a second R/W comes before the first one is completed execution is blocked. Sort of an automatic wait for a memory block.

Roy Eltham wrote: »

And what about doing multiple wrxxxx in a row? Now you have to do all the ugly flipflop thing many times so they can all avoid stalling? Seems bad to me...

kwinn · 2014-05-14 20:29

The wait for the read to be completed could be done executing useful instructions and/or a wait. It would be up to the programmer to make sure that there has been enough time for the data to be read. That's why they get the big bucks.

jmg wrote: »

That's more logic for the buffers, but I can see how that could help write interleave with code.

However, Read I cannot see, because it cannot do anything until the value is valid, unless there is some flag checked for Read-Data-Register-Now-Valid, but that's looking clumsy.

kwinn · 2014-05-14 20:32

Saw that. Good news indeed. Hope nothing pops up to change that.

jmg wrote: »

See Chip's update
http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1267735&viewfull=1#post1267735

kwinn · 2014-05-14 20:35

Yep, that would be great for writing data to hub in a short loop.

[QUOTE=mark

RossH · 2014-05-14 20:35

kwinn wrote: »

That's why they get the big bucks.

Ha! This explains why you guys insist on coming up with such complicated proposals - it's all a massive conspiracy to raise programmer's pay rates!

jmg · 2014-05-14 20:39

kwinn wrote: »

Chip does say it is possible, but I think it is better to have this be done by the R/W instruction rather than an added wait of some kind. Perhaps a "stall until completed" bit, or perhaps just always allow execution to continue unless a second R/W is encountered before the first one completes. Once the first one completes execution can continue until.........etc.

See my expanded description of SNAPCNT here
http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate

It does pretty much what you say - ie has a bit to attach the wait to WR or RD, so those opcodes apply the snap, but you need to define the delay value, so that is why a separate opcode (or config register) is suggested.

The SNAPCNT is quite useful outside of HUB loops, which is another reason to have an opcode for it.

jmg · 2014-05-14 20:42

kwinn wrote: »

The wait for the read to be completed could be done executing useful instructions and/or a wait. It would be up to the programmer to make sure that there has been enough time for the data to be read. That's why they get the big bucks.

Only the 'gain' you think this (expensive & risky) cycle counting is buying, is not real, as when the data does arrive, some cycles need to be stolen to move into the dest register in COG memory.
Better to use two opcodes, as that is very clear what is happening, when, and is safely edit-proof.

mark · 2014-05-14 20:48

kwinn wrote: »

Yep, that would be great for writing data to hub in a short loop.

It might not seem obvious at first, but it's also handy for when the loop is longer than 16 clocks. You have a deterministic loop w/hub write/read that takes 20 clocks? No problem!

New Hub Scheme For Next Chip

Comments