New Hub Scheme For Next Chip

tonyp12 · 2014-05-14 14:21

>experiment which eventually proves theory.

If Chip says the 4bit column-banking runs at 200mhz from $xxxx0-to-$xxxxF addresses, I take his word for it but if experimenting shows otherwise then there is a hardware bug and much bigger problem.

Say my cog right at this movement can only access any hub address that ends in $xxxxA but I want to read address $8207
I will have to wait until $xxx7 railcar shows up at my station.
Not much to prove as mathematically it's all there, as we know the robin speed and we know how many clocks a cog instruction takes.

>Why would it not be linear ?. It needs to be predictable, and easy to visualize.
Maybe a common code routine will benefit from having two neighboring longs showing up after 2clocks?
Or you just have to interleave your code (or hub variables) so windows match up good.

btw, how many clocks does a RDLong take if there is 0 clock stall? is it's like P1? double the normal cog instruction = 4clocks for the P2

And is hub access only for longs now?, so the actual round robin hub address in bytes would be: %xxxxx111100
as the two lower bits are always zero "if you could" refer to hub as byte addresses?
If Robin is for bytes, when RDLONG can only access at 0,4,8,C Slots of the Robin?

jmg · 2014-05-14 14:28

Roy Eltham wrote: »

I look at compiler generated assembly code all the time. It's not nearly as bad as you are making it out to be. In a typical stream of 16 instructions you'll find on average 2-3 branches, and MOST of the time those branches are skipping over 1 or 2 instructions if they are taken. So you often end up executing more than half of those 16 instructions, quite often 2/3rds+.

I'd agree.
Compilers can even decide to copy different size chunks, as software optimise improves, and the tools can report when across-block reloads are needed.
That will mean the most important inner-loops, can avoid re-loads.

Do you know if relative addressing is still in the opcodes ? - as that will make such code-moving a lot easier.

A relative Skip/Short jump, and this type of far-jump-avoid, will also help with Code designed to XIP from QuadSPI memory.
The problems are very similar, where a big jump has a higher relative cost, than a very small forward one.
Of course, QuadSPI has very large sizes, for low cost.

cgracey · 2014-05-14 14:40

I got this new memory scheme implemented in Verilog.

It takes ~9,000 LE's, which is equivalent to ~2 cogs. It's hefty, for sure, but buys a lot of performance.

It's closing timing at ~140MHz, which is pretty decent, and on par with a cog's timing, so far. That means we can likely go 160MHz on the FPGA, but not 200MHz. The silicon will be able to reach 200MHz, though, and maybe higher.

I keep wondering if there is some way to collapse the complexity of this. It all comes down to lots of mux's, which are hard to speed up.

Here is the Verilog code that makes the memory system:

// hub mem

`include "hub_ram.v"

module hub_mem
(
input		clk,
input		ena,

output		[63:0]	slot,	// 16 sets of 4-bit slot code outputs

input		[15:0]	e,	// 16 enable inputs
input		[63:0]	w,	// 16 sets of 4 byte-write-enable inputs
input		[207:0]	a,	// 16 sets of 13 address inputs
input		[511:0]	d,	// 16 sets of 32 data inputs
output		[511:0]	q	// 16 sets of 32 data outputs
);


// slot rotator

reg [15:0][3:0] s;

always @(posedge clk or negedge ena)
if (!ena)
	s <= 64'hFEDCBA9876543210;
else
	s <= {s[0], s[15:1]};

wire [15:0][3:0] sx	= {s[14:0], s[15]};

assign slot		= s;


// memories

wire [15:0]		mem_e = e;
wire [15:0]  [3:0]	mem_w = w;
wire [15:0] [12:0]	mem_a = a;
wire [15:0] [31:0]	mem_d = d;

wire [15:0]		mem_ex;
wire [15:0]  [3:0]	mem_wx;
wire [15:0] [12:0]	mem_ax;
wire [15:0] [31:0]	mem_dx;

wire [15:0] [31:0]	mem_qx;


reg [15:0]		mem_c;
reg [15:0] [31:0]	mem_q;

reg [15:0]		mem_rd;
reg [15:0]  [3:0]	mem_qm;


genvar i;
generate
	for (i=0; i<16; i++)
	begin : hub_mem_gen

		assign mem_ex[i]	= mem_e[s[i]];		// ram perspective
		assign mem_wx[i]	= mem_w[s[i]];
		assign mem_ax[i]	= mem_a[s[i]];
		assign mem_dx[i]	= mem_d[s[i]];

		hub_ram hub_ram_(	.clk(clk),
					.e(mem_ex[i]),
					.w(mem_wx[i]),
					.a(mem_ax[i]),
					.d(mem_dx[i]),
					.q(mem_qx[i])	);

		always @(posedge clk or negedge ena)
		if (!ena)
			mem_c[i] <= 1'b0;
		else
			mem_c[i] <= mem_ex[i] && ~|mem_wx[i];

		always @(posedge clk)
		if (mem_c[i])
			mem_q[i] <= mem_qx[i];


		always @(posedge clk or negedge ena)		// cog perspective
		if (!ena)
			mem_rd[i] <= 1'b0;
		else
			mem_rd[i] <= mem_e[i] && ~|mem_w[i];

		always @(posedge clk)
		if (mem_rd[i])
			mem_qm[i] <= sx[i];

		assign q[i*32+31:i*32] = mem_q[mem_qm[i]];

	end
endgenerate

endmodule

Here is the Verilog code that makes an instance of an 8192x32 RAM w/byte-write:

// hub ram

module hub_ram
(
input			clk,
input			e,
input		  [3:0]	w,
input		 [12:0]	a,
input		 [31:0]	d,
output reg	 [31:0]	q
);


reg [7:0] ram3 [8191:0];
reg [7:0] ram2 [8191:0];
reg [7:0] ram1 [8191:0];
reg [7:0] ram0 [8191:0];


// 8192 x 32 ram with byte-write

always @(posedge clk)
if (e)
begin
	if (w[3]) ram3[a] <= d[31:24];
	if (w[2]) ram2[a] <= d[23:16];
	if (w[1]) ram1[a] <= d[15:8];
	if (w[0]) ram0[a] <= d[7:0];
	q <= {ram3[a], ram2[a], ram1[a], ram0[a]};
end

endmodule

jazzed · 2014-05-14 14:40

tonyp12 wrote: »

>experiment which eventually proves theory.

If Chip says the 4bit column-banking runs at 200mhz from $xxxx0-to-$xxxxF addresses, I take his word for it but if experimenting shows otherwise then there is a hardware bug and much bigger problem.

That's exactly why we need theory of operation which to date is incomplete. It will remain incomplete at least until Chip provides a very good description of how things work, and a bit file that implements his design.

Once we have a bit-file we can experiment all day and "confim" or prove that what should happen does happen.

Roy Eltham · 2014-05-14 14:52

Nice Chip! Thanks for sharing the code!

jazzed · 2014-05-14 14:55

potatohead wrote: »

Actually, experiment confirms theory, until it doesn't, and then we find we need new theory.

Working through that, we get understanding. That is what science brings us on a basic level. Understanding.

Yup. Rather than confirm or prove, "provides support" is a more accurate wording.

Chip, that looks like progress. Thanks.

potatohead · 2014-05-14 15:07

[video=youtube_share;EYPapE-3FRw]

Of course, it's very hard to get more clear on this than Feynman did.

jmg · 2014-05-14 15:42

cgracey wrote: »

I got this new memory scheme implemented in Verilog.

It's closing timing at ~140MHz, which is pretty decent, and on par with a cog's timing, so far. That means we can likely go 160MHz on the FPGA, but not 200MHz. The silicon will be able to reach 200MHz, though, and maybe higher.

Have you pointed a Cyclone V part number at the tools ?. Interested in how Cyclone V compares wrt Cyclone IV.

mark · 2014-05-14 15:47

@Chip

Is there a possibility of having a mode where WRxxxx (except for block) and possibly RDxxxx not stall execution?

jmg · 2014-05-14 15:48

tonyp12 wrote: »

Say my cog right at this movement can only access any hub address that ends in $xxxxA but I want to read address $8207
I will have to wait until $xxx7 railcar shows up at my station.
Not much to prove as mathematically it's all there, as we know the robin speed and we know how many clocks a cog instruction takes.

>Why would it not be linear ?. It needs to be predictable, and easy to visualize.
Maybe a common code routine will benefit from having two neighboring longs showing up after 2clocks?
Or you just have to interleave you code (or hub variables) so windows match up good.

See my new thread, on how to use the Spoke Address variance, to give higher bandwidth transfers for FIFO/Buffers

Not suited to CODE, or True Random Access arrays, but fast buffers/transfers tend to be a very common problem

http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate?p=1267745&viewfull=1#post1267745

jmg · 2014-05-14 15:51

[QUOTE=mark

Heater. · 2014-05-14 15:52

potatohead,

Of course, it's very hard to get more clear on this than Feynman did.

We should not confuse the wants and needs of MCU/Propeller users with the laws of physics.

Interestingly, Feynman did design stuff for the "Thinking Machines" massively parallel computer that they did not actually use because it was thought to hard to make at the time. I wonder what it was?

potatohead · 2014-05-14 15:57

Totally.

There is an interview with Danny whatever his name is, which hints at some of those things. No details I've ever found online.

Now, I linked that because the basic method of science isn't really getting at proof, just better understanding and it's just a nit that I like to pick sometimes.

Wants / needs? And there is the "what is worth what?" discussion, ongoing, but on it's way to some degree of consensus. Or, maybe it's just improving. Who knows?

The next FPGA is going to be a lot of fun!

mark · 2014-05-14 16:16

jmg wrote: »

I think this is the same problem as buffering a UART - works faster for the first char(s), but at some stage you always have to slow down when the buffer fills to the real hand-over rate.

There should be no buffer, so perhaps it should stall execution at the instruction if it's used again before the data had been committed to memory.

Post #181 gives my reasoning for why not stalling execution is desirable

"
I didn't see the implication of ctwardell's suggestion that hub writes (except for block writes, obviously) shouldn't stall execution at first, but I think there's a VERY good case for it not to given the partitioning of hub ram.

Consider the example again of a tight PASM loop which incorporates a WRBYTE

WRBYTE
INST 1
....
INST 6
loop

As it stands with this new hub scheme, the timing cannot be deterministic if the hub addresses used in the WRBYTE instruction are not limited to one page of memory. However, if WRXXXX instructions don't stall execution, then it can be! There will be jitter on the actual data write in hub ram, but I think this is less of a concern.

........

RDx is a bit more tricky, but we can still maintain loop timing determinism *IF* RDx doesn't stall execution AND we pipeline our loop so that we send a hub request for that data 16 clocks before we intend to use it. "

edit: can't figure out why the above quote keeps getting split. I edit it, but it keeps coming back for some strange reason.

jmg · 2014-05-14 16:35

[QUOTE=mark

jazzed · 2014-05-14 16:44

potatohead wrote: »

Now, I linked that because the basic method of science isn't really getting at proof, just better understanding and it's just a nit that I like to pick sometimes.

Must be one of them thar liberal-arts-science definitions ;-) Just joking of course. It's all part of the journey of discovery.

In the context of engineering we use the body of substantiated theory (that which we think we understand) to design things. We verify something works as expected (empirical observations, i.e. experience) within reason by knowing precisely how it should work (theory).

Yes, this FPGA bit-file should be fun .... Go Chip go!

Lawson · 2014-05-14 16:48

[QUOTE=mark

potatohead · 2014-05-14 17:11

jazzed wrote: »

Must be one of them thar liberal-arts-science definitions ;-) Just joking of course. It's all part of the journey of discovery.

In the context of engineering we use the body of substantiated theory (that which we think we understand) to design things. We verify something works as expected (empirical observations, i.e. experience) within reason by knowing precisely how it should work (theory).

Yes, this FPGA bit-file should be fun .... Go Chip go!

Indeed!

I would insert analysis, both classical and computer simulation to better understand dynamics prior to fabrication or programming in some cases, to reduce the empirical costs and variances and better understand failure conditions, modes, boundaries, etc...

"Prove it out" makes complete sense in engineering. Eliminating risk is primary, and there we see the science expanding what engineering has to work with, which is why we seriously have to respect the science, lest our engineers get stuck with a broken tool. --a risk unaccounted for, despite competent and otherwise inclusive engineering.

Which is why I pick that nit.

mark · 2014-05-14 17:13

jmg wrote: »

As you say, that only partly solves only Writes. So if someone has code where they have no idea or control over the Addresss LSB and they need it to be cycle-deterministic, then I think some other form of Granular-snap control is going to be needed.

It fully "solves" random writes to *any* address in memory from the loop's standpoint when said loop performs a WRxxxx every 16 clocks, which is not the case if WRx stalls execution as it currently does. If you try performing two WRxxxx instructions with less than 16 clocks between them, then sure, it might stall execution depending on whether or not the prior data was able to be committed to memory during that time. The point of this is to maintain deterministic timing of a loop which incorporates a WRx instruction every 16 clocks or more.

jmg · 2014-05-14 17:20

[QUOTE=mark

mark · 2014-05-14 17:25

Lawson wrote: »

Ah, so you're proposing breaking RDxxxx into two instructions? The first instruction tells the hub what address to read, and where the cog will store the value. The second instruction stalls if the hub hasn't executed the read yet, then stores the value. You could eliminate the finish instruction if the RDxxxxD instructions just always updated the cog memory location 16 clocks after the RDxxxxD instruction. Add a queue so that several RDxxxxD instructions could be in process at once, and Hub reads no-longer add execution jitter. (assuming that there isn't a low address nibble collision in those several RDxxxxD instructions) This would be similar to what the Mill processor is proposed to do. (and might require an exchange of license fees)

Marty

I wasn't really thinking of breaking it into two instructions, but rather have the ability to select between stalling and non-stalling RD/WRxxxx "modes". In stalling mode, RDx would function just as it always has on the original Propeller. In the non-stalling mode, you would perform an RDx, continue doing something else, and then after 16 clocks, you could grab the data from the destination register and do whatever you want with it. Keep in mind, the data might actually arrive in less than 16 clocks, but there's now way of knowing for sure without computing it based on the address of the data and the hub slot at the time of the request. As with WRx, this is just to be able to maintain a deterministic loop cycle (without unnecessary overhead) with an RDx instruction to addresses in a random memory bank every 16 sys clocks or more. Of course, it's not ideal as you're making data requests a while before you'll be able to do anything with it, but it's better than nothing.

Rayman · 2014-05-14 17:28

I think this concept is finally starting to sink through my skull (for better or worse...)

For deterministic timing, couldn't you just do a waitcnt after the read or write with the worst case clocks (16 plus one or two extra maybe)?

Maybe there could be flag or special version of rd/wr that does this for you automatically?

cgracey · 2014-05-14 17:31

jmg wrote: »

Have you pointed a Cyclone V part number at the tools ?. Interested in how Cyclone V compares wrt Cyclone IV.

The Cyclone IV compiles at 132.71 MHz, while the Cyclone V compiles at 126.63Mhz. These can change up and down with different seed values, but it seems to me that Cyclone V is usually a bit slower than Cyclone IV.

mark · 2014-05-14 17:32

jmg wrote: »

Sure, but you also want something that works for Reads too, which is the point behind my snapcnt suggestion.
I'm also not sure about 16 as a target hard number, as a INC by 1 of address will give a spoke repeat of 17, so 16 would not work there.

I addressed the RDx issue. Granted, it's by no means perfect. snapcnt would work, but you'll be wasting potentially valuable cycles if I understand it correctly.

As far as I can tell, data will be committed to hub ram no more than 16 sysclocks after a WRx instruction is executed no matter what.

cgracey · 2014-05-14 17:32

[QUOTE=mark

jmg · 2014-05-14 17:37

Rayman wrote: »

For deterministic timing, couldn't you just do a waitcnt after the read or write with the worst case clocks (16 plus one or two extra maybe)?

Exactly, hence my suggestion of a SNAPCNT - I think it needs to be adjustable, for granularity, and 16 does not quite guarantee completion.

You also want a slight variant of WAITCNT, as the idea is to SNAP to some known rate, which is not quite what std WAITCNT does.

It is a little hard to encode into R/W as something needs to set & start this rate timer.
I think SNAPCNT can have 9 immediate bits for opcode support, so perhaps upper 2 bits can 'attach' to each of R,W, giving 1..128, and values < 16 give more control choices.

jmg · 2014-05-14 17:41

cgracey wrote: »

The Cyclone IV compiles at 132.71 MHz, while the Cyclone V compiles at 126.63Mhz. These can change up and down with different seed values, but it seems to me that Cyclone V is usually a bit slower than Cyclone IV.

Wow, really ? Is that the same speed-grades ? maybe the Early release Cyclone V speed files are conservative ?

cgracey · 2014-05-14 17:42

It could probably be done with a bunch of flipflops, but it would be ugly.

jmg · 2014-05-14 17:46

cgracey wrote: »

It could probably be done with a bunch of flipflops, but it would be ugly.

I think that's a reply to Roy's comment (now deleted) about this question. ?
Is there a possibility of having a mode where WRxxxx (except for block) and possibly RDxxxx not stall execution?

cgracey · 2014-05-14 17:48

jmg wrote: »

I think that's a reply to Roy's comment (now deleted) about this question. ?
Is there a possibility of having a mode where WRxxxx (except for block) and possibly RDxxxx not stall execution?

I suppose so, but it would have the UART-like problems that you mentioned. It would be an ugly exception to how the cog wants to work.

New Hub Scheme For Next Chip

Comments