Shop OBEX P1 Docs P2 Docs Learn Events
New Hub Scheme For Next Chip - Page 9 — Parallax Forums

New Hub Scheme For Next Chip

1679111237

Comments

  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2014-05-14 17:52
    And what about doing multiple wrxxxx in a row? Now you have to do all the ugly flipflop thing many times so they can all avoid stalling? Seems bad to me...
  • jmgjmg Posts: 15,175
    edited 2014-05-14 17:54
    cgracey wrote: »
    I suppose so, but it would have the UART-like problems that you mentioned. It would be an ugly exception to how the cog wants to work.

    Is there opcode room for a SNAPCNT

    eg Maybe 9 immediate bits fits
    upper 2 bits can 'attach' to each of R,W Hub access, which saves needing an extra code line
    Lower 7 gives 1..128 choice in delays
    >= 16 change the snap rate, and reset the snap timer. * 'impractical' values < 16 give more control choices, like Enable/reset/RunToSnap
  • tonyp12tonyp12 Posts: 1,951
    edited 2014-05-14 18:00
    Is hub ram now treated the same way as cog, a "byte" is 32bits and +1 points to the next long?
    And then of cource hub-address-robin is long address based?
    Pros is that opcode for hub access only needs 18bit for 1MB range
  • jmgjmg Posts: 15,175
    edited 2014-05-14 18:00
    Roy Eltham wrote: »
    And what about doing multiple wrxxxx in a row? Now you have to do all the ugly flipflop thing many times so they can all avoid stalling? Seems bad to me...

    Yes, I think buffering is like the UART I mentioned, as you say, at some stage you DO have to stall.
    Hence the alternative of SNAPCNT - no extra buffers needed, and optionally 'attaches' to R/W
    Effect is to slow them down.
    That solves cycle determinism with random address, but is a little blunt in that it stalls - I think the practical repeat delays here will be ~32 (alllows room for Code+Access, like P1 does now)
    That is 2x the P1, but if the SysCLK is 2x P1, maybe that is ok.
  • markmark Posts: 252
    edited 2014-05-14 18:02
    Roy Eltham wrote: »
    And what about doing multiple wrxxxx in a row? Now you have to do all the ugly flipflop thing many times so they can all avoid stalling? Seems bad to me...

    I would think it should stall execution if you try performing a RD/WRx instruction again before the previous one had a chance to commit/read data from hub ram. Non-stalling mode would effectively have a single byte/word/long buffer.
  • jmgjmg Posts: 15,175
    edited 2014-05-14 18:04
    tonyp12 wrote: »
    Is hub ram now treated the same way as cog, a "byte" is 32bits and +1 points to the next long?
    And then of cource hub-address-robin is long address based?
    Pros is that opcode for hub access only needs 18bit for 1MB range

    I think slots/R/W are always 32b, but any BYTE opcode could invisibly select the 32b,and extract/shift the BYTE from within that - so incrementing BYTE opcodes, would fetch the same 32b 4 times (16 fSys repeat) and then +1 to next 32b,for a 17 fSys repeat.
    Write byte would need to RMW - did someone say Write opcodes were slower ?
  • tonyp12tonyp12 Posts: 1,951
    edited 2014-05-14 18:10
    >but any BYTE opcode could invisibly select the 32b,and extract/shift the BYTE
    If 18bits for address is pushing it for a opcode, how can there be room for 2more bits for lower address range?
    So reading 4 adjacent bytes will take 64+ cycles, though the first rdbyte actually got 32bits and had all of them (if long aligned)
    I say, let it be a 32bit only mcu.
  • jmgjmg Posts: 15,175
    edited 2014-05-14 18:18
    tonyp12 wrote: »
    >but any BYTE opcode could invisibly select the 32b,and extract/shift the BYTE
    If 18bits for address is pushing it for a opcode, how can there be room for 2more bits for lower address range?
    I say, let it be a 32bit only mcu.

    Present opcode info says
    RDBYTE D,PTRA/PTRB
    so I'd guess a 32b PTRA gives your answer ?
  • jmgjmg Posts: 15,175
    edited 2014-05-14 18:22
    [QUOTE=mark
  • markmark Posts: 252
    edited 2014-05-14 18:45
    jmg wrote: »
    That's more logic for the buffers, but I can see how that could help write interleave with code.
    Yeah, it does require more logic. The other option is to overwrite the buffer, which obviously is a terrible idea. But yeah, the point is to not stall execution just because there's no good way to time random writes. Snapcnt gets close, but wastes cycles. Either way, I think it's important that we have one of these options at our disposal.

    jmg wrote: »
    However, Read I cannot see, because it cannot do anything until the value is valid, unless there is some flag checked for Read-Data-Register-Now-Valid, but that's looking clumsy.
    non-stalled reads are only useful if you have a sort of pipelined loop going where you can do something like perform a RDx, do some work on some data that you requested earlier, then after at least 16 cycles, you get the more recent data you requested and on and on... Otherwise the option is to work with arrays limited to one page of ram, and sync to that the old fashioned way.
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 18:49
    Seems to me that a sort of token ring is virtually built in to this scheme. Since the hub address is [15 bit location][4 bit block] if you set up 15 sequential registers in the cog you could read/write a long from/to every other cog once for every rotation using the block read/write. All you really need is for the cogs to agree on a starting location in the ram block. The 4 bit block address would take care of reading/writing it in the correct ram block for the intended cog. So a cog to cog communication method is included with this plan.
    dMajo wrote: »
    Correct me if I am wrong.

    Any cog sequentially or randomly accessing at addresses Adr*16+CogID(0..15) will be totally deterministic and will have 1/16 of hub bandwidth ... If all doing so nothing changes compared to traditional scheme (except hub isolates and reduce to 32K, so at least one cog should perform differently to allow data exchange).

    Any cog sequentially or randomly accessing at addresses Adr*8+0..7 will also be totally deterministic with 1/8 of the hub bandwidth

    To take advantage of the fast sequential bank/page/hubs reading perhaps a sort of token-ring communication between the hubs

    Perhaps some inc/dec instructions/pointers can be provided that allows this kind of math
  • jmgjmg Posts: 15,175
    edited 2014-05-14 19:05
    [QUOTE=mark
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 19:26
    Agreed. This scheme has tremendous potential to speed things up and maintain determinism.....provided the gating between cogs and ram can be done.
    jmg wrote: »
    Nope, determinism is not 'lost' as you seem to believe, - it depends on what your code is doing, & determinism has always depended on user code design, and what your code is doing.

    Now, the factors are just a little different.

    I can see many ways to code and know exactly how many cycles will be needed, and ways to gain much better than16N, when moving Data to or from the HUB.
    (and I'm fairly sure that includes HUB to/from Pins, when Chip replies to my earlier questions )

    There are also some simple improvements to Pointer Handling, that can work very well with this scheme, to remove additional code complexity, and give higher bandwidths to FIFO/Buffers.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-05-14 19:43
    Chip,
    Thanks for posting the Verilog code :)
    I am amazed that Cyclone V is slower, but the LEs are different so guess that's a problem in the mix.

    How does OnSemi's RAM compare to the one you and Beau did (silicon space) ?
    I ask because I realised that more private cog memory could be added to each cog that would execute at full speed using a simple hubexec style
    (see http://forums.parallax.com/showthread.php/155687-A-solution-to-increasing-cog-ram-works-with-any-hub-slot-method )
    and wondered if there might be some more space on the die.
    Alternately, it might be better to have larger cogs and smaller hub - at first I didn't like it, but it has grown on me and I am not sure of the optimal balance - maybe 256KB hub and 16*16KB cogs ???
  • jmgjmg Posts: 15,175
    edited 2014-05-14 19:44
    kwinn wrote: »
    Agreed. This scheme has tremendous potential to speed things up and maintain determinism.....provided the gating between cogs and ram can be done.

    See Chip's update
    http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1267735&viewfull=1#post1267735
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 19:45
    Would the gating for an equivalent barrel shifter have less heft? That is after all what it amounts to.
    cgracey wrote: »
    I got this new memory scheme implemented in Verilog.

    It takes ~9,000 LE's, which is equivalent to ~2 cogs. It's hefty, for sure, but buys a lot of performance.

    It's closing timing at ~140MHz, which is pretty decent, and on par with a cog's timing, so far. That means we can likely go 160MHz on the FPGA, but not 200MHz. The silicon will be able to reach 200MHz, though, and maybe higher.

    I keep wondering if there is some way to collapse the complexity of this. It all comes down to lots of mux's, which are hard to speed up.


    Here is the Verilog code that makes the memory system:
    // hub mem
    
    `include "hub_ram.v"
    
    module hub_mem
    (
    input		clk,
    input		ena,
    
    output		[63:0]	slot,	// 16 sets of 4-bit slot code outputs
    
    input		[15:0]	e,	// 16 enable inputs
    input		[63:0]	w,	// 16 sets of 4 byte-write-enable inputs
    input		[207:0]	a,	// 16 sets of 13 address inputs
    input		[511:0]	d,	// 16 sets of 32 data inputs
    output		[511:0]	q	// 16 sets of 32 data outputs
    );
    
    
    // slot rotator
    
    reg [15:0][3:0] s;
    
    always @(posedge clk or negedge ena)
    if (!ena)
    	s <= 64'hFEDCBA9876543210;
    else
    	s <= {s[0], s[15:1]};
    
    wire [15:0][3:0] sx	= {s[14:0], s[15]};
    
    assign slot		= s;
    
    
    // memories
    
    wire [15:0]		mem_e = e;
    wire [15:0]  [3:0]	mem_w = w;
    wire [15:0] [12:0]	mem_a = a;
    wire [15:0] [31:0]	mem_d = d;
    
    wire [15:0]		mem_ex;
    wire [15:0]  [3:0]	mem_wx;
    wire [15:0] [12:0]	mem_ax;
    wire [15:0] [31:0]	mem_dx;
    
    wire [15:0] [31:0]	mem_qx;
    
    
    reg [15:0]		mem_c;
    reg [15:0] [31:0]	mem_q;
    
    reg [15:0]		mem_rd;
    reg [15:0]  [3:0]	mem_qm;
    
    
    genvar i;
    generate
    	for (i=0; i<16; i++)
    	begin : hub_mem_gen
    
    		assign mem_ex[i]	= mem_e[s[i]];		// ram perspective
    		assign mem_wx[i]	= mem_w[s[i]];
    		assign mem_ax[i]	= mem_a[s[i]];
    		assign mem_dx[i]	= mem_d[s[i]];
    
    		hub_ram hub_ram_(	.clk(clk),
    					.e(mem_ex[i]),
    					.w(mem_wx[i]),
    					.a(mem_ax[i]),
    					.d(mem_dx[i]),
    					.q(mem_qx[i])	);
    
    		always @(posedge clk or negedge ena)
    		if (!ena)
    			mem_c[i] <= 1'b0;
    		else
    			mem_c[i] <= mem_ex[i] && ~|mem_wx[i];
    
    		always @(posedge clk)
    		if (mem_c[i])
    			mem_q[i] <= mem_qx[i];
    
    
    		always @(posedge clk or negedge ena)		// cog perspective
    		if (!ena)
    			mem_rd[i] <= 1'b0;
    		else
    			mem_rd[i] <= mem_e[i] && ~|mem_w[i];
    
    		always @(posedge clk)
    		if (mem_rd[i])
    			mem_qm[i] <= sx[i];
    
    		assign q[i*32+31:i*32] = mem_q[mem_qm[i]];
    
    	end
    endgenerate
    
    endmodule
    


    Here is the Verilog code that makes an instance of an 8192x32 RAM w/byte-write:
    // hub ram
    
    module hub_ram
    (
    input			clk,
    input			e,
    input		  [3:0]	w,
    input		 [12:0]	a,
    input		 [31:0]	d,
    output reg	 [31:0]	q
    );
    
    
    reg [7:0] ram3 [8191:0];
    reg [7:0] ram2 [8191:0];
    reg [7:0] ram1 [8191:0];
    reg [7:0] ram0 [8191:0];
    
    
    // 8192 x 32 ram with byte-write
    
    always @(posedge clk)
    if (e)
    begin
    	if (w[3]) ram3[a] <= d[31:24];
    	if (w[2]) ram2[a] <= d[23:16];
    	if (w[1]) ram1[a] <= d[15:8];
    	if (w[0]) ram0[a] <= d[7:0];
    	q <= {ram3[a], ram2[a], ram1[a], ram0[a]};
    end
    
    endmodule
    
  • markmark Posts: 252
    edited 2014-05-14 19:45
    jmg wrote: »
    Yes, Not pretty.

    It's looking like a split opcode & more choices (I think someone else mentioned split opcodes ? ) wold be best
    RDREQ - issues Read Address - (resets RDGET flags ?)
    ...
    RDGET - Fetches RdReq result, stalls if not ready yet (must be paired with RDREQ)

    While splitting the instructions like this might be "safer", a single instruction is more efficient for someone who understands the limitations and requirements of it. It hasn't been uncommon for people to try to reduce the number of instructions used in their code as much as possible in order to get it to fit in the limited cog ram, so that's why I'd hate for a read to require two instructions.
  • tonyp12tonyp12 Posts: 1,951
    edited 2014-05-14 19:49
    >RDREQ - issues Read Address - (resets RDGET flags ?)
    >RDGET - Fetches RdReq result, stalls if not ready yet (must be paired with RDREQ)

    RDREQ only need the source address, its opcode could use a 18bit intermediate address
    A 18bit movs type instruction is needed to self-modify it as to not having use: movs,shrl,movd
    it still allows auto-inc/dec on its own address

    RDGET gets 9bit for dest, but with no source there is room for other options like ROL, Mask etc.(replacing word and byte opcodes, as the options can emulate that)

    I like it.
  • jmgjmg Posts: 15,175
    edited 2014-05-14 20:04
    [QUOTE=mark
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 20:05
    You mean a RD/WR instruction that works similar to how the waitvid does? That is the RD/WR is held/buffered until access to the memory block comes around, but the cog can continue to execute the following instructions unless it is another R/W. A second R/W before the previous one is completed will then halt execution. That would be a very good thing to have.

    [QUOTE=mark
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 20:13
    Chip does say it is possible, but I think it is better to have this be done by the R/W instruction rather than an added wait of some kind. Perhaps a "stall until completed" bit, or perhaps just always allow execution to continue unless a second R/W is encountered before the first one completes. Once the first one completes execution can continue until.........etc.
    jmg wrote: »
    Exactly, hence my suggestion of a SNAPCNT - I think it needs to be adjustable, for granularity, and 16 does not quite guarantee completion.

    You also want a slight variant of WAITCNT, as the idea is to SNAP to some known rate, which is not quite what std WAITCNT does.

    It is a little hard to encode into R/W as something needs to set & start this rate timer.
    I think SNAPCNT can have 9 immediate bits for opcode support, so perhaps upper 2 bits can 'attach' to each of R,W, giving 1..128, and values < 16 give more control choices.
  • markmark Posts: 252
    edited 2014-05-14 20:17
    kwinn wrote: »
    You mean a RD/WR instruction that works similar to how the waitvid does? That is the RD/WR is held/buffered until access to the memory block comes around, but the cog can continue to execute the following instructions unless it is another R/W. A second R/W before the previous one is completed will then halt execution. That would be a very good thing to have.

    Exactly.
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 20:23
    It would make contiguous R/W to a single memory block fast and simple. The first R/W is buffered, the cog continues execution and if a second R/W comes before the first one is completed execution is blocked. Sort of an automatic wait for a memory block.
    Roy Eltham wrote: »
    And what about doing multiple wrxxxx in a row? Now you have to do all the ugly flipflop thing many times so they can all avoid stalling? Seems bad to me...
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 20:29
    The wait for the read to be completed could be done executing useful instructions and/or a wait. It would be up to the programmer to make sure that there has been enough time for the data to be read. That's why they get the big bucks.
    jmg wrote: »
    That's more logic for the buffers, but I can see how that could help write interleave with code.

    However, Read I cannot see, because it cannot do anything until the value is valid, unless there is some flag checked for Read-Data-Register-Now-Valid, but that's looking clumsy.
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 20:32
    Saw that. Good news indeed. Hope nothing pops up to change that.
    jmg wrote: »
  • kwinnkwinn Posts: 8,697
    edited 2014-05-14 20:35
    Yep, that would be great for writing data to hub in a short loop.

    [QUOTE=mark
  • RossHRossH Posts: 5,485
    edited 2014-05-14 20:35
    kwinn wrote: »
    That's why they get the big bucks.

    Ha! This explains why you guys insist on coming up with such complicated proposals - it's all a massive conspiracy to raise programmer's pay rates! :lol:
  • jmgjmg Posts: 15,175
    edited 2014-05-14 20:39
    kwinn wrote: »
    Chip does say it is possible, but I think it is better to have this be done by the R/W instruction rather than an added wait of some kind. Perhaps a "stall until completed" bit, or perhaps just always allow execution to continue unless a second R/W is encountered before the first one completes. Once the first one completes execution can continue until.........etc.

    See my expanded description of SNAPCNT here
    http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate

    It does pretty much what you say - ie has a bit to attach the wait to WR or RD, so those opcodes apply the snap, but you need to define the delay value, so that is why a separate opcode (or config register) is suggested.

    The SNAPCNT is quite useful outside of HUB loops, which is another reason to have an opcode for it.
  • jmgjmg Posts: 15,175
    edited 2014-05-14 20:42
    kwinn wrote: »
    The wait for the read to be completed could be done executing useful instructions and/or a wait. It would be up to the programmer to make sure that there has been enough time for the data to be read. That's why they get the big bucks.

    Only the 'gain' you think this (expensive & risky) cycle counting is buying, is not real, as when the data does arrive, some cycles need to be stolen to move into the dest register in COG memory.
    Better to use two opcodes, as that is very clear what is happening, when, and is safely edit-proof.
  • markmark Posts: 252
    edited 2014-05-14 20:48
    kwinn wrote: »
    Yep, that would be great for writing data to hub in a short loop.

    It might not seem obvious at first, but it's also handy for when the loop is longer than 16 clocks. You have a deterministic loop w/hub write/read that takes 20 clocks? No problem!
Sign In or Register to comment.