Nibble-Carry - Higher speed Buffers/FIFOs using new HUB Rotate

jmg · 2014-05-14 15:26

The Multi-Spoke-Spinning-wheel nature of this new Hub rotate has higher BW, but it does change the details.

examples (I've assumed no pipeline interactions with spoke-rates)
a) Linear Array Writes using DEC by 1 will be spaced 15 clocks,
Opcodes are 2N, so Code Loops up to 7 opcodes, will sync as 2*7+1w = 15 SysCLK
b) Same-Address (eg Polling) will be spaced 15 clocks.
Opcodes are 2N, so Code Loops up to 8 opcodes, will sync as 2*8+1w = 16 SysCLK
c) Linear Array Writes using INC by 1 will be spaced 17 clocks.
Opcodes are 2N, so Code Loops up to 8 opcodes, will sync as 2*8+1w = 17 SysCLK

ie if you know the Address-Change, you know the cycles = cycle-deterministic

What if the code misses that Next-Rotate aperture ? - and still needs cycle-deterministic operation, I think next-spoke is now

...
14+16*N (dec by 2 ) or
15+16*N (dec by 1 ) or
16+16*N (same-address eg HUB polling )
17+16*N (inc by 1), or
18+16*N (inc by 2), or
....

That is still deterministic, unless you have no idea of N, (ie very elastic code).
It is rare to have such elasticity and need cycle-determinism, but those cases could use *WAITxx to realign to hard real time.

That change in Throughput Cycles, by Nibble-Change, can be used to our advantage if we can carefully choose the next-nibble Address to write to

Taking one example of Next-Nibble Address Adder set for a Starting Nibble address of 0xA, and a tight Code loop that is 3 Opcodes.(6 SysCLK), giving an Address Adder of 7

These are the Spoke Values ( always +7 Delta spaced) :
 I=7;  
  N = 0x000A   N=N+I;N%16  repeated gives 
ans = 0x0001
ans = 0x0008
ans = 0x000F
ans = 0x0006
ans = 0x000D
ans = 0x0004
ans = 0x000B
ans = 0x0002
ans = 0x0009
ans = 0x0000
ans = 0x0007
ans = 0x000E
ans = 0x0005
ans = 0x000C
ans = 0x0003
ans = 0x000A  ### Nibble_Carry
ans = 0x0001
ans = 0x0008
ans = 0x000F
ans = 0x0006
ans = 0x000D
ans = 0x0004
ans = 0x000B
ans = 0x0002

Note we now have high speed, 7 cycle cycle-deterministic capture/buffering.

Also notice how the RAM Block fills, and 16 writes later, it returns to the same Nibble.
At this point, a Nibble_Carry is generated, and the next block is selected, before the write/read continues.

Yes, the Block fill is non-sequential, but in RAM FIFO/Buffer that detail does not matter, Provided the read starts the same, and uses the same INC, the same coverage is guaranteed. (everyone scrambles their RAM pins on PCBs, right ?)

The details mean FIFO/Buffer memory would be allocated in 16N Chunks, but it does not need to use 100% of that.( and tools are likely to have 16N support )

This is valid for Odd-number Deltas of 1,3,5,7,9,11,13,15, which infer CODE loops any of ?,2,4,6,8,10,12,14 + 1Wait. (even numbers are more sparse, they can be supported, or ignored - not sure yet which is best )

Such address and Nibble_Carry handling can be done in SW, but the real key is to have simple HW do this Fast Buffer form of Spoke-Increment handling, leaving the Loop code to be smaller and faster.
REPS is an ideal opcode to pair here.

~~~~~~~~~~~~~ Separate SNAPCNT Control ~~~~~~~~~~~~~~
* To help porting P1 code, there may be scope to add a small variant of WAITxx (snapcnt) that is User-set Modulus. (including 16N, whatever size makes opcode sense - 9 bits ? )

This would also allow Even-Loop cases to be Granular managed, at the cost of needing the opcode,and it gives a cycle-deterministic control method for random-access uses.

eg 9 immediate bits in Opcode
Upper 2 bits can optionally 'attach' SNAPCNT to each of R,W Hub access, which saves needing an extra code line
Lower 7 bits gives 1..128 choice in delays
>= 16 change the snap rate, and reset the snap timer.
'impractical' values < 16 give more control choices, like Enable/reset/RunToSnap
SNAPCNT 0 removes the feature (default)

The stall might re-apply variably each time (Address-change is unknown) , so practical repeat delays here will be ~32 to allow room for [Some Code+Access-Stall], very like P1 does now.
That is 2x the P1, but if the SysCLK is 2x P1, maybe that is ok.

Addit: I see present WAIT Opcode
WAIT D/# (wait for some number of clocks, 0 same as 1)

SNAPCNT is really just a coarse version of that, with 'attach' options

koehler · 2014-05-14 18:19

jmg wrote: »

* To help porting P1 code, there may be scope to add a small variant of WAITxx (snapcnt) that is User-set Modulus. (including 16N, whatever size makes opcode sense - 9 bits ? )

So, this is the first time I think I've seen this question actually addressed, unless I missed a post somewhere.

Assuming this is the new path forward to complete product, what happens to all of the current OBEX objects?
Do they work as is?
Do they work only after porting?

jmg · 2014-05-14 18:25

koehler wrote: »

Assuming this is the new path forward to complete product, what happens to all of the current OBEX objects?
Do they work as is?
Do they work only after porting?

Very little is going to 'work as is' (maybe some HLL stuff?), but work after porting is realistic, and hence the snapcnt suggestion.

potatohead · 2014-05-14 19:20

Porting was always on the table for the next chip.

Cluso99 · 2014-05-14 19:47

jmg,
I don't understand this nibble scheme at all. What am I missing?

jmg · 2014-05-14 19:55

Cluso99 wrote: »

jmg,
I don't understand this nibble scheme at all. What am I missing?

Bummer, I thought it was crystal-clear.
See the section now titled
"Taking one example Address Adder set "
and follow the address Nibble sequence.

The 'magic' is in choosing a Nibble adder that matches the opcode delays, so there is minimal waits inserted.
Opcodes are 2 cy, and best Nibbler adders are odd, so that means in practice a single cycle wait

Once you have that, all that is left is deciding when to make the Nibble-carry (ie advance to the next page)

This complements the BLOCK copy, which has a fairly chunky 18N impact on any code loop, whilst this adder scheme allows code-loops of a few opcodes to send data once per loop.

Cluso99 · 2014-05-14 20:18

jmg,
So do you mean having a nibble adder circuit that can be set after reset that will determine the hub interleaving factor?

jmg · 2014-05-14 20:28

Cluso99 wrote: »

jmg,
So do you mean having a nibble adder circuit that can be set after reset that will determine the hub interleaving factor?

Depends what you mean by "hub interleaving factor".

This is a Nibble Adder applied to the HUB PTR++, so does not affect other COGS at all - COGs can all have different values in their Nibble Adders if they want, no interactions.

It does not affect the Rotate scanner either, just 'co-operates' with it in a clever way, to boost bandwidth.

The user chooses the Adder value, (can be 1, to look like a 'normal' adder), but it can also be any other Nibble value, to give slots that are just-in-time matching to any tight code loop you care to design.

Cluso99 · 2014-05-14 21:23

So the hub ram can use different lower nibble addressing for each cog.
It is just like rewiring the lower 4 address lines through a lookup table.

Of course other cogs accessing this info would also have to be aware of this, because hub addressing (lowest nibble) may not be contiguous as stored in hub.

Is this what you mean?

jmg · 2014-05-14 22:01

Cluso99 wrote: »

So the hub ram can use different lower nibble addressing for each cog.
It is just like rewiring the lower 4 address lines through a lookup table.

Of course other cogs accessing this info would also have to be aware of this, because hub addressing (lowest nibble) may not be contiguous as stored in hub.

Is this what you mean?

That's very close. It's using an adder, not a lookup table ( but a lookup table can be an adder )
Yes, COGS using this would need to use the same Base address on their Buffer and the same adder value.

It can address the whole HUB memory this way, with the right Nibble-Carry, and so is suited to many apps that need to stream data quickly (where the video HW may not quite fit - eg Read is ??tbf?? via Video )

Example: Hopefully with REPS, Pin-info can Stream capture to memory at SysCLK/3 from a single COG
3 Phased COGS would sample at SysCLK.

That is quite close to what the earlier HUB Mapping table could deliver, which would have done this in 2 COGs, but with BW impact on other COGs

The NEW HubRotator means 3 COGS can stream in at 200MHz and the other COGS do not even know they are there..

I guess 3 more COGS could stream OUT data at 200MHz at the same time.
The more limited BW before, could not touch that. - and there are still 10 GOGs with BW left !

Without this Nibble-Adder, those impressive numbers plummet, but a 4 bit adder (optional) has very low cost.

Cluso99 · 2014-05-14 22:24

jmg wrote: »

That's very close. It's using an adder, not a lookup table ( but a lookup table can be an adder )
Yes, COGS using this would need to use the same Base address on their Buffer and the same adder value.

It can address the whole HUB memory this way, with the right Nibble-Carry, and so is suited to many apps that need to stream data quickly (where the video HW may not quite fit - eg Read is ??tbf?? via Video )

Example: Hopefully with REPS, Pin-info can Stream capture to memory at SysCLK/3 from a single COG
3 Phased COGS would sample at SysCLK.

That is quite close to what the earlier HUB Mapping table could deliver, which would have done this in 2 COGs, but with BW impact on other COGs

The NEW HubRotator means 3 COGS can stream in at 200MHz and the other COGS do not even know they are there..

I guess 3 more COGS could stream OUT data at 200MHz at the same time.
The more limited BW before, could not touch that. - and there are still 10 GOGs with BW left !

Without this Nibble-Adder, those impressive numbers plummet, but a 4 bit adder (optional) has very low cost.

What is the impact to other cogs just running standard (normal +1 adder I presume)? I would think none???

jmg · 2014-05-14 22:44

Cluso99 wrote: »

What is the impact to other cogs just running standard (normal +1 adder I presume)? I would think none???

Correct, they have no idea any of this of going on. They can be sending blocks back and forth.

Fingers crossed REPS can make this very compact loop stuff fly, as the Nibble adder can deliver the BW

jmg · 2014-05-15 14:38

Expanding with some examples of various Adder-Values and the resulting Nibble-Address, and when Nibble-Carry occurs (tagged ##) to advance the higher address bits.

// Pointer Adding Schemes, Nibble Carry handling 
// Can start from any Value and apply any Delta (and Sign) and COGS will scan the same 
//
// ## Tags Nibble Carry Rule is effectively "about to access any address twice" 
// - odd numbers can re-loop to fill all 16, even ones are less lucky, and more sparse*.
// In use, odd numbers will likely Round-Up to the next even number, as MOPS = SysLCK/2  = 1 wait applied.
// OR round down in SW, and Round UP in Wait. ie Wait is 3 Clocks, others are even
//                                                                            
//  I=3         I=5           I=9      I=15       I=12        =6      I=11     I=13
//    ~~~~~~~~~~~~~~~~~~~  N=N+I;N%16  ~~~~~~~~~~~~~~~~~~~~~~~~~~~            
//  = 0           = 5         = 9       = 15       = 12       = 6     = 11     = 13
//  = 3           = 10        = 2       = 14       = 8        = 12    = 6      = 10
//  = 6           = 15        = 11      = 13       = 4        = 2     = 1      = 7
//  = 9           = 4         = 4       = 12       = 0        = 8     = 12     = 4
//  = 12          = 9         = 13      = 11       = 12 ##    = 14    = 7      = 1
//  = 15  6       = 14        = 6       = 10       = 8        = 4     = 2      = 14
//  = 2           = 3         = 15      = 9        = 4        = 10    = 13     = 11
//  = 5           = 8         = 8       = 8        = 0        = 0     = 8      = 8
//  = 8           = 13        = 1       = 7        = 12       = 6 ##  = 3      = 5
//  = 11          = 2         = 10      = 6        = 8        = 12    = 14     = 2
//  = 14  5       = 7         = 3       = 5        = 4        = 2     = 9      = 15
//  = 1           = 12        = 12      = 4        = 0        = 8     = 4      = 12
//  = 4           = 1         = 5       = 3        = 12       = 14    = 15     = 9
//  = 7           = 6         = 14      = 2        = 8        = 4     = 10     = 6
//  = 10          = 11        = 7       = 1        = 4        = 10    = 5      = 3
//  = 13  5L      = 0         = 0       = 0        = 0        = 0     = 0      = 0
//  = 0   17th ## = 5  ##     = 9 ##    = 15 ##    = 12       = 6     = 11 ##  = 13 ##
//  = 3   18      = 10        = 2       = 14       = 8        = 12    = 6      = 10
//  = 6   19      = 15        = 11      = 13       = 4        = 2     = 1      = 7
//  = 9   20      = 4         = 4       = 12       = 0        = 8     = 12     = 4
//  = 12  21      = 9         = 13      = 11       = 12       = 14    = 7      = 1
//  = 15          = 14        = 6       = 10       = 8        = 4     = 2      = 14
//  = 2           = 3         = 15      = 9        = 4        = 10    = 13     = 11
//  = 5           = 8         = 8       = 8        = 0        = 0     = 8      = 8
//  = 8           = 13        = 1       = 7        = 12       = 6     = 3      = 5
//  = 11          = 2         = 10      = 6        = 8        = 12    = 14     = 2
//  = 14          = 7         = 3       = 5        = 4        = 2     = 9      = 15
//  = 1           = 12        = 12      = 4        = 0        = 8     = 4      = 12
//  = 4           = 1         = 5                  = 12       = 14    = 15     = 9
//                                                                             = 6

* even ones are less lucky, and more sparse - such cases can still be useful, for things like Video, where you reset the base index every line.
Examples of Base-Index restarts, and following bursts and interleaves, Even-N values

ODD => Nibble_Carry every 16, very simple.
EVEN => need TWO adders Nibble + Column,
Nibble_Carry is always every 16 in all Numbers (full-page)
It also needs de-stutter as the process of Column incs to cover ALL memories, gives non-evenly spaced memory access.
The average rate is fSys/N

 ## Tags Scan PreLoad, and also is the Nibble Carry 16 writes later
 @@ Tags Column Carry  Col=(Col+1) & ColWRAP, ^^ is the last value of that Column befoe NextCol
 RowCtr & ColCtr spin wrapping, until Nibble_Carry exits that page fill.
 
NibbleIndex (ans)  is N+Column
> I=2;N=0;   > I=2;N=1;
> N=N+I;N%16 > N=N+I;N%16           RowCtr
ans = 2      ans = 3                  7 
ans = 4      ans = 5                  6 
ans = 6      ans = 7                  5 
ans = 8      ans = 9                  4 
ans = 10     ans = 11                 3 
ans = 12     ans = 13                 2 
ans = 14     ans = 15  ## 8w-Repeat   1 
ans = 0  ^^  ans = 1   ^^             0
[ns = 2  @@  ans = 3   @@             7
ans = 4      ans = 5
ans = 6      ans = 7

> I=6;N=0;    > I=6;N=1;
> N=N+I;N%16  > N=N+I;N%16           RowCtr
ans = 6       ans = 7                  7
ans = 12      ans = 13                 6
ans = 2       ans = 3                  5
ans = 8       ans = 9                  4
ans = 14      ans = 15                 3
ans = 4       ans = 5                  2
ans = 10      ans = 11  ## 8w-Repeat   1
ans = 0  ^^   ans = 1   ^^             0
[ns = 6  @@   ans = 7   @@             7
ans = 12      ans = 13]

> I=10;N=0;   > I=10;N=1;    
> N=N+I;N%16  > N=N+I;N%16            RowCtr 
ans = 10      ans = 11                 7
ans = 4       ans = 5                  6
ans = 14      ans = 15                 5
ans = 8       ans = 9                  4
ans = 2       ans = 3                  3
ans = 12      ans = 13                 2
ans = 6       ans = 7  ##  8w-Repeat   1    
ans = 0  ^^   ans = 1  ^^              0
[ns = 10 @@   ans = 11 @@              7
ans = 4       ans = 5 ]      

> I=14;N=0;  > I=14;N=1;
> N=N+I;N%16 > N=N+I;N%16          RowCtr
ans = 14     ans = 15                 7
ans = 12     ans = 13                 6
ans = 10     ans = 11                 5
ans = 8      ans = 9                  4
ans = 6      ans = 7                  3
ans = 4      ans = 5                  2
ans = 2      ans = 3  ## 8w-Repeat    1
ans = 0  ^^  ans = 1  ^^              0
[ns = 14 @@  ans = 15 @@              7
ans = 12     ans = 13]
ans = 10     ans = 11


> I=4;N=0;    > I=4;N=1;    > I=4;N=2;   > I=4;N=3;
> N=N+I;N%16  > N=N+I;N%16  > N=N+I;N%16 > N=N+I;N%16          RowCtr
ans = 4       ans = 5       ans = 6      ans = 7                3  
ans = 8       ans = 9       ans = 10     ans = 11               2 
ans = 12      ans = 13      ans = 14     ans = 15 ## 4w-Repeat  1 
ans = 0 ^^    ans = 1  ^^   ans = 2 ^^   ans = 3 ^^             0 
[ns = 4 @@    ans = 5  @@   ans = 6 @@   ans = 7 @@             3 
ans = 8       ans = 9       ans = 10     ans = 11                
ans = 12      ans = 13      ans = 14     ans = 15      ]

> I=12;N=0;  > I=12;N=1;   > I=12;N=2;   > I=12;N=3;
> N=N+I;N%16 > N=N+I;N%16  > N=N+I;N%16  > N=N+I;N%16
ans = 12     ans = 13      ans = 14      ans = 15                3
ans = 8      ans = 9       ans = 10      ans = 11                2
ans = 4      ans = 5       ans = 6       ans = 7  ## 4w-Repeat   1
ans = 0  ^^  ans = 1  ^^   ans = 2  ^^   ans = 3  ^^             0
[ns = 12 @@  ans = 13 @@   ans = 14 @@   ans = 15 @@             3
ans = 8      ans = 9       ans = 10      ans = 11
ans = 4      ans = 5       ans = 6       ans = 7  ]


> I=8;N=0;   > I=8;N=1;   > I=8;N=2;   > I=8;N=3;   > I=8;N=4;   > I=8;N=5;   > I=8;N=6;   > I=8;N=7;
> N=N+I;N%16 > N=N+I;N%16 > N=N+I;N%16 > N=N+I;N%16 > N=N+I;N%16 > N=N+I;N%16 > N=N+I;N%16 > N=N+I;N%16
ans = 8      ans = 9      ans = 10     ans = 11     ans = 12     ans = 13     ans = 14     ans = 15 ##   1
ans = 0 ^^   ans = 1 ^^   ans = 2  ^^  ans = 3  ^^  ans = 4  ^^  ans = 5  ^^  ans = 6  ^^  ans = 7  ^^   0 
[ns = 8 @@   ans = 9 @@   ans = 10 @@  ans = 11 @@  ans = 12 @@  ans = 13 @@  ans = 14 @@  ans = 15 @@   1
ans = 0      ans = 1      ans = 2      ans = 3      ans = 4      ans = 5      ans = 6      ans = 7       0 ]


best to lock-Start to a critical Start Value [Col.Add] so that from there /N pacing can de-stutter the samples.

jmg · 2014-05-16 18:39

cgracey wrote: »

I think the apprehension over this new hub memory scheme is overblown. It's true that there will be some jitter for random accesses, but the flip side is that by paying attention to the order you do your writes in, you can actually get higher throughput.

I've been thinking about the video mechanism all afternoon and I just realized that because it's going to be tied to the system clock, all the clock domain decoupling that has been part of video since Prop1 can go away. Now, there can be different instructions to do different video output streams. There's no longer a need to chain video commands, in other words. This means that we CAN have a 256 LUT by reading the pixels from hub, translating them via cog RAM into 32-bit patterns, and outputting them to the DACs. This simplifies video quite a bit.

There can now be a small set of video output instructions that get the job done in a simple way, outputting a whole visible scan line at a time:

VID32........32-bit hub-to-DAC mode at Fsys/N
VID16........16-bit hub-to-DAC mode at Fsys/N
VID8..........8-bit hub-to-LUT-to-DAC mode at Fsys/N
VID4..........4-bit hub-to-LUT-to-DAC mode at Fsys/N
VID2..........2-bit hub-to-LUT-to-DAC mode at Fsys/N
VID1..........1-bit hub-to-LUT-to-DAC mode at Fsys/N

( N= 1, 2, 3,... 64 )

Once these instructions are over, they can return the DAC states to whatever they were before, with a mapped DAC register holding the four 8-bit values. That way, horizontal sync's can be done with 'MOV DAC,dacstates' and 'WAIT clocks' instructions. This simplifies the video greatly. Because there is no decoupling, though, the cog will be busy while it generates the pixels.
...{merge}
Because these instructions would stall execution for their duration, the cog RAM can be used as a LUT.
...{merge}
I'm not sure yet. It might need to buffer a block first in 16 cycles, then transfer that to the shifter and load the next block. That would take a lot of flops, though. I need to map out the timing.

You said your nibble adder was good for odd-N values, right?

Odd-N values are relatively simple.- see the first table in #13. Even-N was more of a challenge.

I have updated the #13 table 2, to reflect the Even-N coding iterations.
The Even-N in table 1 shows the sparse issue, and table 2 shows how to manage Even-N nibble adder, so it is no longer sparse. ie fully scans all memory addresses in a page, before moving to the next page.

A few iterations were needed to handle the finer details,adding in so that
* All Values of N have the same delay from launch
* A simple means to re-sync even cases to even-pitch the 2N average - can be at the pin-register.

End Result :
Support for Video Streaming at fSys/N where Odd and Even are supported, well suited to DMA use, (and not using a lot of flops.)
Logic cost is 2 Nibble adders, and a handful of 4 bit counters.
P&R reports 38 Slices, 70 LUT4's (includes wrapper test dummies) and 337.838MHz

This all swallows into a 'Smarter Adder' block that is used by DMA-HW when doing Rotate-Streaming, @ Fsys/N, and the same verilog code is used by a special WRVID Reg, PTR++ (that ensures SW write and HW read are both linear to the user )

The 'Smarter Adder' can also work with SW Write and SW-Read cases, but some details mean that is probably best limited to Odd-N.

For SW-Video modes, that have to use Even-N, (due to clock constraints), another simple choice is to change coding to vary PreLoad by Scan line.
Each line is sparse, but shifted until M lines use all addresses.
HW VIdeo modes for Even-N would most likely be used, and they do not have to employ this Scan Line (but they could).

jmg · 2014-05-19 00:23

pasted from other thread, as this Topic is about Streaming at fSys/N

cgracey wrote: »

.... how to proceed with the video/hubexec buffering: A 20-stage FIFO in the cog that spits out hub longs at any rate at or below Fsys would simplify video quite a bit and provide hubexec instructions. And it would only take 640 flipflops.

I realized a FIFO can be built with each stage mux'ing either the hub read or the above FIFO stage into its inputs. Then we have a 5-bit counter to keep track of how many longs are stacked up, and what stage gets the next write. This would even bust out of the 16-long block constraint, since you can't really get started until you have your initial data, anyway. With this, you can get started with video or hub exec as soon as you get the first long, because the second and third are definitely following right behind. And if there's a stall in long consumption, there will be enough longs stacked up in the FIFO to allow loading to resume at the next identical hub opportunity (16 clocks later). This is a fix-all for hub long data needed at any rate, up to and including Fsys.

I've looked into a Tag Memory-FIFO, and cannot find a case where it needs to be > 16, so I think this can reduce in size, and also fit into Simple Dual Port Memory. (no MUX per stage needed, just DPRAM Wr & Rd)
All of which saves important die Space

I coded a simulator for the data-flow decisions, and allow for a separate read-trigger which allows 'arming' the FIFO, for any-cycle precise start of the Streaming to Pins/DAC/LUT

This revealed one special case where fSys/1 and one phase of Start/Read gives issues with Tag management, as here Wr and Rd are on the same clock edge.

It looks like that can be solved with a pass-thru mode, so all phases and fSys/N seem to be ok.

Codes with a full size HUB Address that is preloaded and conditionally INCs, 4 LSBs address the RPRAM_WR, and a 5b Read Address that uses MSB for compare & 4 LSBs address the RPRAM_RD, the smaller Read INCs at fSys/N

Nibble-Carry - Higher speed Buffers/FIFOs using new HUB Rotate

Comments