Nibble-Carry - Higher speed Buffers/FIFOs using new HUB Rotate
jmg
Posts: 15,171
The Multi-Spoke-Spinning-wheel nature of this new Hub rotate has higher BW, but it does change the details.
examples (I've assumed no pipeline interactions with spoke-rates)
a) Linear Array Writes using DEC by 1 will be spaced 15 clocks,
Opcodes are 2N, so Code Loops up to 7 opcodes, will sync as 2*7+1w = 15 SysCLK
b) Same-Address (eg Polling) will be spaced 15 clocks.
Opcodes are 2N, so Code Loops up to 8 opcodes, will sync as 2*8+1w = 16 SysCLK
c) Linear Array Writes using INC by 1 will be spaced 17 clocks.
Opcodes are 2N, so Code Loops up to 8 opcodes, will sync as 2*8+1w = 17 SysCLK
ie if you know the Address-Change, you know the cycles = cycle-deterministic
What if the code misses that Next-Rotate aperture ? - and still needs cycle-deterministic operation, I think next-spoke is now
It is rare to have such elasticity and need cycle-determinism, but those cases could use *WAITxx to realign to hard real time.
That change in Throughput Cycles, by Nibble-Change, can be used to our advantage if we can carefully choose the next-nibble Address to write to
Taking one example of Next-Nibble Address Adder set for a Starting Nibble address of 0xA, and a tight Code loop that is 3 Opcodes.(6 SysCLK), giving an Address Adder of 7
Note we now have high speed, 7 cycle cycle-deterministic capture/buffering.
Also notice how the RAM Block fills, and 16 writes later, it returns to the same Nibble.
At this point, a Nibble_Carry is generated, and the next block is selected, before the write/read continues.
Yes, the Block fill is non-sequential, but in RAM FIFO/Buffer that detail does not matter, Provided the read starts the same, and uses the same INC, the same coverage is guaranteed. (everyone scrambles their RAM pins on PCBs, right ?)
The details mean FIFO/Buffer memory would be allocated in 16N Chunks, but it does not need to use 100% of that.( and tools are likely to have 16N support )
This is valid for Odd-number Deltas of 1,3,5,7,9,11,13,15, which infer CODE loops any of ?,2,4,6,8,10,12,14 + 1Wait. (even numbers are more sparse, they can be supported, or ignored - not sure yet which is best )
Such address and Nibble_Carry handling can be done in SW, but the real key is to have simple HW do this Fast Buffer form of Spoke-Increment handling, leaving the Loop code to be smaller and faster.
REPS is an ideal opcode to pair here.
~~~~~~~~~~~~~ Separate SNAPCNT Control ~~~~~~~~~~~~~~
* To help porting P1 code, there may be scope to add a small variant of WAITxx (snapcnt) that is User-set Modulus. (including 16N, whatever size makes opcode sense - 9 bits ? )
This would also allow Even-Loop cases to be Granular managed, at the cost of needing the opcode,and it gives a cycle-deterministic control method for random-access uses.
eg 9 immediate bits in Opcode
Upper 2 bits can optionally 'attach' SNAPCNT to each of R,W Hub access, which saves needing an extra code line
Lower 7 bits gives 1..128 choice in delays
>= 16 change the snap rate, and reset the snap timer.
'impractical' values < 16 give more control choices, like Enable/reset/RunToSnap
SNAPCNT 0 removes the feature (default)
The stall might re-apply variably each time (Address-change is unknown) , so practical repeat delays here will be ~32 to allow room for [Some Code+Access-Stall], very like P1 does now.
That is 2x the P1, but if the SysCLK is 2x P1, maybe that is ok.
Addit: I see present WAIT Opcode
WAIT D/# (wait for some number of clocks, 0 same as 1)
SNAPCNT is really just a coarse version of that, with 'attach' options
examples (I've assumed no pipeline interactions with spoke-rates)
a) Linear Array Writes using DEC by 1 will be spaced 15 clocks,
Opcodes are 2N, so Code Loops up to 7 opcodes, will sync as 2*7+1w = 15 SysCLK
b) Same-Address (eg Polling) will be spaced 15 clocks.
Opcodes are 2N, so Code Loops up to 8 opcodes, will sync as 2*8+1w = 16 SysCLK
c) Linear Array Writes using INC by 1 will be spaced 17 clocks.
Opcodes are 2N, so Code Loops up to 8 opcodes, will sync as 2*8+1w = 17 SysCLK
ie if you know the Address-Change, you know the cycles = cycle-deterministic
What if the code misses that Next-Rotate aperture ? - and still needs cycle-deterministic operation, I think next-spoke is now
... 14+16*N (dec by 2 ) or 15+16*N (dec by 1 ) or 16+16*N (same-address eg HUB polling ) 17+16*N (inc by 1), or 18+16*N (inc by 2), or ....That is still deterministic, unless you have no idea of N, (ie very elastic code).
It is rare to have such elasticity and need cycle-determinism, but those cases could use *WAITxx to realign to hard real time.
That change in Throughput Cycles, by Nibble-Change, can be used to our advantage if we can carefully choose the next-nibble Address to write to
Taking one example of Next-Nibble Address Adder set for a Starting Nibble address of 0xA, and a tight Code loop that is 3 Opcodes.(6 SysCLK), giving an Address Adder of 7
These are the Spoke Values ( always +7 Delta spaced) : I=7; N = 0x000A N=N+I;N%16 repeated gives ans = 0x0001 ans = 0x0008 ans = 0x000F ans = 0x0006 ans = 0x000D ans = 0x0004 ans = 0x000B ans = 0x0002 ans = 0x0009 ans = 0x0000 ans = 0x0007 ans = 0x000E ans = 0x0005 ans = 0x000C ans = 0x0003 ans = 0x000A ### Nibble_Carry ans = 0x0001 ans = 0x0008 ans = 0x000F ans = 0x0006 ans = 0x000D ans = 0x0004 ans = 0x000B ans = 0x0002
Note we now have high speed, 7 cycle cycle-deterministic capture/buffering.
Also notice how the RAM Block fills, and 16 writes later, it returns to the same Nibble.
At this point, a Nibble_Carry is generated, and the next block is selected, before the write/read continues.
Yes, the Block fill is non-sequential, but in RAM FIFO/Buffer that detail does not matter, Provided the read starts the same, and uses the same INC, the same coverage is guaranteed. (everyone scrambles their RAM pins on PCBs, right ?)
The details mean FIFO/Buffer memory would be allocated in 16N Chunks, but it does not need to use 100% of that.( and tools are likely to have 16N support )
This is valid for Odd-number Deltas of 1,3,5,7,9,11,13,15, which infer CODE loops any of ?,2,4,6,8,10,12,14 + 1Wait. (even numbers are more sparse, they can be supported, or ignored - not sure yet which is best )
Such address and Nibble_Carry handling can be done in SW, but the real key is to have simple HW do this Fast Buffer form of Spoke-Increment handling, leaving the Loop code to be smaller and faster.
REPS is an ideal opcode to pair here.
~~~~~~~~~~~~~ Separate SNAPCNT Control ~~~~~~~~~~~~~~
* To help porting P1 code, there may be scope to add a small variant of WAITxx (snapcnt) that is User-set Modulus. (including 16N, whatever size makes opcode sense - 9 bits ? )
This would also allow Even-Loop cases to be Granular managed, at the cost of needing the opcode,and it gives a cycle-deterministic control method for random-access uses.
eg 9 immediate bits in Opcode
Upper 2 bits can optionally 'attach' SNAPCNT to each of R,W Hub access, which saves needing an extra code line
Lower 7 bits gives 1..128 choice in delays
>= 16 change the snap rate, and reset the snap timer.
'impractical' values < 16 give more control choices, like Enable/reset/RunToSnap
SNAPCNT 0 removes the feature (default)
The stall might re-apply variably each time (Address-change is unknown) , so practical repeat delays here will be ~32 to allow room for [Some Code+Access-Stall], very like P1 does now.
That is 2x the P1, but if the SysCLK is 2x P1, maybe that is ok.
Addit: I see present WAIT Opcode
WAIT D/# (wait for some number of clocks, 0 same as 1)
SNAPCNT is really just a coarse version of that, with 'attach' options
Comments
So, this is the first time I think I've seen this question actually addressed, unless I missed a post somewhere.
Assuming this is the new path forward to complete product, what happens to all of the current OBEX objects?
Do they work as is?
Do they work only after porting?
Very little is going to 'work as is' (maybe some HLL stuff?), but work after porting is realistic, and hence the snapcnt suggestion.
I don't understand this nibble scheme at all. What am I missing?
Bummer, I thought it was crystal-clear.
See the section now titled
"Taking one example Address Adder set "
and follow the address Nibble sequence.
The 'magic' is in choosing a Nibble adder that matches the opcode delays, so there is minimal waits inserted.
Opcodes are 2 cy, and best Nibbler adders are odd, so that means in practice a single cycle wait
Once you have that, all that is left is deciding when to make the Nibble-carry (ie advance to the next page)
This complements the BLOCK copy, which has a fairly chunky 18N impact on any code loop, whilst this adder scheme allows code-loops of a few opcodes to send data once per loop.
So do you mean having a nibble adder circuit that can be set after reset that will determine the hub interleaving factor?
Depends what you mean by "hub interleaving factor".
This is a Nibble Adder applied to the HUB PTR++, so does not affect other COGS at all - COGs can all have different values in their Nibble Adders if they want, no interactions.
It does not affect the Rotate scanner either, just 'co-operates' with it in a clever way, to boost bandwidth.
The user chooses the Adder value, (can be 1, to look like a 'normal' adder), but it can also be any other Nibble value, to give slots that are just-in-time matching to any tight code loop you care to design.
It is just like rewiring the lower 4 address lines through a lookup table.
Of course other cogs accessing this info would also have to be aware of this, because hub addressing (lowest nibble) may not be contiguous as stored in hub.
Is this what you mean?
That's very close. It's using an adder, not a lookup table ( but a lookup table can be an adder )
Yes, COGS using this would need to use the same Base address on their Buffer and the same adder value.
It can address the whole HUB memory this way, with the right Nibble-Carry, and so is suited to many apps that need to stream data quickly (where the video HW may not quite fit - eg Read is ??tbf?? via Video )
Example: Hopefully with REPS, Pin-info can Stream capture to memory at SysCLK/3 from a single COG
3 Phased COGS would sample at SysCLK.
That is quite close to what the earlier HUB Mapping table could deliver, which would have done this in 2 COGs, but with BW impact on other COGs
The NEW HubRotator means 3 COGS can stream in at 200MHz and the other COGS do not even know they are there..
I guess 3 more COGS could stream OUT data at 200MHz at the same time.
The more limited BW before, could not touch that. - and there are still 10 GOGs with BW left !
Without this Nibble-Adder, those impressive numbers plummet, but a 4 bit adder (optional) has very low cost.
Correct, they have no idea any of this of going on. They can be sending blocks back and forth.
Fingers crossed REPS can make this very compact loop stuff fly, as the Nibble adder can deliver the BW
* even ones are less lucky, and more sparse - such cases can still be useful, for things like Video, where you reset the base index every line.
Examples of Base-Index restarts, and following bursts and interleaves, Even-N values
Odd-N values are relatively simple.- see the first table in #13. Even-N was more of a challenge.
I have updated the #13 table 2, to reflect the Even-N coding iterations.
The Even-N in table 1 shows the sparse issue, and table 2 shows how to manage Even-N nibble adder, so it is no longer sparse. ie fully scans all memory addresses in a page, before moving to the next page.
A few iterations were needed to handle the finer details,adding in so that
* All Values of N have the same delay from launch
* A simple means to re-sync even cases to even-pitch the 2N average - can be at the pin-register.
End Result :
Support for Video Streaming at fSys/N where Odd and Even are supported, well suited to DMA use, (and not using a lot of flops.)
Logic cost is 2 Nibble adders, and a handful of 4 bit counters.
P&R reports 38 Slices, 70 LUT4's (includes wrapper test dummies) and 337.838MHz
This all swallows into a 'Smarter Adder' block that is used by DMA-HW when doing Rotate-Streaming, @ Fsys/N, and the same verilog code is used by a special WRVID Reg, PTR++ (that ensures SW write and HW read are both linear to the user )
The 'Smarter Adder' can also work with SW Write and SW-Read cases, but some details mean that is probably best limited to Odd-N.
For SW-Video modes, that have to use Even-N, (due to clock constraints), another simple choice is to change coding to vary PreLoad by Scan line.
Each line is sparse, but shifted until M lines use all addresses.
HW VIdeo modes for Even-N would most likely be used, and they do not have to employ this Scan Line (but they could).
I've looked into a Tag Memory-FIFO, and cannot find a case where it needs to be > 16, so I think this can reduce in size, and also fit into Simple Dual Port Memory. (no MUX per stage needed, just DPRAM Wr & Rd)
All of which saves important die Space
I coded a simulator for the data-flow decisions, and allow for a separate read-trigger which allows 'arming' the FIFO, for any-cycle precise start of the Streaming to Pins/DAC/LUT
This revealed one special case where fSys/1 and one phase of Start/Read gives issues with Tag management, as here Wr and Rd are on the same clock edge.
It looks like that can be solved with a pass-thru mode, so all phases and fSys/N seem to be ok.
Codes with a full size HUB Address that is preloaded and conditionally INCs, 4 LSBs address the RPRAM_WR, and a 5b Read Address that uses MSB for compare & 4 LSBs address the RPRAM_RD, the smaller Read INCs at fSys/N