If we can ever make a P3, might as well make it 64 bits and have 3 larger register fields: two sources and one destination, with several fast indirect registers with auto inc/dec. Lots of things could be done at 22nm, or so.
@cgracey said:
If we can ever make a P3, might as well make it 64 bits and have 3 larger register fields: two sources and one destination, with several fast indirect registers with auto inc/dec. Lots of things could be done at 22nm, or so.
Lol, 3 x 16-bit. 64 kWords (512 kBytes) of cogRAM.
Oh, so the reduced number of addressable special registers in the Prop2 cogs helps make that faster compared to the Prop1 then?
EDIT: And that would mean a round power-of-two will be ideal number of addressable special registers. Maybe minus one to allow SRAM to be one of the mux sources.
Oh, so the reduced number of addressable special registers in the Prop2 cogs helps make that faster compared to the Prop1 then?
EDIT: And that would mean a round power-of-two will be ideal number of addressable special registers. Maybe minus one to allow SRAM to be one of the mux sources.
The way things work with timing paths, powers-of-two muxes are not necessarily best, because of the varying distances signals must cross. The wiring/buffer delays can easily overshadow a 2-to-1 mux propagation delay.
@evanh said:
All the special registers can be packed in tight. They're dedicated to serving the ALU.
It's a huge tug-of-war. The placer/router spends many hours trying to meet the timing requirement. There are many paths competing for the shortest routes. Everything can't be in the same place, but must get spread out across the die, of course.
cogRAM is certainly a long way out, and dimensionally large, in comparison. You're not going to get further away than that. Each ALU is going to be on one side of each SRAM block likely.
And the associated lutRAM block has to be nearby too. I can see why that needed an extra stage. Gives higher priority to placement around cogRAM.
@evanh said:
cogRAM is certainly a long way out, and dimensionally large, in comparison. You're not going to get further away than that. Each ALU is going to be on one side of each SRAM block likely.
And the associated lutRAM block has to be nearby too. I can see why that needed an extra stage. Gives higher priority to placement around cogRAM.
Yes, the logic of each cog grouped right around its own cogRAM and lutRAM. All the RAMs were spread around the perimeter of the cell area, creating two axes of symmetry. This allowed the placer/router a big open area in the center to assemble and wire the ~600k cells.
I feel like just adding a signed version of QMUL and a way to get the middle 32 bits of the result in one instruction would significantly speed up fixed-point vector / DSP tasks
I just realised that if QX/QY were mapped to cogRAM addresses then pushing these results over the hub-op bus is complicated by the fact that there is two results at once. They'd then have to be via time staggered updates from the Cordic registers to the Cog registers, which makes the whole scheme all that more complicated. Same applies to writing to lutRAM.
Fixed point QMUL should work just by fetching the high 32 bits with GETQY.
@evanh said:
Fixed point QMUL should work just by fetching the high 32 bits with GETQY
No, it doesn't. That gets you the integer portion of the result only (assuming 16.16 inputs). And if the operands were signed, it's wrong to boot (unsigned and signed multiplication differ only in the upper half of the result)
16.16 squared won't all fit in anything other than 64 bits. If you want all the high bits then GETQY is it.
Like floats, signed is extra logic handling. SHL fpn,#1 WC to extract and prep unsigned value. And RCR fpn,#1 to reconstruct. Of course this isn't two's complement any longer. There's two zeros for a starters.
@ersmith said:
In my ideal P3 you'd be able to fetch any 32 bits of the full 64 bit multiply result, something like:
setq2 #16' shift result right by 16getqx ans ' now ans = correct value for 16.16 multiplication
Or make it just one double operand instruction.
qetqz hi32,#32qetqz mid32,#16qetqz lo32,#0
Problem is it likely requires a separate barrel shifter. And those are about as bulky as an integer multiply.
EDIT: It would be interesting to know if the ALU's shifter circuits could be directed at this, and therefore GETQZ wouldn't need a separate shifter. It would take longer to fetch for sure. And then all 64 bits are input data, which is not the case for the ALU shifter ops.
The shifter could be part of the coprocessor and use the hitherto unused SETQ input to QMUL, which could also enable signed mode with bit 5. That seems like the simplest to implement.
Nice. It is a multiply specific requirement. Presumably it's still a barrel shifter though. But at least it is cleanly tucked in the Cordic's single pipeline then.
Ada,
I understand what you're saying now about 16.16 fixed point multiplies. It'd never sunk in what if anything is different about fixed-point vs integer before. But it is exactly what you've requested of QMUL. Finding best way to retail word size. Which comes down to where is the fixed point position. duh!
With integers, a 32b x 32b = 64b and the fixed point position stays at bit 0 as the word size increases. Therefore, to retain a 32-bit word size, the least 32 bits is kept, [63:0] -> [31:0]. So integers truly are a subset of fixed point. This detail hadn't been obvious to me. I'd seen the similarity before but not sure.
That obviously leads to 16.16 x 16.16 = 32.32 [63:0]. Then picking 32 bits, for the result word, we want fixed point at bit 16, [63:0] -> [47:16]. This had seemed somewhat arbitrary to me previously.
A good way to intuit about fixed point is to treat it a bit like physical units. Imagine you have an integer representing some length in meters. We need more precision, so we go down to centimeters. There are 100 cm in 1 m, so we can simply add two zeroes to our meter value. But what happens if we multiply some lengths? We get an area in cm². And there are 10000 cm² in 1 m², because the unit constant got multiplied alongside the actual values.
So retaining two's complement encoding is desirable then. It should allow direct use of integer summing instructions on fixed-point numbers. No shuffling of bits. C/Z flags still function the same.
EDIT: In the case of fetching QMUL results, I think that's one extra instruction - A SUB #1 when negative.
Working out the sign might be a few instructions though. As well as the signs of the multipliers, bit15 of QY is integral to unsigned circular wrapping, aka carry/borrow.
EDIT2: Okay, circular msb (bit15 of QY) shouldn't be included in sign logic. Had to mull that one over. And it'll naturally appear set on odd rotations of any overflowing multiplication. That's suitable for unsigned.
The trick is to not overwrite it with the sign bit when an overflow does happen. Then the msb can serve the dual functions desired .... or not, the C/Z flags aren't going to match with summing this way though ...
C always holds the sign of the result, even during overflow.
7fffffff(reg1) - ffffffff(reg2) = 80000000(result)
Collected flags = 000000a0CMP reg1,reg2: C = 1, Z = 0SUB reg1,reg2: C = 1, Z = 0CMPS reg1,reg2: C = 0, Z = 0SUBS reg1,reg2: C = 0, Z = 0
EDIT4: Hmm, there's probably no circular use for the C flag. Signed and unsigned use of C can be the same: Straight overflow. The difference then is unsigned is 16.16 and C is overflow, whereas signed is 1.15.16 and C is overflow.
So C flag use is different to summing. But maybe that isn't important. As long as the result is encoded as two's complement.
EDIT5: Oh, in that case, when both C and Z are set, it can indicate an underflow. Which means a zero check would be IF_NC_AND_Z
EDIT6: Or, maybe the C flag should just be exception occurred. And have exception code in the result. Codes for overflow, underflow, infinity, negatives of each, and of course the beloved NaN. Z flag can be simple zero result again.
Chip,
I've before wanted to clarify a particular detail of the cog execution pipeline. In your ASCII art diagram of the pipeline it isn't completely clear at which points cogRAM addressing occurs. Knowing this would clear things up for me.
So, the question is is rdRAM the data from cogRAM ... or is rdRAM the address generation for the next cycle - where data then comes from cogRAM and is subsequently latched?
EDIT: Another way to ask: Where does PC address gen fit into the diagram?
Chip,
I feel your attention is required. On behalf of ManAtWork, I wrote a program to test hubRAM efficiency with streamer reading and cog doing a block write at the same time. Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus. Notably, it hasn't afflicted byte-wise streamer ops. It's quite inconsistent too. An example is here - https://forums.parallax.com/discussion/comment/1535641/#Comment_1535641
The program in the post after that uses Flexspin with #include, so you can't just drop it into Pnut.
Here's an example run. The second column is the "divider", third column is the NCO value for that divider, fourth column onwards are twelve repeated tests of that config - without restarting the streamer. They all consist of the measured number of sysclock ticks, from GETCT, required to complete a block write of 64 k longwords into hubRAM using a SETQ + WRLONG. Done while the streamer is running as configured by the divider and mode as per implied RFBYTE then RFWORD then RFLONG.
[NOTE: Updated data on next page]
So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.
@evanh said:
Here's an example run. The second column is the "divider", third column is the NCO value for that divider, fourth column onwards are twelve repeated tests of that config - without restarting the streamer. They all consist of the measured number of sysclock ticks, from GETCT, required to complete a block write of 64 k longwords into hubRAM using a SETQ + WRLONG. Done while the streamer is running as configured by the divider and mode as per implied RFBYTE then RFWORD then RFLONG.
So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.
What is a 'serious excursion' - does it drop data, or just sometimes take more cycles than hoped ?
Running a NCO and a streamer could be tricky, for non-binary numbers. Not sure why bytes would be unaffected, but maybe the 32:8 fifo effect is enough to smooth things ?
Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus.
What does 'badly hog' amount to ?
With hub being MOD 8, and a NCO not being mod 8, I can see some varying waits could be needed, but things should never 100% choke ?
@evanh said:
So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.
What is a 'serious excursion' - does it drop data, or just sometimes take more cycles than hoped ?
Running a NCO and a streamer could be tricky, for non-binary numbers. Not sure why bytes would be unaffected, but maybe the 32:8 fifo effect is enough to smooth things ?
Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus.
What does 'badly hog' amount to ?
With hub being MOD 8, and a NCO not being mod 8, I can see some varying waits could be needed, but things should never 100% choke ?
A couple of bad bus hogging examples for two different streamer frequencies: a fast block write of N longs sometimes takes 1.5N and sometimes 9N, or sometimes 1.17N and sometimes 9N.
Comments
If we can ever make a P3, might as well make it 64 bits and have 3 larger register fields: two sources and one destination, with several fast indirect registers with auto inc/dec. Lots of things could be done at 22nm, or so.
ALU critical paths right?
Lol, 3 x 16-bit. 64 kWords (512 kBytes) of cogRAM.
That feed the ALU, yes.
Oh, so the reduced number of addressable special registers in the Prop2 cogs helps make that faster compared to the Prop1 then?
EDIT: And that would mean a round power-of-two will be ideal number of addressable special registers. Maybe minus one to allow SRAM to be one of the mux sources.
The way things work with timing paths, powers-of-two muxes are not necessarily best, because of the varying distances signals must cross. The wiring/buffer delays can easily overshadow a 2-to-1 mux propagation delay.
All the special registers can be packed in tight. They're dedicated to serving the ALU.
It's a huge tug-of-war. The placer/router spends many hours trying to meet the timing requirement. There are many paths competing for the shortest routes. Everything can't be in the same place, but must get spread out across the die, of course.
cogRAM is certainly a long way out, and dimensionally large, in comparison. You're not going to get further away than that. Each ALU is going to be on one side of each SRAM block likely.
And the associated lutRAM block has to be nearby too. I can see why that needed an extra stage. Gives higher priority to placement around cogRAM.
Yes, the logic of each cog grouped right around its own cogRAM and lutRAM. All the RAMs were spread around the perimeter of the cell area, creating two axes of symmetry. This allowed the placer/router a big open area in the center to assemble and wire the ~600k cells.
I feel like just adding a signed version of QMUL and a way to get the middle 32 bits of the result in one instruction would significantly speed up fixed-point vector / DSP tasks
I just realised that if QX/QY were mapped to cogRAM addresses then pushing these results over the hub-op bus is complicated by the fact that there is two results at once. They'd then have to be via time staggered updates from the Cordic registers to the Cog registers, which makes the whole scheme all that more complicated. Same applies to writing to lutRAM.
Fixed point QMUL should work just by fetching the high 32 bits with GETQY.
No, it doesn't. That gets you the integer portion of the result only (assuming 16.16 inputs). And if the operands were signed, it's wrong to boot (unsigned and signed multiplication differ only in the upper half of the result)
In my ideal P3 you'd be able to fetch any 32 bits of the full 64 bit multiply result, something like:
setq2 #16 ' shift result right by 16 getqx ans ' now ans = correct value for 16.16 multiplication
and of course there would be both signed and unsigned multiply.
Of course if the P3 is a 64 bit processor then perhaps this becomes moot (unless it has a 64x64 -> 128 bit multiply!)
16.16 squared won't all fit in anything other than 64 bits. If you want all the high bits then GETQY is it.
Like floats, signed is extra logic handling.
SHL fpn,#1 WC
to extract and prep unsigned value. AndRCR fpn,#1
to reconstruct. Of course this isn't two's complement any longer. There's two zeros for a starters.Or make it just one double operand instruction.
qetqz hi32,#32 qetqz mid32,#16 qetqz lo32,#0
Problem is it likely requires a separate barrel shifter. And those are about as bulky as an integer multiply.
EDIT: It would be interesting to know if the ALU's shifter circuits could be directed at this, and therefore GETQZ wouldn't need a separate shifter. It would take longer to fetch for sure. And then all 64 bits are input data, which is not the case for the ALU shifter ops.
Ada's suggestions would have smallest circuit footprint.
The shifter could be part of the coprocessor and use the hitherto unused SETQ input to QMUL, which could also enable signed mode with bit 5. That seems like the simplest to implement.
' normal, unsigned integer mul setq #0 qmul a,b getqx result ' unsigned, discard 16 LSBs setq #16 qmul a,b getqx result ' signed multiply (QX identical to unsigned...) setq #32 qmul a,b getqx result ' signed, discard 16 LSBs -> 16.16 multiply setq #48 qmul a,b getqx result ' signed, discard 24 LSBs -> 8.24 multiply setq #56 qmul a,b getqx result
Nice. It is a multiply specific requirement. Presumably it's still a barrel shifter though. But at least it is cleanly tucked in the Cordic's single pipeline then.
Ada,
I understand what you're saying now about 16.16 fixed point multiplies. It'd never sunk in what if anything is different about fixed-point vs integer before. But it is exactly what you've requested of QMUL. Finding best way to retail word size. Which comes down to where is the fixed point position. duh!
With integers, a 32b x 32b = 64b and the fixed point position stays at bit 0 as the word size increases. Therefore, to retain a 32-bit word size, the least 32 bits is kept, [63:0] -> [31:0]. So integers truly are a subset of fixed point. This detail hadn't been obvious to me. I'd seen the similarity before but not sure.
That obviously leads to 16.16 x 16.16 = 32.32 [63:0]. Then picking 32 bits, for the result word, we want fixed point at bit 16, [63:0] -> [47:16]. This had seemed somewhat arbitrary to me previously.
A good way to intuit about fixed point is to treat it a bit like physical units. Imagine you have an integer representing some length in meters. We need more precision, so we go down to centimeters. There are 100 cm in 1 m, so we can simply add two zeroes to our meter value. But what happens if we multiply some lengths? We get an area in cm². And there are 10000 cm² in 1 m², because the unit constant got multiplied alongside the actual values.
The realisation for me was pure integers fitting the same formula. I'd not fully associated the two means.
So retaining two's complement encoding is desirable then. It should allow direct use of integer summing instructions on fixed-point numbers. No shuffling of bits. C/Z flags still function the same.
EDIT: In the case of fetching QMUL results, I think that's one extra instruction - A SUB #1 when negative.
Working out the sign might be a few instructions though. As well as the signs of the multipliers, bit15 of QY is integral to unsigned circular wrapping, aka carry/borrow.
EDIT2: Okay, circular msb (bit15 of QY) shouldn't be included in sign logic. Had to mull that one over. And it'll naturally appear set on odd rotations of any overflowing multiplication. That's suitable for unsigned.
The trick is to not overwrite it with the sign bit when an overflow does happen. Then the msb can serve the dual functions desired .... or not, the C/Z flags aren't going to match with summing this way though ...
EDIT3: Man, couldn't remember how the Prop2's flags are treated with signed summing. It's not really in the docs. Had to dig up the old testing here - https://forums.parallax.com/discussion/comment/1421452/#Comment_1421452
For signed instructions
7fffffff(reg1) - ffffffff(reg2) = 80000000(result) Collected flags = 000000a0 CMP reg1,reg2: C = 1, Z = 0 SUB reg1,reg2: C = 1, Z = 0 CMPS reg1,reg2: C = 0, Z = 0 SUBS reg1,reg2: C = 0, Z = 0
EDIT4: Hmm, there's probably no circular use for the C flag. Signed and unsigned use of C can be the same: Straight overflow. The difference then is unsigned is 16.16 and C is overflow, whereas signed is 1.15.16 and C is overflow.
So C flag use is different to summing. But maybe that isn't important. As long as the result is encoded as two's complement.
EDIT5: Oh, in that case, when both C and Z are set, it can indicate an underflow. Which means a zero check would be
IF_NC_AND_Z
EDIT6: Or, maybe the C flag should just be exception occurred. And have exception code in the result. Codes for overflow, underflow, infinity, negatives of each, and of course the beloved NaN. Z flag can be simple zero result again.
Chip,
I've before wanted to clarify a particular detail of the cog execution pipeline. In your ASCII art diagram of the pipeline it isn't completely clear at which points cogRAM addressing occurs. Knowing this would clear things up for me.
Here's the original:
| | | | | | | rdRAM Ib|------+ | rdRAM Ic|------+ | rdRAM Id|------+ | rdRAM Ie| | | | | | | | | | | latch Da|--+ +--> rdRAM Db|---------> latch Db|--+ +--> rdRAM Dc|---------> latch Dc|--+ +--> rdRAM Dd|---------> latch Dd| latch Sa|--+ +--> rdRAM Sb|---------> latch Sb|--+ +--> rdRAM Sc|---------> latch Sc|--+ +--> rdRAM Sd|---------> latch Sd| latch Ia|--+ +--> latch Ib|---------> latch Ib|--+ +--> latch Ic|---------> latch Ic|--+ +--> latch Id|---------> latch Id| | | | | | | | | | | | +---------------ALU--------> wrRAM Ra| +---------------ALU--------> wrRAM Rb| +---------------ALU--------> wrRAM Rc| | | | | | | | | |stall/done = 'gox' | |stall/done = 'gox' | |stall/done = 'gox' | | 'get' | done = 'go' | 'get' | done = 'go' | 'get' | done = 'go' |
So, the question is is rdRAM the data from cogRAM ... or is rdRAM the address generation for the next cycle - where data then comes from cogRAM and is subsequently latched?
EDIT: Another way to ask: Where does PC address gen fit into the diagram?
I suspect it's the latter, ie:
| | | | | | | addr Ib|------+ | addr Ic|------+ | addr Id|------+ | addr Ie| | | | | | | | | | | stage Da|--+ +--> addr Db|---------> stage Db|--+ +--> addr Dc|---------> stage Dc|--+ +--> addr Dd|---------> stage Dd| stage Sa|--+ +--> addr Sb|---------> stage Sb|--+ +--> addr Sc|---------> stage Sc|--+ +--> addr Sd|---------> stage Sd| stage Ia|--+ +--> stage Ib|---------> stage Ib|--+ +--> stage Ic|---------> stage Ic|--+ +--> stage Id|---------> stage Id| | | | | | | | | | | | +--> execute --->|mux --+-> forward | +--> execute --->|mux --+-> forward | +--> execute --->|mux --+-> forward | | | | | | | | | | | | | +---> addr Ra|----> write Ra | +---> addr Rb|----> write Rb | +---> addr Rc| | | | | | | | | |stall/done = 'gox' | |stall/done = 'gox' | |stall/done = 'gox' | | 'get' | done = 'go' | 'get' | done = 'go' | 'get' | done = 'go' |
And matching schedule chart:
--------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------| Operands| | | Result | | | | Fetch(a)| Execute (a) | Execute (a) | Writeback (a) | | | | --------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------| | Instruction | Operands | | | Result | | | Fetch (b) | Fetch (b) | Execute (b) | Execute (b) | Writeback (b) | | --------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------| | | | Instruction | Operands | | | | | | Fetch (c) | Fetch (c) | Execute (c) | Execute (c) | --------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------| | | | | | Instruction | Operands | | | | | | Fetch (d) | Fetch (d) | --------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
Chip,
I feel your attention is required. On behalf of ManAtWork, I wrote a program to test hubRAM efficiency with streamer reading and cog doing a block write at the same time. Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus. Notably, it hasn't afflicted byte-wise streamer ops. It's quite inconsistent too. An example is here - https://forums.parallax.com/discussion/comment/1535641/#Comment_1535641
The program in the post after that uses Flexspin with #include, so you can't just drop it into Pnut.
Here's an example run. The second column is the "divider", third column is the NCO value for that divider, fourth column onwards are twelve repeated tests of that config - without restarting the streamer. They all consist of the measured number of sysclock ticks, from GETCT, required to complete a block write of 64 k longwords into hubRAM using a SETQ + WRLONG. Done while the streamer is running as configured by the divider and mode as per implied RFBYTE then RFWORD then RFLONG.
[NOTE: Updated data on next page]
So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.
Total smartpins = 64 1111111111111111111111111111111111111111111111111111111111111111 Rev B silicon. Sysclock 120.0000 MHz Block Length is 65536 BYTE 2 80000000 87384 87384 87384 87384 87384 87384 87384 87384 87384 87384 87384 87384** SHORT 2 80000000 131072 131072 131072 131072 131072 131072 131072 131072 131072 131072 131072 131072** BYTE 3 55555556 98312 98304 98304 98304 98304 98304 98304 98304 98304 98304 98304 98304** SHORT 3 55555556 589824 589784 589784 589784 589784 589784 589784 589784 589784 589784 589784 589784** LONG 3 55555556 589760 589784 589784 589784 589784 589784 589784 589784 589784 589784 589784 589784** BYTE 4 40000000 76720 76728 76720 76728 76720 76728 76720 76728 76720 76728 76720 76728 SHORT 4 40000000 87384 87384 87384 87384 87384 87384 87384 87384 87384 87384 87384 87384** LONG 4 40000000 87384 87384 87384 87384 87384 87384 87384 87384 87384 87384 87384 87384** BYTE 5 33333334 81920 81920 81920 81920 81928 81920 81920 81912 81920 81928 81920 81920** SHORT 5 33333334 81920 81920 81920 81920 81920 81920 81920 81920 81920 81920 81920 81920** LONG 5 33333334 126024 126024 126024 126024 126024 126024 126024 126024 126024 126024 126024 126024** BYTE 6 2aaaaaaa 74904 74904 74904 74896 74896 74896 74896 74896 74912 74904 74904 74904 SHORT 6 2aaaaaaa 78640 78648 78640 78640 78648 78640 78648 78640 78640 78648 78640 78648 LONG 6 2aaaaaaa 78640 78648 78640 78648 78640 78648 78640 78648 78640 78648 78640 78648 BYTE 7 24924924 72432 72432 72440 72440 72432 72432 72432 72440 72432 72432 72432 72440 SHORT 7 24924924 91760 91744 91760 91752 91744 91752 91744 91752 91744 91752 91744 91752** LONG 7 24924924 76464 76456 76464 76456 76464 76456 76464 76456 76464 76456 76464 76456 BYTE 8 20000000 71488 71488 71496 71496 71488 71488 71496 71496 71496 71488 71496 71496 SHORT 8 20000000 76728 76728 76720 76728 76720 76728 76720 76728 76720 76728 76720 76728 LONG 8 20000000 76728 76720 76728 76720 76728 76720 76728 76720 76728 76720 76728 76720 BYTE 9 1c71c71c 70776 70776 70776 70776 70776 70784 70784 70776 70776 70776 70776 70776 SHORT 9 1c71c71c 76928 76936 76936 76936 76928 76936 76936 76936 76928 76936 76936 76936 LONG 9 1c71c71c 589800 589784 589784 589784 589784 589784 589784 589784 589784 589784 589784 589784** BYTE 10 1999999a 70216 70216 70224 70224 70216 70216 70216 70224 70216 70216 70216 70224 SHORT 10 1999999a 81928 81920 81920 81920 81920 81912 81920 81920 81920 81920 81912 81920** LONG 10 1999999a 81920 81920 81912 81920 81920 81920 81920 81912 81920 81920 81920 81920** BYTE 11 1745d174 69768 69768 69760 69760 69760 69768 69768 69768 69760 69760 69760 69768 SHORT 11 1745d174 90104 90120 90112 90120 90112 90120 90112 90120 90112 90120 90112 90120** LONG 11 1745d174 90112 90120 90112 90120 90112 90120 90112 90120 90112 90120 90112 90120** BYTE 12 15555556 69392 69392 69392 69392 69392 69392 69392 69384 69392 69392 69392 69392 SHORT 12 15555556 74896 74904 74904 74904 74904 74904 74904 74896 74896 74896 74896 74896 LONG 12 15555556 74896 74896 74904 74904 74896 74904 74904 74904 74896 74896 74896 74896 BYTE 13 13b13b14 69080 69080 69072 69080 69072 69080 69080 69080 69080 69072 69080 69080 SHORT 13 13b13b14 73032 73024 73032 73024 73032 73024 73032 73024 73024 73024 73024 73024 LONG 13 13b13b14 73024 73024 73032 73024 73032 73024 73032 73024 73032 73024 73032 73024 BYTE 14 12492492 68808 68808 68816 68808 68808 68816 68816 68808 68816 68816 68808 68816 SHORT 14 12492492 72432 72432 72432 72440 72432 72432 72440 72440 72432 72432 72432 72440 LONG 14 12492492 72440 72440 72432 72432 72440 72440 72432 72432 72432 72440 72432 72432 BYTE 15 11111112 68584 68584 68584 68584 68584 68584 68584 68584 68584 68584 68584 68584 SHORT 15 11111112 75616 75624 75624 75616 75616 75616 75624 75616 75616 75616 75616 75624 LONG 15 11111112 75616 75624 75616 75624 75616 75616 75616 75624 75624 75616 75616 75616 BYTE 16 10000000 68384 68384 68392 68384 68384 68384 68384 68384 68384 68392 68384 68384 SHORT 16 10000000 71496 71488 71488 71496 71496 71496 71488 71496 71496 71496 71488 71496 LONG 16 10000000 71496 71488 71488 71496 71496 71496 71488 71496 71496 71496 71488 71496 BYTE 17 0f0f0f10 68208 68208 68216 68208 68208 68216 68208 68208 68216 68216 68208 68208 SHORT 17 0f0f0f10 123784 123792 123784 123776 123776 123776 123776 123776 123776 123776 123776 123776** LONG 17 0f0f0f10 71112 71112 71112 71112 71112 71112 71112 71120 71120 71112 71112 71112 BYTE 18 0e38e38e 68056 68056 68056 68056 68056 68056 68056 68056 68056 68056 68056 68056 SHORT 18 0e38e38e 70776 70784 70784 70776 70776 70776 70776 70776 70784 70784 70776 70776 LONG 18 0e38e38e 70776 70776 70776 70776 70784 70784 70776 70776 70776 70776 70776 70784 BYTE 19 0d79435e 67920 67920 67920 67920 67912 67920 67920 67920 67920 67920 67920 67920 SHORT 19 0d79435e 77824 77824 77824 77824 77824 77816 77824 77824 77824 77824 77824 77816 LONG 19 0d79435e 77824 77824 77824 77824 77824 77824 77824 77832 77824 77816 77824 77824 BYTE 20 0ccccccc 67800 67800 67792 67792 67800 67792 67792 67800 67800 67792 67792 67800 SHORT 20 0ccccccc 70224 70216 70216 70216 70216 70216 70216 70216 70224 70216 70216 70216 LONG 20 0ccccccc 70224 70216 70216 70216 70224 70216 70216 70216 70224 70216 70216 70216 BYTE 21 0c30c30c 67688 67688 67688 67688 67688 67688 67688 67688 67680 67680 67680 67680 SHORT 21 0c30c30c 76472 76456 76456 76456 76456 76456 76472 76456 76464 76456 76456 76456 LONG 21 0c30c30c 69976 69984 69976 69976 69984 69976 69976 69984 69976 69976 69984 69976 BYTE 22 0ba2e8ba 67584 67584 67584 67584 67584 67584 67592 67584 67584 67584 67584 67584 SHORT 22 0ba2e8ba 69760 69768 69768 69768 69760 69760 69760 69768 69768 69768 69760 69760 LONG 22 0ba2e8ba 69760 69768 69768 69768 69760 69760 69760 69760 69768 69768 69768 69760 BYTE 23 0b21642c 67488 67488 67496 67496 67488 67488 67496 67496 67488 67496 67496 67488 SHORT 23 0b21642c 71776 71784 71776 71776 71776 71784 71784 71776 71784 71776 71776 71776 LONG 23 0b21642c 71776 71776 71776 71776 71776 71784 71784 71776 71776 71776 71776 71776 BYTE 24 0aaaaaaa 67408 67416 67408 67408 67408 67408 67408 67408 67408 67408 67416 67408 SHORT 24 0aaaaaaa 69384 69392 69392 69392 69392 69392 69392 69392 69392 69384 69392 69392 LONG 24 0aaaaaaa 69392 69392 69392 69392 69384 69392 69392 69392 69392 69392 69392 69392 BYTE 25 0a3d70a4 67328 67328 67328 67328 67336 67328 67336 67328 67336 67328 67336 67328 SHORT 25 0a3d70a4 96368 96376 96376 96376 96384 96376 96376 96376 96376 96376 96376 96376** LONG 25 0a3d70a4 96376 96368 96376 96368 96376 96368 96376 96376 96376 96376 96376 96376**
[removed out-of-date info]
What is a 'serious excursion' - does it drop data, or just sometimes take more cycles than hoped ?
Running a NCO and a streamer could be tricky, for non-binary numbers. Not sure why bytes would be unaffected, but maybe the 32:8 fifo effect is enough to smooth things ?
What does 'badly hog' amount to ?
With hub being MOD 8, and a NCO not being mod 8, I can see some varying waits could be needed, but things should never 100% choke ?
A couple of bad bus hogging examples for two different streamer frequencies: a fast block write of N longs sometimes takes 1.5N and sometimes 9N, or sometimes 1.17N and sometimes 9N.