If we can ever make a P3, might as well make it 64 bits and have 3 larger register fields: two sources and one destination, with several fast indirect registers with auto inc/dec. Lots of things could be done at 22nm, or so.
@cgracey said:
If we can ever make a P3, might as well make it 64 bits and have 3 larger register fields: two sources and one destination, with several fast indirect registers with auto inc/dec. Lots of things could be done at 22nm, or so.
Lol, 3 x 16-bit. 64 kWords (512 kBytes) of cogRAM.
Oh, so the reduced number of addressable special registers in the Prop2 cogs helps make that faster compared to the Prop1 then?
EDIT: And that would mean a round power-of-two will be ideal number of addressable special registers. Maybe minus one to allow SRAM to be one of the mux sources.
Oh, so the reduced number of addressable special registers in the Prop2 cogs helps make that faster compared to the Prop1 then?
EDIT: And that would mean a round power-of-two will be ideal number of addressable special registers. Maybe minus one to allow SRAM to be one of the mux sources.
The way things work with timing paths, powers-of-two muxes are not necessarily best, because of the varying distances signals must cross. The wiring/buffer delays can easily overshadow a 2-to-1 mux propagation delay.
@evanh said:
All the special registers can be packed in tight. They're dedicated to serving the ALU.
It's a huge tug-of-war. The placer/router spends many hours trying to meet the timing requirement. There are many paths competing for the shortest routes. Everything can't be in the same place, but must get spread out across the die, of course.
cogRAM is certainly a long way out, and dimensionally large, in comparison. You're not going to get further away than that. Each ALU is going to be on one side of each SRAM block likely.
And the associated lutRAM block has to be nearby too. I can see why that needed an extra stage. Gives higher priority to placement around cogRAM.
@evanh said:
cogRAM is certainly a long way out, and dimensionally large, in comparison. You're not going to get further away than that. Each ALU is going to be on one side of each SRAM block likely.
And the associated lutRAM block has to be nearby too. I can see why that needed an extra stage. Gives higher priority to placement around cogRAM.
Yes, the logic of each cog grouped right around its own cogRAM and lutRAM. All the RAMs were spread around the perimeter of the cell area, creating two axes of symmetry. This allowed the placer/router a big open area in the center to assemble and wire the ~600k cells.
I feel like just adding a signed version of QMUL and a way to get the middle 32 bits of the result in one instruction would significantly speed up fixed-point vector / DSP tasks
I just realised that if QX/QY were mapped to cogRAM addresses then pushing these results over the hub-op bus is complicated by the fact that there is two results at once. They'd then have to be via time staggered updates from the Cordic registers to the Cog registers, which makes the whole scheme all that more complicated. Same applies to writing to lutRAM.
Fixed point QMUL should work just by fetching the high 32 bits with GETQY.
@evanh said:
Fixed point QMUL should work just by fetching the high 32 bits with GETQY
No, it doesn't. That gets you the integer portion of the result only (assuming 16.16 inputs). And if the operands were signed, it's wrong to boot (unsigned and signed multiplication differ only in the upper half of the result)
16.16 squared won't all fit in anything other than 64 bits. If you want all the high bits then GETQY is it.
Like floats, signed is extra logic handling. SHL fpn,#1 WC to extract and prep unsigned value. And RCR fpn,#1 to reconstruct. Of course this isn't two's complement any longer. There's two zeros for a starters.
@ersmith said:
In my ideal P3 you'd be able to fetch any 32 bits of the full 64 bit multiply result, something like:
setq2 #16 ' shift result right by 16
getqx ans ' now ans = correct value for 16.16 multiplication
Or make it just one double operand instruction.
qetqz hi32,#32
qetqz mid32,#16
qetqz lo32,#0
Problem is it likely requires a separate barrel shifter. And those are about as bulky as an integer multiply.
EDIT: It would be interesting to know if the ALU's shifter circuits could be directed at this, and therefore GETQZ wouldn't need a separate shifter. It would take longer to fetch for sure. And then all 64 bits are input data, which is not the case for the ALU shifter ops.
The shifter could be part of the coprocessor and use the hitherto unused SETQ input to QMUL, which could also enable signed mode with bit 5. That seems like the simplest to implement.
' normal, unsigned integer mul
setq #0
qmul a,b
getqx result
' unsigned, discard 16 LSBs
setq #16
qmul a,b
getqx result
' signed multiply (QX identical to unsigned...)
setq #32
qmul a,b
getqx result
' signed, discard 16 LSBs -> 16.16 multiply
setq #48
qmul a,b
getqx result
' signed, discard 24 LSBs -> 8.24 multiply
setq #56
qmul a,b
getqx result
Nice. It is a multiply specific requirement. Presumably it's still a barrel shifter though. But at least it is cleanly tucked in the Cordic's single pipeline then.
Ada,
I understand what you're saying now about 16.16 fixed point multiplies. It'd never sunk in what if anything is different about fixed-point vs integer before. But it is exactly what you've requested of QMUL. Finding best way to retail word size. Which comes down to where is the fixed point position. duh!
With integers, a 32b x 32b = 64b and the fixed point position stays at bit 0 as the word size increases. Therefore, to retain a 32-bit word size, the least 32 bits is kept, [63:0] -> [31:0]. So integers truly are a subset of fixed point. This detail hadn't been obvious to me. I'd seen the similarity before but not sure.
That obviously leads to 16.16 x 16.16 = 32.32 [63:0]. Then picking 32 bits, for the result word, we want fixed point at bit 16, [63:0] -> [47:16]. This had seemed somewhat arbitrary to me previously.
A good way to intuit about fixed point is to treat it a bit like physical units. Imagine you have an integer representing some length in meters. We need more precision, so we go down to centimeters. There are 100 cm in 1 m, so we can simply add two zeroes to our meter value. But what happens if we multiply some lengths? We get an area in cm². And there are 10000 cm² in 1 m², because the unit constant got multiplied alongside the actual values.
So retaining two's complement encoding is desirable then. It should allow direct use of integer summing instructions on fixed-point numbers. No shuffling of bits. C/Z flags still function the same.
EDIT: In the case of fetching QMUL results, I think that's one extra instruction - A SUB #1 when negative.
Working out the sign might be a few instructions though. As well as the signs of the multipliers, bit15 of QY is integral to unsigned circular wrapping, aka carry/borrow.
EDIT2: Okay, circular msb (bit15 of QY) shouldn't be included in sign logic. Had to mull that one over. And it'll naturally appear set on odd rotations of any overflowing multiplication. That's suitable for unsigned.
The trick is to not overwrite it with the sign bit when an overflow does happen. Then the msb can serve the dual functions desired .... or not, the C/Z flags aren't going to match with summing this way though ...
C always holds the sign of the result, even during overflow.
7fffffff(reg1) - ffffffff(reg2) = 80000000(result)
Collected flags = 000000a0
CMP reg1,reg2: C = 1, Z = 0
SUB reg1,reg2: C = 1, Z = 0
CMPS reg1,reg2: C = 0, Z = 0
SUBS reg1,reg2: C = 0, Z = 0
EDIT4: Hmm, there's probably no circular use for the C flag. Signed and unsigned use of C can be the same: Straight overflow. The difference then is unsigned is 16.16 and C is overflow, whereas signed is 1.15.16 and C is overflow.
So C flag use is different to summing. But maybe that isn't important. As long as the result is encoded as two's complement.
EDIT5: Oh, in that case, when both C and Z are set, it can indicate an underflow. Which means a zero check would be IF_NC_AND_Z
EDIT6: Or, maybe the C flag should just be exception occurred. And have exception code in the result. Codes for overflow, underflow, infinity, negatives of each, and of course the beloved NaN. Z flag can be simple zero result again.
Chip,
I've before wanted to clarify a particular detail of the cog execution pipeline. In your ASCII art diagram of the pipeline it isn't completely clear at which points cogRAM addressing occurs. Knowing this would clear things up for me.
So, the question is is rdRAM the data from cogRAM ... or is rdRAM the address generation for the next cycle - where data then comes from cogRAM and is subsequently latched?
EDIT: Another way to ask: Where does PC address gen fit into the diagram?
Chip,
I feel your attention is required. On behalf of ManAtWork, I wrote a program to test hubRAM efficiency with streamer reading and cog doing a block write at the same time. Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus. Notably, it hasn't afflicted byte-wise streamer ops. It's quite inconsistent too. An example is here - https://forums.parallax.com/discussion/comment/1535641/#Comment_1535641
The program in the post after that uses Flexspin with #include, so you can't just drop it into Pnut.
Here's an example run. The second column is the "divider", third column is the NCO value for that divider, fourth column onwards are twelve repeated tests of that config - without restarting the streamer. They all consist of the measured number of sysclock ticks, from GETCT, required to complete a block write of 64 k longwords into hubRAM using a SETQ + WRLONG. Done while the streamer is running as configured by the divider and mode as per implied RFBYTE then RFWORD then RFLONG.
[NOTE: Updated data on next page]
So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.
@evanh said:
Here's an example run. The second column is the "divider", third column is the NCO value for that divider, fourth column onwards are twelve repeated tests of that config - without restarting the streamer. They all consist of the measured number of sysclock ticks, from GETCT, required to complete a block write of 64 k longwords into hubRAM using a SETQ + WRLONG. Done while the streamer is running as configured by the divider and mode as per implied RFBYTE then RFWORD then RFLONG.
So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.
What is a 'serious excursion' - does it drop data, or just sometimes take more cycles than hoped ?
Running a NCO and a streamer could be tricky, for non-binary numbers. Not sure why bytes would be unaffected, but maybe the 32:8 fifo effect is enough to smooth things ?
Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus.
What does 'badly hog' amount to ?
With hub being MOD 8, and a NCO not being mod 8, I can see some varying waits could be needed, but things should never 100% choke ?
@evanh said:
So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.
What is a 'serious excursion' - does it drop data, or just sometimes take more cycles than hoped ?
Running a NCO and a streamer could be tricky, for non-binary numbers. Not sure why bytes would be unaffected, but maybe the 32:8 fifo effect is enough to smooth things ?
Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus.
What does 'badly hog' amount to ?
With hub being MOD 8, and a NCO not being mod 8, I can see some varying waits could be needed, but things should never 100% choke ?
A couple of bad bus hogging examples for two different streamer frequencies: a fast block write of N longs sometimes takes 1.5N and sometimes 9N, or sometimes 1.17N and sometimes 9N.
Comments
If we can ever make a P3, might as well make it 64 bits and have 3 larger register fields: two sources and one destination, with several fast indirect registers with auto inc/dec. Lots of things could be done at 22nm, or so.
ALU critical paths right?
Lol, 3 x 16-bit. 64 kWords (512 kBytes) of cogRAM.
That feed the ALU, yes.
Oh, so the reduced number of addressable special registers in the Prop2 cogs helps make that faster compared to the Prop1 then?
EDIT: And that would mean a round power-of-two will be ideal number of addressable special registers. Maybe minus one to allow SRAM to be one of the mux sources.
The way things work with timing paths, powers-of-two muxes are not necessarily best, because of the varying distances signals must cross. The wiring/buffer delays can easily overshadow a 2-to-1 mux propagation delay.
All the special registers can be packed in tight. They're dedicated to serving the ALU.
It's a huge tug-of-war. The placer/router spends many hours trying to meet the timing requirement. There are many paths competing for the shortest routes. Everything can't be in the same place, but must get spread out across the die, of course.
cogRAM is certainly a long way out, and dimensionally large, in comparison. You're not going to get further away than that. Each ALU is going to be on one side of each SRAM block likely.
And the associated lutRAM block has to be nearby too. I can see why that needed an extra stage. Gives higher priority to placement around cogRAM.
Yes, the logic of each cog grouped right around its own cogRAM and lutRAM. All the RAMs were spread around the perimeter of the cell area, creating two axes of symmetry. This allowed the placer/router a big open area in the center to assemble and wire the ~600k cells.
I feel like just adding a signed version of QMUL and a way to get the middle 32 bits of the result in one instruction would significantly speed up fixed-point vector / DSP tasks
I just realised that if QX/QY were mapped to cogRAM addresses then pushing these results over the hub-op bus is complicated by the fact that there is two results at once. They'd then have to be via time staggered updates from the Cordic registers to the Cog registers, which makes the whole scheme all that more complicated. Same applies to writing to lutRAM.
Fixed point QMUL should work just by fetching the high 32 bits with GETQY.
No, it doesn't. That gets you the integer portion of the result only (assuming 16.16 inputs). And if the operands were signed, it's wrong to boot (unsigned and signed multiplication differ only in the upper half of the result)
In my ideal P3 you'd be able to fetch any 32 bits of the full 64 bit multiply result, something like:
and of course there would be both signed and unsigned multiply.
Of course if the P3 is a 64 bit processor then perhaps this becomes moot (unless it has a 64x64 -> 128 bit multiply!)
16.16 squared won't all fit in anything other than 64 bits. If you want all the high bits then GETQY is it.
Like floats, signed is extra logic handling.
SHL fpn,#1 WC
to extract and prep unsigned value. AndRCR fpn,#1
to reconstruct. Of course this isn't two's complement any longer. There's two zeros for a starters.Or make it just one double operand instruction.
Problem is it likely requires a separate barrel shifter. And those are about as bulky as an integer multiply.
EDIT: It would be interesting to know if the ALU's shifter circuits could be directed at this, and therefore GETQZ wouldn't need a separate shifter. It would take longer to fetch for sure. And then all 64 bits are input data, which is not the case for the ALU shifter ops.
Ada's suggestions would have smallest circuit footprint.
The shifter could be part of the coprocessor and use the hitherto unused SETQ input to QMUL, which could also enable signed mode with bit 5. That seems like the simplest to implement.
Nice. It is a multiply specific requirement. Presumably it's still a barrel shifter though. But at least it is cleanly tucked in the Cordic's single pipeline then.
Ada,
I understand what you're saying now about 16.16 fixed point multiplies. It'd never sunk in what if anything is different about fixed-point vs integer before. But it is exactly what you've requested of QMUL. Finding best way to retail word size. Which comes down to where is the fixed point position. duh!
With integers, a 32b x 32b = 64b and the fixed point position stays at bit 0 as the word size increases. Therefore, to retain a 32-bit word size, the least 32 bits is kept, [63:0] -> [31:0]. So integers truly are a subset of fixed point. This detail hadn't been obvious to me. I'd seen the similarity before but not sure.
That obviously leads to 16.16 x 16.16 = 32.32 [63:0]. Then picking 32 bits, for the result word, we want fixed point at bit 16, [63:0] -> [47:16]. This had seemed somewhat arbitrary to me previously.
A good way to intuit about fixed point is to treat it a bit like physical units. Imagine you have an integer representing some length in meters. We need more precision, so we go down to centimeters. There are 100 cm in 1 m, so we can simply add two zeroes to our meter value. But what happens if we multiply some lengths? We get an area in cm². And there are 10000 cm² in 1 m², because the unit constant got multiplied alongside the actual values.
The realisation for me was pure integers fitting the same formula. I'd not fully associated the two means.
So retaining two's complement encoding is desirable then. It should allow direct use of integer summing instructions on fixed-point numbers. No shuffling of bits. C/Z flags still function the same.
EDIT: In the case of fetching QMUL results, I think that's one extra instruction - A SUB #1 when negative.
Working out the sign might be a few instructions though. As well as the signs of the multipliers, bit15 of QY is integral to unsigned circular wrapping, aka carry/borrow.
EDIT2: Okay, circular msb (bit15 of QY) shouldn't be included in sign logic. Had to mull that one over. And it'll naturally appear set on odd rotations of any overflowing multiplication. That's suitable for unsigned.
The trick is to not overwrite it with the sign bit when an overflow does happen. Then the msb can serve the dual functions desired .... or not, the C/Z flags aren't going to match with summing this way though ...
EDIT3: Man, couldn't remember how the Prop2's flags are treated with signed summing. It's not really in the docs. Had to dig up the old testing here - https://forums.parallax.com/discussion/comment/1421452/#Comment_1421452
For signed instructions
EDIT4: Hmm, there's probably no circular use for the C flag. Signed and unsigned use of C can be the same: Straight overflow. The difference then is unsigned is 16.16 and C is overflow, whereas signed is 1.15.16 and C is overflow.
So C flag use is different to summing. But maybe that isn't important. As long as the result is encoded as two's complement.
EDIT5: Oh, in that case, when both C and Z are set, it can indicate an underflow. Which means a zero check would be
IF_NC_AND_Z
EDIT6: Or, maybe the C flag should just be exception occurred. And have exception code in the result. Codes for overflow, underflow, infinity, negatives of each, and of course the beloved NaN. Z flag can be simple zero result again.
Chip,
I've before wanted to clarify a particular detail of the cog execution pipeline. In your ASCII art diagram of the pipeline it isn't completely clear at which points cogRAM addressing occurs. Knowing this would clear things up for me.
Here's the original:
So, the question is is rdRAM the data from cogRAM ... or is rdRAM the address generation for the next cycle - where data then comes from cogRAM and is subsequently latched?
EDIT: Another way to ask: Where does PC address gen fit into the diagram?
I suspect it's the latter, ie:
And matching schedule chart:
Chip,
I feel your attention is required. On behalf of ManAtWork, I wrote a program to test hubRAM efficiency with streamer reading and cog doing a block write at the same time. Things went well until discovery that some streamer rates causes the FIFO to badly hog the hubRAM bus. Notably, it hasn't afflicted byte-wise streamer ops. It's quite inconsistent too. An example is here - https://forums.parallax.com/discussion/comment/1535641/#Comment_1535641
The program in the post after that uses Flexspin with #include, so you can't just drop it into Pnut.
Here's an example run. The second column is the "divider", third column is the NCO value for that divider, fourth column onwards are twelve repeated tests of that config - without restarting the streamer. They all consist of the measured number of sysclock ticks, from GETCT, required to complete a block write of 64 k longwords into hubRAM using a SETQ + WRLONG. Done while the streamer is running as configured by the divider and mode as per implied RFBYTE then RFWORD then RFLONG.
[NOTE: Updated data on next page]
So far I've noted serious excursions with dividers of 3, 5, 6, 7, 9, 10, 11, 17, 25, 33 and 41. Again, only occurs with shortwords and longwords.
[removed out-of-date info]
What is a 'serious excursion' - does it drop data, or just sometimes take more cycles than hoped ?
Running a NCO and a streamer could be tricky, for non-binary numbers. Not sure why bytes would be unaffected, but maybe the 32:8 fifo effect is enough to smooth things ?
What does 'badly hog' amount to ?
With hub being MOD 8, and a NCO not being mod 8, I can see some varying waits could be needed, but things should never 100% choke ?
A couple of bad bus hogging examples for two different streamer frequencies: a fast block write of N longs sometimes takes 1.5N and sometimes 9N, or sometimes 1.17N and sometimes 9N.