I just realized that putting extra stages into the pipelined CORDIC to compensate for K is probably a waste. If compensation is needed, it can be just as quickly done with MULS in the cog, without needing extra hardware. That would drop the stages from 20 to 16 for 16-bit results.
One area where 16 bits might not be enough, and where 24 would certainly be sufficient, would be in positional control for CNC systems.
I was trying to think of anything in the world (including the world) to resolve to what angle, and couldn't come up with more than 24 bits of angle information needed.
Same thing with an angle of a shaft on something human-scale (robotic arm or something)
Likewise anything frequency generation related in the hundreds of megahertz range.
Not that I'm saying I need 24 bits or have any specific application. Just that this is a sane upper limit.
One area where 16 bits might not be enough, and where 24 would certainly be sufficient, would be in positional control for CNC systems.
For those cases, are there higher precision, but slower, ways to calculate ? - ie what precision & granularity do the Sine and Cosine mathops give, or would this need a floating point library to get more precision ?
For those cases, are there higher precision, but slower, ways to calculate ? - ie what precision & granularity do the Sine and Cosine mathops give, or would this need a floating point library to get more precision ?
You could always perform a CORDIC in software at whatever resolution you needed, at a cost of about 6 instructions per iteration.
For those cases, are there higher precision, but slower, ways to calculate ? - ie what precision & granularity do the Sine and Cosine mathops give, or would this need a floating point library to get more precision ?
IEEE 32 bit floating point gives 24 bit precision. On that basis, I reckon I agree with those who say that 24 bits precision would be sufficient.
IEEE 32 bit floating point gives 24 bit precision. On that basis, I reckon I agree with those who say that 24 bits precision would be sufficient.
Yes, but I think Chip was saying 16 bits is easy, and 24 bits is in the harder basket, so it could help to get some indicators of the speeds of 16b (HW?) and 24b(HW and some SW ?) versions of the conversions.
There could also be a combination, of higher precision used for way-point style decisions, and interpolation between those points, using faster but lower precision calcs.
Yes, but I think Chip was saying 16 bits is easy, and 24 bits is in the harder basket, so it could help to get some indicators of the speeds of 16b (HW?) and 24b(HW and some SW ?) versions of the conversions.
There could also be a combination, of higher precision used for way-point style decisions, and interpolation between those points, using faster but lower precision calcs.
I thought Chip said 32 bits would be too expensive (4 times as much as 16 bit). But 24 bit (presumably 2 times as much) might be doable?
I thought Chip said 32 bits would be too expensive (4 times as much as 16 bit). But 24 bit (presumably 2 times as much) might be doable?
Another approach is to allow users to specify the precision, with higher precision taking longer ?
I've also seen some Microcontroller designs talk about 24bit types, as being more compact than 32b to store, and having a granularity that is compatible with 32b Floats, and also saving silicon on things like multipliers.
If it was me, I would say for the ease, and speed and power usage, go with 16 bit, as you say if you need more precision, it can be done in software, besides, having the 16bit results make it easy to then do fast 16x16 multiplications on the results
If it was me, I would say for the ease, and speed and power usage, go with 16 bit, as you say if you need more precision, it can be done in software, besides, having the 16bit results make it easy to then do fast 16x16 multiplications on the results
I'm going to give OnSemi a few files to synthesize so we can check area and speed for 16x16, 20x20, and 24x24 multipliers. We can use the larger ones with a SCL instruction, which, in the case of a 20x20 multiplier, could perform a 2.18 x 2.18 multiply and return a 2.30 result. Inputs and outputs could be MSB-justified.
You could always perform a CORDIC in software at whatever resolution you needed, at a cost of about 6 instructions per iteration.
With CNC, I like to calculate, and set my step sizes up to 10X the claimed resolution. this would handle a 6.5" axis with 16 bit.
Most rapid feeds require only one calculation (Per Axis), and actual calculation-intensive applications such as cutting a circle happen at a slower, measured feed rate. Software CORDIC should be just fine.
Might get a little hairy with a large 6 axis arm robot tho, but there ARE 16 cogs after all...
Is there some relationship between the cog multiplier size and the hub cordic size? I don't get it...
Even if the final result of the calculation is 16-bit, you probably want to do the calculation at higher resolution because you lose acuracy with each operation...
On the other hand, I'll take what I can get.
It might be nice if the cordic system can be able to generate the math tables in the P1 ROM to the same accuracy...
Actually, I guess this isn't really necessary with all the RAM space, there's plenty of room for math tables...
After 16 stages, if we have access to all of the final values, Xn, Yn, Zn, then it should be possible to iterate the result to higher precision. Draw the next applicable An for stages 17, ... from an auxiliary table. The values hard-coded in the table for all of the steps out to stage 16 would have to be maximum precision, say 32 bit, otherwise extending precision would be futile.
Would it be possible to have the CORDIC machinery point to an auxiliary data table, or is the table hard-coded in the pipeline?
How does this "CORDIC in hub" thing work? There's just one copy of the circuitry envisioned, I believe (correct me if I'm mistaken). And if I recall correctly, while fast, it takes quite a few clocks to complete an operation (more than a memory hub-op, I believe). So, does that preclude more than one cog from using it at a time? I suppose so, but I'm just trying to clarify my understanding (or lack thereof, lol). Even if so, I'm guessing that one copy of the circuitry would suffice in many/most cases. But what would happen if two cogs tried to compete/contend for this resource at overlapping times? Would there be a "busy" flag to check or value returned, or would such an attempt cause that circuitry to reset with the latest op?
UPDATE: I struck through the text to help avoid misinformation, as Chip plans to pipeline the CORDIC for simultaneous usage by all cogs (in pipelined fashion). COOL BEANS!!! See Post 1373 below and/or the excerpt that follows:
"The CORDIC in the hub will be pipelined, so all cogs can use it at once. On any hub turn, a cog can start a CORDIC computation and get results back some fixed number of clocks later."
UPDATE 2: Chip actually mentioned the CORDIC being pipelined back in Post #159:
"CORDIC, MUL32X32, DIV64/32, SQRT (maybe) will be in the hub, but pipelined, so nobody has to wait for anybody else, only their turn at the hub."
Apologies for not reviewing the relevant CORDIC posts before posting here. But at least now it's doubly clear.
Anyway, as to the desirable number of bits of accuracy, I don't have anything useful to contribute, but I sure do like the idea of keeping things relatively simple, and the 16-bit version (or the version that matches the number of bits in the multiplier) meets the KISS principle. But given that P16X32 should be suitable for a standalone system, I'd let the input of the CNC people, such as rabaggett weigh heavily. The thing he said about fancy curve "cuts" basically needing to be executed slower anyway from a tool-movement perspective makes a lot of sense. Of course, there will be other applications for CORDIC, such as video applications and maybe signal processing. But in the case of video, it would seem to be more "decorative" (for lack of a better term) than functional, and if the new chip can't readily be applied to a CNC machine, that would seem to be a big missed opportunity (even if CNC's often connect with PC's). I'm not a CNC guy myself, though, unfortunately, but the answer to Chip's question would seem to at least partially lie in that particular application.
I'm going to give OnSemi a few files to synthesize so we can check area and speed for 16x16, 20x20, and 24x24 multipliers. We can use the larger ones with a SCL instruction, which, in the case of a 20x20 multiplier, could perform a 2.18 x 2.18 multiply and return a 2.30 result. Inputs and outputs could be MSB-justified.
Sounds like a good idea - ( Also, do you have area and speed indicators for ~ 200MHz target Counters in their process ?)
In the meanwhile, you could implement as 18x18 (which I think is native in the FPGA), so will be fast/small, and that could give something using the idea to test with - it would use the SCL instruction.
Is there any chance that the upcoming DE2-115 configuration could take advantage of the DE2's on board peripherals?
Perhaps the I/O they use could be 'mapped' onto the P1+ I/O by using some of the 18 slide switches and corresponding LED's, to just switch them in and out as required?
I realise this isn't at all useful for the DE0 guys but I think it would help the development process along somewhat especially if we had access to the PS2 port (which is a dual port btw) and the SD card slot.
Baggers and I have several plans which would make use of these features and would be even better if we can get the RS232 port, USB Host and Ethernet PHY too.
Shouldn't take much silicon to implement and would be a huge benefit to the developers.
I don't think I will ever need more than 16bits of cordic resolution.
If a user really needs better resolution with reasonable performance, SDRAM could be used to store hard coded 32+ bit values(retrieved from elsewhere:) and then do the rest in software.
re Coley@1369
I would very much like to see the SDRAM supported with hardware connections... verilog be damned we have PASM!!!
I like 16 bit on multipliers and CORDIC. It keeps the transistor count down. Sure a CNC application might need more, but software is more than fast enough for that application.
I just realized that putting extra stages into the pipelined CORDIC to compensate for K is probably a waste. If compensation is needed, it can be just as quickly done with MULS in the cog, without needing extra hardware. That would drop the stages from 20 to 16 for 16-bit results.
If CORDIC is a HUBOP, does that mean that you could initiate a call on one hub access and get the result on the next one?
The CORDIC in the hub will be pipelined, so all cogs can use it at once. On any hub turn, a cog can start a CORDIC computation and get results back some fixed number of clocks later. So, it is deterministic to the cog, with respect to its hub cycle. Same thing for big multiply and divide, and square root, if we include it, too.
If CORDIC is a HUBOP, does that mean that you could initiate a call on one hub access and get the result on the next one?
perhaps, or maybe a few clocks later. It all depends on how many stages the thing has. I'm thinking it would be good to make one pipelined solver that could do CORDIC, 32x32 multiply, 64/32 divide and 64-bit square root. That would make one set of data conduit between the solver and the cogs. It would need to have a fixed number of stages. That might dictate 32 plus 2 or 3 more for negation and result trimming.
... That might dictate 32 plus 2 or 3 more for negation and result trimming.
Retriving the result would have to be a multiple of the cog's hub access cycle, right? So, 32+2 clocks would end up being three times around the hub, which is 48 clocks from starting to gathering the result. I think that's what Serith was getting at too.
Retriving the result would have to be a multiple of the cog's hub access cycle, right? So, 32+2 clocks would end up being three times around the hub, which is 48 clocks from starting to gathering the result. I think that's what Serith was getting at too.
a) If this allows 'invisible' access to any COG, then another COG cannot really 'steal' another COGs answer, because the queue handling is COG specific. Result queues mean they cannot contaminate, but cannot cross-read either.
b) On the other hand, if it blocks, so a result has to be read before another MATHOP can be launched, then allowing any-cog free for all, means you could risk someone else reading your answer, which starts to need complex semaphores, and everything becomes less deterministic.
My understanding from what Chip has posted over the last couple of weeks, is it works like a) (not like b)
The question is about when the result can be grabbed. Presumption is grabbing with the same Cog that started it.
I read #1372 as asking about the next COG, but reading it again, he may have meant the next (or subsequent) HUB accesses from the same COG.
That would seem to be @ 48 cycles for 32 +2,3 (assuming no Slot-ReMapping feature.)
I'm going to give OnSemi a few files to synthesize so we can check area and speed for 16x16, 20x20, and 24x24 multipliers. We can use the larger ones with a SCL instruction, which, in the case of a 20x20 multiplier, could perform a 2.18 x 2.18 multiply and return a 2.30 result. Inputs and outputs could be MSB-justified.
20 or 24 bit multiplication will open the DSP marketsegment for the P1+. For professional Audio we will need a 24bit multiply (24bit, 96kHz is standard).
For DSP application the SCL instruction is much more in use that a MUL(S), so a SCL instruction should also be implemented if the Multiplier is only 16x16.
MSB aligned is not a good idea for DSP, either the input and output values are not in the same format, or if the result is also MSB aligned then you have no headroom for accumulation.
MSB aligned is not a good idea for DSP, either the input and output values are not in the same format, or if the result is also MSB aligned then you have no headroom for accumulation.
MSB aligned is good when more than one multiple and/or divide is in use. Accumulation limits are easy to manage with accompanying bit-shifts or just have the scaling set to the right number of leading zeros. Or use an oversized accumulator.
Comments
Same. I feel it's enough for the moment though. Stay efficient and revisit a bulkier variant with the Prop3.
One area where 16 bits might not be enough, and where 24 would certainly be sufficient, would be in positional control for CNC systems.
Same thing with an angle of a shaft on something human-scale (robotic arm or something)
Likewise anything frequency generation related in the hundreds of megahertz range.
Not that I'm saying I need 24 bits or have any specific application. Just that this is a sane upper limit.
For those cases, are there higher precision, but slower, ways to calculate ? - ie what precision & granularity do the Sine and Cosine mathops give, or would this need a floating point library to get more precision ?
You could always perform a CORDIC in software at whatever resolution you needed, at a cost of about 6 instructions per iteration.
IEEE 32 bit floating point gives 24 bit precision. On that basis, I reckon I agree with those who say that 24 bits precision would be sufficient.
Ross.
Yes, but I think Chip was saying 16 bits is easy, and 24 bits is in the harder basket, so it could help to get some indicators of the speeds of 16b (HW?) and 24b(HW and some SW ?) versions of the conversions.
There could also be a combination, of higher precision used for way-point style decisions, and interpolation between those points, using faster but lower precision calcs.
I thought Chip said 32 bits would be too expensive (4 times as much as 16 bit). But 24 bit (presumably 2 times as much) might be doable?
Another approach is to allow users to specify the precision, with higher precision taking longer ?
I've also seen some Microcontroller designs talk about 24bit types, as being more compact than 32b to store, and having a granularity that is compatible with 32b Floats, and also saving silicon on things like multipliers.
I'm going to give OnSemi a few files to synthesize so we can check area and speed for 16x16, 20x20, and 24x24 multipliers. We can use the larger ones with a SCL instruction, which, in the case of a 20x20 multiplier, could perform a 2.18 x 2.18 multiply and return a 2.30 result. Inputs and outputs could be MSB-justified.
With CNC, I like to calculate, and set my step sizes up to 10X the claimed resolution. this would handle a 6.5" axis with 16 bit.
Most rapid feeds require only one calculation (Per Axis), and actual calculation-intensive applications such as cutting a circle happen at a slower, measured feed rate. Software CORDIC should be just fine.
Might get a little hairy with a large 6 axis arm robot tho, but there ARE 16 cogs after all...
Even if the final result of the calculation is 16-bit, you probably want to do the calculation at higher resolution because you lose acuracy with each operation...
On the other hand, I'll take what I can get.
It might be nice if the cordic system can be able to generate the math tables in the P1 ROM to the same accuracy...
Actually, I guess this isn't really necessary with all the RAM space, there's plenty of room for math tables...
Would it be possible to have the CORDIC machinery point to an auxiliary data table, or is the table hard-coded in the pipeline?
UPDATE: I struck through the text to help avoid misinformation, as Chip plans to pipeline the CORDIC for simultaneous usage by all cogs (in pipelined fashion). COOL BEANS!!! See Post 1373 below and/or the excerpt that follows:
UPDATE 2: Chip actually mentioned the CORDIC being pipelined back in Post #159:
Apologies for not reviewing the relevant CORDIC posts before posting here. But at least now it's doubly clear.
Anyway, as to the desirable number of bits of accuracy, I don't have anything useful to contribute, but I sure do like the idea of keeping things relatively simple, and the 16-bit version (or the version that matches the number of bits in the multiplier) meets the KISS principle. But given that P16X32 should be suitable for a standalone system, I'd let the input of the CNC people, such as rabaggett weigh heavily. The thing he said about fancy curve "cuts" basically needing to be executed slower anyway from a tool-movement perspective makes a lot of sense. Of course, there will be other applications for CORDIC, such as video applications and maybe signal processing. But in the case of video, it would seem to be more "decorative" (for lack of a better term) than functional, and if the new chip can't readily be applied to a CNC machine, that would seem to be a big missed opportunity (even if CNC's often connect with PC's). I'm not a CNC guy myself, though, unfortunately, but the answer to Chip's question would seem to at least partially lie in that particular application.
Sounds like a good idea - ( Also, do you have area and speed indicators for ~ 200MHz target Counters in their process ?)
In the meanwhile, you could implement as 18x18 (which I think is native in the FPGA), so will be fast/small, and that could give something using the idea to test with - it would use the SCL instruction.
Is there any chance that the upcoming DE2-115 configuration could take advantage of the DE2's on board peripherals?
Perhaps the I/O they use could be 'mapped' onto the P1+ I/O by using some of the 18 slide switches and corresponding LED's, to just switch them in and out as required?
I realise this isn't at all useful for the DE0 guys but I think it would help the development process along somewhat especially if we had access to the PS2 port (which is a dual port btw) and the SD card slot.
Baggers and I have several plans which would make use of these features and would be even better if we can get the RS232 port, USB Host and Ethernet PHY too.
Shouldn't take much silicon to implement and would be a huge benefit to the developers.
Regards,
Coley
I don't think I will ever need more than 16bits of cordic resolution.
If a user really needs better resolution with reasonable performance, SDRAM could be used to store hard coded 32+ bit values(retrieved from elsewhere:) and then do the rest in software.
re Coley@1369
I would very much like to see the SDRAM supported with hardware connections... verilog be damned we have PASM!!!
Rich
If CORDIC is a HUBOP, does that mean that you could initiate a call on one hub access and get the result on the next one?
perhaps, or maybe a few clocks later. It all depends on how many stages the thing has. I'm thinking it would be good to make one pipelined solver that could do CORDIC, 32x32 multiply, 64/32 divide and 64-bit square root. That would make one set of data conduit between the solver and the cogs. It would need to have a fixed number of stages. That might dictate 32 plus 2 or 3 more for negation and result trimming.
Does making a single CORDIC/etc pipelined save quite a bit of logic over having 16 (1 per cog) without the pipelining ??
a) If this allows 'invisible' access to any COG, then another COG cannot really 'steal' another COGs answer, because the queue handling is COG specific. Result queues mean they cannot contaminate, but cannot cross-read either.
b) On the other hand, if it blocks, so a result has to be read before another MATHOP can be launched, then allowing any-cog free for all, means you could risk someone else reading your answer, which starts to need complex semaphores, and everything becomes less deterministic.
My understanding from what Chip has posted over the last couple of weeks, is it works like a) (not like b)
I read #1372 as asking about the next COG, but reading it again, he may have meant the next (or subsequent) HUB accesses from the same COG.
That would seem to be @ 48 cycles for 32 +2,3 (assuming no Slot-ReMapping feature.)
20 or 24 bit multiplication will open the DSP marketsegment for the P1+. For professional Audio we will need a 24bit multiply (24bit, 96kHz is standard).
For DSP application the SCL instruction is much more in use that a MUL(S), so a SCL instruction should also be implemented if the Multiplier is only 16x16.
MSB aligned is not a good idea for DSP, either the input and output values are not in the same format, or if the result is also MSB aligned then you have no headroom for accumulation.
Andy
MSB aligned is good when more than one multiple and/or divide is in use. Accumulation limits are easy to manage with accompanying bit-shifts or just have the scaling set to the right number of leading zeros. Or use an oversized accumulator.