The question is about when the result can be grabbed. Presumption is grabbing with the same Cog that started it.
That's correct. The implicit question was "how long will it take to get the results after the operation completes?". I was assuming that the MATHOP was round-robin like HUBOP, so there'd be up to 14 clocks delay in starting the operation and, depending on when it completed (hence the original question), potentially several clock cycles delay when retrieving the results. It seems from Chip's reply that the MATHOP will not be part of HUBOP, but I would still expect each cog to have access only once every 16 clock cycles. In which case, all MATHOPs would implicitly run for some multiple of 16 cycles. Assuming chip has a data conduit that allows setup/execute and fetch/results to be done in a single instruction (and therefor, single MATHOP cycle) it looks like we are realistically looking at 48-62 clocks per MATHOP, at least for CORDIC.
And if I have that correct, it seems like the number of stages employed should take advantage of those timing requirements (i.e. don't leave a lot of unused cycles). Get just enough stages in to maximize results without blowing past the next access window. And by "next", I don't necessarily mean "16 clock cycles later", rather "some reasonable multiple of 16 clock cycles later."
The implicit question was "how long will it take to get the results after the operation completes?".... Assuming chip has a data conduit that allows setup/execute and fetch/results to be done in a single instruction (and therefor, single MATHOP cycle) it looks like we are realistically looking at 48-62 clocks per MATHOP, at least for CORDIC.
It all depends on how many stages the thing has. I'm thinking it would be good to make one pipelined solver that could do CORDIC, 32x32 multiply, 64/32 divide and 64-bit square root. That would make one set of data conduit between the solver and the cogs. It would need to have a fixed number of stages. That might dictate 32 plus 2 or 3 more for negation and result trimming.
Taking that with his earlier comments, I think the MathBlock is 'data isolated', but probably not 'time isolated'.
ie because the MathBlock needs every one of 32+2,3 cycles, that sets a minimum answer delay, but it does not set a maximum.
(The HW multiplier, of tbf size, is one-per-COG)
Q: if two or more COGs want to use the MathBlock, what are the best/worst case delays for each COGs answer ? Does a queued COG(s) simply block until each answer is completed ?
Q: if two or more COGs want to use the MathBlock, what are the best/worst case delays for each COGs answer ? Does a queued COG(s) simply block until each answer is completed ?
As I have understood since the mathblock/cordic is pipelined while it executes eg stage1 for cog3 it contemporaneously exedcutes stage2 for cog2, stage3 for cog1 and so on thus there is no difference if one or more cogs use it at the same time
As I have understood since the mathblock/cordic is pipelined while it executes eg stage1 for cog3 it contemporaneously exedcutes stage2 for cog2, stage3 for cog1 and so on thus there is no difference if one or more cogs use it at the same time
That's correct. Every stage can be doing an operation for a different cog.
That's correct. Every stage can be doing an operation for a different cog.
Cool, that's impressive.
I had assumed some HW (like multiply) sharing across pipelines, but if each pipeline stage is unique/isolated then any COG does not care where any other COG might be, and it behaves as totally separate MathBlocks.
If slot-mapping is introduced, ( & that has merits for Hub BW and Power control ), I wonder how that would interact with MathBlock Pipelines ?
Cool, that's impressive.
I had assumed some HW (like multiply) sharing across pipelines, but if each pipeline stage is unique/isolated then any COG does not care where any other COG might be, and it behaves as totally separate MathBlocks.
If slot-mapping is introduced, ( & that has merits for Hub BW and Power control ), I wonder how that would interact with MathBlock Pipelines ?
Wasn't slot mapping one of the more gate and power hungry functions that needed to be dropped?
You may be thinking of HW tasking, which I think is just paused, until the Opcodes and MathBlock can be released in FPGA form, and some OnSemi Sims done to Check MHz and Watts.
That gives a fixed software reference point, and HW tasking does not change that, it augments it.
Slot mapping is simpler, it is a single rotating table of which COG gets the Hub Slot. It can default to a 1:16 scan, or be set to 1:8 to emulate P1, or it can give one COG 50% of the Hub BW (etc) , on a N/M basis.
As I have understood since the mathblock/cordic is pipelined while it executes eg stage1 for cog3 it contemporaneously exedcutes stage2 for cog2, stage3 for cog1 and so on thus there is no difference if one or more cogs use it at the same time
That's correct. Every stage can be doing an operation for a different cog.
If access is going to be every 16 clock cycles, it seems like it would make sense to have the MATHOP cycle shifted by 8 clocks relative to the HUBOP cycle (i.e. cog 0's HUB cycle starts on cycle 0 and MATH cycle starts on cycle 8). That would allow relatively fast turnaround for HUBOP-to-MATHOP operations, and vice versa.
If access is going to be every 16 clock cycles, it seems like it would make sense to have the MATHOP cycle shifted by 8 clocks relative to the HUBOP cycle (i.e. cog 0's HUB cycle starts on cycle 0 and MATH cycle starts on cycle 8). That would allow relatively fast turnaround for HUBOP-to-MATHOP operations, and vice versa.
If Chip gets this math processor in the hub thing working, I wonder if it doesn't cost too much to add things there...
At least, there are only one of these and not 8 or 16...
Maybe things like floating point support or CRC operations could be added?
So wouldn't it take the same amount of logic to dedicate a cordic engine to each cog? That way a cog wouldn't have to wait for it's hub slot.
It would take a lot more logic, because there'd be 32 additional barrel shifters (two per cog). When a CORDIC is hardwired in a pipelined fashion, those shifters turn into hardwired connections, reducing the logic and doubling the speed.
It's true that we could add floating point, as well. Since all this is now separate from the cogs, all kinds of things are possible.
I see. So it takes less gates to unroll a single cordic engine into a pipelined implementation than to have 16 cordic engines that feedback through a barrel-shifter.
By using quads, all input data/params can be written to hub cordic/maths in a single hub clock and all results could be read in a quad in one later hub read.
It's true that we could add floating point, as well. Since all this is now separate from the cogs, all kinds of things are possible.
Even if Full Floating point proves a little bulky, there are Float libraries now for P1, so those could be ported to this, and then some key 'helper operations' could be included in the MathBlock at lower cost ?
It's true that we could add floating point, as well. Since all this is now separate from the cogs, all kinds of things are possible.
Chip does the data exchange between cog and cordic happen during the cog's hub access window, or phase shifted from it? Or is that still being worked out?
I do like the whole concept, it's efficient use of transistors.
I wonder if this HUB math thing will have an accumulator (or two)...
If you are doing a string of operations, isn't it more efficient to leave the result in the hub, operate on it, and then get it when done?
And/Or, if you could send it a Quad input... Seems you could ask for a chain of 3 or 4 calculations and then get the result when done...
I had great fun writing a version of F32 that had a built in interpreter to run long strings of calculations in the background. It worked out very well, and it would be handy if there was something similar in the P1+. It worked by running the Spin source through a "compiler" that parsed the code, looking for blocks of math operations. It then broke down the operations into <result*>=<operand*><operator><operand*> and stored that as four longs in memory. During run time the code would just call F32 with the start of the instruction sequence, and the result would appear later. It didn't block the main thread (like the current F32 does).
We were able to do somewhere around 200 lines of floating point math at some reasonable speed (50Hz?)
Chip does the data exchange between cog and cordic happen during the cog's hub access window, or phase shifted from it? Or is that still being worked out?
I do like the whole concept, it's efficient use of transistors.
It probably would be best to offset it from the hub, so that you could do a hub instruction just before and/or just after the math operation.
I've been thinking that it might be best to have the cog wait while the computation is being done. It would make messy code to always separately start a math operation and then wait for the result. It would be cleanest, I think, to have the cog count off so many clocks before grabbing the result off a common bus. That way, there's no need to handshake at the end.
Oh oh, now you're making me nervous. Nice to have added features as long as it does not lead to more delays and power consumption problems.
The only point is that such extensions are possible without having to touch the cog. It's way easier to develop another pipelined math function than integrate one into the cog. I love the idea of wrapping up the cog and adding new things via the hub and pins, in a simple, deterministic way. I'm getting tired of monkeying with cogs.
I wonder if this HUB math thing will have an accumulator (or two)...
If you are doing a string of operations, isn't it more efficient to leave the result in the hub, operate on it, and then get it when done?
And/Or, if you could send it a Quad input... Seems you could ask for a chain of 3 or 4 calculations and then get the result when done...
Neat ideas!
The cog will have fast multiply instructions (mul/muls) and 64-bit accumulation would take just two more instructions (add, addx). That's six clocks vs maybe 34 for a big math operation like CORDIC.
It probably would be best to offset it from the hub, so that you could do a hub instruction just before and/or just after the math operation.
Probably makes sense, as this is not actual HUB memory is it ? (it is separately mapped hardware)
I guess that Phase offset would be fixed, so Programmers would need to know how to best align Hub/Math opcodes ?
The tools could easily check and report on alignment issues ?
I've been thinking that it might be best to have the cog wait while the computation is being done. It would make messy code to always separately start a math operation and then wait for the result. It would be cleanest, I think, to have the cog count off so many clocks before grabbing the result off a common bus. That way, there's no need to handshake at the end.
Do these Mathops use Quads ? - That could make sense, given these have significant slot-delays if users can define a Quad with Operands+Mathop, send that in one opcode, and then do other things (if they want), and then the Result-read can be launched at any time, and if it launches early, then result read auto-waits. (A later read just waits for the next MathBlock slot)
A 64/32 would fit neatly in a quad, when you include the Divide operator field, and the DIV and REM would both return on Read.
Chip,
By using a quad long of registers in cog, and setting them up appropriately, then writing a quad to the hub maths unit on a hub cycle, and then reading the result(s) again as a quad into the cog (same or different cog address) on a subsequent hub cycle (when the data is ready), the cog could continue to execute cog code.
This way, the maths operation(s) would be in parallel. We would just need to know that the result took 1/2/3/... hub cycles before being ready.
Part of the quad setup could be a set of maths instructions to perform, and the rest a set of input operands.
In fact, the maths unit could be a "quad window" into the hub ROM that would not get used.
Or it could have MathBlock specific opcodes for the RDQUAD/WRQUAD ? (eg RDQMATH, WRQMATH)
Yes, but I was thinking more of sharing the hub address/data bus with the ROM/RAM and MathBlock, instead of perhaps having a separate bus to the MathBlock.
Yes, but I was thinking more of sharing the hub address/data bus with the ROM/RAM and MathBlock, instead of perhaps having a separate bus to the MathBlock.
However the address bus is pretty much redundant with a single Quad needed to the MathBlock, and if the HUB is phase-moved from the MathBlock Slot, then it will need a data path mux at some stage anyway.
Comments
That's correct. The implicit question was "how long will it take to get the results after the operation completes?". I was assuming that the MATHOP was round-robin like HUBOP, so there'd be up to 14 clocks delay in starting the operation and, depending on when it completed (hence the original question), potentially several clock cycles delay when retrieving the results. It seems from Chip's reply that the MATHOP will not be part of HUBOP, but I would still expect each cog to have access only once every 16 clock cycles. In which case, all MATHOPs would implicitly run for some multiple of 16 cycles. Assuming chip has a data conduit that allows setup/execute and fetch/results to be done in a single instruction (and therefor, single MATHOP cycle) it looks like we are realistically looking at 48-62 clocks per MATHOP, at least for CORDIC.
And if I have that correct, it seems like the number of stages employed should take advantage of those timing requirements (i.e. don't leave a lot of unused cycles). Get just enough stages in to maximize results without blowing past the next access window. And by "next", I don't necessarily mean "16 clock cycles later", rather "some reasonable multiple of 16 clock cycles later."
Taking that with his earlier comments, I think the MathBlock is 'data isolated', but probably not 'time isolated'.
ie because the MathBlock needs every one of 32+2,3 cycles, that sets a minimum answer delay, but it does not set a maximum.
(The HW multiplier, of tbf size, is one-per-COG)
Q: if two or more COGs want to use the MathBlock, what are the best/worst case delays for each COGs answer ? Does a queued COG(s) simply block until each answer is completed ?
As I have understood since the mathblock/cordic is pipelined while it executes eg stage1 for cog3 it contemporaneously exedcutes stage2 for cog2, stage3 for cog1 and so on thus there is no difference if one or more cogs use it at the same time
That's correct. Every stage can be doing an operation for a different cog.
Cool, that's impressive.
I had assumed some HW (like multiply) sharing across pipelines, but if each pipeline stage is unique/isolated then any COG does not care where any other COG might be, and it behaves as totally separate MathBlocks.
If slot-mapping is introduced, ( & that has merits for Hub BW and Power control ), I wonder how that would interact with MathBlock Pipelines ?
You may be thinking of HW tasking, which I think is just paused, until the Opcodes and MathBlock can be released in FPGA form, and some OnSemi Sims done to Check MHz and Watts.
That gives a fixed software reference point, and HW tasking does not change that, it augments it.
Slot mapping is simpler, it is a single rotating table of which COG gets the Hub Slot. It can default to a 1:16 scan, or be set to 1:8 to emulate P1, or it can give one COG 50% of the Hub BW (etc) , on a N/M basis.
If access is going to be every 16 clock cycles, it seems like it would make sense to have the MATHOP cycle shifted by 8 clocks relative to the HUBOP cycle (i.e. cog 0's HUB cycle starts on cycle 0 and MATH cycle starts on cycle 8). That would allow relatively fast turnaround for HUBOP-to-MATHOP operations, and vice versa.
At least, there are only one of these and not 8 or 16...
Maybe things like floating point support or CRC operations could be added?
It would take a lot more logic, because there'd be 32 additional barrel shifters (two per cog). When a CORDIC is hardwired in a pipelined fashion, those shifters turn into hardwired connections, reducing the logic and doubling the speed.
It's true that we could add floating point, as well. Since all this is now separate from the cogs, all kinds of things are possible.
Maybe the most basic floating point support would be pack and unpack instruction to separate/merge exponent with mantissa...
Even if Full Floating point proves a little bulky, there are Float libraries now for P1, so those could be ported to this, and then some key 'helper operations' could be included in the MathBlock at lower cost ?
Chip does the data exchange between cog and cordic happen during the cog's hub access window, or phase shifted from it? Or is that still being worked out?
I do like the whole concept, it's efficient use of transistors.
Lol, that's just a little bit funny.
If you are doing a string of operations, isn't it more efficient to leave the result in the hub, operate on it, and then get it when done?
And/Or, if you could send it a Quad input... Seems you could ask for a chain of 3 or 4 calculations and then get the result when done...
We were able to do somewhere around 200 lines of floating point math at some reasonable speed (50Hz?)
Anyway, the instructions looked just like you might expect:
https://code.google.com/p/anzhelka/source/browse/software/spin/src/main.spin
F32_CMD:
https://code.google.com/p/anzhelka/source/browse/software/spin/lib/F32_CMD.spin
Command "compiler":
https://code.google.com/p/anzhelka/source/browse/software/spin/tool/math_processor.py
Oh oh, now you're making me nervous. Nice to have added features as long as it does not lead to more delays and power consumption problems.
It probably would be best to offset it from the hub, so that you could do a hub instruction just before and/or just after the math operation.
I've been thinking that it might be best to have the cog wait while the computation is being done. It would make messy code to always separately start a math operation and then wait for the result. It would be cleanest, I think, to have the cog count off so many clocks before grabbing the result off a common bus. That way, there's no need to handshake at the end.
The only point is that such extensions are possible without having to touch the cog. It's way easier to develop another pipelined math function than integrate one into the cog. I love the idea of wrapping up the cog and adding new things via the hub and pins, in a simple, deterministic way. I'm getting tired of monkeying with cogs.
Neat ideas!
The cog will have fast multiply instructions (mul/muls) and 64-bit accumulation would take just two more instructions (add, addx). That's six clocks vs maybe 34 for a big math operation like CORDIC.
Probably makes sense, as this is not actual HUB memory is it ? (it is separately mapped hardware)
I guess that Phase offset would be fixed, so Programmers would need to know how to best align Hub/Math opcodes ?
The tools could easily check and report on alignment issues ?
Do these Mathops use Quads ? - That could make sense, given these have significant slot-delays if users can define a Quad with Operands+Mathop, send that in one opcode, and then do other things (if they want), and then the Result-read can be launched at any time, and if it launches early, then result read auto-waits. (A later read just waits for the next MathBlock slot)
A 64/32 would fit neatly in a quad, when you include the Divide operator field, and the DIV and REM would both return on Read.
By using a quad long of registers in cog, and setting them up appropriately, then writing a quad to the hub maths unit on a hub cycle, and then reading the result(s) again as a quad into the cog (same or different cog address) on a subsequent hub cycle (when the data is ready), the cog could continue to execute cog code.
This way, the maths operation(s) would be in parallel. We would just need to know that the result took 1/2/3/... hub cycles before being ready.
Part of the quad setup could be a set of maths instructions to perform, and the rest a set of input operands.
In fact, the maths unit could be a "quad window" into the hub ROM that would not get used.
Or it could have MathBlock specific opcodes for the RDQUAD/WRQUAD ? (eg RDQMATH, WRQMATH)