The last couple of pages have been on software implementation, not the hardware, which may or may not support mathops such as float etc.
Can that discussion take place on another thread and keep this for hardware implementation please.
Some really good hw points seem to have been lost in the haystack.
Mathops in Hub...
Floating point can be done in sw now. So only if it can be done simply should it be included in hw. Maybe some support operations may make sense.
MAC etc make more sense to me. Some of the things Phil has done where DSP operational support seem to make more sense for a wider audience of P2.
How about we just get this FPGA image out first so we can start testing? Other MATHOPS could be added (or not) after Chip understands how it goes together.
I don't think Spin needs floating point... I think most wanting floating point would be heavily attracted to the C/C++ compilers...
Well, except if you wanted to do something really impressive, then you'd want to do it in assembly and then you might hope there was some hardware support for it...
Name Extension Memory consumption Range
Byte .b 1 byte -128 to +127
Ascii .a 1 byte 0 to +255
Character .c 1 byte (in ascii mode) 0 to +255
Character .c 2 bytes (in unicode mode) 0 to +65535
Word .w 2 bytes -32768 to +32767
Unicode .u 2 bytes 0 to +65535
Long .l 4 bytes -2147483648 to +2147483647
Integer .i 4 bytes (using 32-bit compiler) -2147483648 to +2147483647
Integer .i 8 bytes (using 64-bit compiler) -9223372036854775808 to +9223372036854775807
Float .f 4 bytes unlimited (see below)
Quad .q 8 bytes -9223372036854775808 to +9223372036854775807
Double .d 8 bytes unlimited (see below)
String .s string length + 1 unlimited
Fixed String .s{Length} string length unlimited
I've proposed symbol suffix, you are proposing dotChar suffix. Same identical result. I am fine with any solution. I've taken my from visualbasic that allow "abc%" in place of "abc as integer"
I usually declare variables using prefixes eg strMyString as string, nMyNumber as integer, fMyFlag as boolean ... while I use the short form during quick tests, prove of concept to avoid typing a lot.
I did not forget. Point is that when up casting from 32 bit int to 32 bit float there is a continuous range of values that can be represented exactly. The other correct conversions sparsely populate the rest of the space. That makes working outside of 24 bit ints error prone.
You might want to revisit your estimates. There are about 4.2 billion possible values in the int of which only 150 million can be represented as a float correctly. About 3.5 percent. As I said "mostly wrong". If this were a communication protocol it be rejected but somehow such silent failures in the heart of our programming language are acceptable!
As for the "0.1 + 0.2 == 0.3" problem it's not the equality operator or even the plus that is the issue.
I don't want to reject the idea of floating point in Spin. I just hope it can be done in a nice clean way. Chip must have had something in mind when he allowed the use of floating point literals in Spin.
As for hardware floating point support my feeling is that it's better not.
1) I imagine a floating point unit per COG bloats the gate count back up to the unmanageable proportions we had before.
2) A shared floating point unit might be slow enough that it's not worth the bother.
3) Most importantly it violates my "Get me it, NOW" criteria. I want this chip done.
Right: Heater's "Get me it, NOW" criteria should be a top-level project driver. I wonder what development time-limit Ken would put on non-critical features (though, in the end, only Chip can gauge/guess what is critical or not). Would a week (or two) be tolerable? I really feel that anything even approaching a month is waaaaaaaay out of line at this point because the window-of-opportunity (WoO?) for the chip could slip (or is slipping day-by-day). Sales would be lost, other chips could step in. Maybe a week should be the limit (perhaps not including documentation and so on). Time marches on and competing products are released continually. But I also think that Chip is "on task" to get the new chip into FPGA form, then to fab, then out the door. So, if a feature is perceived to need too much development time, I think it will be nixed. I hope Parallax is committed to silicon this year, and I think that Parallax should strive to get the chip done early (in case something delays things, which something will). I love how Chip has compartmentalized things (cogs, pins, etc.) such that changes/development can be focused on specific areas (and recompiling is maybe faster). He seems to be making progress at lightening speed now; things are really starting to fuse/gel.
... Would a week (or two) be tolerable? I really feel that anything even approaching a month is waaaaaaaay out of line at this point because the window-of-opportunity (WoO?) for the chip could slip (or is slipping day-by-day). Sales would be lost, other chips could step in. Maybe a week should be the limit (perhaps not including documentation and so on).
For what exactly? A regular FPGA release cycle maybe? I'm not sure exactly how this has much to do with windows of opportunity though.
From the peanut gallery: Before I'd subscribe to a floating point capability, I'd be sure I was competitive with functionality like brand X's DSP engine (e.g. 40bit Multiply and accumulate of 17bit arguments in minimal clock cycles and barrel shifting). I also think it's crucial, with the doubling of COGS, to move to a time-frame model rather than the round-robin model (I forget the nomenclature used here). The HUB should be able to move to any COG in the next time frame in one clock (with maybe a two clock pipeline to look ahead for the next COG), all under program control and remaining totally deterministic.
... I also think it's crucial, with the doubling of COGS, to move to a time-frame model rather than the round-robin model (I forget the nomenclature used here).
The instruction ratio hasn't changed from the Prop1. It's still eight instruction time intervals per hub access.
The HUB should be able to move to any COG in the next time frame in one clock (with maybe a two clock pipeline to look ahead for the next COG), all under program control and remaining totally deterministic.
Most ideas around faster hub access are for non-deterministic exceptions, ie: Those that just want speed can forgo determinism on certain Cogs. You might want to explain a little more.
Roy,
JavaScript pushes this even further by making all numbers 64 bit floats. Then we can handle 53 bit integers accurately as well. Big enough for any one right? But still people complain because "0.1 + 0.2 == 0.3" is false. You just can't win.
APL dealt with this issue with something they called "fuzz" ... the number of bits to use in a compare. It was a "state" attribute of the interpreter and could be changed under program control (which was frowned-on practice). It also had a state for index origin (0 best for matrix work; 1 best for vector (list) work).
I sincerely hope that I'm not forced by some new variation of Spin to add suffixes or prefixes to variables to show others what the variable type is. Those that want to indicate type via naming can do it now as has been pointed out; the compiler knows what kind of variable its dealing with after the declaration; everything else is for humans.
I sincerely hope that I'm not forced by some new variation of Spin to add suffixes or prefixes to variables to show others what the variable type is. Those that want to indicate type via naming can do it now as has been pointed out; the compiler knows what kind of variable its dealing with after the declaration; everything else is for humans.
I'd prefer that new versions of spin doesn't add a suffix or prefix to variable names to indicate type. However I have to admit that after spending the last 2-3 years mostly programming in perl that I've grown to like having the prefix on the variables. Makes it easy for the programmer to keep things straight. I know it's only for humans but it does make it easy to visually differentiate arrays, scalars and hashes. :-)
I sincerely hope that I'm not forced by some new variation of Spin to add suffixes or prefixes to variables to show others what the variable type is. Those that want to indicate type via naming can do it now as has been pointed out; the compiler knows what kind of variable its dealing with after the declaration; everything else is for humans.
Completely concur. I've hated that everywhere I've seen it done.
The instruction ratio hasn't changed from the Prop1. It's still eight instruction time intervals per hub access.
Most ideas around faster hub access are for non-deterministic exceptions, ie: Those that just want speed can forgo determinism on certain Cogs. You might want to explain a little more.
Leaves me wondering what happens on the 9th clock with 9 COGs running. Seems COG 0 and 8 would have access to the HUB simultaneously.
Explaining a little more: Consider a circular queue of lenth N (user controlled 1 .. maxN) where the element of the queue is the COG number getting HUB access on that clock. A COG could be represented multiple times in the queue. An element in the queue could be empty (negative?) meaning no COG access on that clock. Pointer into the queue would be clock modulo N.
As usual, the devil is in the details. Note: substitute PC for clock as appropriate.
Leaves me wondering what happens on the 9th clock with 9 COGs running. Seems COG 0 and 8 would have access to the HUB simultaneously.
The hub access repeats every 16 clock cycles (or, as evanh stated, every 8 instruction cycles). Also, hub timeslicing is the same regardless of the number of cogs running.
Leaves me wondering what happens on the 9th clock with 9 COGs running. Seems COG 0 and 8 would have access to the HUB simultaneously.
I believe Chip is planning on running the COG opcodes at 100Mhz but the HUB memory is targeting 200MHz.
That means there will be an interleaved action, and it also means COGS will run slightly phase shifted depending on which Phase they are granted for HUB access. (which may matter for critical timing across COGs)
Explaining a little more: Consider a circular queue of lenth N (user controlled 1 .. maxN) where the element of the queue is the COG number getting HUB access on that clock. A COG could be represented multiple times in the queue. An element in the queue could be empty (negative?) meaning no COG access on that clock. Pointer into the queue would be clock modulo N.
Yes, that table mapping has been discussed, pretty much exactly as you state, as a means for both power control and bandwidth control. I don't think Chip has implemented it yet, as it can be a late-change.
A single common table, somewhat larger than 16 to give granularity, is filled with COG mappings. Default would be equal-spread, and user code can re-map to give (for example) one COG 50% of bandwidth.
User control of Wrap-length allows matching with actual COG numbers used in a system design, for jitter free mapping.
If those Map slots have 200MHz granularity, you would need to map even slots to one half of COGS and odd slots to the other.
- ie whilst adjacent 200MHz slots could be mapped same-cog, one would be ignored/useless as the opcode can only run at 100MHz.
( These figures assume Chip can achieve Timing and Power Envelope details for the 200MHz memory speeds )
I believe Chip is planning on running the COG opcodes at 100Mhz but the HUB memory is targeting 200MHz.
That means there will be an interleaved action, and it also means COGS will run slightly phase shifted depending on which Phase they are granted for HUB access. (which may matter for critical timing across COGs)
To clarify, instructions still take 4 clock cycles, but are semi-pipelined such that they overlap by 2 clock cycles. This gives a potential of 100MIPS at a 200MHz clock speed. As jmg points out, there will be phasing between cogs (specifically between 4 adjacent cogs) such that cog 1 will be one clock cycle out of sync with cog 0, cog 2 will be 2 cycles out of sync, cog 3 will be 3 cycles out, cog 4 will be in sync, etc. But with the overlapped instruction, potentially only every other cog will be out of sync.
Of course, you could always use a WAITCNT to bring one cog's instruction cycle in sync with another cog, but it would take a few additional instructions to set up the WAITCNT. It makes me wonder if it would make sense to add a WAITxxx instruction that will sync to the 2 LSBs of CNT.
Yes, that table mapping has been discussed, pretty much exactly as you state, as a means for both power control and bandwidth control. I don't think Chip has implemented it yet, as it can be a late-change.
I would opt for that "way" before floating point. Seems to me a COG could get 100% of the available cycles ... or be granted such a burst dynamically just by mapping it into the first position and setting N (number of elements in the circular queue) to 1. Remaining elements in the queue would be untouched and setting N to some larger number of elements would bring other COGs back into play ... kind of like generating a hard loop which is so popular with micro-controllers. Then I'd want a hardware (and/or software) interrupt to bust out of the loop ... and here I go with feature creep.
I'd agree, but I think Chip asked about Floating point just because he was working on the MathBlock, and was looking at what precision trade offs to make in Maths and that led to FP questions.
Certainly there will be Floating Point support - the question is how much in silicon, and how much in SW ?
Not sure how the OnSemi sim's worked out in Size/speed on the different sized MULT choices Chip was trying ?
Seems to me a COG could get 100% of the available cycles ... or be granted such a burst dynamically just by mapping it into the first position and setting N (number of elements in the circular queue) to 1. Remaining elements in the queue would be untouched and setting N to some larger number of elements would bring other COGs back into play .
Yes, that is another natural use of Table Modulus/wrap I had not considered.
If you break the fixed round-robin hub-access order, what you'll end up with is utter chaos. Different objects, written by different programmers, will have different "requirements", each competing with the others for hub slots. I cannot see any way that order and determinism could ever prevail in such a scenerio.
If you break the fixed round-robin hub-access order, what you'll end up with is utter chaos. Different objects, written by different programmers, will have different "requirements", each competing with the others for hub slots. I cannot see any way that order and determinism could ever previal in such a scenerio.
-Phil
There might be a way to achieve this. I'm not really satisfied we've thought it through properly. We did find a solution to the task slotting after a while, but tasks being inside a cog means its not the same problem. Right at this point, and for this processor, I think keeping it very simple is the way to go.
The simplest I've come up with so far would be to modify cognew, so that on request it looks and loads two diametric cogs at the same time. The second cog would be 8 above the first cog, and would permanently donate its hub slot to speed up RDLONGs etc in the first cog. I think this would then operate similarly to multi-cog video on P1. You can 'fail' right at cogrun if a free cog pair can't be found.
Of course, you could always use a WAITCNT to bring one cog's instruction cycle in sync with another cog, but it would take a few additional instructions to set up the WAITCNT. It makes me wonder if it would make sense to add a WAITxxx instruction that will sync to the 2 LSBs of CNT.
Not sure how the OnSemi sim's worked out in Size/speed on the different sized MULT choices Chip was trying ?
Going from a 16x16 multiplier to a 24x24 increased the total ALU area by only 10%.
With extra space provided for 68% area utilization (to accommodate buffering and clock tree insertion downstream), the total ALU size w/24-bit multiplier is now 0.25 square mm per cog. That was for 160Mhz, which is way faster than the RAMs could run (320 MHz). The RAMs are only good for 250MHz.
If you break the fixed round-robin hub-access order, what you'll end up with is utter chaos. Different objects, written by different programmers, will have different "requirements", each competing with the others for hub slots. I cannot see any way that order and determinism could ever previal in such a scenerio.
If we are willing to give up determinism while still maintaining order, then the approach I detailed in my blog would be relatively straight forward.
You could even vary the approach slightly, where the assertions are always active for only the active cogs. This would have the effect of having the access window be modulo the number of active cogs.
This could be the default behavior of the hub. Then, if a particular module requires deterministic timing without having to know about any of the other cogs activities, it could use a HUBOP to disable this mode (effectively have every cog assert, whether running or not). At which point,you are now running modulo 16, and you have traded off performance for determinism.
And before anyone comments that this would allow a module requiring determinism to cause issues for a module requiring high hub throughput, those are two conflicting goals and would require a much, much more complicated access scheme. I think the approach I offer provides a fairly simple solution that also allows a fallback to the current approach when determinism is absolutely required.
If you break the fixed round-robin hub-access order, what you'll end up with is utter chaos. Different objects, written by different programmers, will have different "requirements", each competing with the others for hub slots. I cannot see any way that order and determinism could ever previal in such a scenerio.
Comments
Can that discussion take place on another thread and keep this for hardware implementation please.
Some really good hw points seem to have been lost in the haystack.
Mathops in Hub...
Floating point can be done in sw now. So only if it can be done simply should it be included in hw. Maybe some support operations may make sense.
MAC etc make more sense to me. Some of the things Phil has done where DSP operational support seem to make more sense for a wider audience of P2.
How about we just get this FPGA image out first so we can start testing? Other MATHOPS could be added (or not) after Chip understands how it goes together.
Well, except if you wanted to do something really impressive, then you'd want to do it in assembly and then you might hope there was some hardware support for it...
Personally, I like how PureBasic addressed it, which while adding a bit of verbosity, also makes it very clear what you are working with.
http://www.purebasic.com/documentation/reference/variables.html
Name Extension Memory consumption Range
Byte .b 1 byte -128 to +127
Ascii .a 1 byte 0 to +255
Character .c 1 byte (in ascii mode) 0 to +255
Character .c 2 bytes (in unicode mode) 0 to +65535
Word .w 2 bytes -32768 to +32767
Unicode .u 2 bytes 0 to +65535
Long .l 4 bytes -2147483648 to +2147483647
Integer .i 4 bytes (using 32-bit compiler) -2147483648 to +2147483647
Integer .i 8 bytes (using 64-bit compiler) -9223372036854775808 to +9223372036854775807
Float .f 4 bytes unlimited (see below)
Quad .q 8 bytes -9223372036854775808 to +9223372036854775807
Double .d 8 bytes unlimited (see below)
String .s string length + 1 unlimited
Fixed String .s{Length} string length unlimited
I've proposed symbol suffix, you are proposing dotChar suffix. Same identical result. I am fine with any solution. I've taken my from visualbasic that allow "abc%" in place of "abc as integer"
I usually declare variables using prefixes eg strMyString as string, nMyNumber as integer, fMyFlag as boolean ... while I use the short form during quick tests, prove of concept to avoid typing a lot.
I did not forget. Point is that when up casting from 32 bit int to 32 bit float there is a continuous range of values that can be represented exactly. The other correct conversions sparsely populate the rest of the space. That makes working outside of 24 bit ints error prone.
You might want to revisit your estimates. There are about 4.2 billion possible values in the int of which only 150 million can be represented as a float correctly. About 3.5 percent. As I said "mostly wrong". If this were a communication protocol it be rejected but somehow such silent failures in the heart of our programming language are acceptable!
As for the "0.1 + 0.2 == 0.3" problem it's not the equality operator or even the plus that is the issue.
I don't want to reject the idea of floating point in Spin. I just hope it can be done in a nice clean way. Chip must have had something in mind when he allowed the use of floating point literals in Spin.
As for hardware floating point support my feeling is that it's better not.
1) I imagine a floating point unit per COG bloats the gate count back up to the unmanageable proportions we had before.
2) A shared floating point unit might be slow enough that it's not worth the bother.
3) Most importantly it violates my "Get me it, NOW" criteria. I want this chip done.
For what exactly? A regular FPGA release cycle maybe? I'm not sure exactly how this has much to do with windows of opportunity though.
The instruction ratio hasn't changed from the Prop1. It's still eight instruction time intervals per hub access.
Most ideas around faster hub access are for non-deterministic exceptions, ie: Those that just want speed can forgo determinism on certain Cogs. You might want to explain a little more.
I'd prefer that new versions of spin doesn't add a suffix or prefix to variable names to indicate type. However I have to admit that after spending the last 2-3 years mostly programming in perl that I've grown to like having the prefix on the variables. Makes it easy for the programmer to keep things straight. I know it's only for humans but it does make it easy to visually differentiate arrays, scalars and hashes. :-)
Completely concur. I've hated that everywhere I've seen it done.
Explaining a little more: Consider a circular queue of lenth N (user controlled 1 .. maxN) where the element of the queue is the COG number getting HUB access on that clock. A COG could be represented multiple times in the queue. An element in the queue could be empty (negative?) meaning no COG access on that clock. Pointer into the queue would be clock modulo N.
As usual, the devil is in the details. Note: substitute PC for clock as appropriate.
The hub access repeats every 16 clock cycles (or, as evanh stated, every 8 instruction cycles). Also, hub timeslicing is the same regardless of the number of cogs running.
I believe Chip is planning on running the COG opcodes at 100Mhz but the HUB memory is targeting 200MHz.
That means there will be an interleaved action, and it also means COGS will run slightly phase shifted depending on which Phase they are granted for HUB access. (which may matter for critical timing across COGs)
Yes, that table mapping has been discussed, pretty much exactly as you state, as a means for both power control and bandwidth control. I don't think Chip has implemented it yet, as it can be a late-change.
A single common table, somewhat larger than 16 to give granularity, is filled with COG mappings. Default would be equal-spread, and user code can re-map to give (for example) one COG 50% of bandwidth.
User control of Wrap-length allows matching with actual COG numbers used in a system design, for jitter free mapping.
If those Map slots have 200MHz granularity, you would need to map even slots to one half of COGS and odd slots to the other.
- ie whilst adjacent 200MHz slots could be mapped same-cog, one would be ignored/useless as the opcode can only run at 100MHz.
( These figures assume Chip can achieve Timing and Power Envelope details for the 200MHz memory speeds )
To clarify, instructions still take 4 clock cycles, but are semi-pipelined such that they overlap by 2 clock cycles. This gives a potential of 100MIPS at a 200MHz clock speed. As jmg points out, there will be phasing between cogs (specifically between 4 adjacent cogs) such that cog 1 will be one clock cycle out of sync with cog 0, cog 2 will be 2 cycles out of sync, cog 3 will be 3 cycles out, cog 4 will be in sync, etc. But with the overlapped instruction, potentially only every other cog will be out of sync.
Of course, you could always use a WAITCNT to bring one cog's instruction cycle in sync with another cog, but it would take a few additional instructions to set up the WAITCNT. It makes me wonder if it would make sense to add a WAITxxx instruction that will sync to the 2 LSBs of CNT.
I'd agree, but I think Chip asked about Floating point just because he was working on the MathBlock, and was looking at what precision trade offs to make in Maths and that led to FP questions.
Certainly there will be Floating Point support - the question is how much in silicon, and how much in SW ?
Not sure how the OnSemi sim's worked out in Size/speed on the different sized MULT choices Chip was trying ?
Yes, that is another natural use of Table Modulus/wrap I had not considered.
-Phil
There might be a way to achieve this. I'm not really satisfied we've thought it through properly. We did find a solution to the task slotting after a while, but tasks being inside a cog means its not the same problem. Right at this point, and for this processor, I think keeping it very simple is the way to go.
The simplest I've come up with so far would be to modify cognew, so that on request it looks and loads two diametric cogs at the same time. The second cog would be 8 above the first cog, and would permanently donate its hub slot to speed up RDLONGs etc in the first cog. I think this would then operate similarly to multi-cog video on P1. You can 'fail' right at cogrun if a free cog pair can't be found.
I think the only "safe" way would be if Chip can make the hub to cog window wider or occur more often (for all of them equally).
Great idea!
Going from a 16x16 multiplier to a 24x24 increased the total ALU area by only 10%.
With extra space provided for 68% area utilization (to accommodate buffering and clock tree insertion downstream), the total ALU size w/24-bit multiplier is now 0.25 square mm per cog. That was for 160Mhz, which is way faster than the RAMs could run (320 MHz). The RAMs are only good for 250MHz.
If we are willing to give up determinism while still maintaining order, then the approach I detailed in my blog would be relatively straight forward.
You could even vary the approach slightly, where the assertions are always active for only the active cogs. This would have the effect of having the access window be modulo the number of active cogs.
This could be the default behavior of the hub. Then, if a particular module requires deterministic timing without having to know about any of the other cogs activities, it could use a HUBOP to disable this mode (effectively have every cog assert, whether running or not). At which point,you are now running modulo 16, and you have traded off performance for determinism.
And before anyone comments that this would allow a module requiring determinism to cause issues for a module requiring high hub throughput, those are two conflicting goals and would require a much, much more complicated access scheme. I think the approach I offer provides a fairly simple solution that also allows a fallback to the current approach when determinism is absolutely required.
This discussion comes up now and then and I really think for this design we need to set it off the table.
Agreed.