Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 50 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

14748505253144

Comments

  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-29 17:32
    The last couple of pages have been on software implementation, not the hardware, which may or may not support mathops such as float etc.
    Can that discussion take place on another thread and keep this for hardware implementation please.
    Some really good hw points seem to have been lost in the haystack.

    Mathops in Hub...
    Floating point can be done in sw now. So only if it can be done simply should it be included in hw. Maybe some support operations may make sense.
    MAC etc make more sense to me. Some of the things Phil has done where DSP operational support seem to make more sense for a wider audience of P2.

    How about we just get this FPGA image out first so we can start testing? Other MATHOPS could be added (or not) after Chip understands how it goes together.
  • RaymanRayman Posts: 14,762
    edited 2014-04-29 20:09
    I don't think Spin needs floating point... I think most wanting floating point would be heavily attracted to the C/C++ compilers...

    Well, except if you wanted to do something really impressive, then you'd want to do it in assembly and then you might hope there was some hardware support for it...
  • kwinnkwinn Posts: 8,697
    edited 2014-04-29 20:55
    +1 Thank you Heater.
    Heater. wrote: »
    dMajo,

    Because decorating variable names like that is horribly ugly.
  • koehlerkoehler Posts: 598
    edited 2014-04-30 01:48
    I agree that type of decoration is ugly and a PITA.

    Personally, I like how PureBasic addressed it, which while adding a bit of verbosity, also makes it very clear what you are working with.

    http://www.purebasic.com/documentation/reference/variables.html

    Name Extension Memory consumption Range
    Byte .b 1 byte -128 to +127
    Ascii .a 1 byte 0 to +255
    Character .c 1 byte (in ascii mode) 0 to +255
    Character .c 2 bytes (in unicode mode) 0 to +65535
    Word .w 2 bytes -32768 to +32767
    Unicode .u 2 bytes 0 to +65535
    Long .l 4 bytes -2147483648 to +2147483647
    Integer .i 4 bytes (using 32-bit compiler) -2147483648 to +2147483647
    Integer .i 8 bytes (using 64-bit compiler) -9223372036854775808 to +9223372036854775807
    Float .f 4 bytes unlimited (see below)
    Quad .q 8 bytes -9223372036854775808 to +9223372036854775807
    Double .d 8 bytes unlimited (see below)
    String .s string length + 1 unlimited
    Fixed String .s{Length} string length unlimited
    Heater. wrote: »
    dMajo,

    Because decorating variable names like that is horribly ugly.
  • dMajodMajo Posts: 855
    edited 2014-04-30 02:59
    koehler wrote: »
    I agree that type of decoration is ugly and a PITA.

    Personally, I like how PureBasic addressed it, which while adding a bit of verbosity, also makes it very clear what you are working with.

    http://www.purebasic.com/documentation/reference/variables.html

    Name Extension Memory consumption Range
    Byte .b 1 byte -128 to +127
    Ascii .a 1 byte 0 to +255
    Character .c 1 byte (in ascii mode) 0 to +255
    Character .c 2 bytes (in unicode mode) 0 to +65535
    Word .w 2 bytes -32768 to +32767
    Unicode .u 2 bytes 0 to +65535
    Long .l 4 bytes -2147483648 to +2147483647
    Integer .i 4 bytes (using 32-bit compiler) -2147483648 to +2147483647
    Integer .i 8 bytes (using 64-bit compiler) -9223372036854775808 to +9223372036854775807
    Float .f 4 bytes unlimited (see below)
    Quad .q 8 bytes -9223372036854775808 to +9223372036854775807
    Double .d 8 bytes unlimited (see below)
    String .s string length + 1 unlimited
    Fixed String .s{Length} string length unlimited

    I've proposed symbol suffix, you are proposing dotChar suffix. Same identical result. I am fine with any solution. I've taken my from visualbasic that allow "abc%" in place of "abc as integer"
    I usually declare variables using prefixes eg strMyString as string, nMyNumber as integer, fMyFlag as boolean ... while I use the short form during quick tests, prove of concept to avoid typing a lot.
  • Heater.Heater. Posts: 21,230
    edited 2014-04-30 03:00
    Roy,

    I did not forget. Point is that when up casting from 32 bit int to 32 bit float there is a continuous range of values that can be represented exactly. The other correct conversions sparsely populate the rest of the space. That makes working outside of 24 bit ints error prone.

    You might want to revisit your estimates. There are about 4.2 billion possible values in the int of which only 150 million can be represented as a float correctly. About 3.5 percent. As I said "mostly wrong". If this were a communication protocol it be rejected but somehow such silent failures in the heart of our programming language are acceptable!

    As for the "0.1 + 0.2 == 0.3" problem it's not the equality operator or even the plus that is the issue.

    I don't want to reject the idea of floating point in Spin. I just hope it can be done in a nice clean way. Chip must have had something in mind when he allowed the use of floating point literals in Spin.

    As for hardware floating point support my feeling is that it's better not.

    1) I imagine a floating point unit per COG bloats the gate count back up to the unmanageable proportions we had before.

    2) A shared floating point unit might be slow enough that it's not worth the bother.

    3) Most importantly it violates my "Get me it, NOW" criteria. I want this chip done.
  • JRetSapDoogJRetSapDoog Posts: 954
    edited 2014-04-30 04:03
    Right: Heater's "Get me it, NOW" criteria should be a top-level project driver. I wonder what development time-limit Ken would put on non-critical features (though, in the end, only Chip can gauge/guess what is critical or not). Would a week (or two) be tolerable? I really feel that anything even approaching a month is waaaaaaaay out of line at this point because the window-of-opportunity (WoO?) for the chip could slip (or is slipping day-by-day). Sales would be lost, other chips could step in. Maybe a week should be the limit (perhaps not including documentation and so on). Time marches on and competing products are released continually. But I also think that Chip is "on task" to get the new chip into FPGA form, then to fab, then out the door. So, if a feature is perceived to need too much development time, I think it will be nixed. I hope Parallax is committed to silicon this year, and I think that Parallax should strive to get the chip done early (in case something delays things, which something will). I love how Chip has compartmentalized things (cogs, pins, etc.) such that changes/development can be focused on specific areas (and recompiling is maybe faster). He seems to be making progress at lightening speed now; things are really starting to fuse/gel.
  • evanhevanh Posts: 16,039
    edited 2014-04-30 05:16
    ... Would a week (or two) be tolerable? I really feel that anything even approaching a month is waaaaaaaay out of line at this point because the window-of-opportunity (WoO?) for the chip could slip (or is slipping day-by-day). Sales would be lost, other chips could step in. Maybe a week should be the limit (perhaps not including documentation and so on).

    For what exactly? A regular FPGA release cycle maybe? I'm not sure exactly how this has much to do with windows of opportunity though.
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-04-30 05:38
    From the peanut gallery: Before I'd subscribe to a floating point capability, I'd be sure I was competitive with functionality like brand X's DSP engine (e.g. 40bit Multiply and accumulate of 17bit arguments in minimal clock cycles and barrel shifting). I also think it's crucial, with the doubling of COGS, to move to a time-frame model rather than the round-robin model (I forget the nomenclature used here). The HUB should be able to move to any COG in the next time frame in one clock (with maybe a two clock pipeline to look ahead for the next COG), all under program control and remaining totally deterministic.
  • evanhevanh Posts: 16,039
    edited 2014-04-30 05:58
    ... I also think it's crucial, with the doubling of COGS, to move to a time-frame model rather than the round-robin model (I forget the nomenclature used here).

    The instruction ratio hasn't changed from the Prop1. It's still eight instruction time intervals per hub access.

    The HUB should be able to move to any COG in the next time frame in one clock (with maybe a two clock pipeline to look ahead for the next COG), all under program control and remaining totally deterministic.

    Most ideas around faster hub access are for non-deterministic exceptions, ie: Those that just want speed can forgo determinism on certain Cogs. You might want to explain a little more.
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-04-30 06:22
    Heater. wrote: »
    Roy,
    JavaScript pushes this even further by making all numbers 64 bit floats. Then we can handle 53 bit integers accurately as well. Big enough for any one right? But still people complain because "0.1 + 0.2 == 0.3" is false. You just can't win.
    APL dealt with this issue with something they called "fuzz" ... the number of bits to use in a compare. It was a "state" attribute of the interpreter and could be changed under program control (which was frowned-on practice). It also had a state for index origin (0 best for matrix work; 1 best for vector (list) work).
  • JonnyMacJonnyMac Posts: 9,159
    edited 2014-04-30 11:29
    I sincerely hope that I'm not forced by some new variation of Spin to add suffixes or prefixes to variables to show others what the variable type is. Those that want to indicate type via naming can do it now as has been pointed out; the compiler knows what kind of variable its dealing with after the declaration; everything else is for humans.
  • pik33pik33 Posts: 2,388
    edited 2014-04-30 11:34
    There were Atari 8-bit computers.Their Basic worked with BCD 6-byte floats - 6502 has native support for BCD and 0.1+0.2 was always 0.3 :)
  • 4x5n4x5n Posts: 745
    edited 2014-04-30 12:57
    JonnyMac wrote: »
    I sincerely hope that I'm not forced by some new variation of Spin to add suffixes or prefixes to variables to show others what the variable type is. Those that want to indicate type via naming can do it now as has been pointed out; the compiler knows what kind of variable its dealing with after the declaration; everything else is for humans.

    I'd prefer that new versions of spin doesn't add a suffix or prefix to variable names to indicate type. However I have to admit that after spending the last 2-3 years mostly programming in perl that I've grown to like having the prefix on the variables. Makes it easy for the programmer to keep things straight. I know it's only for humans but it does make it easy to visually differentiate arrays, scalars and hashes. :-)
  • photomankcphotomankc Posts: 943
    edited 2014-04-30 13:43
    JonnyMac wrote: »
    I sincerely hope that I'm not forced by some new variation of Spin to add suffixes or prefixes to variables to show others what the variable type is. Those that want to indicate type via naming can do it now as has been pointed out; the compiler knows what kind of variable its dealing with after the declaration; everything else is for humans.


    Completely concur. I've hated that everywhere I've seen it done.
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-04-30 14:47
    evanh wrote: »
    The instruction ratio hasn't changed from the Prop1. It's still eight instruction time intervals per hub access.
    Most ideas around faster hub access are for non-deterministic exceptions, ie: Those that just want speed can forgo determinism on certain Cogs. You might want to explain a little more.
    Leaves me wondering what happens on the 9th clock with 9 COGs running. Seems COG 0 and 8 would have access to the HUB simultaneously.

    Explaining a little more: Consider a circular queue of lenth N (user controlled 1 .. maxN) where the element of the queue is the COG number getting HUB access on that clock. A COG could be represented multiple times in the queue. An element in the queue could be empty (negative?) meaning no COG access on that clock. Pointer into the queue would be clock modulo N.

    As usual, the devil is in the details. Note: substitute PC for clock as appropriate.
  • SeairthSeairth Posts: 2,474
    edited 2014-04-30 15:27
    Leaves me wondering what happens on the 9th clock with 9 COGs running. Seems COG 0 and 8 would have access to the HUB simultaneously.

    The hub access repeats every 16 clock cycles (or, as evanh stated, every 8 instruction cycles). Also, hub timeslicing is the same regardless of the number of cogs running.
  • jmgjmg Posts: 15,175
    edited 2014-04-30 15:39
    Leaves me wondering what happens on the 9th clock with 9 COGs running. Seems COG 0 and 8 would have access to the HUB simultaneously.

    I believe Chip is planning on running the COG opcodes at 100Mhz but the HUB memory is targeting 200MHz.
    That means there will be an interleaved action, and it also means COGS will run slightly phase shifted depending on which Phase they are granted for HUB access. (which may matter for critical timing across COGs)

    Explaining a little more: Consider a circular queue of lenth N (user controlled 1 .. maxN) where the element of the queue is the COG number getting HUB access on that clock. A COG could be represented multiple times in the queue. An element in the queue could be empty (negative?) meaning no COG access on that clock. Pointer into the queue would be clock modulo N.

    Yes, that table mapping has been discussed, pretty much exactly as you state, as a means for both power control and bandwidth control. I don't think Chip has implemented it yet, as it can be a late-change.


    A single common table, somewhat larger than 16 to give granularity, is filled with COG mappings. Default would be equal-spread, and user code can re-map to give (for example) one COG 50% of bandwidth.
    User control of Wrap-length allows matching with actual COG numbers used in a system design, for jitter free mapping.

    If those Map slots have 200MHz granularity, you would need to map even slots to one half of COGS and odd slots to the other.
    - ie whilst adjacent 200MHz slots could be mapped same-cog, one would be ignored/useless as the opcode can only run at 100MHz.
    ( These figures assume Chip can achieve Timing and Power Envelope details for the 200MHz memory speeds )
  • SeairthSeairth Posts: 2,474
    edited 2014-04-30 15:59
    jmg wrote: »
    I believe Chip is planning on running the COG opcodes at 100Mhz but the HUB memory is targeting 200MHz.
    That means there will be an interleaved action, and it also means COGS will run slightly phase shifted depending on which Phase they are granted for HUB access. (which may matter for critical timing across COGs)

    To clarify, instructions still take 4 clock cycles, but are semi-pipelined such that they overlap by 2 clock cycles. This gives a potential of 100MIPS at a 200MHz clock speed. As jmg points out, there will be phasing between cogs (specifically between 4 adjacent cogs) such that cog 1 will be one clock cycle out of sync with cog 0, cog 2 will be 2 cycles out of sync, cog 3 will be 3 cycles out, cog 4 will be in sync, etc. But with the overlapped instruction, potentially only every other cog will be out of sync.

    Of course, you could always use a WAITCNT to bring one cog's instruction cycle in sync with another cog, but it would take a few additional instructions to set up the WAITCNT. It makes me wonder if it would make sense to add a WAITxxx instruction that will sync to the 2 LSBs of CNT.
  • Todd MarshallTodd Marshall Posts: 89
    edited 2014-04-30 16:13
    jmg wrote: »
    Yes, that table mapping has been discussed, pretty much exactly as you state, as a means for both power control and bandwidth control. I don't think Chip has implemented it yet, as it can be a late-change.
    I would opt for that "way" before floating point. Seems to me a COG could get 100% of the available cycles ... or be granted such a burst dynamically just by mapping it into the first position and setting N (number of elements in the circular queue) to 1. Remaining elements in the queue would be untouched and setting N to some larger number of elements would bring other COGs back into play ... kind of like generating a hard loop which is so popular with micro-controllers. Then I'd want a hardware (and/or software) interrupt to bust out of the loop ... and here I go with feature creep.
  • mindrobotsmindrobots Posts: 6,506
    edited 2014-04-30 16:30
    We don't take kindly to interrupts around these parts. ;)
  • jmgjmg Posts: 15,175
    edited 2014-04-30 16:32
    I would opt for that "way" before floating point.

    I'd agree, but I think Chip asked about Floating point just because he was working on the MathBlock, and was looking at what precision trade offs to make in Maths and that led to FP questions.

    Certainly there will be Floating Point support - the question is how much in silicon, and how much in SW ?

    Not sure how the OnSemi sim's worked out in Size/speed on the different sized MULT choices Chip was trying ?
    Seems to me a COG could get 100% of the available cycles ... or be granted such a burst dynamically just by mapping it into the first position and setting N (number of elements in the circular queue) to 1. Remaining elements in the queue would be untouched and setting N to some larger number of elements would bring other COGs back into play .

    Yes, that is another natural use of Table Modulus/wrap I had not considered.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2014-04-30 16:58
    If you break the fixed round-robin hub-access order, what you'll end up with is utter chaos. Different objects, written by different programmers, will have different "requirements", each competing with the others for hub slots. I cannot see any way that order and determinism could ever prevail in such a scenerio.

    -Phil
  • TubularTubular Posts: 4,705
    edited 2014-04-30 17:51
    If you break the fixed round-robin hub-access order, what you'll end up with is utter chaos. Different objects, written by different programmers, will have different "requirements", each competing with the others for hub slots. I cannot see any way that order and determinism could ever previal in such a scenerio.

    -Phil

    There might be a way to achieve this. I'm not really satisfied we've thought it through properly. We did find a solution to the task slotting after a while, but tasks being inside a cog means its not the same problem. Right at this point, and for this processor, I think keeping it very simple is the way to go.

    The simplest I've come up with so far would be to modify cognew, so that on request it looks and loads two diametric cogs at the same time. The second cog would be 8 above the first cog, and would permanently donate its hub slot to speed up RDLONGs etc in the first cog. I think this would then operate similarly to multi-cog video on P1. You can 'fail' right at cogrun if a free cog pair can't be found.
  • Roy ElthamRoy Eltham Posts: 3,000
    edited 2014-04-30 17:58
    I have to agree with Phil on this one. Any form of sharing or remapping or whatever is just going to be a mess.

    I think the only "safe" way would be if Chip can make the hub to cog window wider or occur more often (for all of them equally).
  • cgraceycgracey Posts: 14,208
    edited 2014-04-30 18:32
    Seairth wrote: »
    Of course, you could always use a WAITCNT to bring one cog's instruction cycle in sync with another cog, but it would take a few additional instructions to set up the WAITCNT. It makes me wonder if it would make sense to add a WAITxxx instruction that will sync to the 2 LSBs of CNT.


    Great idea!
  • cgraceycgracey Posts: 14,208
    edited 2014-04-30 18:38
    jmg wrote: »
    Not sure how the OnSemi sim's worked out in Size/speed on the different sized MULT choices Chip was trying ?


    Going from a 16x16 multiplier to a 24x24 increased the total ALU area by only 10%.

    With extra space provided for 68% area utilization (to accommodate buffering and clock tree insertion downstream), the total ALU size w/24-bit multiplier is now 0.25 square mm per cog. That was for 160Mhz, which is way faster than the RAMs could run (320 MHz). The RAMs are only good for 250MHz.
  • SeairthSeairth Posts: 2,474
    edited 2014-04-30 18:52
    If you break the fixed round-robin hub-access order, what you'll end up with is utter chaos. Different objects, written by different programmers, will have different "requirements", each competing with the others for hub slots. I cannot see any way that order and determinism could ever previal in such a scenerio.

    If we are willing to give up determinism while still maintaining order, then the approach I detailed in my blog would be relatively straight forward.

    You could even vary the approach slightly, where the assertions are always active for only the active cogs. This would have the effect of having the access window be modulo the number of active cogs.

    This could be the default behavior of the hub. Then, if a particular module requires deterministic timing without having to know about any of the other cogs activities, it could use a HUBOP to disable this mode (effectively have every cog assert, whether running or not). At which point,you are now running modulo 16, and you have traded off performance for determinism.

    And before anyone comments that this would allow a module requiring determinism to cause issues for a module requiring high hub throughput, those are two conflicting goals and would require a much, much more complicated access scheme. I think the approach I offer provides a fairly simple solution that also allows a fallback to the current approach when determinism is absolutely required.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-04-30 19:19
    What makes the whole thing work is that determinism just is. We don't specify it. Known behavior that everybody codes for.

    This discussion comes up now and then and I really think for this design we need to set it off the table.
  • RossHRossH Posts: 5,477
    edited 2014-04-30 19:22
    If you break the fixed round-robin hub-access order, what you'll end up with is utter chaos. Different objects, written by different programmers, will have different "requirements", each competing with the others for hub slots. I cannot see any way that order and determinism could ever previal in such a scenerio.

    -Phil

    Agreed.
Sign In or Register to comment.