Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 28 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

12526283031144

Comments

  • potatoheadpotatohead Posts: 10,254
    edited 2014-04-14 00:07
    BTW: The assembler is simple and expressive enough for people to code directly in hex, mix mnemonic coding, labels, and whatever else they want to do together, in one nice file, specifying data in a lot of extremely easy and readable ways.

    That's fantastic! People can make the expression that makes the most sense to them at that time with no worries.
  • koehlerkoehler Posts: 598
    edited 2014-04-14 00:16
    I'll admit too having not done much assembler in a long time, however its nice to see D/S clearly incrementing.

    EDIT- nonsense removed.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-14 00:21
    potatohead,
    I think you are missing something...
               MOV     [ptr.D++],[ptr.S++]  
    becomes 2 instructions...
               ALTDS   D,#rrr_ddd_sss  
               MOV     0-0,0-0  
    so that a REP instruction is likely to cause the programmer to make a bug by forgetting the inserted instruction.
    You cannot perform MOVI/MOVD/MOVS to a hub instruction, only a cog instruction.
    
    But
    :_LoopS    REP     #count,#_LoopE-LoopS  
               MOV     [ptr.D++],[ptr.S++]  
               .. other code
    :_LoopE
    overcomes this problem. 
    
    BTW No compiler mod is required for this (the REP instruction I mean). It should handle the #_LoopE-LoopS AFAIK.
    So jmg and potatohead can have it whichever way they like.
  • jmgjmg Posts: 15,148
    edited 2014-04-14 00:25
    potatohead wrote: »
    Now instead of having three dead simple to understand instructions MOV, MOVI, MOVS, which by the way have about a decade of common and well understood use, you want to load it all up under MOV, which then becomes one harder to understand thing.

    - only I did not say anywhere that those opcodes would be removed, so your whole point is moot.
  • jmgjmg Posts: 15,148
    edited 2014-04-14 00:31
    Cluso99 wrote: »
    BTW No compiler mod is required for this (the REP instruction I mean). It should handle the #_LoopE-LoopS AFAIK.
    So jmg and potatohead can have it whichever way they like.

    In P2 there was a strict dictate of a preamble/pipeline delay, (unclear if that is still in P1+ ?)

    Which is why the dual label form, with labels at actual REP_start and REP_End allows the Assembler to context check the pgmr is meeting the 'fine print', without the pgmr having to remember all the fine print.
    (ie the trivial stuff, the PC should be doing )

    Importantly, those simplest of edits, of insert/remove/comment lines are safe to do.
  • BaggersBaggers Posts: 3,019
    edited 2014-04-14 02:15
    :_LoopS    REP     #count,#_LoopE - _LoopS  
               MOV     [ptr.D++],[ptr.S++]  
              add mode code
    :_LoopE
    

    This surely gives #2 ( + size of add more code ) for the rep instruction count.

    I would have done...
        REP #count,#_LoopE - _LoopS
    :_LoopS
        MOV [ptr.D++],[ptr.S++]
        add mode code
    :_LoopE
    

    This would give #1 ( + size of add more code ) for the rep instruction count
  • LoopyBytelooseLoopyByteloose Posts: 12,537
    edited 2014-04-14 02:23
    Dave Hein wrote: »
    Loopy, I think most of us need the OBEX. If you really want to kill P1+ just tell everybody they have to program it in Forth. I do plan on porting pfth or Fast to the P1+, but I view that as more of an academic exercise. The real work will be done in Spin, C and PASM.

    Noted, but I have all along thought that Forth would provide a lot of users with useful understanding of the new Propeller.

    I really can't add much constructively to this 16cog, 512Kb, 64 ADC/DAC but my enthusiasm. It is exciting news as it will allow a lot more to get done with just one chip. SOC chips are never going to be anything by an accessory to the Propeller, certainly not a direct competitor -- but other chips that are comparable have long offered more memory for larger programs.

    Those chips can also load Forth, but on only one CPU. Forth with 8 cpus is much faster. With 16 cpus, even better. I guess I am just excited to have this coming soon. It will be a good thing for GCC as well.
  • ctwardellctwardell Posts: 1,716
    edited 2014-04-14 02:33
    How will ALTDS be handled with respect to tasks?

    Will each task have its own state for the associated data?

    Chris Wardell
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-14 02:37
    Baggers wrote: »
    :_LoopS    REP     #count,#_LoopE - _LoopS  
               MOV     [ptr.D++],[ptr.S++]  
              add mode code
    :_LoopE
    

    This surely gives #2 ( + size of add more code ) for the rep instruction count.

    I would have done...
        REP #count,#_LoopE - _LoopS
    :_LoopS
        MOV [ptr.D++],[ptr.S++]
        add mode code
    :_LoopE
    

    This would give #1 ( + size of add more code ) for the rep instruction count
    Of course you are correct! #0 will perform the loop once IIRC. Else #_LoopE - _LoopS - 1
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 06:35
    Looks great - both of your messages are below.

    I'll chew on it a bit, I think I may have an interesting variation on your posts, but I want to try it on paper before I post.
    cgracey wrote: »
    I need to read a bunch of preceding posts to know what has been floated, but I think we only need ONE instruction to handle the whole indirect mechanism:

    ALTDS D,S/# - Selectively alter D and S fields in next instruction by using D as dual 9-bit pointers with S/# specifying the mode

    S/# = %ddd_sss

    ddd = 000: don't alter D field of next instruction
    ddd = 001: alter D field of next instruction by substituting current D[17:9]
    ddd = 010: <some mode we could define>
    ddd = 011: <some mode we could define>
    ddd = 100: alter D field of next instruction by substituting current D[17:9], increment current D[17:9]
    ddd = 101: alter D field of next instruction by substituting current D[17:9], decrement current D[17:9]
    ddd = 110: alter D field of next instruction by substituting current D[17:9]+1, increment current D[17:9]
    ddd = 111: alter D field of next instruction by substituting current D[17:9]-1, decrement current D[17:9]

    sss = 000: don't alter S field of next instruction
    sss = 001: alter S field of next instruction by substituting current D[8:0]
    sss = 010: <some mode we could define>
    sss = 011: <some mode we could define>
    sss = 100: alter S field of next instruction by substituting current D[8:0], increment current D[8:0]
    sss = 101: alter S field of next instruction by substituting current D[8:0], decrement current D[8:0]
    sss = 110: alter S field of next instruction by substituting current D[8:0]+1, increment current D[8:0]
    sss = 111: alter S field of next instruction by substituting current D[8:0]-1, decrement current D[8:0]


    Usage could be made simple by the assembler:
    MOVS    ptr,#from
    MOVD    ptr,#to
    REP     #count,#2
    MOV     [ptr++],[ptr++]           'move using pointers in ptr, this is actually two instructions: ALTDS and MOV
    

    cgracey wrote: »
    Great idea!!!

    We've only got six bits specified for S/# in ALTDS, so we can use the three bits above to specify write-register alteration, with D[31:23] serving as the pointer for write redirection:

    ALTDS D,S/#

    S/# = %rrr_ddd_sss

    rrr = same as ddd/sss, but uses D[31:23] as a write redirection pointer.

    MOVI D,S/# can be used to set D[31:23]

    Now we've got it all in one instruction!!!
  • cgraceycgracey Posts: 14,133
    edited 2014-04-14 06:43
    ctwardell wrote: »
    How will ALTDS be handled with respect to tasks?

    Will each task have its own state for the associated data?

    Chris Wardell


    Either every task must have its own set of state data for ALTDS, or else we have just one set of state data that remembers what task is using it. I think the latter might be fine.

    Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.
  • cgraceycgracey Posts: 14,133
    edited 2014-04-14 06:47
    Regarding REPS and the code-size matter, remember that there's the $ for origin:
    	REPS	#count,#:end-$
    	inst
    	inst
    :end	inst
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 06:59
    To work around this (and not lose byte code interpreter / word code interpreter speedup due to cache) how about:

    movbf dest wc
    movwf dest wc

    that walks the quad d-cache setting C if it goes past the end, at which time it resets to point at the start of the dcache?

    inner loop becomes:
    init:  rdquad pcode
    
    ' rest of init code
    
    next: movbf opcode wc ' basically round-robbin read of bytes in quad, wraps, sets C when wrapping
    
     if_c  rdlong pcode++   ' only executed 1/32 of the time
    
    .. decode instructions
            jmp #next
    



    cgracey wrote: »
    Either every task must have its own set of state data for ALTDS, or else we have just one set of state data that remembers what task is using it. I think the latter might be fine.

    Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 07:00
    Or:
    	REPS	#count    ' assembler hides end-$ computation
      	   inst
    	   inst
               inst
            ENDR
    
    cgracey wrote: »
    Regarding REPS and the code-size matter, remember that there's the $ for origin:
    	REPS	#count,#:end-$
    	inst
    	inst
    :end	inst
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 07:02
    Alternate 'ALT' proposal:

    Prop style:

    ALT D,S/#
    INST D,S/#

    Three operand style:

    ALT A,B/#B
    INST C,#MODE

    - C is the destination, A & B are sources
    - A&B are specified earlier, so more time to do the op

    MODE could be 10 bits, because it would always be immediate - so I bit can be re-purposed

    Proposed MODE encoding:

    %AABBCC_NNNN

    Where:

    AA/BB/CC:

    00=use AA/BB/CC directly, no offset, no updating
    01=add NNNN to AA/BB/CC before using pointer, DO NOT update register
    10=add NNNN to AA/BB/CC before using pointer, update register
    11=add NNNN to AA/BB/CC after using pointer, update register

    Alternate encoding for NNN, leaves 8 more possible modes:

    SNNN = 0, +1, +2, +4, +16, -1, -2, -4, -16

    For two op instructions, ie

    MOV C, A
    RDxxx C,A

    perhaps B could be used as an optional index?
  • ElectrodudeElectrodude Posts: 1,621
    edited 2014-04-14 07:07
    cgracey wrote: »
    Another thing: RDxxxxC aren't going to work anymore because there's no time to interpret the opcode and substitute the DCACHE address into the D field. This is the exact same dilemma that INDA/INDB suffered from, and that ALTDS gets us around. We have 4-register transfers now via RDQUAD and WRQUAD, but we've lost the convenience of RDxxxxC.

    Do we even need indirect access for RDxxxxC? The S field isn't immediate. If you need indirect addressing for a RDxxxxC, just use
    mov temp, [inda]
    rdbytec x, temp
    

    If someone tries rdbytec x, [inda], just say it's undefined behavior (or make the indirection just not happen).

    electrodude
  • cgraceycgracey Posts: 14,133
    edited 2014-04-14 07:23
    To work around this (and not lose byte code interpreter / word code interpreter speedup due to cache) how about:

    movbf dest wc
    movwf dest wc

    that walks the quad d-cache setting C if it goes past the end, at which time it resets to point at the start of the dcache?

    inner loop becomes:
    init:  rdquad pcode
    
    ' rest of init code
    
    next: movbf opcode wc ' basically round-robbin read of bytes in quad, wraps, sets C when wrapping
    
     if_c  rdlong pcode++   ' only executed 1/32 of the time
    
    .. decode instructions
            jmp #next
    


    Good idea. The quad address could be expressed in D and then an index that is reset by RD/WRQUAD would pick the byte/word/long, setting C when rollover occurs.

    I like the +1/-1 without affecting the D field, for ALTDS.
  • potatoheadpotatohead Posts: 10,254
    edited 2014-04-14 07:39
    @Bill: +1 prefer this to anything else said so far.


    @JMG: My apologies BTW. Let's say some outside factors were affecting discussion. :) I deleted the crappy post first chance I got.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 07:39
    Thanks Chip.

    I was thinking about everything with P16X512, and took a mental step back.

    1) As far as I understand it, the issues with INDA/INDB/RDxxxC are simply not having enough time in the two cycle instructions to do all the work in time (without lots of complexity, or big drop in clock speed).

    2) I've caught myself making calculation mistakes due to the 200MHz/100MIPS dichotomy, with the 2+ cycle instructions

    I think there may be a solution, that does not involve prefix instructions, and simplifies things... but I don't know how much it affects Verilog or gate count, so please let me know :)

    Here is what I think may work better:

    We always talk in terms of clock cycles. No more cycle count confusion. We already have hub instructions etc that take more than 2 cycles - and as long as we know the cycle count, it is still deterministic

    INDA/INDB:

    Add a clock cycle. It should be easy to detect if an instruction refers to the INDA/INDB registers, so add a index-compute cycle. Heck add two if needed, because we save a LOT of memory using one op instead of two.

    RDxxxxC:

    Add a clock cycle. Still cheaper than adding more instructions to do the same thing, still faster. Hopefully that cycle is enough in case INDx is also used, but if not, add another cycle. Still saves a lot of memory.

    Much faster than losing cached data reads, and takes less memory than my movf-reads-quad suggestion.

    The point is that even if some instruction cycle counds increase (if using INDx or RDxxxxC) it will still be faster than not having them, or adding instructions... and use a lot less memory, so more fits in a P16X512
    cgracey wrote: »
    Good idea. The quad address could be expressed in D and then an index that is reset by RD/WRQUAD would pick the byte/word/long, setting C when rollover occurs.

    I like the +1/-1 without affecting the D field, for ALTDS.
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-04-14 08:05
    Q1) Am I right in thinking that, as things stand, cores running out of hub ram will run at 50 MIPS and that all 16 cores will be able to run at that rate simultaneously?

    Q2) And that when running from registers they will run at 100MIPS?


    Finding the current spec is getting harder by the day :)
  • Dave HeinDave Hein Posts: 6,347
    edited 2014-04-14 08:24
    I believe this is correct assuming the clock rate is 200 MHz. The 50 MIPS for hubex is for straight-line code, where all 4 longs in a quad are executed.
  • BaggersBaggers Posts: 3,019
    edited 2014-04-14 08:52
    One question I have about HUBEXEC, apologies in advance if it's been explained and I've missed it in the masses of the various threads.

    Say your first of 4 instructions in the quad is a RDQUAD, does it have to wait for the next slot, then read the desired quad, then re-read the initial instructions quad to continue, or is it still in cache somewhere?

    I guess that's two questions in one lol

    So to clarify... the two questions are :-

    1. Assuming the HUBEXEC read take the slot of HUBRAM for that cog, if any of those 4 instructions use a HUB-OP does this have to wait for a next free slot ( i.e. if it's the first instruction, does it have to wait for 3 instructions to do the HUB-OP, thus delaying execution ( albeit still deterministic )
    2. If you have a HUBOP in one of the four instructions, does it need to re-read the quad for instructions? or are they cached? even if one of those read a quad in the mean time
  • ctwardellctwardell Posts: 1,716
    edited 2014-04-14 08:59
    Multi-phase instructions?

    Thought of this as a possible way to implement Bill's suggestion in #830.

    If an opcode has sufficient space to contain all the data needed for an operation, but timing prevents doing the operation in 2 clocks, what about having instuctions that execute by taking two trips through the 'pipeline'?

    The first trip through does part of the processing, the second completes the processing.

    More detail to follow...

    C.W.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 09:15
    1: yes

    2: no
    Baggers wrote: »
    One question I have about HUBEXEC, apologies in advance if it's been explained and I've missed it in the masses of the various threads.

    Say your first of 4 instructions in the quad is a RDQUAD, does it have to wait for the next slot, then read the desired quad, then re-read the initial instructions quad to continue, or is it still in cache somewhere?

    I guess that's two questions in one lol

    So to clarify... the two questions are :-

    1. Assuming the HUBEXEC read take the slot of HUBRAM for that cog, if any of those 4 instructions use a HUB-OP does this have to wait for a next free slot ( i.e. if it's the first instruction, does it have to wait for 3 instructions to do the HUB-OP, thus delaying execution ( albeit still deterministic )
    2. If you have a HUBOP in one of the four instructions, does it need to re-read the quad for instructions? or are they cached? even if one of those read a quad in the mean time
  • cgraceycgracey Posts: 14,133
    edited 2014-04-14 09:50
    The problem with adding a 3rd clock to indirect instructions is that it takes us right back to the INDA/INDB situation where we have to analyze the instruction data currently being read, in order to issue an optional change-of-D-register before the next clock. That just tacks time onto the clock cycle.

    We can design all the logic so that it is faster than the RAMs, the RAMs being things we cannot make go any faster - they can define the critical path, while we stay out of the way. These RAMs can actually clock at 250MHz+ and if we can keep logic out of their paths, we can easily go there.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 10:01
    Thank you, makes sense now.

    If feasible it would be nice to not require prefix instructions, and not lose RDxxxxC, INDA/INDB for performance and code density.

    If not feasible, then it is not feasible :)

    Don't the D/S addresses already have to be checked for the "special register range" for multiplexing special registers instead of the shadow registers (currently used for cache if I correctly recall)? Does that not give you what you would need for an optional change-of-D?

    The reason I'd hate to lose INDx is that it makes for much faster table lookup code, cog based stack etc., than not having it. If it can't be done in 2 cycles, 3 or 4 is still far preferrable to self-modifying code for 99% possible cases.

    Same for RDxxxC - if two clocks is not feasible, even if it had to go to 4 clocks, it is still much better than 16 clocks for the next hub cycle.

    Mind you, resurrecting your movef{b/w/l} on the quad can take care of the RDxxxC cases, at the expense of more complicated code and slightly lower code density, and an ALT variation can substitute for INDx at the expense of memory and speed.

    Only you (and deep diving your Verilog) can figure out the best option :)

    p.s.

    The modified ALT would be very handy addition to INDA/INDB, as it would effectively provide MANY additional slower IND registers
    cgracey wrote: »
    The problem with adding a 3rd clock to indirect instructions is that it takes us right back to the INDA/INDB situation where we have to analyze the instruction data currently being read, in order to issue an optional change-of-D-register before the next clock. That just tacks time onto the clock cycle.

    We can design all the logic so that it is faster than the RAMs, the RAMs being things we cannot make go any faster - they can define the critical path, while we stay out of the way. These RAMs can actually clock at 250MHz+ and if we can keep logic out of their paths, we can easily go there.
  • cgraceycgracey Posts: 14,133
    edited 2014-04-14 10:22
    ...Don't the D/S addresses already have to be checked for the "special register range" for multiplexing special registers instead of the shadow registers (currently used for cache if I correctly recall)? Does that not give you what you would need for an optional change-of-D?


    The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-14 10:27
    Chip,

    At one point you, or maybe it was Ken, suggested that you might make the RTL for P1 available after P2 shipped. Now it seems that the RTL for P1+ is going to be an extension of the RTL for P1. Do you still plan to release any RTL either before or after you ship the next chip? Did you by any chance archive the RTL for P1 before you started morphing it into P1+ or P2 or whatever the chip being described in this thread will be called?

    Thanks,
    David
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 10:36
    Thanks Chip, I am learning a lot from you about the guts of P16x512 !!

    Well, ALT is a nice instruction, and movf{b,w,l} with the mod addressing of the quad will allow pretty good performance, and we will have 512KB to play with :)
    cgracey wrote: »
    The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.
  • ctwardellctwardell Posts: 1,716
    edited 2014-04-14 10:59
    More detail for Multi-phase instructions I mentioned in post #834.

    This is intended as a possibly simple way to implement instructions that would have issues due the the timing problem mentioned by Chip:
    cgracey wrote: »
    The trouble is, we get only one early shot at reading D and S registers. To make everything go as fast as the RAMs, we need to feed the instruction data bits coming out of the RAM straight back into the address inputs. There is a mux there, of course, to accommodate the two phases of memory access, but it's selector is ready long before the data passes through. To do some logic based on the instruction bits, then drive a mux (also needs buffering, takes time) with the result, would be very slow. The special registers, on the other hand, are mux'd after D and S are read.

    The instruction would basically loop to itself one time, doing a different operation depending on if it is the first or second execution.

    So for an INDA/INDB situation the first phase pass would compute the actual D/S value based on the indirect value.
    The second pass would then use the computed D/S values.

    Without knowing implemention details of the P1+ it is hard to suggest an actual implementation but conceptually I see something like this.

    - An instruction using INDA/INDB is exectuted
    - Since this uses INDA/INDB it is treated as a two phase instruction.
    - Phase one computes the absolute values and places an absolute version of the instruction in an alternate instruction register and does not increment the PC.
    - The instruction in the alternate instruction register is fetched instead of from memory and is executed like normal.

    C.W.
Sign In or Register to comment.