Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 29 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

12627293132144

Comments

  • cgraceycgracey Posts: 14,212
    edited 2014-04-14 13:18
    David Betz wrote: »
    Chip,

    At one point you, or maybe it was Ken, suggested that you might make the RTL for P1 available after P2 shipped. Now it seems that the RTL for P1+ is going to be an extension of the RTL for P1. Do you still plan to release any RTL either before or after you ship the next chip? Did you by any chance archive the RTL for P1 before you started morphing it into P1+ or P2 or whatever the chip being described in this thread will be called?

    Thanks,
    David


    We plan to release Prop1 code, at first.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-04-14 13:38
    cgracey wrote: »
    We plan to release Prop1 code, at first.
    Great! I'm glad that's still part of the plan.
  • SapiehaSapieha Posts: 2,964
    edited 2014-04-14 13:42
    Hi Chip.

    Good to hear
    Have that have be good for my Serial-Com's experiments
    cgracey wrote: »
    We plan to release Prop1 code, at first.
  • cgraceycgracey Posts: 14,212
    edited 2014-04-14 13:51
    I made a mistake in thinking about and explaining the indirect issues. We CAN do an extra clock to achieve indirection, without any clock speed penalty. We would let the first clock go through, reading D and S according the D and S fields in the instruction. Then, on the next clock, we could issue the reads for indirect D and S using some implied hidden registers. This would help speed and code density, but would give us only maybe two indirect registers. I kind of like the universal approach using two instructions, but it is wasteful, compared to what could be done with implied indirection via special INDA/INDB register locations.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 14:08
    EXCELLENT NEWS!!!

    I knew you would figure out a way.

    Ummm... could we have our cake, and eat it too?

    Adding the universal ALT instruction gives us MANY additional pointers, and we can use INDA/INDB for the highest speed uses.
    cgracey wrote: »
    I made a mistake in thinking about and explaining the indirect issues. We CAN do an extra clock to achieve indirection, without any clock speed penalty. We would let the first clock go through, reading D and S according the D and S fields in the instruction. Then, on the next clock, we could issue the reads for indirect D and S using some implied hidden registers. This would help speed and code density, but would give us only maybe two indirect registers. I kind of like the universal approach using two instructions, but it is wasteful, compared to what could be done with implied indirection via special INDA/INDB register locations.
  • tonyp12tonyp12 Posts: 1,951
    edited 2014-04-14 14:48
    > but it is wasteful, compared to what could be done with implied indirection via special INDA/INDB register
    But does the INDA/INDB have optional post-inc/pre-dec flags?, as any space saved is lost if you have to put a sub before or a add after the INDx instruction
  • jmgjmg Posts: 15,175
    edited 2014-04-14 15:14
    cgracey wrote: »
    Regarding REPS and the code-size matter, remember that there's the $ for origin:
    	REPS	#count,#:end-$
    	inst
    	inst
    :end	inst
    

    The problem with this simpler, vanilla form, is the code fails if that last inst opcode is a double-size one
    Is there still a delay following REPS before the looping block, or has that gone ?

    If REPS now starts immediately, then a single label form is ok, a finite delay needs two labels.
    	REPS	#count,EndLoop
    	inst
    	inst
    	inst  ' can be one or two sized inst
    :EndLoop
    

    addit :
    Or:
    	REPS	#count    ' assembler hides end-$ computation
      	   inst
    	   inst
               inst
            ENDR
    

    yes, that also works well, if there is no lead-in delay on REPS
    I think REPS cannot be nested, so this form is fine, and if anyone does REPS..REPS ENDR ENDR it can spit an error
  • jmgjmg Posts: 15,175
    edited 2014-04-14 15:25
    EXCELLENT NEWS!!!

    I knew you would figure out a way.

    Ummm... could we have our cake, and eat it too?

    Adding the universal ALT instruction gives us MANY additional pointers, and we can use INDA/INDB for the highest speed uses.

    I was wondering the same thing.

    I'm unclear if Chip is meaning an extra Opcode clock (4 SysClk Opcode) or an extra SysClk (3 SysClkCycClk opcode) in #645 ?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 15:37
    I think 3 cycclk

    fyi,

    I think everything should be documented in sysclocks, I've been confused a number of times, which suggests to me that those new to the prop will generally go HUH??? at the two clocks.
    jmg wrote: »
    I was wondering the same thing.

    I'm unclear if Chip is meaning an extra Opcode clock (4 SysClk Opcode) or an extra SysClk (3 CycClk opcode) in #645 ?
  • jmgjmg Posts: 15,175
    edited 2014-04-14 15:52
    I think everything should be documented in sysclocks, I've been confused a number of times, which suggests to me that those new to the prop will generally go HUH??? at the two clocks.

    Yes, and if there are going to be 3 SysClk opcodes, that gives little choice, as you cannot really spec 1.5 OpCodeClks ?
    That would make mnemonics 2/3/4 SysClks in speed.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-04-14 15:54
    I agree with this. It is confusing. Sysclocks would be unambigious.

    Is there a delay now on REP? I thought I read we don't have one sans the complex pipeline. If that is true, I like the form Bill posted last the best. (unable to quote on this device in any sane way)
  • cgraceycgracey Posts: 14,212
    edited 2014-04-14 16:53
    potatohead wrote: »
    I agree with this. It is confusing. Sysclocks would be unambigious.

    Is there a delay now on REP? I thought I read we don't have one sans the complex pipeline. If that is true, I like the form Bill posted last the best. (unable to quote on this device in any sane way)


    REP has no delay slots in this design, since there's no pipeline.

    The way we can get INDA/INDB to have all the different pre/post-inc/dec possibilities is to make 8 registers worth of INDx registers:

    $1F8 = INDA
    $1F9 = INDA++
    $1FA = INDA--
    $1FB = ++INDA
    $1FC = INDB
    $1FD = INDB++
    $1FE = INDB--
    $1FF = ++INDB

    MOV INDB++,INDA++ ...same as... MOV $1FC,$1F9
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 17:08
    How about SETINDMOD #bbbaaa

    aaa

    0xx = use INDA value directly
    100 = INDA++
    101 = INDA--
    110 = ++INDA
    110 = --INDA

    bbb

    0xx = use INDB value directly
    100 = INDB++
    101 = INDB--
    110 = ++INDB
    110 = --INDB

    That way there is no need to have so many registers...
    cgracey wrote: »
    REP has no delay slots in this design, since there's no pipeline.

    The way we can get INDA/INDB to have all the different pre/post-inc/dec possibilities is to make 8 registers worth of INDx registers:

    $1F8 = INDA
    $1F9 = INDA++
    $1FA = INDA--
    $1FB = ++INDA
    $1FC = INDB
    $1FD = INDB++
    $1FE = INDB--
    $1FF = ++INDB

    MOV INDB++,INDA++ ...same as... MOV $1FC,$1F9
  • cgraceycgracey Posts: 14,212
    edited 2014-04-14 20:17
    How about SETINDMOD #bbbaaa

    aaa

    0xx = use INDA value directly
    100 = INDA++
    101 = INDA--
    110 = ++INDA
    110 = --INDA

    bbb

    0xx = use INDB value directly
    100 = INDB++
    101 = INDB--
    110 = ++INDB
    110 = --INDB

    That way there is no need to have so many registers...


    But we need to be able to do stuff like JMPSW INDA,++INDA.

    Hey, I just realized that we had cog RAM stacks in the Prop2, all along:

    JMPSW INDA--,ADR = CALL ADR
    JMP ++INDA = RET

    This takes less transistors than a hardware LIFO.
  • jmgjmg Posts: 15,175
    edited 2014-04-14 20:29
    That way there is no need to have so many registers...

    Is there a spare opcode bit, that can re-map these to give the decode choices, but not to consume (valuable) register address space with fixed-use-locations ?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 20:29
    True!

    I think you can safely deep-six the four level stack.

    Good point about task switching. So how about
    How about SETINDMOD #ddd,#sss
    
    ddd - applies to whichever index register is used as the destination
    
    0xx = use INDd value directly, d=A/B
    100 = INDd++
    101 = INDd--
    110 = ++INDd
    110 = --INDd
    
    sss - applies to whichever index register is used as the destination
    
    0xx = use INDs value directly, d=A/B
    100 = INDs++
    101 = INDs--
    110 = ++INDs
    110 = --INDs
    
    That way there is no need to have so many registers... and the INDMOD stays in effect until next SETINDMOD
    
    cgracey wrote: »
    But we need to be able to do stuff like JMPSW INDA,++INDA.

    Hey, I just realized that we had cog RAM stacks in the Prop2, all along:

    JMPSW INDA--,ADR = CALL ADR
    JMP ++INDA = RET

    This takes less transistors than a hardware LIFO.
  • cgraceycgracey Posts: 14,212
    edited 2014-04-14 21:08
    jmg wrote: »
    Is there a spare opcode bit, that can re-map these to give the decode choices, but not to consume (valuable) register address space with fixed-use-locations ?


    The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 21:12
    Umm.. noooooooooooooooooooooooooooooo

    With 512 cog locations (minus shadow regs) it is still barely possible to do a 256 entry lookup table for vm's using cog memory, without taking a hub cycle hit.

    16 cogs with 512 registers, and tasks, is FAR more useful than 32 cogs with 256 registers.
    cgracey wrote: »
    The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!
  • cgraceycgracey Posts: 14,212
    edited 2014-04-14 21:14
    True!

    I think you can safely deep-six the four level stack.

    Good point about task switching. So how about
    How about SETINDMOD #ddd,#sss
    
    ddd - applies to whichever index register is used as the destination
    
    0xx = use INDd value directly, d=A/B
    100 = INDd++
    101 = INDd--
    110 = ++INDd
    110 = --INDd
    
    sss - applies to whichever index register is used as the destination
    
    0xx = use INDs value directly, d=A/B
    100 = INDs++
    101 = INDs--
    110 = ++INDs
    110 = --INDs
    
    That way there is no need to have so many registers... and the INDMOD stays in effect until next SETINDMOD
    


    Good idea on the mode bits. If we gave two registers per INDx, we could pick between two of the preset modes. That would enable stacks.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-04-14 21:24
    Sounds good!
    cgracey wrote: »
    Good idea on the mode bits. If we gave two registers per INDx, we could pick between two of the preset modes. That would enable stacks.
  • jmgjmg Posts: 15,175
    edited 2014-04-14 21:30
    cgracey wrote: »
    The only way we could get a spare bit, or two, would be to reduce cog RAM to 256 longs. Then, we could get indirect bits for both D and S. And cog size would be cut in half. It might be viable with hub exec, you know. 32 cogs!

    That sounds too costly, I was thinking more along the lines of 'blurring' the 9b + 9b opcode fields to something like 10b+ 8b in some sparse cases, where that split allows things like dedicated pointers/sfr (special function registers) 'above' usual 512 limit, and now the #Immediate value is smaller, & the source register form is limited to one of the lower 256.
    -but it has avoided eating-into general purpose RAM.
  • AribaAriba Posts: 2,690
    edited 2014-04-14 21:31
    cgracey wrote: »
    But we need to be able to do stuff like JMPSW INDA,++INDA.
    ....

    If this is for task switching, do we then not also need the automatic wrapping inside a table (FIXINDx instructions) ?
    I have the feeling this needs a lot of logic gates (comparators, muxes, registers) ?

    I think we can always do it with hardcoded task switches instead of a task-list. For example with 3 tasks:
    jmpsw task1,task2
       ...
       jmpsw task2,task3
       ...
       jmpsw task3,task1
       ...
    

    Andy
  • cgraceycgracey Posts: 14,212
    edited 2014-04-14 21:43
    Ariba wrote: »
    If this is for task switching, do we then not also need the automatic wrapping inside a table (FIXINDx instructions) ?
    I have the feeling this needs a lot of logic gates (comparators, muxes, registers) ?

    I think we can always do it with hardcoded task switches instead of a task-list. For example with 3 tasks:
    jmpsw task1,task2
       ...
       jmpsw task2,task3
       ...
       jmpsw task3,task1
       ...
    


    Andy


    I'd like to avoid that wrapping issue, as it is costly. I hope to have all these issues settled tonight. There are many other things that need attention.
  • cgraceycgracey Posts: 14,212
    edited 2014-04-14 21:46
    jmg wrote: »
    That sounds too costly, I was thinking more along the lines of 'blurring' the 9b + 9b opcode fields to something like 10b+ 8b in some sparse cases, where that split allows things like dedicated pointers/sfr (special function registers) 'above' usual 512 limit, and now the #Immediate value is smaller, & the source register form is limited to one of the lower 256.
    -but it has avoided eating-into general purpose RAM.


    I could see that, but how would you know if you were in a 10b + 8b situation?
  • AribaAriba Posts: 2,690
    edited 2014-04-14 22:01
    cgracey wrote: »
    I'd like to avoid that wrapping issue, as it is costly. I hope to have all these issues settled tonight. There are many other things that need attention.

    One of these other things:

    I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
    If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.

    My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:
    movword D,S,#%ds
    
       movword D,S,#%00   ' D.word0 <- S.word0
       movword D,S,#%01   ' D.word0 <- S.word1
       movword D,S,#%10   ' D.word1 <- S.word0
       movword D,S,#%11   ' D.word1 <- S.word1
    

    Andy
  • jmgjmg Posts: 15,175
    edited 2014-04-14 22:21
    Ariba wrote: »
    One of these other things:

    I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
    If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.

    My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:
    movword D,S,#%ds
    
       movword D,S,#%00   ' D.word0 <- S.word0
       movword D,S,#%01   ' D.word0 <- S.word1
       movword D,S,#%10   ' D.word1 <- S.word0
       movword D,S,#%11   ' D.word1 <- S.word1
    

    Andy

    In common with SDRAM which would want 16 bit and a CS# strobe, LCD parallel interfaces are similar.
    Some need 24b, which may be best via the video , others need 16b i8080 bus models.

    A useful opcode here could be a double-move, that does 2 x16b moves on a 32 bit register.
    With 2 SysClks available, it may be possible to get close to 200MHz bursts ?
  • cgraceycgracey Posts: 14,212
    edited 2014-04-14 22:31
    Ariba wrote: »
    One of these other things:

    I think we need a word-move instruction for SDRAM drivers. We have Nibble and Byte instructions (ROLNIB, GETNIB, SETNB etc.) but nothing like that with word size.
    If we want to access 16bit wide SDRAM with half the sysclock rate we need to do the move and shift in one single cycle instruction.

    My proposal is a MOVWORD instruction which moves the lower or higher 16bits from S to the lower or higher 16bits of D. The other 16bits in D must not be affected:
    movword D,S,#%ds
    
       movword D,S,#%00   ' D.word0 <- S.word0
       movword D,S,#%01   ' D.word0 <- S.word1
       movword D,S,#%10   ' D.word1 <- S.word0
       movword D,S,#%11   ' D.word1 <- S.word1
    

    Andy


    Thanks for pointing this out. Perhaps GETNIB/BYTE/WORD should perform a ROL function, too.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-14 23:21
    re INDA/INDB

    Why do we need more than one mode?

    When used as a stack, you are either PUSHing (CALLing) or POPing (RETing). So you choose PUSH INDA++ which pushes first, then increments, and POP --INDA which decrements first then pops. This only requires 2 register spaces for INDA. For the occasional times you require just take a copy from the stack you have to POP then PUSH.

    If we want 2 stacks, the probability is we will go from each end. Therefore make INDB work the opposite - PUSH --INDB and POP INDB++, and 2 more register spaces.

    Now, you will notice that INDA & INDB both have pre-decrement and post-increment. They just get reversed for CALL to RET and also for PUSH to POP. Silicon should be simplified here.

    This also gives us the freedom to use INDx++ and --INDx in standard op codes too.

    Might INDA & INDB be better called STACKA and STACKB ?

    BTW I agree, the 4 level LIFO (which is only 17 address bits + flags) can go.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-14 23:37
    NIB/BYTE/WORD

    May I suggest...

    MOVNIB D,S/#, #0..7
    MOVBYTE D,S/#,#0..3
    MOVWORD D,S/#,#0..1
    where the NIB/BYTE/WORD (rightmost bits) of S/# replace the bits in D as indexed by #0..n
    and
    GETNIB D,S,#0..7
    GETBYTE D,S,#0..3
    GETWORD D,S,0..1
    where D is left-zero filled and #0..n is an index into S.
    These 6 instructions could map nicely to one set of opcode sets.

    Can you remind me what ROLNIB/BYTE/WORD does?

    My preference when performing a MOVe to cog memory is to use "MOVxxx". Similarly I prefer MOVD, MOVS, MOVI or MOVINST, and MOVCOND.
    My preference to use "SETxxx" is for buried registers or setting modes.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-15 00:17
    RD CACHE

    Thinking about what has been discussed above, I wonder if an alternative could be...

    RDQUADC D/#,S/PTRA++/PTRB++ 'reads a quad long into cog ram at a quad boundary (no need for buried DCACHE) and resets the internal OFFSET 4-bit counter.

    RDBYTEC D,S/# WC 'reads a byte from the cog where S/# specifies the location D/# used in the RDQUADC, and increments the OFFSET counter +1. "C" set if OFFSET wraps = last byte.

    RDWORDC D,S/# WC 'reads a word from the cog where S/# specifies the location D/# used in the RDQUADC, and increments the OFFSET counter +2. "C" set if OFFSET wraps = last word.

    Note mixed RDBYTE and RDWORD not supported.
    Note if the user gives a different S/# in RDBYTEC or RDWORDC than the D/# used in the RDQUADC then results will be unpredictable (reads from the cog location specified)
Sign In or Register to comment.