Shop OBEX P1 Docs P2 Docs Learn Events
The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Page 25 — Parallax Forums

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

12223252728144

Comments

  • potatoheadpotatohead Posts: 10,254
    edited 2014-04-13 02:26
    Does that mean : mov (D), (S)?
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-04-13 02:34
    Chip,
    Awesome! That is an excellent way to handle it! Glad I said that even though it was worded odd and wasn't the actual way. It made you think of one! :D

    It's nice to have at least some form of indirection available!
  • cgraceycgracey Posts: 14,133
    edited 2014-04-13 02:35
    potatohead wrote: »
    Does that mean : mov (D), (S)?


    I'm not sure what (D) and (S) mean.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-13 02:37
    Chip,
    I thought you were implementing hubexec as an extremely simple model. This is what I thought was happening...

    The JMP/CALL/RET (4 level lifo, PTRA, PTRB) instructions can mix cog and hub execution. PUSH & POP versions supported.
    The instruction cache is 4 longs (hence 128 bit wide).
    There would be an address tag holding what quad hub address is stored in the Icache.
    When a new instruction fetch was required, we'd wait for the hub and read the quad into the Icache.
    Then up to 4 instructions could execute from this Icache. Unless there was a loop within this quad, a new hub fetch would be required.
    Self-modifying hubexec instructions would not be possible. We have the same problem with LMM anyway.
    The LOCxxxx instructions permit us loading into registers embedded constants (17 bits). BTW I have suggested you make it 18 bits for future 1MB expansion.

    So, presuming sequential hubexec instructions, the speed would be 1/2 that of cog mode (4 instructions per hub access). Non-sequential or hub data access will slow this down.

    If you can do this then this would be fantastic.

    We can get up to other tricks to improve speed, but this would permit the user to run from hub simply, and so the benefit is we no longer are tied to 2KB cog.
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-04-13 02:37
    Chip,
    he means like your [ptr] example. using the ALTDS instruction you could have mov [ptr1], [ptr2]
  • cgraceycgracey Posts: 14,133
    edited 2014-04-13 02:39
    Roy Eltham wrote: »
    Chip,
    Awesome! That is an excellent way to handle it! Glad I said that even though it was worded odd and wasn't the actual way. It made you think of one! :D


    I'm really glad we've got a solution now. What a relief!. In some ways, this is better than INDA/INDB.
  • cgraceycgracey Posts: 14,133
    edited 2014-04-13 02:40
    Roy Eltham wrote: »
    Chip,
    he means like your [ptr] example. using the ALTDS instruction you could have mov [ptr1], [ptr2]


    Sure!

    ALTDS ptr1,ptr2
    MOV 0,0

    same as:

    MOV [ptr1],[ptr2]
  • cgraceycgracey Posts: 14,133
    edited 2014-04-13 02:42
    Cluso99 wrote: »
    Chip,
    I thought you were implementing hubexec as an extremely simple model. This is what I thought was happening...

    The JMP/CALL/RET (4 level lifo, PTRA, PTRB) instructions can mix cog and hub execution. PUSH & POP versions supported.
    The instruction cache is 4 longs (hence 128 bit wide).
    There would be an address tag holding what quad hub address is stored in the Icache.
    When a new instruction fetch was required, we'd wait for the hub and read the quad into the Icache.
    Then up to 4 instructions could execute from this Icache. Unless there was a loop within this quad, a new hub fetch would be required.
    Self-modifying hubexec instructions would not be possible. We have the same problem with LMM anyway.
    The LOCxxxx instructions permit us loading into registers embedded constants (17 bits). BTW I have suggested you make it 18 bits for future 1MB expansion.

    So, presuming sequential hubexec instructions, the speed would be 1/2 that of cog mode (4 instructions per hub access). Non-sequential or hub data access will slow this down.

    If you can do this then this would be fantastic.

    We can get up to other tricks to improve speed, but this would permit the user to run from hub simply, and so the benefit is we no longer are tied to 2KB cog.



    That all still stands. It just seemed to me that without INDA/INDB, it wouldn't be that practical. Kind of myopic on my part.

    Anyway, now we've got a solution to indirection: ALTD/ALTS/ALTDS
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-13 02:51
    Just so everyone realises, MOV [ptrd],[ptrs] would still be limited to registers (cog not hub).
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-13 02:56
    Just thinking...

    Why would you do

    ALTDS ptr1, ptr2
    MOV 0,0

    It is compiled code, so couldn't you just do

    MOV ptr1, ptr2 ???

    Or does

    ALTDS ptr1, ptr2
    read the values of ptr1 and ptr2 (lets say they store $20 and $30)
    and then when
    MOV 0,0
    executes it performs
    MOV $20, $30

    Just realised this is what it is doing, hence both D & S can be via indirection. Nice!
  • cgraceycgracey Posts: 14,133
    edited 2014-04-13 03:01
    Cluso99 wrote: »
    Just thinking...

    Why would you do

    ALTDS ptr1, ptr2
    MOV 0,0

    It is compiled code, so couldn't you just do

    MOV ptr1, ptr2 ???

    Or does

    ALTDS ptr1, ptr2
    read the values of ptr1 and ptr2 (lets say they store $20 and $30)
    and then when
    MOV 0,0
    executes it performs
    MOV $20, $30

    ???


    The latter is true.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-13 03:06
    Do we require 2 hub stacks (PTRA & PTRB) ?
    Would one cog stack be a nice alternative ?

    Or could PTRA & PTB use the address to determine cog or hub mode ?
  • potatoheadpotatohead Posts: 10,254
    edited 2014-04-13 03:08
    Yeah, sorry I should have used brackets.

    Nice!

    I think two stacks is a very nice to have.
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-04-13 03:11
    We already essentially have indirection for reading and writing hub memory with rdxxxx/wrxxxx since they use a register as the address to read/write.
    Now we have a solution for indirection in cog memory accesses.
  • Heater.Heater. Posts: 21,230
    edited 2014-04-13 03:13
    Chip,
    The problems arise from hub exec needing INDA/INDB-type functionality to overcome the impracticality of self-modifying code in hub memory.
    Wait up. The Prop I does LMM. It has no such demands. The LMM is just a loop to fetch and execute instructions, with some other kernel support functionality.


    What we need is for the PII to support doing what that 4 instruction LMM does, as fast as possible.


    A hardware LMM loop with software support would do the trick.
    Is hub exec worth slowing the cog down for?
    My gut tells me "no way". When you need the speed you need the speed.
    For example, the Fast Fourier Transorm fits in and runs from a single COG. Such things as FFT are never fast enough.
    Or think about the fact that even when executing from HUB propgcc can cache tight code loops into COG for maximum speed. In fact it does this so well that the C version of the FFT runs nearly as fast as the hand made PASM version.
    Taking 30% off of all of this sounds really bad.
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-04-13 03:16
    Heater,
    You're a bit behind, we have a solution now that doesn't slow the cogs down and keeps hubexec. We won't have inda/indb.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-04-13 03:20
    potatohead wrote: »
    I think two stacks is a very nice to have.
    But do they both have to be in hub? That is my question - could one be in cog?
    Remember, every CALLA/CALLB/RETA/RETB will perform a wr/rd hub long, which causes a hub access (in addition to the instruction fetching), and so will slow down a lot.
  • Heater.Heater. Posts: 21,230
    edited 2014-04-13 03:21
    Roy,

    So I see! Great work guys. I'll go back to sleep.
  • jmgjmg Posts: 15,148
    edited 2014-04-13 03:21
    cgracey wrote: »
    What you said made me realize that we could do something like AUGS/AUGD, but instead of augmenting the next S/D constant, we could alter the S/D field in the next instruction.

    So this is 'in-line self modification', that can also be hidden from the programmer ?
    (self-modifying-code is not easy to read months later, or by someone else )
    cgracey wrote: »
    ALTS ptr
    MOV OUTA,0

    Could also be coded as:

    MOV OUTA,[ptr]

    The second form is much more preferable, even if it means that 'opcode' is 4 cycles/2 words in how it is managed.
  • Roy ElthamRoy Eltham Posts: 2,996
    edited 2014-04-13 03:27
    jmg,
    My understanding is that it doesn't actually modify the memory of the mov instruction. It just changes its behavior. The same way the AUGS instruction does.

    It accomplishes the same end result as self modifying code, without actually doing the modify of the code in memory, and it's actually easier to use since it doesn't require the extra instruction(s) between the modify and the execution of the modified instruction.
  • jmgjmg Posts: 15,148
    edited 2014-04-13 03:35
    Roy Eltham wrote: »
    jmg,
    My understanding is that it doesn't actually modify the memory of the mov instruction. It just changes its behavior. The same way the AUGS instruction does.

    It accomplishes the same end result as self modifying code, without actually doing the modify of the code in memory,

    Yes that is what I meant by 'in-line self modification' - it occurs ln-line, in the execution unit, not in the memory.

    The P1+ really does need an 'indirect opcode' group, and this seems a good way to achieve that. I also prefer the 2nd handling form Chip gives, that 'hides' the 'in-line self modification', so it looks more normal to anyone reading code.

    A number of micros are slower/larger for indirect access opcodes, so that is ok.
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-13 04:12
    cgracey wrote: »
    I realized tonight that it is hard to support hub exec and not get mired back in the complexities of the Prop2.

    The problems arise from hub exec needing INDA/INDB-type functionality to overcome the impracticality of self-modifying code in hub memory. INDA/INDB require an effective pipeline stage, unto themselves, in order to take the instruction, recognize INDA/INDB usage, and then substitute the INDA/INDB pointers into the S and D fields of the instruction before reading S and D. The Prop1's simpler architecture just feeds the S and D fields of the instruction straight into the address inputs of the cog RAM to read S and D on the next clock, requiring no extra stage. This pipeline stage needed to support INDA/INDB increases the number of instructions trailing a branch (which will need cancellation) from one to two. That, in turn, has the effect of requiring Prop2-type INDA/INDB backtracking circuits to accommodate various cancellation scenarios. This pipeline stage is also required to make the RDxxxxC cached reads work. To implement hub exec, we're going to be complicating the cog quite a bit. Delayed branches will become more necessary to regain looping performance. I think this is the wrong road, in light of what became of the P2 at 180nm.

    There is a way around this, and that is to resolve all the INDA/INDB stuff on the same cycle as the instruction read, tacking the INDA/INDB logic time onto the cog RAM access time. This would have the effect of slowing the clock down by ~30%, I estimate. It would require less flops and not introduce a new pipeline stage, but would instantly create what will remain the critical path in the cog.

    In summary, if we don't pursue hub exec and INDA/INDB, we're much simpler, which means smaller and faster. There is little cost and no extra pipelining required to implement PTRA/PTRB, though, so that is viable. Same with a hardware LIFO stack for CALL/RET. And 128-bit hub transfers are no problem, either. Same with keeping the four tasks - those make a lot possible and are almost free.

    Is hub exec worth slowing the cog down for?
    I guess I haven't been following this closely enough. What do INDA and INDB have to do with executing code from hub memory. I thought all you were doing was extending the PC to 17 bits and fetching from the i-cache if the high 8 bits were non-zero. That and adding the JMP and CALL instructions that take 17 bit addresses is all there is to hub exeuction isn't it?
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-13 04:16
    cgracey wrote: »
    What you said made me realize that we could do something like AUGS/AUGD, but instead of augmenting the next S/D constant, we could alter the S/D field in the next instruction. This is the way to achieve indirection for S and D! This is REALLY simple.

    Along with augmenting D and S constants, we could alter D and S registers:

    ALTD D/#
    ALTS S/#
    ALTDS D/#,S/#


    This:

    ALTS ptr
    MOV OUTA,0

    Could also be coded as:

    MOV OUTA,[ptr]
    By the way, if you get rid of the ability to execute code from the hub you probably don't need AUGS and AUGD anymore either since the original "BIG" instruction was mostly to replace the LMM macro to fetch 32 bit constants when running from the hub directly.
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-13 04:26
    I don't understand this proposed hubexec model. Can someone please explain it in a single message rather than scattered across the entire thread? I don't believe we need anything other than the following:

    1) PC extended to 17 bits to address 512K of hub memory
    2) When the high 8 bits of the PC are zero, it points to COG memory
    3) When the high 8 bits of the PC are non-zero, it points to hub memory
    4) CALL/JMP instructions with 17 bit address fields
    5) A CALL instruction that stores its full 17 bit return address in a register
    6) A way to load a 32 bit constant (AUGS)
    7) A RDLONGC-like facility that allows 128 bits to be fetched at once and used as a one-line i-cache

    What else is needed? In fact, number 7 could be left out in a really simple implementation but I think Bill determined that would only be 25% faster than LMM.

    What is all of this about INDA, INDB, ALTD, ALTS, etc? What am I missing here?

    Edit: Okay, I think I'm beginning to understand. Chip said INDA-like functionality. I guess the problem is with my points 2 and 3 where the value in PC is sometimes treated as a hub address and sometimes as a COG address. Is that the issue?
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-13 04:43
    cgracey wrote: »
    That all still stands. It just seemed to me that without INDA/INDB, it wouldn't be that practical. Kind of myopic on my part.

    Anyway, now we've got a solution to indirection: ALTD/ALTS/ALTDS
    I don't understand the need for indirect COG address. Can someone post an example of how this would be needed in conjunction with hubexec? It seems like an orthogonal feature to me.
  • BaggersBaggers Posts: 3,019
    edited 2014-04-13 04:50
    Heater. wrote: »
    Chip,
    What we need is for the PII to support doing what that 4 instruction LMM does, as fast as possible.
    A hardware LMM loop with software support would do the trick.

    Do you mean like the REPS instruction we had in P2?
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-13 04:52
    Baggers wrote: »
    Do you mean like the REPS instruction we had in P2?
    REPS is too hard to work with in compiler code generators. It works really well for straight line code but very badly with branching code. Unfortunately, I think most code tends to branch a lot.
  • ozpropdevozpropdev Posts: 2,791
    edited 2014-04-13 05:03
    David Betz wrote: »
    I don't understand the need for indirect COG address. Can someone post an example of how this would be needed in conjunction with hubexec? It seems like an orthogonal feature to me.
    David
    If INDA/B are removed from P2 and you can't self modify code in hub-exec, it makes it difficult to index through COG ram.
    The only way would be to call a COG mode program that self modifies and return back to hub-exec mode.
    Here's a P3 example.
    read_cogram		setd	myinda,val2
    			nop
    			nop
    			nop
    myinda			pushb	0-0
    			reta
    
    
    By adding ALTD/ALTS/ALTDS allows indexing of COG ram from within Hub-exec.
    Brian
  • David BetzDavid Betz Posts: 14,511
    edited 2014-04-13 05:07
    ozpropdev wrote: »
    David
    If INDA/B are removed from P2 and you can't self modify code in hub-exec, it makes it difficult to index through COG ram.
    The only way would be to call a COG mode program that self modifies and return back to hub-exec mode.
    Here's a P3 example.
    read_cogram		setd	myinda,val2
    			nop
    			nop
    			nop
    myinda			pushb	0-0
    			reta
    
    
    By adding ALTD/ALTS/ALTDS allows indexing of COG ram from within Hub-exec.
    Brian
    Okay, I understand. That sounds like a cool facility but I don't think it will be needed by PropGCC. As I said, it seems like it's orthogonal to hub execution. Without INDA and INDB we are in no worse shape than we were in with P1 which also had no indirect COG memory instructions.
  • Heater.Heater. Posts: 21,230
    edited 2014-04-13 05:14
    ozpropdev,
    If INDA/B are removed from P2 and you can't self modify code in hub-exec, it makes it difficult to index through COG ram.
    How is indexing through COG RAM whilst executing from HUB a useful feature? Who would miss it if we could not do it?
Sign In or Register to comment.