Shop OBEX P1 Docs P2 Docs Learn Events
Hub Execution Model Thread (split from blog) - Page 3 — Parallax Forums

Hub Execution Model Thread (split from blog)

1356722

Comments

  • David BetzDavid Betz Posts: 14,516
    edited 2013-12-03 13:54
    jmg wrote: »
    Cautious is good, but adding opcodes is relatively safe, older LMM designs still work fine. This does not close any doors.
    The P2 code base already in FPGA testing, would quickly catch any unexpected issues, should any occur on existing opcodes.
    I have a kind of ugly suggestion that might address one of Eric's concerns. Could we add yet another instruction that sets the register to use as the return address register. This would be executed once during initialization and for PropGCC it could set the register to be the already existing LR register. Then any HCALL would store its return address in that register and HRET would return to the address in that register. This would eliminate using AUX memory just for the purpose of temporarily storing the return address and would require minimal changes in the PropGCC LMM code generator. Of course, other LMM engines could choose a different register so nothing would lock in the current PropGCC register model.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 13:57
    Odds are the popping off the AUX would be hidden waiting for the next hub cycle to do the WRLONG

    (FOR P3)

    Allowing WRLONG to write the top of the AUX stack would certainly get rid of that delay.

    Call tree analysis would allow a lot more than leaf functions to use the AUX stack. Without any arguments, 256 levels would be available, with one argument, 128.

    I am of course talking microcontroller type application, not recursive descent parsers or recursive computation.
    David Betz wrote: »
    Popping it off the stack right away is what I figured would be necessary for everything other than leaf functions. This isn't so bad I guess because for non-leaf functions we have to save LR anyway and popping the return address off the stack is a single instruction just like moving LR to another register. I guess it might be a bit slower if the current code generator can do a WRLONG LR, -SP rather than moving it into a register first though.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 14:01
    Simplest is to pop it off AUX - especially since there is now no need for specific LMM registers.

    They can of course be kept initially, but my reading of the GCC guts tells me that it has a rather exceptional capability to use large numbers of registers... a potential for another great performance boost for a later effort.

    Don't forget, with the embedded 18 bit address there is no room to specify an LR register. Popping it off AUX lets you place it wherever is convenient.
    David Betz wrote: »
    I have a kind of ugly suggestion that might address one of Eric's concerns. Could we add yet another instruction that sets the register to use as the return address register. This would be executed once during initialization and for PropGCC it could set the register to be the already existing LR register. Then any HCALL would store its return address in that register and HRET would return to the address in that register. This would eliminate using AUX memory just for the purpose of temporarily storing the return address and would require minimal changes in the PropGCC LMM code generator. Of course, other LMM engines could choose a different register so nothing would lock in the current PropGCC register model.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-03 14:03
    I was going to keep this until a little later, but the pace has been frenetic overnight, and the hub execute in place with Bill's comments, make this more relevant now.

    I put forward a proposal that the AUX & QUAD-CACHE be redone as AUX 32*8*longs, enabling a RD/WRWIDE to read/write 8*Longs straight into any of the 32 AUX 8*Long blocks in 1 hub clock. Assuming this comes to pass...

    The RDWIDE[C] [#]D/PTR instruction becomes...

    RD/WRWIDE [#]D/PTR,#bbb,#nnnnn [WC]
    where:
    #bbb: is the starting block# (0-31) of the 8*Long Aux Ram block (it will wrap if necessary)
    #nnnnn: is the count of 8*Long hub transfers to perform
    WC: is set to stall the cog until this/these transfer(s) complete


    RD/WRWIDE is run by a tiny state m/c so that the transfer can be done in the background and the cog can continue processing. WC controls whether the cog can continue or not.

    We will now have two methods to increase slot bandwidth (providing Chip still implements it). One with a donor(s) cogs, and another using any free slots. If you just presume a donor slot of 1 extra slot, making 1:4, then at 200MHz you can transfer up to 1KB in 2*8*Long/8 transfers per hub cycle = 1KB @ 1600MB/s.

    VIDEO GEN:
    By being able to transfer 1KB in a non-blocking instruction allows the cog to execute the RDWIDE instruction to transfer a whole 1KB using only 1 clock to start the transfer. So together with the waitvid and a few frame instuctions, the whole generation is done (per 1KB of display data) in a few clocks, leaving the majority for some really serious other processing.
    In fact, for P3 with a few more registers, the complete video frame could be generated with a state machine, leaving the cog completely free (without AUX) to process anything else.

    LMM:
    By being able to transfer large blocks up to 1KB in the background, LMM would be significantly enhanced. If "n" blocks of AUX could be windowed into cog, simulated execute in place (with the appropriate caveats and/or tweaks) could become a reality.
    LMM would probably use only a couple of 8*Long blocks, leaving the remaining AUX to be used as stack space. The remaining cog space would be used for variables.

    In the old LMM method, FJMP & FCALL instructions were followed by a NOP a..a holding the hub address. It could be beneficial to change the sequence to NOP a..a followed by FJMP & FCALL instructions. The NOP, FJMP and FCALL instructions might now become dedicated LMM instructions to save yet more instruction execution space/time.
    BTW I only took a cursory look at Bill and others info about the execution model.

    COG PAMS PROGRAMS:
    By being able to transfer large blocks of data between Hub and Aux using the non-blocking RD/WRWIDE instruction would provide an enormous speed and space improvement. Video update, gui and game cogs would benefit greatly by these methods.

    There are a few possibilities that could help significantly...

    (1) The use of AUXA/AUXB pointers located at $1F0..$1F1. This would work precisely as does INDA/INDB, but would point to AUX Ram instead of Cog Ram.
    All existing instructions would benefit from this.

    (2) Windowing Aux Ram blocks into Cog Ram blocks would aid by instructions being able to work on the Aux directly for data usage, removing the requirement to transfer back and forth between Aux Ram and Cog Ram.

    (3) A much better solution would be to allow a special instruction to "switch" the use of the top bit of the resultant D & S address bits ("x"=bit 8 of x_aaaa_aaaa) such that addresses $000.$0FF would use Aux Ram $00..$FF and addresses $100.$1FF would use Cog Ram.

    This would permit all instructions to use Aux Ram $00.$FF as variable space, while still retaining the full Cog Ram as instruction (shared with cog variable space anywhere above $100). The caveat here is that self-modifying instructions in the lower cog space of $000..$0FF would not be possible.

    This would function similar to the way the shadow registers function on the P1 where it is possible to run sw from the shadow registers.

    A different mapping scheme might also be possible by using a more complex enable instruction.

    BTW I am going to dual post this in both Propeller II update and Hub Execution Model threads.
  • David BetzDavid Betz Posts: 14,516
    edited 2013-12-03 14:04
    Simplest is to pop it off AUX - especially since there is now no need for specific LMM registers.

    They can of course be kept initially, but my reading of the GCC guts tells me that it has a rather exceptional capability to use large numbers of registers... a potential for another great performance boost for a later effort.

    Don't forget, with the embedded 18 bit address there is no room to specify an LR register. Popping it off AUX lets you place it wherever is convenient.
    Using an instruction to set a global 9 bit register would allow any register to be used as well but would avoid having to use the AUX memory which might be needed for some other purpose. If a program wants to use the AUX memory as a stack it can certainly push the LR register onto an AUX stack.
  • jmgjmg Posts: 15,173
    edited 2013-12-03 14:08
    Cluso99 wrote: »
    I put forward a proposal that the AUX & QUAD-CACHE be redone as AUX 32*8*longs, enabling a RD/WRWIDE to read/write 8*Longs straight into any of the 32 AUX 8*Long blocks in 1 hub clock. Assuming this comes to pass...

    I think Chip said the AUX memory was hand-laid-out, and their own block, so changes there are probably off the time-line ?
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 14:13
    I think I have a solution that not only follows the rest of the P2 philosophy, but accomodates us all.

    HCALL addr - saves return address in designated link register, corresponds to pasm CALL
    SETLR reg - designate the link register to use for HCALL, perhaps default to $1F0 (currently unused)

    HCALLAR addr - push return address using SPA, corresponds to pasm CALLAR
    HCALLBR addr - push return address using SPA, corresponds to pasm CALLBR

    HRET would be a macro for HJMP LR

    HRETAR would be a return using address popped for SPA
    HRETBR would be a return using address popped for SPB

    The above revision would give GCC the LR you and Eric want, and would still have the fast AUX stack capability, at the expense of a few simple opcodes.

    Not only that, it totally fits the P2 instruction set symmetry with cog jmpret and stack style subroutines. A win-win I call it.
    David Betz wrote: »
    Using an instruction to set a global 9 bit register would allow any register to be used as well but would avoid having to use the AUX memory which might be needed for some other purpose. If a program wants to use the AUX memory as a stack it can certainly push the LR register onto an AUX stack.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-03 14:16
    jmg wrote: »
    I think Chip said the AUX memory was hand-laid-out, and their own block, so changes there are probably off the time-line ?
    I believe it is and its a small 256*Long block. This would basically be a 32*Long block repeated 8 times (for 8* -> 256 bit wide bus). And it is used 8 times (ie for each cog).
    So perhaps this may not be such a big job???

    The P2 has taken such a huge leap forward, perhaps it is also worth doing???

    BTW Perhaps this would be best posted in the Propeller II update thread. Do you want me to copy there?
  • David BetzDavid Betz Posts: 14,516
    edited 2013-12-03 14:18
    I think I have a solution that not only follows the rest of the P2 philosophy, but accomodates us all.

    HCALL addr - saves return address in designated link register, corresponds to pasm CALL
    SETLR reg - designate the link register to use for HCALL, perhaps default to $1F0 (currently unused)

    HCALLAR addr - push return address using SPA, corresponds to pasm CALLAR
    HCALLBR addr - push return address using SPA, corresponds to pasm CALLBR

    HRET would be a macro for HJMP LR

    HRETAR would be a return using address popped for SPA
    HRETBR would be a return using address popped for SPB

    The above revision would give GCC the LR you and Eric want, and would still have the fast AUX stack capability, at the expense of a few simple opcodes.

    Not only that, it totally fits the P2 instruction set symmetry with cog jmpret and stack style subroutines. A win-win I call it.
    Sounds okay to me if Chip is willing to add that many new instructions.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 14:20
    Let's hope so.

    I strongly prefer HUBEXEC as it will be significantly faster than LMM - and easier to use, but your suggestion - a sort of RDOCTLAUX / WROCTLAUX is incredibly useful for feeding the video engine.

    Only Chip knows what (& how much) will fit.
    Cluso99 wrote: »
    I believe it is and its a small 256*Long block. This would basically be a 32*Long block repeated 8 times (for 8* -> 256 bit wide bus). And it is used 8 times (ie for each cog).
    So perhaps this may not be such a big job???

    The P2 has taken such a huge leap forward, perhaps it is also worth doing???
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 14:22
    I belive he said he has space for a lot of ops now that NR is gone, and each of those should be very easy and share a lot of transistors.

    Frankly, I strongly prefer the AUX stack model, but if the extra instructions are helpful to you and Eric, and they fit, we all come out ahead.
    David Betz wrote: »
    Sounds okay to me if Chip is willing to add that many new instructions.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-03 14:26
    I think I have a solution that not only follows the rest of the P2 philosophy, but accomodates us all.

    HCALL addr - saves return address in designated link register, corresponds to pasm CALL
    SETLR reg - designate the link register to use for HCALL, perhaps default to $1F0 (currently unused)

    HCALLAR addr - push return address using SPA, corresponds to pasm CALLAR
    HCALLBR addr - push return address using SPA, corresponds to pasm CALLBR

    HRET would be a macro for HJMP LR

    HRETAR would be a return using address popped for SPA
    HRETBR would be a return using address popped for SPB

    The above revision would give GCC the LR you and Eric want, and would still have the fast AUX stack capability, at the expense of a few simple opcodes.

    Not only that, it totally fits the P2 instruction set symmetry with cog jmpret and stack style subroutines. A win-win I call it.
    Looks like its possible you are only using single operands, but just in case..
    There is now currently no dual operand instruction (D,[#]S) space available. I am not sure if DECOD3/DECOD4/DECOD5 could be folded into 1 instruction by utilising WZ/WC bits?
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 14:31
    For P2 with 256KB hub, only a single dual operand space would be needed:

    Consider:

    TTTTTTT ZC I CCCC jjAAAAAAAAA AAAAAAA

    TTTTTTT ZC I CCCC 00AAAAAAAAA AAAAAAA HJMP
    TTTTTTT ZC I CCCC 01AAAAAAAAA AAAAAAA HCALL
    TTTTTTT ZC I CCCC 10AAAAAAAAA AAAAAAA HCALLAR
    TTTTTTT ZC I CCCC 11AAAAAAAAA AAAAAAA HCALLBR

    Actually, HJMP/HCALL could be the same op, if LR was always written. For a JMP, it would just be ignored. (thanks David)

    Condition codes would still work

    ZC could be used to set flags for stack wrap / stack collision

    I is still available to add other variations

    Your suggestion of folding the decode instructions by using the not needed WZ WC bits sounds great to me!
    Cluso99 wrote: »
    Looks like its possible you are only using single operands, but just in case..
    There is now currently no dual operand instruction (D,[#]S) space available. I am not sure if DECOD3/DECOD4/DECOD5 could be folded into 1 instruction by utilising WZ/WC bits?
  • David BetzDavid Betz Posts: 14,516
    edited 2013-12-03 14:33
    For P2 with 256KB hub, only a single dual operand space would be needed:

    Consider:

    TTTTTTT ZC I CCCC jjAAAAAAAAA AAAAAAA

    TTTTTTT ZC I CCCC 00AAAAAAAAA AAAAAAA HJMP
    TTTTTTT ZC I CCCC 01AAAAAAAAA AAAAAAA HCALL
    TTTTTTT ZC I CCCC 10AAAAAAAAA AAAAAAA HCALLAR
    TTTTTTT ZC I CCCC 11AAAAAAAAA AAAAAAA HCALLBR

    Actually, HJMP/HCALL could be the same op, if LR was always written. For a JMP, it would just be ignored.

    Condition codes would still work

    ZC could be used to set flags for stack wrap / stack collision

    I is still available to add other variations

    Your suggestion of folding the decode instructions by using the not needed WZ WC bits sounds great to me!
    I don't think it would be a good idea to always write LR since it means that leaf functions can't use any HJMP instructions without clobbering their return address.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 14:34
    Excellent point.
    David Betz wrote: »
    I don't think it would be a good idea to always write LR since it means that leaf functions can't use any HJMP instructions without clobbering their return address.
  • cgraceycgracey Posts: 14,151
    edited 2013-12-03 15:43
    Sorry I haven't been contributing. After two hours of sleep, I had to get up to tend to some things and I haven't been back to sleep. My brain won't be sharp until I sleep again. I'm contemplating how this will all work, as well as I can. I'm kind of fuzzy on a few things. I need to draw some diagrams of how things work.
  • ersmithersmith Posts: 6,052
    edited 2013-12-03 15:55
    (about using the AUX stack)
    Here we disagree to some extent. For microcontroller type of programs it would suffice, and be faster.
    With 256K of hub memory available, people will be writing much more complicated programs. 1K of stack will not be enough for some (many?) of these programs. I've already run into situations in P1 where >1K is needed.

    Even if the stack size were not a problem, there are still the issues of:
    (1) multiple threads on one COG (e.g. pthreads) need separate stacks
    (2) AUX is used for other things
    High resolution graphics drivers will still needed to be written in cog pasm. Low/medium resolution drivers would be possible in hubex mode.
    C programs should be able to do anything that pasm programs do. For example, in P1 we had a PONG game including VGA driver written in C and running in a single COG.


    (about calling COG from HUB and vice-versa)
    We do have to watch out for switching, but I do not see it as a large problem because:

    Most common usage cases will be:

    - cog-only, native cog code, high performance drivers

    - hubexec larger code, using FCACHE like loading of tight loops, where we need not support returning to hubexec until the tight cog code is done

    - cog-only code, like the Spin interpreter, which may HCALL small routines, where we do not need to support calling cog routines, or support only a single level.
    True, those will be the most common cases, but the hardware will have to handle any instructions we give it. "handle" may mean "failing in a defined way", of course... but all of this does need to be defined!
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 15:58
    Chip,

    Go back to sleep, and re-read this thread when you are clear headed.

    You have to rest so we can have the best P2 :)
    cgracey wrote: »
    Sorry I haven't been contributing. After two hours of sleep, I had to get up to tend to some things and I haven't been back to sleep. My brain won't be sharp until I sleep again. I'm contemplating how this will all work, as well as I can. I'm kind of fuzzy on a few things. I need to draw some diagrams of how things work.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 16:08
    ersmith wrote: »
    (about using the AUX stack)
    With 256K of hub memory available, people will be writing much more complicated programs. 1K of stack will not be enough for some (many?) of these programs. I've already run into situations in P1 where >1K is needed.

    Even if the stack size were not a problem, there are still the issues of:
    (1) multiple threads on one COG (e.g. pthreads) need separate stacks
    (2) AUX is used for other things

    I agree that we have to support a hub based stack for larger programs; my point was that we need to support the AUX stack for the smaller programs / drivers, where the extra speed is most important.

    With the P2, I can realisticly see multiple larger than cog memory programs running on different cogs at the same time, and they would benefit from the AUX stack.

    Personally, I think the overall HMI app will probably be running out of XMM, with some variation of XLMM, but that is beyond the scope of our current discussion.
    ersmith wrote: »
    (about using the AUX stack)
    C programs should be able to do anything that pasm programs do. For example, in P1 we had a PONG game including VGA driver written in C and running in a single COG.

    (about calling COG from HUB and vice-versa)

    True, those will be the most common cases, but the hardware will have to handle any instructions we give it. "handle" may mean "failing in a defined way", of course... but all of this does need to be defined!

    Basically, on the P1 we have:

    cog pasm code
    C cog code - close to pasm code
    LMM code - rougly 1/4th the speed of cog pasm code (if not using FCACHE)
    CLMM code - about 1/3rd the speed of LMM code (if not using fcache) but half the size
    XLMM code - roughly 1/2 the speed of LMM code

    On the P2, with the last test propgcc we have:

    cog pasm code
    C cog code
    CLMM code - did not try it
    LMM code - roughly 1/4 the speed of cog pasm code (not using FCACHE)
    XLMM code - I don't think it was ready yet, but I could be wrong.

    HUBEXEC with AUX stack would be about 4/5th-7/8th the speed of cog pasm code without using FCACHE.

    Fits very nicely, performance/capability wise.

    Forcing the use of a hub stack would take it down to about 1/2 the speed of cog pasm code.

    FCACHE would improve them all, but the nice thing about HUBEXEC with the stack in AUX is that it will perform very well even without FCACHE, and allow a HUGE FCACHE.

    After having dug into the gcc code generator to check out what it is like, I don't think supporting the AUX stack model and HUBEXEC would be that difficult - it would certainly be MUCH easier than trying to support a VLIW style RDQUAD based LMM model.

    Regarding HUB calling cog calling hub calling cog nesting... we simply define it as single level only; anyone nesting such calls is unsupported. This gets 99% of the benefit, and avoids that can of worms. In fact, with some care, even such bizarre nesting would work.
  • David BetzDavid Betz Posts: 14,516
    edited 2013-12-03 16:16
    Forcing the use of a hub stack would take it down to about 1/2 the speed of cog pasm code.
    Nothing about the LR scheme prevents using an AUX memory stack. You just push the LR register in the prologue of the function body if the function calls other functions. If not, you just leave the return address in LR. Since we're not short of single operand instructions, you could also have a POPJ instruction that pops a long from the AUX memory stack and jumps to that location in a single operation.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 16:24
    David,

    We know you prefer LR.

    Others prefer AUX for cases where appropriate.

    LR only slows us down, due for the need to an extra push.

    I suggested the compromise of LR + AUX, which is actually doing things the "P2" way.

    The only possible benefit to using a link register, with a separate push, is for leaf functions as it saves a hub stack operation. So does using AUX for leaf functions.

    In all other cases, it requires an extra operation.

    For assembly code, and compiler generated code that does not need a large stack, using the AUX stack is significantly faster than a hub stack - and takes less code.

    I can understand why you would like an ARM style LR, it is better than a stack hub for leaf functions, and it is what propgcc currently uses.

    I don't understand why you are against AUX, it would not be difficult to support an AUX stack model in GCC.

    Again, all of HUBEXEC with support for AUX stack and hub stack is a LOT less work than any LMM model for GCC other than the classic single RDLONG four cycle loop.
    David Betz wrote: »
    Nothing about the LR scheme prevents using an AUX memory stack. You just push the LR register in the prologue of the function body if the function calls other functions. If not, you just leave the return address in LR. Since we're not short of single operand instructions, you could also have a POPJ instruction that pops a long from the AUX memory stack and jumps to that location in a single operation.
  • ersmithersmith Posts: 6,052
    edited 2013-12-03 16:41
    I don't understand why you are against AUX, it would not be difficult to support an AUX stack model in GCC.
    Supporting an AUX stack model actually isn't that easy -- it would require using different instructions to access variables on the stack from accessing variables in HUB memory. Trying to dereference pointers then becomes a nightmare (which memory space does the pointer use?)

    Having a "COG-only" model where all variables were in AUX might be feasible, but then it would be difficult to access HUB memory,

    Eric
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 16:49
    With respect, I disagree.

    I agree that it is not an hours work, but it would take far less time than CMM took for example.

    If the stack model limited arguments + variables, the new indexed stack addressing mode can be used. It would be somewhat wasteful, as everything would have to be longs, but it would work.

    If my reading of the GCC guts was accurate, it boils down to something like:

    if (AUXSTACK) {
    emit_instruction(rdaux reg,spa[+-x])
    } else
    if HUBSTACK) {
    emit_instruction(mov temp,sp)
    emit_instruction(sub temp,sp)
    emit_instruction(rdxxxx reg,temp)
    } else

    It is still infinitely easier than making a VLIW-style RDQUAD based LMM (or RDOCTL) - and faster.

    With my suggested compromise, that is supporting both LR and AUX stack, propgcc could initially support hub stack, and later add aux stack model.

    The reason I care is that for drivers, and medium size programs that fit in the AUX stack, the generated code will be significantly smaller - and significantly faster.

    Dereferencing pointers is a valid point, but it can be addressed by storing those variables whose address is taken in the hub - or simply not allowing taking the address of local variables. I am considering microcontroller style code, not complex code like a conventional operating system - not even Emacs.

    If it is too much work to support an aux model with GCC (at least initially) implementing the AUX stack for hubexec is still a big win for assembly language, and for other compilers that may be able to more easily incorporate it, so if Chip can fit hubex, it is still very much worth adding the AUX stack calls. It will also be a big win for Spin, and other VM's. I'd definitely use it in my projects.
    ersmith wrote: »
    Supporting an AUX stack model actually isn't that easy -- it would require using different instructions to access variables on the stack from accessing variables in HUB memory. Trying to dereference pointers then becomes a nightmare (which memory space does the pointer use?)

    Having a "COG-only" model where all variables were in AUX might be feasible, but then it would be difficult to access HUB memory,

    Eric
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-03 19:04
    Let's hope so.

    I strongly prefer HUBEXEC as it will be significantly faster than LMM - and easier to use, but your suggestion - a sort of RDOCTLAUX / WROCTLAUX is incredibly useful for feeding the video engine.

    Only Chip knows what (& how much) will fit.

    Some fabulous ideas here Bill!

    You guys seem to have a handle on things.
    It's taken me all morning to absorb it all.

    The RDOCTLAUX instruction put a huge smile on my face :)
    This would be great if it can be done. Finger crossed!

    All this talk of bandwidth is lost when there is pipeline stall and double handling of video data to deal with.
    Having to read chunks of video data into COG ram and then pass it to the AUX ram seems a waste of process time.

    Nice work guys

    Ozpropdev :)
  • David BetzDavid Betz Posts: 14,516
    edited 2013-12-03 19:05
    I don't understand why you are against AUX, it would not be difficult to support an AUX stack model in GCC.
    I'm not sure I believe it wouldn't be difficult to use tha AUX stack in GCC. As I mentioned before, it makes it impossible to take the address of a local variable. You say we could just restrict ourselves to code that doesn't do that but I'm not sure it would be easy to get GCC to flag as errors code that violates those rules. Anyway, having both ways is fine if you can find enough opcodes to do that. If not, I'd prefer that the LR method be chosen over the AUX stack method.
  • David BetzDavid Betz Posts: 14,516
    edited 2013-12-03 19:15
    With respect, I disagree.

    I agree that it is not an hours work, but it would take far less time than CMM took for example.

    If the stack model limited arguments + variables, the new indexed stack addressing mode can be used. It would be somewhat wasteful, as everything would have to be longs, but it would work.

    If my reading of the GCC guts was accurate, it boils down to something like:

    if (AUXSTACK) {
    emit_instruction(rdaux reg,spa[+-x])
    } else
    if HUBSTACK) {
    emit_instruction(mov temp,sp)
    emit_instruction(sub temp,sp)
    emit_instruction(rdxxxx reg,temp)
    } else

    It is still infinitely easier than making a VLIW-style RDQUAD based LMM (or RDOCTL) - and faster.

    With my suggested compromise, that is supporting both LR and AUX stack, propgcc could initially support hub stack, and later add aux stack model.

    The reason I care is that for drivers, and medium size programs that fit in the AUX stack, the generated code will be significantly smaller - and significantly faster.

    Dereferencing pointers is a valid point, but it can be addressed by storing those variables whose address is taken in the hub - or simply not allowing taking the address of local variables. I am considering microcontroller style code, not complex code like a conventional operating system - not even Emacs.

    If it is too much work to support an aux model with GCC (at least initially) implementing the AUX stack for hubexec is still a big win for assembly language, and for other compilers that may be able to more easily incorporate it, so if Chip can fit hubex, it is still very much worth adding the AUX stack calls. It will also be a big win for Spin, and other VM's. I'd definitely use it in my projects.
    The code for PropGCC is available in Google Code. You're an experienced compiler writer. Could you make a branch and modify the current compiler to illustrate how you would go about using AUX memory as a C stack?
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 19:19
    If we can't have both, I'd prefer AUX & pop into LR.... but having both is best :)

    If Cluso's suggestion of combining DECODE3/4/5 is ok with Chip, then we have the op code space for both. All we really need is one dual operand instruction slot, see post #74 in this thread.

    Taking the address of a local variable or argument is a valid point, I can see a few ways of doing it, but they are not trivial.

    There is no reason why PropGCC has to support an AUX stack model for the first implementation, there is plenty of time to add it later if it is desired.

    Spin, and some work I am doing can use it immediately to good effect.
    David Betz wrote: »
    I'm not sure I believe it wouldn't be difficult to use tha AUX stack in GCC. As I mentioned before, it makes it impossible to take the address of a local variable. You say we could just restrict ourselves to code that doesn't do that but I'm not sure it would be easy to get GCC to flag as errors code that violates those rules. Anyway, having both ways is fine if you can find enough opcodes to do that. If not, I'd prefer that the LR method be chosen over the AUX stack method.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 19:27
    If I had the time, or I was paid to do it, yes, I could.

    I don't have time to do it on a volunteer basis. Heck, wifey would slow roast me if she knew how much time I spent on the forum in the last couple of days.

    I am quite sure Chip will use it for Spin, and I have a couple of projects in the works that will use it.

    To help anyone who wants to try, here is how I would do it:

    - limiting the total number of arguments and local variables to the index range from SPA is important
    - use longs even for byte/word local variables
    - do not store arrays on the AUX stack, use space on a hub stack
    - either forbid taking the address of local variables, OR
    - any local variables that have their address taken should live on a hub stack (easier fix of two), OR
    - (non-trivial gcc change) add an 'aux' attribute to arguments and local variables that the code generator can use to emit SPA indexed references
    - for debugging runs, use the WC/WZ flags I described to check for stack over/under flow

    Note, in an absolute sense, it would not be "easy", however it would be far easier than a VLIW-style RDQUAD LMM would be.
    David Betz wrote: »
    The code for PropGCC is available in Google Code. You're an experienced compiler writer. Could you make a branch and modify the current compiler to illustrate how you would go about using AUX memory as a C stack?
  • David BetzDavid Betz Posts: 14,516
    edited 2013-12-03 19:29
    If I had the time, or I was paid to do it, yes, I could.
    For what it's worth, I don't think anyone has been paid for the P2 work on PropGCC as of yet. It's all been on a volunteer basis.
    Note, in an absolute sense, it would not be "easy", however it would be far easier than a VLIW-style RDQUAD LMM would be.
    I can certainly agree with that!
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 19:33
    David Betz wrote: »
    For what it's worth, I don't think anyone has been paid for the P2 work on PropGCC as of yet. It's all been on a volunteer basis.

    I know, and I think I speak for all the forum members when I say we appreciate the volunteer effort.
    David Betz wrote: »
    I can certainly agree with that!

    I just knew we would agree on that one!
Sign In or Register to comment.