Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II update - BLOG - Page 123 — Parallax Forums

Propeller II update - BLOG

1120121123125126223

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 19:17
    No, that is not what I said.

    if you are in hubexec mode, and you call a cog routine, returning from it will return you into the hubexec window, which restarts hub mode.
    David Betz wrote: »
    Okay, I guess you're saying that once you enter hub mode you can no longer call any functions that are COG-resident. That means there is no way to use helper functions in COG memory like what you typically call FCACHE.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 19:21
    No, that is not what I said.

    if you are in hubexec mode, and you call a cog routine, returning from it will return you into the hubexec window, which restarts hub mode.
    If you execute a CALL instruction from hub exec mode, what PC gets stored in the corresponding RET instruction?
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 19:27
    Cluso99 wrote: »
    I still don't like it. But Chip is the one to implement it and I am not sure how much extra work that is. The ALU would have to be able to swap where the 9bit immediate bits go (bottom or top).

    No need to involve the ALU, just a decision where the 9 bit immediate is OR'd in, then use the result. Separate or gates avoid an extra ALU op.

    Saving the hub memory is worth it, when optimized for it, I expect an 2%+ savings in hub usage.

    Not sure why you don't like it, cycle/memory savings trumps perceived niceness.
    Cluso99 wrote: »
    Yes, would be nice. But, if I understand correctly, the data bus to the cache is shared between cog and hub (not dual ported), so the cog would stall while it waits for the next cache line to be filled. It's a slowdown but still way better than LMM.

    I think it would only stall if there were hub ops in the window, and even then full 8 cycle stall is unlikely.

    Definitely better than current LMM which is a minimum of 4 cycles per single cycle instruction, even quad based LMM is at least two cycles per single cycle op. And this needs FAR less changes to propgcc.
    Cluso99 wrote: »
    It depends on how Chip implements it. Currently postfix doesn't require any additional registers which was what was so nice.

    True. But the exposed register is a big win for optimization.
    Cluso99 wrote: »
    OK, I understand this now. It is premised on Chip doing the HUBEXEC mode to cater for this.

    Yep. Based on his initial musings about auto re-load, but discarding ptra use for the cogs pc.

    I tried to keep the changes minimal, while trying to make it easy for hub-pasm, gcc, Spin, other vm's & compilers.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 19:28
    The lower 9 bits of the next address in the window - which is the correct place to return to.

    A tiny amount of logic will detect the jump into that block, and can auto-resume hub mode.

    I tried to keep everything simple and elegant, and add as little as possible for good optimization possibilities.
    David Betz wrote: »
    If you execute a CALL instruction from hub exec mode, what PC gets stored in the corresponding RET instruction?
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 19:35
    The lower 9 bits of the next address in the window - which is the correct place to return to.

    A tiny amount of logic will detect the jump into that block, and can auto-resume hub mode.

    I tried to keep everything simple and elegant, and add as little as possible for good optimization possibilities.
    I'm beginning to wonder if executing code from hub memory is really a good idea. There are too many loose ends and the implementation that uses PTRA as the PC but still requires COG address PC to handle CALL/RET is pretty ugly. Maybe this is better left for P3 after all.

    Chip: I appreciate your willingness to look at this but I would hate for a half-baked solution to be cast in concrete. Maybe it's best to take some more time to consider how this should really be done.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 19:58
    I tried to go away from PTRA to save it for gcc, but Chip wants to use it for hubexec.

    so "There are too many loose ends and the implementation that uses PTRA as the PC but still requires COG address PC to handle CALL/RET is pretty ugly." is not true, and is just a misunderstanding on your part.

    If ptra is the hubexec pc, then that does greatly simplify calling cog code from hubexec, and no need to expand the cog's pc.
    David Betz wrote: »
    I'm beginning to wonder if executing code from hub memory is really a good idea. There are too many loose ends and the implementation that uses PTRA as the PC but still requires COG address PC to handle CALL/RET is pretty ugly. Maybe this is better left for P3 after all.

    Chip: I appreciate your willingness to look at this but I would hate for a half-baked solution to be cast in concrete. Maybe it's best to take some more time to consider how this should really be done.
  • dr hydradr hydra Posts: 212
    edited 2013-12-05 20:05
    David

    Get some good sleep...executing from the hub is too big to stop...It is a game changer.

    It looks like you are starting to go in circles. Once Chip works through the details a clearer picture will probably appear.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 20:25
    I tried to go away from PTRA to save it for gcc, but Chip wants to use it for hubexec.

    so "There are too many loose ends and the implementation that uses PTRA as the PC but still requires COG address PC to handle CALL/RET is pretty ugly." is not true, and is just a misunderstanding on your part.

    If ptra is the hubexec pc, then that does greatly simplify calling cog code from hubexec, and no need to expand the cog's pc.
    You still have to deal with what happens if the CALL instruction is in the last long of the hub window. I really don't think a good solution would have two PCs either. While I would love to have the ability to execute code directly from hub memory, I'm not particularly happy with the proposed solution. I think it would be better to spend the time to work out a cleaner approach.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 20:26
    dr hydra wrote: »
    David

    Get some good sleep...executing from the hub is too big to stop...It is a game changer.

    It looks like you are starting to go in circles. Once Chip works through the details a clearer picture will probably appear.
    That is possible. I asked Chip for this a long time ago so I hate to be one to discourage it now but I'd like a clean solution, not something tacked on that mostly works.
  • potatoheadpotatohead Posts: 10,260
    edited 2013-12-05 20:36
    I do too.

    Either it's going to make sense, or it isn't. Lots of ideas here, now it's time to consider the entirety of things in the context of changes to be made. If it's a no-go, it's a no-go.
  • roglohrogloh Posts: 5,277
    edited 2013-12-05 21:54
    dr hydra wrote: »
    ...executing from the hub is too big to stop...It is a game changer.

    It certainly would be a game changer. Brings simple LMM code running C up from the range of a 20-25 MIPs grade device (albeit 32 bits) into something closer to a 160-200 MIPs grade device (think STM32F4 minus the FPU) when at full speed. And there can be up to 8 COGs doing this at once! Even the STM32F4 only has 128k of SRAM that it can execute from at full speed, the P2 will hopefully have about double this, plus all the hard real time and video ability. Wow!!

    That would be so totally awesome so I am really hoping for this to be achievable in the available timeframe. All my fingers and toes are crossed here that Chip can make it so. :lol:

    Even if ultimately we have to resort to having the stack stored in the hub and take performance hits for any branches out of the current 8 long window and during any additional data transfers from hub memory, the extra performance benefits should still be readily apparent. Remember we always have lots of free COG registers for effectively holding register variables to reduce hub data traffic and if Chip also retains the cache registers for normal data read operations (RDOCT) we can be reading/writing back up to 8 32 bit locals/arguments registers to the stack per hub window at a time so the function call overheads may not be so bad for most functions with reasonably small numbers of arguments and sizes. Plenty of code could still benefit. Branches are probably far more common than call operations in a lot of code anyway. But if the hubexec calls can be made to work with an AUX stack somehow (in some modified model) that would be great too.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-06 04:15
    dr hydra wrote: »
    David

    Get some good sleep...executing from the hub is too big to stop...It is a game changer.

    It looks like you are starting to go in circles. Once Chip works through the details a clearer picture will probably appear.
    Okay, I slept on it! :-)

    I still want this to happen but I want it to be clean. I guess I need to trust Chip because he always comes up with clean designs. As you say, if this can happen for P2 it will be a game changer. Let's hope that Chip has better thought out ideas than what we've come up with in this forum. I'm looking forward to hearing about his solution.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-06 08:32
    I was thinking that if we have the BIG instruction we could use that to extend the range of the existing CALLA/CALLB destination. Any destination with bits set in 31:9 could be considered a hub address and cause an automatic switch to hub execute mode using a bit-extended version of the PC register as the program counter rather than PTRA. A CALLA/CALLB instruction that didn't use BIG or that had zeros in bits 31:9 would be considered a COG location. As long as we always call functions with CALLA/CALLB, we don't have to worry about the PC being wider than 9 bits. We could just say you can't use CALL/RET when executing from hub memory. Also, the RETA/RETB instructions could also check bits 31:9 of the value they pop off the stack and switch back to hub mode if those bits were non-zero. Actually, you *could* still use CALL/RET but you couldn't use them from hub mode code. You'd have to use them from COG mode code. So the call from hub mode to COG mode would be using CALLA/CALLB but then the COG mode function can use CALL/RET to call any functions it wants to call that are also in COG mode.

    If we were to go with this approach I'd also like a new instruction like CALLREG that would store its return address in the D register and branch to the S address possibly extended by a BIG instruction. This would allow GCC to avoid the use of the AUX stack. The return could just be the normal JMP D instruction with the appropriate checking of bits 31:9 to determine whether to switch to hub mode or COG mode. If this instruction is added, it could also be used to call from hub mode into COG mode.

    Does this make sense?

    Edit: I guess if it's necessary to use PTRA as the PC in hub mode the real PC could be copied to PTRA on the CALL instruction that switches to hub mode and copied back to the real PC when transitioning from hub mode back to COG mode.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-06 09:04
    David,

    It makes some sense in that it could work, but it would waste a LOT of memory in the hub. Every single hubexec jump / call would take TWO longs instead of one, and two processor cycles instead of one.

    I am guessing this would bloat compiled code by a good 12%-15%+, and cause a 5%-6% performance hit (based on observed % of instructions being a jump or call)

    Opcode space has already been found for HJMP/HCALL/HCALLA/HCALLB, and there is tons of space for HRET/HRETA/HRETB, so there is no reason to waste precious hub longs or processor cycles.

    Sorry, it does not make any sense.

    I am strongly against this proposal, due to the waste of hub memory, and the extra clock cycle every jump and call would incurr.
    David Betz wrote: »
    I was thinking that if we have the BIG instruction we could use that to extend the range of the existing CALLA/CALLB destination. Any destination with bits set in 31:9 could be considered a hub address and cause an automatic switch to hub execute mode using a bit-extended version of the PC register as the program counter rather than PTRA. A CALLA/CALLB instruction that didn't use BIG or that had zeros in bits 31:9 would be considered a COG location. As long as we always call functions with CALLA/CALLB, we don't have to worry about the PC being wider than 9 bits. We could just say you can't use CALL/RET when executing from hub memory. Also, the RETA/RETB instructions could also check bits 31:9 of the value they pop off the stack and switch back to hub mode if those bits were non-zero. Actually, you *could* still use CALL/RET but you couldn't use them from hub mode code. You'd have to use them from COG mode code. So the call from hub mode to COG mode would be using CALLA/CALLB but then the COG mode function can use CALL/RET to call any functions it wants to call that are also in COG mode.

    If we were to go with this approach I'd also like a new instruction like CALLREG that would store its return address in the D register and branch to the S address possibly extended by a BIG instruction. This would allow GCC to avoid the use of the AUX stack. The return could just be the normal JMP D instruction with the appropriate checking of bits 31:9 to determine whether to switch to hub mode or COG mode. If this instruction is added, it could also be used to call from hub mode into COG mode.

    Does this make sense?

    Edit: I guess if it's necessary to use PTRA as the PC in hub mode the real PC could be copied to PTRA on the CALL instruction that switches to hub mode and copied back to the real PC when transitioning from hub mode back to COG mode.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-06 09:08
    David,

    It makes some sense in that it could work, but it would waste a LOT of memory in the hub. Every single hubexec jump / call would take TWO longs instead of one, and two processor cycles instead of one.

    I am guessing this would bloat compiled code by a good 20%-30%+, and cause a 5%-6% performance hit (based on observed % of instructions being a jump or call)

    Opcode space has already been found for HJMP/HCALL/HCALLA/HCALLB, and there is tons of space for HRET/HRETA/HRETB, so there is no reason to waste precious hub longs.

    Sorry, I am strongly against this proposal, due to the waste of hub memory, and the extra clock cycle every jump and call would incurr.
    Actually, we could add new instructions in the places already freed up for HCALL/HCALLA/HCALLB and use them for a single LCALL instruction (for "long call") that has a larger embedded immediate argument but the same semantics that I described in my original message. In other words, the LCALL instruction could call either COG code or HUB code and the distinction would be made based on bits 31:9 just as in my proposal. There would be no need for HRET/HRETA/HRETB because those would be handled as in my proposal. Note also that you probably want LCALLR, LCALLD, and LCALLRD for completeness.
  • potatoheadpotatohead Posts: 10,260
    edited 2013-12-06 09:10
    While we compare the merits of how HUBEXEC could get done, it's worth comparing it to the best LMM solution we have centered on. We know what that will do, and it's in the can already.

    Personally, I prefer clean and robust functionality over peak performance and or size. If it's possible to knock it out of the park, great! We all are up for that. However, if something has to give, my preference would be performance first, then size.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-06 09:13
    David,

    Sorry my friend, but you are grasping at straws, and your proposed changes would significantly decrease performance, and significantly bloat code. Not a good thing.

    I cannot think of a single good technical reason for what you propose, and you are not addressing the memory bloat and performance loss I pointed out.

    It makes good sense to keep the cog CALL and HCALL instructions separate, as it makes the code clearer - if you see an HCALLx, you KNOW it is calling hubexec code, if you see a CALLx, you know it is calling cog code.

    FYI, the reason I used an H prefix for the instructions is that they are hub execution mode instructions, and I would use an X prefix for a hypothetical P2.1 that had a hypothetical DDR2 direct execution mode.

    For the P3, I have fond hopes of having a merged address space, where the highest two bits would distinguish between address spaces :-)
    David Betz wrote: »
    Actually, we could add new instructions in the places already freed up for HCALL/HCALLA/HCALLB and use them for a single LCALL instruction (for "long call") that has a larger embedded immediate argument but the same semantics that I described in my original message. In other words, the LCALL instruction could call either COG code or HUB code and the distinction would be made based on bits 31:9 just as in my proposal. There would be no need for HRET/HRETA/HRETB because those would be handled as in my proposal. Note also that you probably want LCALLR, LCALLD, and LCALLRD for completeness.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-06 09:15
    Potatohead,

    unique instructions for HCALL/HJMP/HRET is more reasonable, faster, and saves memory.

    using a prefix for them is a deliberate waste of memory and processor speed resources, and should not even be considerd.

    David has plenty of great ideas, but this is a terrible idea.
    potatohead wrote: »
    While we compare the merits of how HUBEXEC could get done, it's worth comparing it to the best LMM solution we have centered on. We know what that will do, and it's in the can already.

    Personally, I prefer clean and robust functionality over peak performance and or size. If it's possible to knock it out of the park, great! We all are up for that. However, if something has to give, my preference would be performance first, then size.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-06 09:16
    David,

    Sorry my friend, but you are grasping at straws, and your proposed changes would significantly decrease performance, and significantly bloat code. Not a good thing.

    I cannot think of a single good technical reason for what you propose, and you are not addressing the memory bloat and performance loss I pointed out.
    Huh? If I use LCALL to call a hub function then it takes 32 bits just like in your proposal. Where is the code bloat?

    Another advantage of this approach is that it concentrates the handling of mode changes between COG and hub modes in one place, the examination of the destination of a CALL or JMP. That and the possible need to copy PC to PTRA on transitions to hub mode is pretty much all that is necessary and the instruction cache doesn't need to be mapped into COG register space.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-06 09:20
    You proposed the BIG prefix, then a CALL, being an LCALL. That is where the bloat is.

    Now if you are just meaning that you want to rename HCALLx to LCALLx, that is not really an issue. I prefer the 'H' prefix due to it running code from the 'H'ub, and I was intending on suggesting 'X' prefix for future XMM variant for P2.1 ... both in the interest of easy readebility.

    If you have backed down from two-long HJMP/HCALLx then we are not in conflict, and Chip can (and will) call the new instructions whatever he wants... the names I suggested were merely a logical proposal following the cog instruction names.
    David Betz wrote: »
    Huh? If I use LCALL to call a hub function then it takes 32 bits just like in your proposal. Where is the code bloat?

    Another advantage of this approach is that it concentrates the handling of mode changes between COG and hub modes in one place, the examination of the destination of a CALL or JMP. That and the possible need to copy PC to PTRA on transitions to hub mode is pretty much all that is necessary and the instruction cache doesn't need to be mapped into COG register space.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-06 09:24
    You proposed the BIG prefix, then a CALL, being an LCALL. That is where the bloat is.

    Now if you are just meaning that you want to rename HCALLx to LCALLx, that is not really an issue. I prefer the 'H' prefix due to it running code from the 'H'ub, and I was intending on suggesting 'X' prefix for future XMM variant for P2.1 ... both in the interest of easy readebility.

    If you have backed down from two-long HJMP/HCALLx then we are not in conflict, and Chip can (and will) call the new instructions whatever he wants... the names I suggested were merely a logical proposal following the cog instruction names.
    The reason that I renamed HCALL to LCALL is that it isn't restricted to calling hub functions. If bits 31:9 are zero it will call a COG function. You could still use BIG along with CALLA/CALLB to call a hub function as well but that won't really be useful until we get more hub memory taking it out of range of the LCALL instruction. However, that could happen and this proposal would be able to handle that where yours wouldn't.
  • potatoheadpotatohead Posts: 10,260
    edited 2013-12-06 09:26
    @Bill Yes, I see that, and I'm only expressing a general preference here.

    I think the feature is worth a very robust discussion. That is happening, and it's good to see. Should resolution require trade-offs, I posted what my preference would be. And that's not aimed at anybody, maybe Chip who needs to make the call on this at some point.

    :) Got snow here guys. Maybe I can sneak out a bit early and have some Prop time on a Friday!
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-06 09:28
    If HUBEXEC gives almost 1 clock cycle per instruction due to background fetching, then why add complexities of using COG at all for running instructions?

    Seems like a no-brainer to just keep it clean and always run from Hub. Am I missing something here?
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-06 09:29
    Post edited to remove upset words due to misunderstanding David's post. Sorry David.
    David Betz wrote: »
    The reason that I renamed HCALL to LCALL is that it isn't restricted to calling hub functions. If bits 31:9 are zero it will call a COG function. You could still use BIG along with CALLA/CALLB to call a hub function as well but that won't really be useful until we get more hub memory taking it out of range of the LCALL instruction. However, that could happen and this proposal would be able to handle that where yours wouldn't.
  • potatoheadpotatohead Posts: 10,260
    edited 2013-12-06 09:36
    @Cluso
    Seems like a no-brainer to just keep it clean and always run from Hub.

    Always running from the HUB would mean a basic conflict between executing code and moving data. It also would mean loss of, or significantly more complicated tasking mode. Real time response would be impacted too. We need COG code at this time in the chip history, IMHO.

    I think we should have this discussion for P3, or some variant of it, like there is no HUB in the chip. Make it external memory, or add a MMU, etc... Those kinds of ideas will be interesting then, but seem way out of scope right now. The jump from P1 to P2.x is going to be significant now as it is.

    My .02
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-06 09:42
    Chip,
    The P2 has exploded in both functionality and performance over the past week. This followed on from your mammoth work in reorganising the instruction set, and adding new ones.
    It would now be such a shame not to take the little time required to add some additional significant features, for what is most likely, minimal delays.

    IMHO these are, in no particular order...
    (1) SETSLOT
    (2) HUBEXEC mode
    (3) Auto HUB-AUX transfer (mainly for video)

    SETSLOT
    SETSLOT is I understand quite simple. It uses otherwise unused bandwidth, so really it’s a no-brainer. If you don’t want to tell everyone, that’s fine. But don’t miss the opportunity.
    This is a simple implementation...
    (a) Each COG can YIELD (other cog takes priority) or GIFT (this cog has priority) its’ slot to another COG
    SETSLOT #0_0_y_g_ccc
    (b) Each COG can accept other COG(s) YIELD/GIFT slot(s), and/or accept any AVAILABLE slots
    SETSLOT #p_a_0_0_000

    HUBEXEC & HUB-AUX transfers
    These seem to be related, in that both the AUX Ram and the INSTRUCTION cache could/should have 2 ports, an 8*Long port for hub transfers, and a 1*Long read port for Instruction or Video DAC.
    If I understand correctly, the RD/WRWIDE Cache would be separate from the proposed INSTRUCTION cache.

    I wonder if, a specialised hand laid 8*Long block could not be replicated, 32 times for the AUX Ram, 2 times for the Instruction Cache, and maybe 1 time for the WIDE cache?
    I do understand this would be on a critical path for Beau, but is this really a big job for the benefits that it would likely give, providing of course it does not delay the shuttle run?

    Also, might this mean that the AUX could be increased to 512*longs if there is enough die space available?

    Both the HUBEXEC mode and the HUB-AUX auto-transfers would both be yet another leap forward. Again, surely any additional work would be well worth it.

    I do know there will be a lot of objections to any changes. If you had listened a week ago, we would not have 256KB of Hub and WIDE accesses now.
    In fact, if the current naysayers had their way, there would be no LMM (not even on the P1) because it’s way too complex for the majority!
    Just be careful of vocal minorities and silent majorities. Only you and Ken can really decide.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-06 09:42
    Hi Potatohead,

    Limitations on using other tasks in the same cog, and some data moving slowdown would only occur if the cog was in hubexec mode.

    Any cog not in hubex mode would not be affected at all.

    Having one or two hubexec cogs running at 160-200MIPS vs. 25-50MIPS (lmm) is a HUGE difference, and will get many more design ins, making a lot more money for Parallax and helping fund future P3's.

    Saying that "because a hubexec cog can't efficiently use tasks, therefore let's not do hubexec, and deliberately limit ourselves to 1/4 or less speed for large code" does not make sense to me.

    Tasking in hubexec would work, it would just slow down hubexec to approximately twice LMM speed.
    potatohead wrote: »
    @Cluso

    Always running from the HUB would mean a basic conflict between executing code and moving data. It also would mean loss of, or significantly more complicated tasking mode. Real time response would be impacted too. We need COG code at this time in the chip history, IMHO.

    I think we should have this discussion for P3, or some variant of it, like there is no HUB in the chip. Make it external memory, or add a MMU, etc... Those kinds of ideas will be interesting then, but seem way out of scope right now. The jump from P1 to P2.x is going to be significant now as it is.

    My .02
  • potatoheadpotatohead Posts: 10,260
    edited 2013-12-06 09:49
    If I'm not mistaken, Cluso wrote about "always execute from HUB"

    I took that to mean *always* as in we simply don't fetch the data for a COG to run with like we do now, which seriously changes things.
    Seems like a no-brainer to just keep it clean and always run from Hub.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-06 09:51
    David,

    Let me put it this way.

    GIVE UP ON USING A PREFIX FOR CALL/JMP AND WASTING SO MUCH HUB SPACE AND CYCLES

    This one I will fight, and I am upset.

    You are a smart guy, and there is no way you do not see the increased memory usage and loss of cycles.

    Are you deliberately trying to slow down the P2 and waste memory?

    It's one thing to object to the visibility of BIG and exposing the hub execution registers, but deliberately wasting memory and cycles just to use your BIG prefix?

    I will fall out of my chair in surprise if Chip supports this deliberate bloat, inefficiency.

    As a matter of fact, this would cause P2 to lose design-wins where the memory wasteage and uselessly lost processor cycles will matter.
    You don't seem to be listening to what I'm saying at all. Because of the BIG instruction, a 64 bit CALLA/CALLB will be available without any extra work. It can be used in a future chip if we get more hub memory than can be addressed by LCALL but until then there is no need to use it. I only pointed it out because it falls out of the existence of BIG and if there was no space for the LCALL instruction then it would work. For P2 with 256k of hub memory only LCALLA/LCALLB/LCALLREG would be used. So in that sense it isn't any different from your proposal. The place where it differs is that the same call instructions can be used to call hub mode code or COG mode code and the determination is made based on the upper bits of the destination address. This allows the same RETA/RETB/REGREG/etc instructions to be used so that special HRETx instructions are not needed. It also regularizes the addressing so there is no conflict between hub addresses and COG addresses. They are all in the same address space. This seems like a big win to me and solves the problem of calling COG code from hub code without having to resort to a visible window into hub memory wasting space in the COG register memory map.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-06 09:53
    potatohead wrote: »
    @Cluso

    Always running from the HUB would mean a basic conflict between executing code and moving data. It also would mean loss of, or significantly more complicated tasking mode. Real time response would be impacted too. We need COG code at this time in the chip history, IMHO.
    Why, if the alternative is not having HUBEXEC at all, which seems to be the opinions of a few here.
    If we can get HUBEXEC operating at max efficiency (hub loading in the background, giving full speed to the hub execution mode) then why is switching to cog mode necessary at all?
    We still have LMM (if you really want it) for running if you want to switch back and forth. Isn't it way cleaner to just run Hubexec mode from hub always. This mode is for highly efficient use and simplicity in design.
    BTW It does not mean you cannot switch back to running cog mode for large block transfers, if that is what is needed.
    I think we should have this discussion for P3, or some variant of it, like there is no HUB in the chip. Make it external memory, or add a MMU, etc... Those kinds of ideas will be interesting then, but seem way out of scope right now. The jump from P1 to P2.x is going to be significant now as it is.

    My .02
    I strongly disagree. If it can be done simply, and it seems to be possible providing it's not overcomplicated by silly requests, then shouldn't we try and flesh out a framework so it can be added simply now!
Sign In or Register to comment.