Shop OBEX P1 Docs P2 Docs Learn Events
Hub Execution Model Thread (split from blog) - Page 13 — Parallax Forums

Hub Execution Model Thread (split from blog)

1101113151622

Comments

  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-07 09:06
    Excellent!

    Good performance for whole-cog, low risk for testing. If time allows, vacuuming unused slots in hubexec mode is a big performance boost.

    Note to self... get ready to load DE2-115 as soon as config file is posted...

    Plenty of time for larger caches in future version.

    Pleasant Dreams Chip :-)
    cgracey wrote: »
    I'm going to put a 4-line (x8 long) instruction cache into each cog. Running one hub task would work well. Running more than one task would thrash the cache. There will be a 1-line (x8 long) data cache in each cog for RDxxxxC. Along with Z/C/PC for each task, there will be a bit signifying whether hub mode is active. In hub mode, the conditional branches (DJNZ, JP, TJZ) will probably become bit8-extended relative branches.

    I'm going to sleep. When I come back, I'll increase the program counters to 16 bits and make sure things still run. Then, I'll add the instruction cache.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-07 09:30
    Very interesting to where you guys have reached. Its 4am here and I have been thinking...
    (I hope I can explain the basics of what I am thinking)

    Instruction Cache is basically identical to Aux Ram - it has a 256 bit hub interface on one side (that can also go to cog), and a 32 bit video or instruction on the other side.
    I would think the same could be said for the RDxxxxC Cache.

    Now, what if the Aux Ram was in contiguous blocks that could divided into four identical sections, depending on the program requirements. This way, we don't waste valuable ram when not required. I am not going to suggest increasing its size, although that may be possible - just here me out first.

    So, we have 256 longs of aux built as 4x 64*Longs. Each block is dual port with one port being read only and 32 bits wide. That port can be muxed to either video out or to the COG ALU/PIPELINE (as instruction cache in the case of hubexec, or D or S inputs in the case of the RDxxxxC Cache). The second port is 8*Long bus to HUB for filling the AUX (as in Instruction Cache, and the RDWIDE Cache, and the Video Cache). I am unsure if this also needs to be multiplexed to the COG too???

    Now, the allocation of the 4 blocks of Aux (4x 64*Longs) would be as follows:
    0: AUX RAM
    1: AUX RAM
    2: AUX RAM or INSTRUCTION CACHE (HUBEXEC)
    3: AUX RAM or RDWIDE CACHE
    The idea is that the Instruction Cache is only required if HUBEXEC mode is going to be used. This would be enabled by setting a mode bit when/after the cog starts up. The same applies to the RDWIDE Cache. The default for both is NO Instruction Cache and NO RDWIDE cache. This would then place the whole 4x 64*Longs into AUX RAM and hence be usable as Video Cache or Aux Ram.

    In other words, if we don't need an Instruction Cache, then we can make our video buffer/aux ram bigger. Why waste a precious piece of silicon?
  • cgraceycgracey Posts: 14,133
    edited 2013-12-07 09:37
    Cluso99 wrote: »
    Very interesting to where you guys have reached. Its 4am here and I have been thinking...
    (I hope I can explain the basics of what I am thinking)

    Instruction Cache is basically identical to Aux Ram - it has a 256 bit hub interface on one side (that can also go to cog), and a 32 bit video or instruction on the other side.
    I would think the same could be said for the RDxxxxC Cache.

    Now, what if the Aux Ram was in contiguous blocks that could divided into four identical sections, depending on the program requirements. This way, we don't waste valuable ram when not required. I am not going to suggest increasing its size, although that may be possible - just here me out first.

    So, we have 256 longs of aux built as 4x 64*Longs. Each block is dual port with one port being read only and 32 bits wide. That port can be muxed to either video out or to the COG ALU/PIPELINE (as instruction cache in the case of hubexec, or D or S inputs in the case of the RDxxxxC Cache). The second port is 8*Long bus to HUB for filling the AUX (as in Instruction Cache, and the RDWIDE Cache, and the Video Cache). I am unsure if this also needs to be multiplexed to the COG too???

    Now, the allocation of the 4 blocks of Aux (4x 64*Longs) would be as follows:
    0: AUX RAM
    1: AUX RAM
    2: AUX RAM or INSTRUCTION CACHE (HUBEXEC)
    3: AUX RAM or RDWIDE CACHE
    The idea is that the Instruction Cache is only required if HUBEXEC mode is going to be used. This would be enabled by setting a mode bit when/after the cog starts up. The same applies to the RDWIDE Cache. The default for both is NO Instruction Cache and NO RDWIDE cache. This would then place the whole 4x 64*Longs into AUX RAM and hence be usable as Video Cache or Aux Ram.

    In other words, if we don't need an Instruction Cache, then we can make our video buffer/aux ram bigger. Why waste a precious piece of silicon?

    I like that idea a lot. The hard thing is the 256-bit data port. That memory would need to grow 4x as wide, and it barely fits now. I hate using so many flops to build cache lines (4 lines of 256 bits is 1,024flops. Those take a lot more area than SRAM would. I need to find that image Beau and I made the other day and see if anything is obviously possible. If we expanded AUX 4x, I wonder how big it would be.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-07 09:44
    I think we should discuss this - a LOT - for a future P2.1+ where we can have a much bigger AUX ram. Right now, we need to use the AUX ram as a stack (for languages that can use it)
    cgracey wrote: »
    I like that idea a lot. The hard thing is the 256-bit data port. That memory would need to grow 4x as wide, and it barely fits now. I hate using so many flops to build cache lines (4 lines of 256 bits is 1,024flops. Those take a lot more area than SRAM would. I need to find that image Beau and I made the other day and see if anything is obviously possible. If we expanded AUX 4x, I wonder how big it would be.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-07 09:55
    Regarding LRU, single HUBEXEC task per cog, etc...

    Firstly, let me point out some caveats...

    When in HUBEXEC mode, it is not going to be possible to modify instructions like we do in Cog. Why? Because we have an instruction cache, and this would mean a lot of extra logic to ensure cache coherency - too much silicon and too much risk. Other processors (not PCs) are typically running from ROM so they cannot modify instructions anyway.

    This means that you cannot write to HUB to the "instruction section" as this may already be in the Cache.

    HUBEXEC...

    Any COG(S) can run in HUBEXEC mode.

    I don't believe that multitasking in HUBEXEC mode within that cog should be permitted. By this, I mean only a single HUBEXEC task per cog. I am also happy to preclude multitasking within this cog also, but if Chip can safely make multitasking work (1xHUBEXEC and up to 3xCOG mode) then this is a luxury/bonus. This makes the Instruction Cache so much cleaner to implement, and easier to describe.

    If as in my model above, there were to be a 64*Long block (or 32*Long block as suggested by Chip - the concept is the same), the blocks would represent a block of Hub of that same size and boundary. The Cache controller would autofill the next WIDE, if not loaded, as soon as execution began on the prior WIDE (8*Long). So mapping would be by the lower address bits. This would permit up to a 64 instruction loop in my model (or 32*Long in Chips model). Either would be fine. Since it is only one task this simplifies the WIDE fetching algorithm. No LRU algorithm is required this way.

    I also think that switching in and out of HUBEXEC mode (as suggested by David many posts ago) would be fantastic too.

    Just to reinforce my above model, if the particular cog does not use HUBEXEC mode, then its Instruction Cache would have defaulted to normal AUX Ram usage.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-07 09:58
    Next thing...

    I don't understand why GCC requires a single memory model. Surely GCC is used on other micros, where the instruction memory is flash and the data and stack are ram??? Surely these are not single address models???
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-07 10:00
    Cluso99 wrote: »
    Regarding LRU, single HUBEXEC task per cog, etc...

    Firstly, let me point out some caveats...

    When in HUBEXEC mode, it is not going to be possible to modify instructions like we do in Cog. Why? Because we have an instruction cache, and this would mean a lot of extra logic to ensure cache coherency - too much silicon and too much risk. Other processors (not PCs) are typically running from ROM so they cannot modify instructions anyway.

    This means that you cannot write to HUB to the "instruction section" as this may already be in the Cache.

    Totally agreed. I already assumed this, as we don't have set{s/d/i/x} for hub longs, and it is too big a change. Shudder.
    Cluso99 wrote: »
    HUBEXEC...

    Any COG(S) can run in HUBEXEC mode.

    I don't believe that multitasking in HUBEXEC mode within that cog should be permitted. By this, I mean only a single HUBEXEC task per cog. I am also happy to preclude multitasking within this cog also, but if Chip can safely make multitasking work (1xHUBEXEC and up to 3xCOG mode) then this is a luxury/bonus. This makes the Instruction Cache so much cleaner to implement, and easier to describe.

    I think that may be useful to have multiple hubexec tasks - in a future prop, with bigger cache - as otherwise cache trashing city. For P2? Not needed.

    One hubexec + 3 cog tasks? Could be handy, if easy to implement.

    pthreads will work fine for the hubexec task.
    Cluso99 wrote: »
    If as in my model above, there were to be a 64*Long block (or 32*Long block as suggested by Chip - the concept is the same), the blocks would represent a block of Hub of that same size and boundary. The Cache controller would autofill the next WIDE, if not loaded, as soon as execution began on the prior WIDE (8*Long). So mapping would be by the lower address bits. This would permit up to a 64 instruction loop in my model (or 32*Long in Chips model). Either would be fine. Since it is only one task this simplifies the WIDE fetching algorithm. No LRU algorithm is required this way.

    I also think that switching in and out of HUBEXEC mode (as suggested by David many posts ago) would be fantastic too.

    Just to reinforce my above model, if the particular cog does not use HUBEXEC mode, then its Instruction Cache would have defaulted to normal AUX Ram usage.

    With the limited size of AUX, and its limited number of ports, I do not want it used as the instruction cache. We don't want to add ports to P2.

    If a cog does not use hubexec, the instruction cache could augment the RDxxxxC cache, perhaps by providing a separate line for each task.

    Performance wise, multiple separate busses are a BIG win.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-07 10:04
    cgracey wrote: »
    I like that idea a lot. The hard thing is the 256-bit data port. That memory would need to grow 4x as wide, and it barely fits now. I hate using so many flops to build cache lines (4 lines of 256 bits is 1,024flops. Those take a lot more area than SRAM would. I need to find that image Beau and I made the other day and see if anything is obviously possible. If we expanded AUX 4x, I wonder how big it would be.
    Chip,
    Could these Aux ram blocks be redone as simple 32*Long wide blocks and duplicated, just like Beau and yourself did for the hub? Would they be much smaller? Is there a great deal of work especially in view of the performance gains, usage, etc?
    This way, the Video Cache could be directly loaded in WIDE blocks using a specific instruction. The same instruction could be used by the cog to load Aux Ram (its the same anyway). This way, you get instant 8*Long loading into the Aux Ram section without any moving of data between the RDWIDE Cache and Aux Ram.
    Just a thought. I know nothing of the time/logistics/etc. Only you and Beau know this.

    Postedit:
    A total Aux size of 512 Longs would be wonderful. Thinking about it, 32*Longs would be great for the HUBEXEC block, and just allocate the same for the RDWIDE cache although I think it could be smaller. If only a total of say 256+32+32 that would even be fine too, or even just the 256 total. It gains us so much power.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-07 10:24
    Totally agreed. I already assumed this, as we don't have set{s/d/i/x} for hub longs, and it is too big a change. Shudder.
    Yes, me too. Just wanted everyone to know so they don't get the wrong idea.
    I think that may be useful to have multiple hubexec tasks - in a future prop, with bigger cache - as otherwise cache trashing city. For P2? Not needed.

    One hubexec + 3 cog tasks? Could be handy, if easy to implement.

    pthreads will work fine for the hubexec task.
    Agreed.
    With the limited size of AUX, and its limited number of ports, I do not want it used as the instruction cache. We don't want to add ports to P2.

    If a cog does not use hubexec, the instruction cache could augment the RDxxxxC cache, perhaps by providing a separate line for each task.

    Performance wise, multiple separate busses are a BIG win.
    My thinking is that AUX would hopefully increase by the Instruction Cache and by the RDxxxxC Cache amount. And, if not needed, could be used as a larger AUX.

    The big win here is the direct WIDE loading of the AUX, and its simple and clean. And if I am not mistaken, only muxes onto the various existing buses would be required. Obviously the buses will be widened to 256 bits.

    It may just be worth doing to use the smaller SRAM rather than flip-flops. We have gained so much already, it would be nice to have this too. This would be the icing on the cake!!! I sense a long phone call between Chip and Beau ;)

    WRWIDE
    To get the performance in the cog that does the video processing, the WRWIDE instruction will be necessary. This means that the AUX needs to have the WIDE bus be both Read/Write to Hub, and since this would be shared with the cog, the cog will be stalled while the WRWIDE takes place. Perhaps the WRWIDE can be executed in the background and cog execution can proceed.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-07 11:34
    LRU performs much better than direct mapped, MUCH faster.

    Having said that, you could use the task id bits (2 bits) as the high bits, and two more bits for cache line.

    Suggestion:

    Direct mapped for first verilog test, to get things working

    while everyone plays, you see if you can easily add LRU
    Yes, LRU will probably perform better but it also takes more logic to implement doesn't it? First, you need to update the LRU bits on every access. Then you also have to do an associative lookup to figure out if an address is already in one of the cache lines. If that additional logic is acceptable then LRU would probably work better than direct-mapped.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-07 11:36
    I agree that would be very nice, but in order to not kill performance, requires multiple lines of data cache and code cache per task --> probably too big a change for P2, and significantly more transistors.
    I wasn't proposing multiple hub mode tasks. I was proposing a single hub mode task and up to 3 COG mode tasks that wouldn't use the instruction cache at all.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-07 11:44
    Cluso99 wrote: »
    Next thing...

    I don't understand why GCC requires a single memory model. Surely GCC is used on other micros, where the instruction memory is flash and the data and stack are ram??? Surely these are not single address models???
    Yes that is true but that split is between instruction space and data space. Here we're talking about splitting data space in two. How would you represent pointers? I'd like to write a function that takes a pointer and can dereference it without knowing if that pointer points to hub memory or AUX memory. How would I do that?
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-07 11:48
    cgracey wrote: »
    I'm going to put a 4-line (x8 long) instruction cache into each cog. Running one hub task would work well. Running more than one task would thrash the cache. There will be a 1-line (x8 long) data cache in each cog for RDxxxxC. Along with Z/C/PC for each task, there will be a bit signifying whether hub mode is active. In hub mode, the conditional branches (DJNZ, JP, TJZ) will probably become bit8-extended relative branches.

    I'm going to sleep. When I come back, I'll increase the program counters to 16 bits and make sure things still run. Then, I'll add the instruction cache.
    I'd recommend leaving DJNZ and friends alone and letting them address the larger hub memory by using the BIG instruction. Narrow range relative branches may be difficult to use by the GCC code generator.
  • ctwardellctwardell Posts: 1,716
    edited 2013-12-07 12:10
    David Betz wrote: »
    I'd recommend leaving DJNZ and friends alone and letting them address the larger hub memory by using the BIG instruction. Narrow range relative branches may be difficult to use by the GCC code generator.

    Would the BIG value represent an absolute or relative address?

    I assume we will want the GCC code to be position independent so it is easy to load multiple GCC apps in hub memory for various COGS.

    C.W.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-07 12:54
    ctwardell wrote: »
    Would the BIG value represent an absolute or relative address?

    I assume we will want the GCC code to be position independent so it is easy to load multiple GCC apps in hub memory for various COGS.

    C.W.
    It would be absolute. All fo the CALL and JMP instructions are absolute anyway so this doesn't really make it worse. We'd have to have a different instruction set if we want position independent code.
  • ctwardellctwardell Posts: 1,716
    edited 2013-12-07 13:18
    David Betz wrote: »
    It would be absolute. All fo the CALL and JMP instructions are absolute anyway so this doesn't really make it worse. We'd have to have a different instruction set if we want position independent code.

    How much of the instruction set would really need changed? CALL and JMP obviously. I assume hub mode in general will be somewhat 'LMM like' in that some portion of the COG will be treated as registers and those will usually be the source and destination for most commands.

    I haven't read all of the concepts on the hub execution and won't pretend to fully understand them yet, but what if there was BIGA for absolute and BIGR for relative. Whichever one was used directly before an instruction would determine if the value was treated as absolute or relative.

    C.W.
  • jmgjmg Posts: 15,155
    edited 2013-12-07 13:35
    David Betz wrote: »
    I'd recommend leaving DJNZ and friends alone and letting them address the larger hub memory by using the BIG instruction. Narrow range relative branches may be difficult to use by the GCC code generator.

    This can get complex quickly

    Q: Is BIG available on all opcodes ?

    Relative seems a good idea, because smaller, faster code is possible, but some small micros avoid relative, because it does cost extra logic.

    The smaller faster code of Relative helps especially in libraries, and it allows some DLL approaches to libraries.

    I can understand the Compiler will (initially) find longer-reach relatives harder to manage, but it can auto-generate a conditional relative jump around a BIG Jump (which is how little micros do this now) without needing any reach-checking
    cgracey wrote:
    Along with Z/C/PC for each task, there will be a bit signifying whether hub mode is active. In hub mode, the conditional branches (DJNZ, JP, TJZ) will probably become bit8-extended relative branches.

    Is there opcode (and Logic) room to have these relative in all modes ?
    It seems a little risky making an Opcode flip how it behaves, based on a RAM bit ?

    Having it able to reach over all COG memory in relative mode, I think needs one more opcode bit
  • jmgjmg Posts: 15,155
    edited 2013-12-07 13:35
    Strange duplicate post trimmed, as delete fails to work
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-07 14:19
    jmg wrote: »
    This can get complex quickly

    Q: Is BIG available on all opcodes ?
    I hope so. I guess Chip would have to answer to be sure.
    Relative seems a good idea, because smaller, faster code is possible, but some small micros avoid relative, because it does cost extra logic.

    The smaller faster code of Relative helps especially in libraries, and it allows some DLL approaches to libraries.

    I can understand the Compiler will (initially) find longer-reach relatives harder to manage, but it can auto-generate a conditional relative jump around a BIG Jump (which is how little micros do this now) without needing any reach-checking
    A relative branch around a BIG jmp will take three longs. The BIG DJNZ will only take two longs.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-07 14:23
    jmg wrote: »
    It seems a little risky making an Opcode flip how it behaves, based on a RAM bit ?
    I agree completely! I guess the COG mode DJNZ could be relative as well. It would make it harder to use with MOVS though since you'd have to compute the relative address rather than just stuffing in the value of a label.
  • jmgjmg Posts: 15,155
    edited 2013-12-07 15:13
    David Betz wrote: »
    A relative branch around a BIG jmp will take three longs. The BIG DJNZ will only take two longs.

    Yup, and a Short.Rel DJNZ needs just one long. I like the idea of smallest, fastest loops, so a single long has appeal.

    BIG DJNZ value assumes that exists as a valid combination.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-07 15:24
    jmg wrote: »
    Yup, and a Short.Rel DJNZ needs just one long. I like the idea of smallest, fastest loops, so a single long has appeal.

    BIG DJNZ value assumes that exists as a valid combination.
    My understanding is that Chip intends to add a BIG style instruction to extend both the S field and the D field of any instruction. At least he mentioned that at one point. I'm not sure if that is still his plan. I guess if you make DJNZ relative then I can reach anywhere in memory using a relative address by extending your relative DJNZ using the BIG instruction just as easily as I can an absolute one. The question remains though whether it's really a good idea to have DJNZ be relative in hub mode and absolute in COG mode.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-07 15:36
    It will be interesting to hear what Chip thinks overnight.

    At least we are all agreed that there shall be only 1x HUBEXEC hw task per COG. This makes total sense. By utilising an unused slot to fetch the next (wide) 8*Longs, the cog would effectively be using an earlier slot, if possible, and then giving up its own slot in return, because most likely it would then not be required. So the slot would just be traded, but performance would be boosted. Its a no-brainer win-win performance boost, but SETSLOT "Use Any" would need to be set on.

    As for the BIG (AUGS & AUGD), AFAIK Chip intends this to be able to modify any following instruction's immediate #S or #D .

    While I appreciate the benefits of DJNZ etc being able to do relative, I just don't know if this is a risk or not, and, if the extra benefits are worth it. Seems that jumping of any kind within the current 32 Longs for multi-looping in hubexec mode would find those instructions within the cache - what a boost that will give!!!
    Yes, the DJNZ/etc will need to be prefixed by a BIG (AUGS) instruction, but hey, we cannot have everything.

    I think Chip is thinking of seeing if 4x AUX will fit to get the 8*longs working. I don't expect this to be possible to fit. Just hoping that maybe the Aux could be laid out again with small repeating 8*long blocks, and that may result in a total set of 512*longs with 2+ blocks being used as the Instruction Cache and possibly 1 as the RD/WRWIDE cache. Then, if they are not used, maybe they could be reused as AUX, but this is not important in the overall picture.

    This would then map as...
    $000-1F6: AUX
    $1F7: RD/WD WIDE Cache
    $1F8-1FF: Instruction Cache

    For a larger 64*Long Instruction Cache, it could map as...
    $000-1EE: AUX
    $1EF: RD/WD WIDE Cache
    $1F0-1FF: Instruction Cache

    David, I am not sure I follow all your pointer stuff in GCC, but I don't have the time atm to wrap my head around it.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-07 15:39
    David Betz wrote: »
    jmg wrote: »
    It seems a little risky making an Opcode flip how it behaves, based on a RAM bit ?/QUOTE]
    I agree completely! I guess the COG mode DJNZ could be relative as well. It would make it harder to use with MOVS though since you'd have to compute the relative address rather than just stuffing in the value of a label.
    David: BEWARE - MOVS/MOVD/MOVI will not work in HUBEXEC mode (see my caveat above)!!! This is because the instructions may already be in the Instruction Cache, and cache coherency is most likely too complex to entertain.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-07 15:51
    Cluso99 wrote: »
    David Betz wrote: »
    David: BEWARE - MOVS/MOVD/MOVI will not work in HUBEXEC mode (see my caveat above)!!! This is because the instructions may already be in the Instruction Cache, and cache coherency is most likely too complex to entertain.
    Yes, I know that MOVS/MOVD/MOVI will not work in hub mode but they will work in COG mode and if you want DJNZ to be consistent between hub and COG mode then you'll probably want to change it to relative in COG mode as well. This will make MOVS/MOVD less useful since they will require relative addresses. I guess if you don't think it's a problem to have DJNZ be relative in one mode and absolute in the other then my comment is irrelevant. I just think that will be confusing.
  • jazzedjazzed Posts: 11,803
    edited 2013-12-07 15:52
    Cluso99 wrote: »
    David: BEWARE - MOVS/MOVD/MOVI will not work in HUBEXEC mode (see my caveat above)!!! This is because the instructions may already be in the Instruction Cache, and cache coherency is most likely too complex to entertain.
    That's right. It should not be necessary to do self modifying code to fetch and execute from HUB. For example, LMM is Harvard Architecture where code and data are separate with the interpreter running in a COG.
  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-07 16:14
    David Betz wrote: »
    Cluso99 wrote: »
    Yes, I know that MOVS/MOVD/MOVI will not work in hub mode but they will work in COG mode and if you want DJNZ to be consistent between hub and COG mode then you'll probably want to change it to relative in COG mode as well. This will make MOVS/MOVD less useful since they will require relative addresses. I guess if you don't think it's a problem to have DJNZ be relative in one mode and absolute in the other then my comment is irrelevant. I just think that will be confusing.
    Yes, I agree it will be confusing. Perhaps my explanation was wanting - I would rather DJNZ not be relative in Hubexec mode either (keep it consistent).
  • roglohrogloh Posts: 5,236
    edited 2013-12-07 16:22
    Just wanted to mention a bit earlier when there were discussions going on about the number of tasks that could run in hub exec mode on a single COG. Even if you have 4 different PCs (one per task) you would still need 4 stack pointers as well so when the task context switched you would get your correct SP. That is beginning to getting trickier to manage, especially if some software use PTRA, some PTRB, SPA, SPB etc for their stacks. Limiting to a single hub exec task limit per COG seems okay to me, however it would be nice to be able to mix it with normal COG tasks. You could even then write your own fine grained scheduler as one special COG task that runs say every 16 cycles and can choose to switch out the hub exec PC, and SP wheen some elapsed time being compared has occurred. That approach could also effectively allow multiple hub exec tasks per COG.

    As to the self modifying code aspect in hub exec mode with MOVS, MOVI and MOVD etc I would have to suspect it is unlikely that high level language compilers like GCC for example would ever try to do anything like that on instruction code they generated. I am talking modifying of code, not data. But David may know more about that. However a user might try it manually so it is certainly important to recognize/document this limitation.
  • cgraceycgracey Posts: 14,133
    edited 2013-12-07 16:28
    Cluso99 wrote: »
    David Betz wrote: »
    David: BEWARE - MOVS/MOVD/MOVI will not work in HUBEXEC mode (see my caveat above)!!! This is because the instructions may already be in the Instruction Cache, and cache coherency is most likely too complex to entertain.

    ...And the instruction cache is not mapped to cog RAM.
  • cgraceycgracey Posts: 14,133
    edited 2013-12-07 16:37
    rogloh wrote: »
    ...Even if you have 4 different PCs (one per task) you would still need 4 stack pointers as well so when the task context switched you would get your correct SP. That is beginning to getting trickier to manage, especially if some software use PTRA, some PTRB, SPA, SPB etc for their stacks...

    Oh, boy! That didn't occur to me. Maybe I should leave the PC's at 9 bits and have a special one-off 16-bit PC that can be assigned to whatever task wants to execute from the hub. That way, the 4 instruction cache lines will always work well, and not get spread too thin. I think that's what I'll do.

    Thanks for thinking about these things, Guys.
Sign In or Register to comment.