Shop OBEX P1 Docs P2 Docs Learn Events
Hub Execution Model Thread (split from blog) - Page 7 — Parallax Forums

Hub Execution Model Thread (split from blog)

145791022

Comments

  • Cluso99Cluso99 Posts: 18,069
    edited 2013-12-05 03:01
    ozpropdev wrote: »
    A small price to pay for a HUGE feature! :)
    Absolutely!
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 03:42
    Cluso99 wrote: »
    Great news. That is clever just reversing the order of the two instructions. And working for #D as well. WTG Chip!
    It doesn't work for multi-tasking? (That's fine I think)

    Looks like we will get that "HUBEXEC" (execute in place) model working! What a performance boost over LMM!
    It seems like it would work with multi-tasking if you put the BIG or AUGI instruction before the instruction that it augments and added a 23 bit register for each thread. Is that too much added logic?
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-05 03:50
    David Betz wrote: »
    It seems like it would work with multi-tasking if you put the BIG or AUGI instruction before the instruction that it augments and added a 23 bit register for each thread. Is that too much added logic?

    Interesting idea David!

    Edit: We only have 1 x SETWIDE facility, like the old SETQUAD, difficult to share amongst tasks?
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 05:10
    ozpropdev wrote: »
    Interesting idea David!

    Edit: We only have 1 x SETWIDE facility, like the old SETQUAD, difficult to share amongst tasks?
    That's true. The "execute from hub" feature will probably only work on a single thread. At the very least, the 8-long window into hub memory will get thrashed if instructions get fetched from a different area of hub memory on every cycle. Maybe this also suggests that the BIG/AUGI instruction doesn't need to support multiple threads since it is mostly useful in "execute from hub" mode.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 06:38
    Warning: BIG wrapup answer :)

    Hi Ray,
    Cluso99 wrote: »
    I have again re-read this thread.
    I am still not sure of the requirement regarding HJMP, HCALL and HRET, and how they get used.

    They are equivalents of FJMP, FCALL, FRET, that do not need to be interpreted; and FJMP/FCALL embed a hub address capable of reaching the full 256KB.
    Cluso99 wrote: »
    I presume you do not need to save/restore the Z/C flags with these instructions?

    Correct!
    Cluso99 wrote: »
    Could we simplify this whole thing a bit, and disregard multi-tasking for this mode of operation? Might simplify it quite a bit for Chip, etc.

    Sorry, I did not provide enough detail. I never thought it would be available for cog tasks as it would need four cache lines (one per task) which would actually be better used for a single task; also pthreads would work nicely in this mode.
    Cluso99 wrote: »
    Does the mapping/windowing of AUX into COG help if you could map larger blocks into COG?

    Does not apply.

    What would help would be to have it in a fixed region, I strongly recommend $1E0-$1E7 as that would allow clever compiler tricks like re-use of BIG values due to the code generator being able to access the eight long window as regular cog registers when needed.
    Cluso99 wrote: »
    Extending the above HUBEXEC (named by Bill) model (replaces LMM model)...?

    No need for an LMM loop, Chip already expressed that the 8-long window would auto-increment to the next 8-long block (unless there was an explicit HJMP/HCALL somewhere else)

    No need for REPS loop etc, Chip will put the fetch phase into the hardware. Automagic!

    It really turns this into directly executing from the hub - the "Holy Grail".
    Cluso99 wrote: »
    I have asked Chip if it were possible to
    (1) Make the RDWIDE instruction capable of delivering up to a count of 32 x 8*Long reads into AUX in the background with a tiny state m/c

    (2) If it would be possible to map up to the whole 32 x 8*Long AUX registers into COG ram

    In the other thread, Chip liked the 8-long to/from AUX idea; but not for hubexec mode.

    Chip intended on replacing mapping the quads into the cog with mapping the octal window.

    After a lot of thought, I believe the 8-long cache should always be at $1E0, as that will allow assembly language and compiler tricks to reduce code size.
    Cluso99 wrote: »
    By mapping a large Aux block into Cog, a good set of hub instructions could be executed inline at a time, and possibly small loops could be contained
    within those blocks read, giving an enormous boost to performance.

    Small REPx loops fit in 8 longs (see post # 3), but must be within the block, too difficult for P2 to span multiple 8-long blocks (would need multiple cache lines, too big a change for P2)

    For bigger loops, use a small reps loop (that fits) to load a big block of cog pasm code in the space $000-$1DF, execute it, and have it return to hubexec mode.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 06:45
    Hi Chip!
    cgracey wrote: »
    I got rid of the SETPIX0/1/2/3 instructions and made a new SETPIXW instruction that loads all eight PIX terms from the WIDE registers, all at once. So, there are four 'D/#,S/#' opcodes available now.

    I've loosely read this thread and I understand that you are looking for some opcode space for HJMP/HCALL/...

    EXCELLENT news!

    For a 256KB hub, I can encode HJMP/HCALL/HCALLA/HCALLB into a single D/#,S/# opcode, and HRET fits in any available single argument op space. I'll update post#1 after breakfast.

    Since I don't know the opcode bit patter, I'll just use TTTTTTT, which can be filled with the exact freed pattern later :)
    cgracey wrote: »
    I also see there is talk about how to have a 32-bit constant in-line. About that: I think the idea has already been posited, but we could have a dummy instruction that doesn't do anything, though its 23 LSBs are free for data payload. Any instruction that has an immediate D or S, with priority going to S, can look for this dummy instruction in the next-lower stage of the pipeline. If it sees it, and it hasn't been cancelled as trailing branch code, it will use its 23 LSBs to augment the 9-bit immediate value it already has, giving it a full 32-bit immediate for D or S. This would solve the problem, would it not?

    Yes, it would solve the problem.

    I like it better than the prefix! More like other conventional multi-word intructions.

    If possible, allow the case of

    BIG val1
    BIG val2
    ..
    BIG valN

    as I have a somewhat vague idea for how to reduce memory requirements yet have fairly high performance jump tables for case statements; basically it involves letting code fall harmlessly though some addresses in order to save code space (trading the single cycle wasted per label fallen throug.
  • Heater.Heater. Posts: 21,230
    edited 2013-12-05 06:48
    Bill,
    It really turns this into directly executing from the hub - the "Holy Grail".
    I have barely been able to keep up with this debate. Is this really true? If so, it's fantastic! One of those "impossibilities" that happens in Parallaxia.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 06:48
    I don't care what it is called, as long as it is available :)

    It was a great suggestion by David, it will be incredibly useful in hubexec.
    cgracey wrote: »
    Super!

    It would be used like this:

    ADD reg,#bigconstant & $1FF
    BIG #bigconstant >> 9

    That would add bigconstant to reg.

    Instead of BIG, we should probably give it a name like AUGI for 'augment immediate'.

    Any instruction having an immediate S or D would look for AUGI behind it. If it sees it and it's not cancelled, it extends the immediate value right in the pipeline, before it gets to stage 4. This was a really clever idea you guys came up with, and it turns out that it can be done by using the registers already in the pipeline, so it's almost free!
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 06:55
    Cluso99 wrote: »
    Great news. That is clever just reversing the order of the two instructions. And working for #D as well. WTG Chip!
    It doesn't work for multi-tasking? (That's fine I think)

    Looks like we will get that "HUBEXEC" (execute in place) model working! What a performance boost over LMM!

    It will feel better following the instruction; also that way it gets rid of the need for a state flip-flop.

    As the constant itself will be in the lowest 23 bits, I can still use it for some tricky optimizations :-)

    Yes - HUBEXEC will be much faster than LMM, and reduce code size. A win-win.

    For multi-tasking hubexec code, pthreads would work just fine (or similar cooperative time slicing, a "YIELD" cog subroutine could switch threads... heck it would be possible to write some code that can be HCALLed to swap thrads, it does not even have to be a cog subroutine.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 06:58
    Agreed.

    For multiple hubexec threads, we just use pthreads from C, for assembler, it should be fairly easy to write a "YIELD" hub-subroutine for cooperative multi-tasking.

    Hubexec for P3 could work for each task, but I really, really think it is too much for P2, as it would require four better caches to work well. Getting this hubexec will let us test it out, and make an even better version for P3.
    David Betz wrote: »
    That's true. The "execute from hub" feature will probably only work on a single thread. At the very least, the 8-long window into hub memory will get thrashed if instructions get fetched from a different area of hub memory on every cycle. Maybe this also suggests that the BIG/AUGI instruction doesn't need to support multiple threads since it is mostly useful in "execute from hub" mode.
  • potatoheadpotatohead Posts: 10,259
    edited 2013-12-05 07:47
    Excellent work. This feature is worth the time. Now we've got hardware LMM. And that takes care of the "super cog" idea we always center on from time to time. Any COG can be a super COG, running a larger program. Or, a whole pile of them. Sheesh.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 08:05
    Yes, its true.

    Bye-Bye LMM (with 4:1 or 5:1 slow down without fcache)

    Hello HubExec, running at (prediction) ~90% of cog-only pasm! (closer to 99.9% with FCACHE/FLIB)
    Heater. wrote: »
    Bill,

    I have barely been able to keep up with this debate. Is this really true? If so, it's fantastic! One of those "impossibilities" that happens in Parallaxia.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 08:09
    Agreed on all counts! This was fun!

    Better yet, also gets rid of the issues with running multiple LMM programs running at once. All we need is a small relocating loader. (or on P3, segment registers)

    Also, see post#1 - this is clearly extendable to XMM on P3 with DDR2... it will be slower than hubexec, but transparent. Think XJMP / XCALL / XRET ... only the op code changes... maybe not even that, if the address is used to distinguish between HUB/XMM mode. But that discussion is for the future.
    potatohead wrote: »
    Excellent work. This feature is worth the time. Now we've got hardware LMM. And that takes care of the "super cog" idea we always center on from time to time. Any COG can be a super COG, running a larger program. Or, a whole pile of them. Sheesh.
  • jazzedjazzed Posts: 11,803
    edited 2013-12-05 08:11
    No need for an LMM loop, Chip already expressed that the 8-long window would auto-increment to the next 8-long block (unless there was an explicit HJMP/HCALL somewhere else)

    No need for REPS loop etc, Chip will put the fetch phase into the hardware. Automagic!

    It really turns this into directly executing from the hub - the "Holy Grail".
    Thanks for clearing that up!

    We're not looking for a repeat of LMM. Hopefully Chip can get rid of the need for it.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 08:17
    My pleasure.

    Think of all the things we can accomplish with large, almost cog speed, pasm, C, etc. code!

    LMM was great for the P1 - we did not have any other choice.

    Executing "directly" from the hub without the need for a fetch/execute loop is far better.
    jazzed wrote: »
    Thanks for clearing that up!

    We're not looking for a repeat of LMM. Hopefully Chip can get rid of the need for it.
  • jazzedjazzed Posts: 11,803
    edited 2013-12-05 08:43
    LMM was great for the P1 - we did not have any other choice.
    Indeed. It certainly has been useful.

    Executing "directly" from the hub without the need for a fetch/execute loop is far better.
    I'm afraid that we need a prototype to prove it all out. That's going to be tough for GCC though since the Chip's instruction overhaul makes a new back-end necessary.

    Can we make a smaller prototype to prove it works when it's available. I suppose an extension of PNut would be necessary ... seems like everything depends on Chip's availability.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 08:50
    jazzed wrote: »
    I'm afraid that we need a prototype to prove it all out. That's going to be tough for GCC though since the Chip's instruction overhaul makes a new back-end necessary.

    Can we make a smaller prototype to prove it works when it's available. I suppose an extension of PNut would be necessary ... seems like everything depends on Chip's availability.

    I suspect Chip will add it to the Verilog fairly soon, and we will get DE0-Nano and DE2-115 configuration files to try it out with quickly.

    The other instruction set changes make a back-end overhaul necessary anyway, and changing FJMP->HJMP, etc. should be very easy; different op-code and embedding the hub address right into the instruction (instead of the long following it).

    Basically (loose example)

    emit("FCALL")
    emit("long label")

    changes to

    emit("HCALL label")

    The rest is adding the few additional new instructions to gas, and fixing the address in ld.

    The good news is that as soon as PNut supports the new instructions, it will be easy to verify them on the FPGA before synthesis, and iron out any issues that may arise with them.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 08:52
    jazzed wrote: »
    I'm afraid that we need a prototype to prove it all out. That's going to be tough for GCC though since the Chip's instruction overhaul makes a new back-end necessary.
    Eric made some quick changes to the GCC backend to remove the use of WR/NR in the code generated by PropGCC so there might be hope of getting PropGCC working pretty soon after we get a final instruction set and FPGA configurations from Chip.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 08:54
    Excellent news!
    David Betz wrote: »
    Eric made some quick changes to the GCC backend to remove the use of WR/NR in the code generated by PropGCC so there might be hope of getting PropGCC working pretty soon after we get a final instruction set and FPGA configurations from Chip.
  • jazzedjazzed Posts: 11,803
    edited 2013-12-05 08:55
    David Betz wrote: »
    Eric made some quick changes to the GCC backend to remove the use of WR/NR in the code generated by PropGCC so there might be hope of getting PropGCC working pretty soon after we get a final instruction set and FPGA configurations from Chip.
    Great news! Thanks.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 10:09
    I changed post#2 to the latest proposed encodings, listing all hubexec instructions and BIG.

    http://forums.parallax.com/showthread.php/152079-Hub-Execution-Model-Thread-%28split-from-blog%29?p=1223971&viewfull=1#post1223971
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 10:26
    [Instructions with embedded 23 bit constant

    Opcode encoding to be assigned by Chip.

    BIG #const23

    Suggested by David, as per Chip's or David's usage, allows extending 9 bit immediate constants to instructions to a full 32 bits.

    It may be useful to allocate $1F1 as the "BIG" value register, and store the created 32 bit constant in it, so subsequent instructions can use it.

    Example:

    RDLONG reg,#const32 ' assembler replaces with RDLONG / BIG pair as per David's suggesting

    mul reg, #5

    add reg,3

    WRLONG reg, $1F1 ' saves one long as address already computed in 'big' register

    Such code is VERY common, so the potential for savings is significant.
    I don't think the WRLONG will work since the BIG register will have been cleared by then. It has to be cleared immediately after use or it will mess up all following instructions that have an S field.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 10:50
    David Betz wrote: »
    I don't think the WRLONG will work since the BIG register will have been cleared by then. It has to be cleared immediately after use or it will mess up all following instructions that have an S field.

    Good point, however if Chip puts in a flip-flop that is cleared after use... or it follows the instruction referencing it like Chip suggested... then it would work, and save some memory.

    It is in Chip's hands.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 10:52
    Good point, however if Chip puts in a flip-flop that is cleared after use... or it follows the instruction referencing it like Chip suggested... then it would work, and save some memory.

    It is in Chip's hands.
    How would it work? The bit would be cleared right after the RDLONG instruction so the BIG register would contain zero by the time you reached the WRLONG. If it didn't get cleared then the MUL and ADD instructions would have their S fields modified and wouldn't work as expected. I don't see how this will work no matter how Chip implements BIG.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 10:57
    With the flip-flop, or Chip's "folow the instruction" method, there is no need to clear $1F1 (big copy) any more.

    I am eating, will post example shortly
    David Betz wrote: »
    How would it work? The bit would be cleared right after the RDLONG instruction so the BIG register would contain zero by the time you reached the WRLONG. If it didn't get cleared then the MUL and ADD instructions would have their S fields modified and wouldn't work as expected. I don't see how this will work no matter how Chip implements BIG.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 11:13
    David Betz wrote: »
    How would it work? The bit would be cleared right after the RDLONG instruction so the BIG register would contain zero by the time you reached the WRLONG. If it didn't get cleared then the MUL and ADD instructions would have their S fields modified and wouldn't work as expected. I don't see how this will work no matter how Chip implements BIG.

    Ok, brunch eaten :)

    Your prefix style, need flipflop cleared after BIG is consumed:
    BIG #highbots ' sets high bits
    MOV  reg,#lowbits ' constructed 32 bit value is visible at $1F1
    
    BIG #highaddressbits ' sets high bits, no need to clear as low bits moved into low 9 bits of BIG like MOVS/SETS
    RDLONG reg,#lowbits '  constructed 32 bit value is visible at $1F1
    
    mul reg, #5
    
    add reg,3
    
    WRLONG reg, $1F1 ' saves one long as address already fully assembled in 'big' register
    

    Chip's suffix style, no need for flipflop, finds 23 bit value in next pipeline stage
    MOV  reg,#lowbits 
    BIG #highbots ' MOV notices following 23 bits, incorporates into move
    
    RDLONG reg,#lowbits ' picks up high bits from next pipe slot
    BIG #highaddressbits ' no need to clear, visible at $1F1
    
    mul reg, #5
    
    add reg,3
    
    WRLONG reg, $1F1 ' saves one long as address already fully assembled in 'big' register
    

    My preference is for BIG to supply the low 23 bits, and the direct argument the high 9 bits - for direct address compatability.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-05 11:16
    Ok, brunch eaten :)

    Your prefix style, need flipflop cleared after BIG is consumed:
    BIG #highbots ' sets high bits
    MOV  reg,#lowbits ' constructed 32 bit value is visible at $1F1
    
    BIG #highaddressbits ' sets high bits, no need to clear as low bits moved into low 9 bits of BIG like MOVS/SETS
    RDLONG reg,#lowbits '  constructed 32 bit value is visible at $1F1
    
    mul reg, #5
    
    add reg,3
    
    WRLONG reg, $1F1 ' saves one long as address already fully assembled in 'big' register
    

    Chip's suffix style, no need for flipflop, finds 23 bit value in next pipeline stage
    MOV  reg,#lowbits 
    BIG #highbots ' MOV notices following 23 bits, incorporates into move
    
    RDLONG reg,#lowbits ' picks up high bits from next pipe slot
    BIG #highaddressbits ' no need to clear, visible at $1F1
    
    mul reg, #5
    
    add reg,3
    
    WRLONG reg, $1F1 ' saves one long as address already fully assembled in 'big' register
    

    My preference is for BIG to supply the low 23 bits, and the direct argument the high 9 bits - for direct address compatability.
    I see what you're doing but are you sure you want to waste another COG location with a visible BIG register? Also, I've already said why I think the low bits should be the 9 bits from the modified instruction. You haven't yet provided an example showing how having the BIG instruction supply the low bits would be useful and I think it will be more complicated to implement in hardware.
  • jmgjmg Posts: 15,155
    edited 2013-12-05 11:17
    cgracey wrote: »
    It would be used like this:

    ADD reg,#bigconstant & $1FF
    BIG #bigconstant >> 9

    That would add bigconstant to reg.

    Instead of BIG, we should probably give it a name like AUGI for 'augment immediate'.

    Any instruction having an immediate S or D would look for AUGI behind it. If it sees it and it's not cancelled, it extends the immediate value right in the pipeline, before it gets to stage 4.

    I would make a larger leap on the basis of Assembler Clarity.
    ( no change to the binary action, just to what the user 'sees' )

    ie If the above opcodes work

    ADD reg,#bigconstant & $1FF
    BIG #bigconstant >> 9

    but what that finally does is 'add bigconstant to reg', then it makes more sense to be able to write this in one ASM line

    ADDI32 reg,#big32constant // does what it says

    Now the assembler creates two 32 bit values, so you have a 2 word opcode.

    if you do want to also support the more obtuse dual opcode in ASM then I'd use EXTend Immediate 32

    ADD reg,#bigconstant & $1FF
    EXTI32 #bigconstant >> 9

    If that second opcode is context dependent on Any instruction having an immediate S or D, then the Assembler should check that, and give an error. ( another reason for the simpler, clearer one line syntax )

    A smart assembler could even support this as well

    ADD reg,#AnyConstant

    and spawn one of two opcode sets (just like many ASMs now do automatically with JMP/CALL)
    The LIST file should make it clear when 32 bit promotion occurred.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-05 11:20
    jmg, excellent, I totally agree.

    Macros could be written that if #AnyConstant<511 use single instruction, otherwise add the following EXTI ...
  • jmgjmg Posts: 15,155
    edited 2013-12-05 11:22
    My preference is for BIG to supply the low 23 bits, and the direct argument the high 9 bits - for direct address compatability.

    See my example above.
    Once a binary path exists, this now really moves into how the Assembler manages what the user wants.

    Clarity, and freedom from context errors should become important in how the Assembler supports this new opcode set.

    Edit: Hehe snap - you can read and comprehend faster than I can type.. :)
Sign In or Register to comment.