Hub Execution Model Thread (split from blog)
Bill Henning
Posts: 6,445
HUB EXECUTION MODEL:
Note - these instructions have normal conditional execution bits :-)
Any cog mode jump or cog mode call outside of the OCTL window exits hub execution mode, with PTRA pointing to next hub instruction
This would make P2 very competitive (actually, due to 8 cores, totally outclass) arm chips without hardware floating point that run at up to 160MHz
It would also save Parallax the development cost of a quad-long based VLIW style GCC port (at a guess, about $250K)
It would be useful to standardize on the 8-long cache being mapped to $1E0, as then instructions within the cache could refer to constants embedded in the code at known locations.
HJMP D/#addr
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
Enters hub-exec mode if in cog mode
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)
Setting C and Z does not make sense for HJMP, so could be used as additional address bits, or relative jumps
C could indicate add AAAAAAAAAAAAAAAAAA00 to PTRA (forward relative jump)
Z could indicate subtract AAAAAAAAAAAAAAAAAA00 from PTRA (backward relative jump)
Relative jumps would be helpful for position independent code.
HCALL D/#addr
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
AUX = ++PTRA
Saves next hub instruction address value onto the AUX stack using --SPA, then
It would also be very desirable to be able to enter hub-exec mode with HCALL, as then cog-only mode would be able to execute library code from the hub
This would largely eliminate the cog memory limitation; note all cogs could share hub subroutine libraries.
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)
WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB
HRET {#offset}
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
PTRA = AUX[SPA++] + offset
execute instruction in cog memory right after the HJMP that entered hub-exec mode
It would be highly desirable that if hub code was invoked with HCALL, that the HRET would go back to cog execution mode - see explanation in HCALL
Offset is scaled by 4, normall 0, but could be used to pop up several levels - think exceptions; of course SUBSPA #offset would do the same
WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB
Loading Constants
RDLONG reg,ptra++
long constant
Loading Variables
Assuming the cache is visible at $1E0
1e0: rdlong reg,$1E1 ' any two consecutive longs of the eight in the cache visible at $1E0, could be word, byte
1e1: long 23 bit address
Saving Variables
1e0: wrlong reg,$1E1 ' any two consecutive longs of the eight in the cache visible at $1E0, could be word, byte
1e1: long 23 bit address
Limitations
- REPxxx loops must fit in the 8 long OCTL cache
- DJNZ and friends must fit in the 8 long window
- any type of jump/loop or call that is not HJMP / HCALL / HRET exits hub execution mode
- RDxxxxC and WRxxxxC instructions must not be used in hub execute model
Possible Improvements
- it should be possible to support calling cog subroutines from hub execution mode using JMPRET, as long as they can return to hub execution mode
- adding a CSEG register that is added to all HJMP/HCALL addresses would eliminate the need for relative jumps
- adding a DSEG register for non-HJMP/HCALL/HRET hub references would also allow relocatable data
- it would be relatively easy to write two cog subroutines for HCALLH and HRETH that would use a hub based stack via PTRB for code that needed a large stack
- in hub stack mode, stack variables can be referenced with indexes of PTRB
- by writing a small relocating loader, it would be possible to support multiple HUBEXEC C programs at the same time, running in different cogs
Folks, with this the P2 is no longer just a microcontroller - it is also a full fledged microprocessor!
HISTORY:
- with the new process, the DAC bus would no longer fit
- removing the DAC bus allowed chip to increase the hub to 256KB
- increasing the hub to 256KB made Chip think of RDOCTL/WROCTL
- jazzed suggested trying to run directly out of the hub
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223354&viewfull=1#post1223354
- Chip tried not to think of executing from the hub
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223807&viewfull=1#post1223807
- I could not help thinking about it, as LMM came about from wanting to run code from the hub
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page175
http://forums.parallax.com/showthread.php/89640-ANNOUNCING-Large-memory-model-for-Propeller-assembly-language-programs!
- Chip started thinking about it... including auto-loading sequential 8-long thunks
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223818&viewfull=1#post1223818
- initially, I considered 8-long thunks, VLIW style
- Chip suggested supporting relative jumps
- David asked if Chip sucessfully avoided thinking about executing from the hub, but Chip thought some more about it, and more
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page175
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223922&viewfull=1#post1223922
- the 8-long grain for hub jumps and calls bothered me, so I finally proposed HUBEXEC, HJMP, HCALL, HRET
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page179
The rest, shall we say, is history - read the thread starting at Chip's post above to see the great discussion that ensued!
The Bright Future
For the P3, using DDR2/3/4/+, this model could be extended to XEXEC - bringing in a small cache of longs, and executing in the same manner as the HUB EXEC model.
By adding CSEG and DSEG registers, and ideally an SSEG (for external stack) with limit registers on each... we could have essentially unlimited memory, and port Linux, as the segment/limit register pairs will effectively act like a per-cog MMU.
Note - these instructions have normal conditional execution bits :-)
Any cog mode jump or cog mode call outside of the OCTL window exits hub execution mode, with PTRA pointing to next hub instruction
This would make P2 very competitive (actually, due to 8 cores, totally outclass) arm chips without hardware floating point that run at up to 160MHz
It would also save Parallax the development cost of a quad-long based VLIW style GCC port (at a guess, about $250K)
It would be useful to standardize on the 8-long cache being mapped to $1E0, as then instructions within the cache could refer to constants embedded in the code at known locations.
HJMP D/#addr
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
Enters hub-exec mode if in cog mode
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)
Setting C and Z does not make sense for HJMP, so could be used as additional address bits, or relative jumps
C could indicate add AAAAAAAAAAAAAAAAAA00 to PTRA (forward relative jump)
Z could indicate subtract AAAAAAAAAAAAAAAAAA00 from PTRA (backward relative jump)
Relative jumps would be helpful for position independent code.
HCALL D/#addr
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
AUX = ++PTRA
Saves next hub instruction address value onto the AUX stack using --SPA, then
It would also be very desirable to be able to enter hub-exec mode with HCALL, as then cog-only mode would be able to execute library code from the hub
This would largely eliminate the cog memory limitation; note all cogs could share hub subroutine libraries.
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)
WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB
HRET {#offset}
TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA
PTRA = AUX[SPA++] + offset
execute instruction in cog memory right after the HJMP that entered hub-exec mode
It would be highly desirable that if hub code was invoked with HCALL, that the HRET would go back to cog execution mode - see explanation in HCALL
Offset is scaled by 4, normall 0, but could be used to pop up several levels - think exceptions; of course SUBSPA #offset would do the same
WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB
Loading Constants
RDLONG reg,ptra++
long constant
Loading Variables
Assuming the cache is visible at $1E0
1e0: rdlong reg,$1E1 ' any two consecutive longs of the eight in the cache visible at $1E0, could be word, byte
1e1: long 23 bit address
Saving Variables
1e0: wrlong reg,$1E1 ' any two consecutive longs of the eight in the cache visible at $1E0, could be word, byte
1e1: long 23 bit address
Limitations
- REPxxx loops must fit in the 8 long OCTL cache
- DJNZ and friends must fit in the 8 long window
- any type of jump/loop or call that is not HJMP / HCALL / HRET exits hub execution mode
- RDxxxxC and WRxxxxC instructions must not be used in hub execute model
Possible Improvements
- it should be possible to support calling cog subroutines from hub execution mode using JMPRET, as long as they can return to hub execution mode
- adding a CSEG register that is added to all HJMP/HCALL addresses would eliminate the need for relative jumps
- adding a DSEG register for non-HJMP/HCALL/HRET hub references would also allow relocatable data
- it would be relatively easy to write two cog subroutines for HCALLH and HRETH that would use a hub based stack via PTRB for code that needed a large stack
- in hub stack mode, stack variables can be referenced with indexes of PTRB
- by writing a small relocating loader, it would be possible to support multiple HUBEXEC C programs at the same time, running in different cogs
Folks, with this the P2 is no longer just a microcontroller - it is also a full fledged microprocessor!
HISTORY:
- with the new process, the DAC bus would no longer fit
- removing the DAC bus allowed chip to increase the hub to 256KB
- increasing the hub to 256KB made Chip think of RDOCTL/WROCTL
- jazzed suggested trying to run directly out of the hub
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223354&viewfull=1#post1223354
- Chip tried not to think of executing from the hub
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223807&viewfull=1#post1223807
- I could not help thinking about it, as LMM came about from wanting to run code from the hub
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page175
http://forums.parallax.com/showthread.php/89640-ANNOUNCING-Large-memory-model-for-Propeller-assembly-language-programs!
- Chip started thinking about it... including auto-loading sequential 8-long thunks
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223818&viewfull=1#post1223818
- initially, I considered 8-long thunks, VLIW style
- Chip suggested supporting relative jumps
- David asked if Chip sucessfully avoided thinking about executing from the hub, but Chip thought some more about it, and more
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page175
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223922&viewfull=1#post1223922
- the 8-long grain for hub jumps and calls bothered me, so I finally proposed HUBEXEC, HJMP, HCALL, HRET
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page179
The rest, shall we say, is history - read the thread starting at Chip's post above to see the great discussion that ensued!
The Bright Future
For the P3, using DDR2/3/4/+, this model could be extended to XEXEC - bringing in a small cache of longs, and executing in the same manner as the HUB EXEC model.
By adding CSEG and DSEG registers, and ideally an SSEG (for external stack) with limit registers on each... we could have essentially unlimited memory, and port Linux, as the segment/limit register pairs will effectively act like a per-cog MMU.
Comments
Link register versions for easy GCC support
AUX stack versions for Spin, other virtual machines, and other compilers for greater performance
Instructions with embedded 18 bit hub address (lowest 2 bits zero due to long aligment, and implied)
TTTTTTT ZC I CCCC jjAAAAAAAAA AAAAAAA
where jj select between HJMP / HCALL / HCALLA / HCALLB
TTTTTTT ZC I CCCC 00AAAAAAAAA AAAAAAA HJMP
TTTTTTT ZC I CCCC 01AAAAAAAAA AAAAAAA HCALL
TTTTTTT ZC I CCCC 10AAAAAAAAA AAAAAAA HCALLA
TTTTTTT ZC I CCCC 11AAAAAAAAA AAAAAAA HCALLB
TTTTTTT is the seven bit op code to be assigned by Chip
ZC,I,CCCC as normal P1/P2 usage
jj selects between the four hub-address instructions
HJMP D/#addr
TTTTTTT ZC I CCCC 00AAAAAAAAA AAAAAAA HJMP
Enters hub-exec mode if in cog mode
Exits hubexec mode and jumps to cog address if address < $1E0, or assign a unique op-code to HRET below instead of aliasing to HJMP
If in HUBEXEC mode
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)
Setting C and Z does not make sense for HJMP, so could be used as additional address bits, or relative jumps
C could indicate add address to the program counter (forward relative jump)
Z could indicate subtract address from the program counter (backward relative jump)
Relative jumps would be helpful for position independent code.
HCALL D/#addr
TTTTTTT ZC I CCCC 01AAAAAAAAA AAAAAAA HCALL
LR = ++PC
Saves next hub instruction address value into a link register, then PC = specified address
I strongly recommend using location $1F0 as the link register, that way compiled code can rely on its know location, and there would be no need for an 'SETLR reg' instruction or linker magic. KISS principle
If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets program counter to scaled address, fetches OCTL, jumps to first instruction in octl)
If not immediate address, jumps to address in D (sets program counter to D, fetches OCTL, jumps to first instruction in octl)
WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB
It would also be very desirable to be able to enter hub-exec mode with HCALL, as then cog-only mode would be able to execute library code from the hub
This would largely eliminate the cog memory limitation; note all cogs could share hub subroutine libraries.
HCALLA D/#addr
TTTTTTT ZC I CCCC 10AAAAAAAAA AAAAAAA HCALLA
AUX = ++PC
Saves next hub instruction address value onto the AUX stack using --SPA, then PC = address
Only difference from HCALL is using a hardware stack instead of a link register
HCALLB D/#addr
TTTTTTT ZC I CCCC 11AAAAAAAAA AAAAAAA HCALLB
AUX = ++PC
Saves next hub instruction address value onto the AUX stack using --SPA, then PC = address
Only difference from HCALLA is using the SPB pointer
Zero or Single operand instructions (bit pattern to be assigned by Chip, a lot are available)
HRET
No opcode needed; exactly equivalent to
HJMP $1F0 ' proposed fixed Link register, can be used as regular general purpose register when not in hubexec mode
If desired, assign unique opcode that jumps to hub address in $1F0
HRETA
Op code to be assigned
PC = [SPA++]
HRETB
Op code to be assigned
PC = [SPB++]
NOTE:
allowing a direct 9 bit constant in the source field would allow cleaning up the stack by removing local variables and arguments.
Example:
HRETA #-9
SPA -= 9 before popping the return address
Instructions with embedded 23 bit constant
Opcode encoding to be assigned by Chip.
BIG #const23
Suggested by David, as per Chip's or David's usage, allows extending 9 bit immediate constants to instructions to a full 32 bits.
It may be useful to allocate $1F1 as the "BIG" value register, and store the created 32 bit constant in it, so subsequent instructions can use it.
Example:
RDLONG reg,#const32 ' assembler replaces with RDLONG / BIG pair as per David's suggesting
mul reg, #5
add reg,3
WRLONG reg, $1F1 ' saves one long as address already computed in 'big' register
Such code is VERY common, so the potential for savings is significant.
Snippet #1: Load & Execute a big block of cog code (think FCACHE)
Snippet #2: Dump a big block of cog code (think debugger)
EDIT: Get rid of dedicated hub access windows, and use a smarter hub bus arbiter.
ptheads will work nicely!
http://forums.parallax.com/showthread.php/152070-To-cache-or-not-to-cache...-musings-on-improving-Spin-LMM-performance
The non-C RD/WR's don't clobber the cache.
For the sake of determinate timing (needed for hard real time) we need to keep hub windows.
Edit: Maybe that can be handled by saying that icache fill always wins the arbitration and stalls the pipeline until the cache line is read. That will also stall the RDxxxx/WRxxxx instruction.
Basically, it's hardware LMM with some helpers.
Seems to me, the conflict between HUB data read write and execute is best solved by simply returning to the COG to write the data, which is nicely provided for in the current instruction proposal. Or, employ another COG, or task in a COG to get this done in parallel with the HUB PASM program execution. It would look sort of like the math operations do.
So then, one needs to organize the COG for this to really be maximized.
One approach is COG library code, or MCU model.
I really like the idea of shared library code frankly.
In that model, one blasts through snippets much like a snippet in SPIN 2. The parallels here are beautiful. Call it, work gets done, return to COG for business as usual. The conflict here isn't that big of a deal.
The other one is the CPU strategy. The conflict here is a big deal as the majority of the time will be spent in HUBEXEC mode.
Put the WIDE register block somewhere. I like top of COG personally.
Setup a few PASM routines that can blast data out to the HUB, then jump to the HUB PASM program. All of the rest of COG LONGS are basically CPU working registers! So do the work, and when it makes sense to get data to the HUB, or to another COG via PORT D, return to the COG to do that.
For larger chunks of data, the penalty is less. Worse case, read, modify, write a single byte.
Additionally, a second COG could be watching PORT D for data. It writes to the HUB, leaving the COG running HUB PASM to carry on with some timing assumptions!
I just caught the other posts. If the non C ops can run, perhaps queued or with a stall, that's yet another option... This is unreal.
Whether Chip actually wants to implement it for P2 is an open question. That needs to be answered quickly BTW so we know what to expect. Personally I don't think this should be even considered as a P2 feature.
Still it's good to flush out what is actually needed for a day when it is seriously considered.
When I mentioned the idea here in paragraph 3 I wanted to dump the requirement for an LMM interpreter and all that it involves. The biggest win is speed, other wins are not needing to waste 2KB on an interpreter among others. Needing any kind of an interpreter for such a feature is a non-starter - that seems to be clear.
But for serious consideration one should look at the implemented LMM interpreters and all aspects of what they do. And anyone who has written one should speak up!
This means going from startup, to multi-cog execution, using ADD/SUP PC, all the FCALL services, and to any effect of ignoring attributes like fcache.
The other question is how would this work with external memory? Should it? I'm guessing that would still require an interpreter unless Chip can volunteer a way to make fetch-exec from external memory possible. I suspect it would be a big friggin headache.
for a large blast, a lot fits in 8 longs with REPS being usable within the 8 block long; so cog-hub, hub-cog, hub-hub copies should fit
otherwise we call a cog subroutine. FCACHE is not dead yet :-)
For single byte, save ptra, and use ptra++/ptrb++ in a REPS loop to move the data.
THIS WILL BE FUN!
Seriously though, how would these segment registers help with XMM? You'd need to also have some sort of TLB and with it traps to handle TLB misses. I think traps are dangerously close to interrupts and might not be tolerated well by the Propeller community!
Wanting to execute out of the hub leads to LMM, FCACHE. Those lead to RDxxxxC and RDQUAD. Getting rid of the DAC and bumping the hub to 256KB bus gives Chip the idea for RDOCTL, he valiantly tries to not think of executing out of the hub. I play with ideas
No idea if he wants to put it into the P2, but as you say, still a worthwhile separate discussion for followup chips.
FYI, I talk about XMM version at the end of post#1.
I was thinking of treating XMM as a large linear space for the P3.
Segments/Limits was how Unix was first run, and provide for memory protection and relocation; heck they have even been used for swapping to virtual store.
You really only need TLB's and traps if you are swapping... I am thinking of running right out of the external memory, with one to four small cache lines per cog. The prop has a microcontroller hertiage, and I don't think it needs full virtual memory / MMU capability; and with per-cog segment/limit registers, virtual memory is still possible, if silly on a microcontroller.
Actually segment registers are not evil - ugly 16 bit segment registers, used to provide only offsets to 64KB segments are evil. Using full 32 bit segment pointers / limits / pointers they are relatively painless, and allow trivial relocation.
Since you have a history section which I just noticed, you should add a link to my post
Edit: Or are you suggesting that this all happens in hardware?
I'm on the fence about it being a P2 feature.
Execute in place changes things. A lot.
Some here believe working on the design to bridge the time gap based on what we learned on what I'll call the dry run makes sense. If we get execute in place at some high speed, say 90 percent? 80 percent? I think that's worth adding, because it is a very serious differentiator and it doesn't break the basics of what a Propeller is and it doesn't mean we can't still do LMM.
Here is what it comes down to for me:
I really don't want the basic dynamics of a Propeller broken on this chip. I think it's important to have those be solid so that current users can adopt this thing and carry the whole community forward. Anything that threatens that really should be considered very carefully.
I don't know whether or not execute in place does that. Maybe it does. I'm already thinking about most COGS now able to spill over into the HUB, and so where do they do it, how is that managed, etc...? Looks to be a real mess from that POV.
Which is why I'm on the fence.
Oh, and I hate to say this, but I feel the same way about execute in place (hardware LMM assist) as others do about opening up the HUB timing slots.
So, if I'm shouted down in the same way others were, no worries. That may be the right thing to do.
It's up to Chip, and I would say Parallax, because they will take the risk, and they will need to add the value needed to carry most of us forward. If they think they can do that, I'm going to trust it, because I believe in them.
Not much else to say.
For hub exec:
the OCTL buffer is a single line cache
For P3 xmm exec:
Totally bypasses the hub, DDR2/+ loads one (or small number of) cache lines, in hardware. This will not be as fast, or as deterministic as hub exec, but the more cache lines, the closer it will be.
We can also expect video to compete for bandwidth.
In both cases we say goodby to LMM loops, and run MUCH faster.
FCACHE will still be useful in both cases.
Highest performance: cog only (fully deterministic)
Second: hub exec (can be deterministic if coded directly in assembler)
Third: xmm exec (not deterministic, but good for overall application)
Well, big segment registers are just a different scale of evil. Sort of a "don't worry about it today" kind of evil that somebody somewhere will worry about eventually kind of evil.
That said, I've no opinion on them. If this feature goes in, it goes in. Given how this has gone, I suspect blowing it all out again for P3 would see them eliminated or another "worry tomorrow" evil will replace them.
- Chip goes to RDOCTL anyway
- AUX only hardware stack
- only the three simple HMM instructions are added (described above)
- Chip, please do not even consider the xmm exec for P2! THAT would be considerable work
The cog already has mechanism to handle back to back reads, so this is just requires a tiny bit of arbitration.
Frankly, its probably less work that hub slot management
... but only Chip knows how much work it will be.
Even this simple version can support a hub based stack, with two helper cog routines (HCALLH, HRETH)
Knowing Chip, he is busily adding this as I type this...
I was not even considering considering hardware xmm support for P2!
Dang it, I think I need to add more disclaimers in discussions....
Just to give you a bigger evil to think on...
64 bit P3, with 64 bit segment/limit registers
It was also under "Possible Improvements", but I meant segment registers for P3, which is why it also showed up in the P3 section... in my excitement, I did not clarify sufficiently!