Great news. That is clever just reversing the order of the two instructions. And working for #D as well. WTG Chip!
It doesn't work for multi-tasking? (That's fine I think)
Looks like we will get that "HUBEXEC" (execute in place) model working! What a performance boost over LMM!
It seems like it would work with multi-tasking if you put the BIG or AUGI instruction before the instruction that it augments and added a 23 bit register for each thread. Is that too much added logic?
It seems like it would work with multi-tasking if you put the BIG or AUGI instruction before the instruction that it augments and added a 23 bit register for each thread. Is that too much added logic?
Interesting idea David!
Edit: We only have 1 x SETWIDE facility, like the old SETQUAD, difficult to share amongst tasks?
Edit: We only have 1 x SETWIDE facility, like the old SETQUAD, difficult to share amongst tasks?
That's true. The "execute from hub" feature will probably only work on a single thread. At the very least, the 8-long window into hub memory will get thrashed if instructions get fetched from a different area of hub memory on every cycle. Maybe this also suggests that the BIG/AUGI instruction doesn't need to support multiple threads since it is mostly useful in "execute from hub" mode.
Could we simplify this whole thing a bit, and disregard multi-tasking for this mode of operation? Might simplify it quite a bit for Chip, etc.
Sorry, I did not provide enough detail. I never thought it would be available for cog tasks as it would need four cache lines (one per task) which would actually be better used for a single task; also pthreads would work nicely in this mode.
Does the mapping/windowing of AUX into COG help if you could map larger blocks into COG?
Does not apply.
What would help would be to have it in a fixed region, I strongly recommend $1E0-$1E7 as that would allow clever compiler tricks like re-use of BIG values due to the code generator being able to access the eight long window as regular cog registers when needed.
Extending the above HUBEXEC (named by Bill) model (replaces LMM model)...?
No need for an LMM loop, Chip already expressed that the 8-long window would auto-increment to the next 8-long block (unless there was an explicit HJMP/HCALL somewhere else)
No need for REPS loop etc, Chip will put the fetch phase into the hardware. Automagic!
It really turns this into directly executing from the hub - the "Holy Grail".
I have asked Chip if it were possible to
(1) Make the RDWIDE instruction capable of delivering up to a count of 32 x 8*Long reads into AUX in the background with a tiny state m/c
(2) If it would be possible to map up to the whole 32 x 8*Long AUX registers into COG ram
In the other thread, Chip liked the 8-long to/from AUX idea; but not for hubexec mode.
Chip intended on replacing mapping the quads into the cog with mapping the octal window.
After a lot of thought, I believe the 8-long cache should always be at $1E0, as that will allow assembly language and compiler tricks to reduce code size.
By mapping a large Aux block into Cog, a good set of hub instructions could be executed inline at a time, and possibly small loops could be contained
within those blocks read, giving an enormous boost to performance.
Small REPx loops fit in 8 longs (see post # 3), but must be within the block, too difficult for P2 to span multiple 8-long blocks (would need multiple cache lines, too big a change for P2)
For bigger loops, use a small reps loop (that fits) to load a big block of cog pasm code in the space $000-$1DF, execute it, and have it return to hubexec mode.
I got rid of the SETPIX0/1/2/3 instructions and made a new SETPIXW instruction that loads all eight PIX terms from the WIDE registers, all at once. So, there are four 'D/#,S/#' opcodes available now.
I've loosely read this thread and I understand that you are looking for some opcode space for HJMP/HCALL/...
EXCELLENT news!
For a 256KB hub, I can encode HJMP/HCALL/HCALLA/HCALLB into a single D/#,S/# opcode, and HRET fits in any available single argument op space. I'll update post#1 after breakfast.
Since I don't know the opcode bit patter, I'll just use TTTTTTT, which can be filled with the exact freed pattern later
I also see there is talk about how to have a 32-bit constant in-line. About that: I think the idea has already been posited, but we could have a dummy instruction that doesn't do anything, though its 23 LSBs are free for data payload. Any instruction that has an immediate D or S, with priority going to S, can look for this dummy instruction in the next-lower stage of the pipeline. If it sees it, and it hasn't been cancelled as trailing branch code, it will use its 23 LSBs to augment the 9-bit immediate value it already has, giving it a full 32-bit immediate for D or S. This would solve the problem, would it not?
Yes, it would solve the problem.
I like it better than the prefix! More like other conventional multi-word intructions.
If possible, allow the case of
BIG val1
BIG val2
..
BIG valN
as I have a somewhat vague idea for how to reduce memory requirements yet have fairly high performance jump tables for case statements; basically it involves letting code fall harmlessly though some addresses in order to save code space (trading the single cycle wasted per label fallen throug.
It really turns this into directly executing from the hub - the "Holy Grail".
I have barely been able to keep up with this debate. Is this really true? If so, it's fantastic! One of those "impossibilities" that happens in Parallaxia.
Instead of BIG, we should probably give it a name like AUGI for 'augment immediate'.
Any instruction having an immediate S or D would look for AUGI behind it. If it sees it and it's not cancelled, it extends the immediate value right in the pipeline, before it gets to stage 4. This was a really clever idea you guys came up with, and it turns out that it can be done by using the registers already in the pipeline, so it's almost free!
Great news. That is clever just reversing the order of the two instructions. And working for #D as well. WTG Chip!
It doesn't work for multi-tasking? (That's fine I think)
Looks like we will get that "HUBEXEC" (execute in place) model working! What a performance boost over LMM!
It will feel better following the instruction; also that way it gets rid of the need for a state flip-flop.
As the constant itself will be in the lowest 23 bits, I can still use it for some tricky optimizations :-)
Yes - HUBEXEC will be much faster than LMM, and reduce code size. A win-win.
For multi-tasking hubexec code, pthreads would work just fine (or similar cooperative time slicing, a "YIELD" cog subroutine could switch threads... heck it would be possible to write some code that can be HCALLed to swap thrads, it does not even have to be a cog subroutine.
For multiple hubexec threads, we just use pthreads from C, for assembler, it should be fairly easy to write a "YIELD" hub-subroutine for cooperative multi-tasking.
Hubexec for P3 could work for each task, but I really, really think it is too much for P2, as it would require four better caches to work well. Getting this hubexec will let us test it out, and make an even better version for P3.
That's true. The "execute from hub" feature will probably only work on a single thread. At the very least, the 8-long window into hub memory will get thrashed if instructions get fetched from a different area of hub memory on every cycle. Maybe this also suggests that the BIG/AUGI instruction doesn't need to support multiple threads since it is mostly useful in "execute from hub" mode.
Excellent work. This feature is worth the time. Now we've got hardware LMM. And that takes care of the "super cog" idea we always center on from time to time. Any COG can be a super COG, running a larger program. Or, a whole pile of them. Sheesh.
I have barely been able to keep up with this debate. Is this really true? If so, it's fantastic! One of those "impossibilities" that happens in Parallaxia.
Better yet, also gets rid of the issues with running multiple LMM programs running at once. All we need is a small relocating loader. (or on P3, segment registers)
Also, see post#1 - this is clearly extendable to XMM on P3 with DDR2... it will be slower than hubexec, but transparent. Think XJMP / XCALL / XRET ... only the op code changes... maybe not even that, if the address is used to distinguish between HUB/XMM mode. But that discussion is for the future.
Excellent work. This feature is worth the time. Now we've got hardware LMM. And that takes care of the "super cog" idea we always center on from time to time. Any COG can be a super COG, running a larger program. Or, a whole pile of them. Sheesh.
No need for an LMM loop, Chip already expressed that the 8-long window would auto-increment to the next 8-long block (unless there was an explicit HJMP/HCALL somewhere else)
No need for REPS loop etc, Chip will put the fetch phase into the hardware. Automagic!
It really turns this into directly executing from the hub - the "Holy Grail".
Thanks for clearing that up!
We're not looking for a repeat of LMM. Hopefully Chip can get rid of the need for it.
Executing "directly" from the hub without the need for a fetch/execute loop is far better.
I'm afraid that we need a prototype to prove it all out. That's going to be tough for GCC though since the Chip's instruction overhaul makes a new back-end necessary.
Can we make a smaller prototype to prove it works when it's available. I suppose an extension of PNut would be necessary ... seems like everything depends on Chip's availability.
I'm afraid that we need a prototype to prove it all out. That's going to be tough for GCC though since the Chip's instruction overhaul makes a new back-end necessary.
Can we make a smaller prototype to prove it works when it's available. I suppose an extension of PNut would be necessary ... seems like everything depends on Chip's availability.
I suspect Chip will add it to the Verilog fairly soon, and we will get DE0-Nano and DE2-115 configuration files to try it out with quickly.
The other instruction set changes make a back-end overhaul necessary anyway, and changing FJMP->HJMP, etc. should be very easy; different op-code and embedding the hub address right into the instruction (instead of the long following it).
Basically (loose example)
emit("FCALL")
emit("long label")
changes to
emit("HCALL label")
The rest is adding the few additional new instructions to gas, and fixing the address in ld.
The good news is that as soon as PNut supports the new instructions, it will be easy to verify them on the FPGA before synthesis, and iron out any issues that may arise with them.
I'm afraid that we need a prototype to prove it all out. That's going to be tough for GCC though since the Chip's instruction overhaul makes a new back-end necessary.
Eric made some quick changes to the GCC backend to remove the use of WR/NR in the code generated by PropGCC so there might be hope of getting PropGCC working pretty soon after we get a final instruction set and FPGA configurations from Chip.
Eric made some quick changes to the GCC backend to remove the use of WR/NR in the code generated by PropGCC so there might be hope of getting PropGCC working pretty soon after we get a final instruction set and FPGA configurations from Chip.
Eric made some quick changes to the GCC backend to remove the use of WR/NR in the code generated by PropGCC so there might be hope of getting PropGCC working pretty soon after we get a final instruction set and FPGA configurations from Chip.
Suggested by David, as per Chip's or David's usage, allows extending 9 bit immediate constants to instructions to a full 32 bits.
It may be useful to allocate $1F1 as the "BIG" value register, and store the created 32 bit constant in it, so subsequent instructions can use it.
Example:
RDLONG reg,#const32 ' assembler replaces with RDLONG / BIG pair as per David's suggesting
mul reg, #5
add reg,3
WRLONG reg, $1F1 ' saves one long as address already computed in 'big' register
Such code is VERY common, so the potential for savings is significant.
I don't think the WRLONG will work since the BIG register will have been cleared by then. It has to be cleared immediately after use or it will mess up all following instructions that have an S field.
I don't think the WRLONG will work since the BIG register will have been cleared by then. It has to be cleared immediately after use or it will mess up all following instructions that have an S field.
Good point, however if Chip puts in a flip-flop that is cleared after use... or it follows the instruction referencing it like Chip suggested... then it would work, and save some memory.
Good point, however if Chip puts in a flip-flop that is cleared after use... or it follows the instruction referencing it like Chip suggested... then it would work, and save some memory.
It is in Chip's hands.
How would it work? The bit would be cleared right after the RDLONG instruction so the BIG register would contain zero by the time you reached the WRLONG. If it didn't get cleared then the MUL and ADD instructions would have their S fields modified and wouldn't work as expected. I don't see how this will work no matter how Chip implements BIG.
How would it work? The bit would be cleared right after the RDLONG instruction so the BIG register would contain zero by the time you reached the WRLONG. If it didn't get cleared then the MUL and ADD instructions would have their S fields modified and wouldn't work as expected. I don't see how this will work no matter how Chip implements BIG.
How would it work? The bit would be cleared right after the RDLONG instruction so the BIG register would contain zero by the time you reached the WRLONG. If it didn't get cleared then the MUL and ADD instructions would have their S fields modified and wouldn't work as expected. I don't see how this will work no matter how Chip implements BIG.
Ok, brunch eaten
Your prefix style, need flipflop cleared after BIG is consumed:
BIG #highbots ' sets high bits
MOV reg,#lowbits ' constructed 32 bit value is visible at $1F1
BIG #highaddressbits ' sets high bits, no need to clear as low bits moved into low 9 bits of BIG like MOVS/SETS
RDLONG reg,#lowbits ' constructed 32 bit value is visible at $1F1
mul reg, #5
add reg,3
WRLONG reg, $1F1 ' saves one long as address already fully assembled in 'big' register
Chip's suffix style, no need for flipflop, finds 23 bit value in next pipeline stage
MOV reg,#lowbits
BIG #highbots ' MOV notices following 23 bits, incorporates into move
RDLONG reg,#lowbits ' picks up high bits from next pipe slot
BIG #highaddressbits ' no need to clear, visible at $1F1
mul reg, #5
add reg,3
WRLONG reg, $1F1 ' saves one long as address already fully assembled in 'big' register
My preference is for BIG to supply the low 23 bits, and the direct argument the high 9 bits - for direct address compatability.
Your prefix style, need flipflop cleared after BIG is consumed:
BIG #highbots ' sets high bits
MOV reg,#lowbits ' constructed 32 bit value is visible at $1F1
BIG #highaddressbits ' sets high bits, no need to clear as low bits moved into low 9 bits of BIG like MOVS/SETS
RDLONG reg,#lowbits ' constructed 32 bit value is visible at $1F1
mul reg, #5
add reg,3
WRLONG reg, $1F1 ' saves one long as address already fully assembled in 'big' register
Chip's suffix style, no need for flipflop, finds 23 bit value in next pipeline stage
MOV reg,#lowbits
BIG #highbots ' MOV notices following 23 bits, incorporates into move
RDLONG reg,#lowbits ' picks up high bits from next pipe slot
BIG #highaddressbits ' no need to clear, visible at $1F1
mul reg, #5
add reg,3
WRLONG reg, $1F1 ' saves one long as address already fully assembled in 'big' register
My preference is for BIG to supply the low 23 bits, and the direct argument the high 9 bits - for direct address compatability.
I see what you're doing but are you sure you want to waste another COG location with a visible BIG register? Also, I've already said why I think the low bits should be the 9 bits from the modified instruction. You haven't yet provided an example showing how having the BIG instruction supply the low bits would be useful and I think it will be more complicated to implement in hardware.
Instead of BIG, we should probably give it a name like AUGI for 'augment immediate'.
Any instruction having an immediate S or D would look for AUGI behind it. If it sees it and it's not cancelled, it extends the immediate value right in the pipeline, before it gets to stage 4.
I would make a larger leap on the basis of Assembler Clarity.
( no change to the binary action, just to what the user 'sees' )
ie If the above opcodes work
ADD reg,#bigconstant & $1FF
BIG #bigconstant >> 9
but what that finally does is 'add bigconstant to reg', then it makes more sense to be able to write this in one ASM line
ADDI32 reg,#big32constant // does what it says
Now the assembler creates two 32 bit values, so you have a 2 word opcode.
if you do want to also support the more obtuse dual opcode in ASM then I'd use EXTend Immediate 32
If that second opcode is context dependent on Any instruction having an immediate S or D, then the Assembler should check that, and give an error. ( another reason for the simpler, clearer one line syntax )
A smart assembler could even support this as well
ADD reg,#AnyConstant
and spawn one of two opcode sets (just like many ASMs now do automatically with JMP/CALL)
The LIST file should make it clear when 32 bit promotion occurred.
Comments
Interesting idea David!
Edit: We only have 1 x SETWIDE facility, like the old SETQUAD, difficult to share amongst tasks?
Hi Ray,
They are equivalents of FJMP, FCALL, FRET, that do not need to be interpreted; and FJMP/FCALL embed a hub address capable of reaching the full 256KB.
Correct!
Sorry, I did not provide enough detail. I never thought it would be available for cog tasks as it would need four cache lines (one per task) which would actually be better used for a single task; also pthreads would work nicely in this mode.
Does not apply.
What would help would be to have it in a fixed region, I strongly recommend $1E0-$1E7 as that would allow clever compiler tricks like re-use of BIG values due to the code generator being able to access the eight long window as regular cog registers when needed.
No need for an LMM loop, Chip already expressed that the 8-long window would auto-increment to the next 8-long block (unless there was an explicit HJMP/HCALL somewhere else)
No need for REPS loop etc, Chip will put the fetch phase into the hardware. Automagic!
It really turns this into directly executing from the hub - the "Holy Grail".
In the other thread, Chip liked the 8-long to/from AUX idea; but not for hubexec mode.
Chip intended on replacing mapping the quads into the cog with mapping the octal window.
After a lot of thought, I believe the 8-long cache should always be at $1E0, as that will allow assembly language and compiler tricks to reduce code size.
Small REPx loops fit in 8 longs (see post # 3), but must be within the block, too difficult for P2 to span multiple 8-long blocks (would need multiple cache lines, too big a change for P2)
For bigger loops, use a small reps loop (that fits) to load a big block of cog pasm code in the space $000-$1DF, execute it, and have it return to hubexec mode.
EXCELLENT news!
For a 256KB hub, I can encode HJMP/HCALL/HCALLA/HCALLB into a single D/#,S/# opcode, and HRET fits in any available single argument op space. I'll update post#1 after breakfast.
Since I don't know the opcode bit patter, I'll just use TTTTTTT, which can be filled with the exact freed pattern later
Yes, it would solve the problem.
I like it better than the prefix! More like other conventional multi-word intructions.
If possible, allow the case of
BIG val1
BIG val2
..
BIG valN
as I have a somewhat vague idea for how to reduce memory requirements yet have fairly high performance jump tables for case statements; basically it involves letting code fall harmlessly though some addresses in order to save code space (trading the single cycle wasted per label fallen throug.
I have barely been able to keep up with this debate. Is this really true? If so, it's fantastic! One of those "impossibilities" that happens in Parallaxia.
It was a great suggestion by David, it will be incredibly useful in hubexec.
It will feel better following the instruction; also that way it gets rid of the need for a state flip-flop.
As the constant itself will be in the lowest 23 bits, I can still use it for some tricky optimizations :-)
Yes - HUBEXEC will be much faster than LMM, and reduce code size. A win-win.
For multi-tasking hubexec code, pthreads would work just fine (or similar cooperative time slicing, a "YIELD" cog subroutine could switch threads... heck it would be possible to write some code that can be HCALLed to swap thrads, it does not even have to be a cog subroutine.
For multiple hubexec threads, we just use pthreads from C, for assembler, it should be fairly easy to write a "YIELD" hub-subroutine for cooperative multi-tasking.
Hubexec for P3 could work for each task, but I really, really think it is too much for P2, as it would require four better caches to work well. Getting this hubexec will let us test it out, and make an even better version for P3.
Bye-Bye LMM (with 4:1 or 5:1 slow down without fcache)
Hello HubExec, running at (prediction) ~90% of cog-only pasm! (closer to 99.9% with FCACHE/FLIB)
Better yet, also gets rid of the issues with running multiple LMM programs running at once. All we need is a small relocating loader. (or on P3, segment registers)
Also, see post#1 - this is clearly extendable to XMM on P3 with DDR2... it will be slower than hubexec, but transparent. Think XJMP / XCALL / XRET ... only the op code changes... maybe not even that, if the address is used to distinguish between HUB/XMM mode. But that discussion is for the future.
We're not looking for a repeat of LMM. Hopefully Chip can get rid of the need for it.
Think of all the things we can accomplish with large, almost cog speed, pasm, C, etc. code!
LMM was great for the P1 - we did not have any other choice.
Executing "directly" from the hub without the need for a fetch/execute loop is far better.
Can we make a smaller prototype to prove it works when it's available. I suppose an extension of PNut would be necessary ... seems like everything depends on Chip's availability.
I suspect Chip will add it to the Verilog fairly soon, and we will get DE0-Nano and DE2-115 configuration files to try it out with quickly.
The other instruction set changes make a back-end overhaul necessary anyway, and changing FJMP->HJMP, etc. should be very easy; different op-code and embedding the hub address right into the instruction (instead of the long following it).
Basically (loose example)
emit("FCALL")
emit("long label")
changes to
emit("HCALL label")
The rest is adding the few additional new instructions to gas, and fixing the address in ld.
The good news is that as soon as PNut supports the new instructions, it will be easy to verify them on the FPGA before synthesis, and iron out any issues that may arise with them.
http://forums.parallax.com/showthread.php/152079-Hub-Execution-Model-Thread-%28split-from-blog%29?p=1223971&viewfull=1#post1223971
Good point, however if Chip puts in a flip-flop that is cleared after use... or it follows the instruction referencing it like Chip suggested... then it would work, and save some memory.
It is in Chip's hands.
I am eating, will post example shortly
Ok, brunch eaten
Your prefix style, need flipflop cleared after BIG is consumed:
Chip's suffix style, no need for flipflop, finds 23 bit value in next pipeline stage
My preference is for BIG to supply the low 23 bits, and the direct argument the high 9 bits - for direct address compatability.
I would make a larger leap on the basis of Assembler Clarity.
( no change to the binary action, just to what the user 'sees' )
ie If the above opcodes work
ADD reg,#bigconstant & $1FF
BIG #bigconstant >> 9
but what that finally does is 'add bigconstant to reg', then it makes more sense to be able to write this in one ASM line
ADDI32 reg,#big32constant // does what it says
Now the assembler creates two 32 bit values, so you have a 2 word opcode.
if you do want to also support the more obtuse dual opcode in ASM then I'd use EXTend Immediate 32
ADD reg,#bigconstant & $1FF
EXTI32 #bigconstant >> 9
If that second opcode is context dependent on Any instruction having an immediate S or D, then the Assembler should check that, and give an error. ( another reason for the simpler, clearer one line syntax )
A smart assembler could even support this as well
ADD reg,#AnyConstant
and spawn one of two opcode sets (just like many ASMs now do automatically with JMP/CALL)
The LIST file should make it clear when 32 bit promotion occurred.
Macros could be written that if #AnyConstant<511 use single instruction, otherwise add the following EXTI ...
See my example above.
Once a binary path exists, this now really moves into how the Assembler manages what the user wants.
Clarity, and freedom from context errors should become important in how the Assembler supports this new opcode set.
Edit: Hehe snap - you can read and comprehend faster than I can type..