A quick query about test and jump instructions in XBYTE. The addressing is relative so next instruction executed after TJx flag,routine if branch occurs is at routine+x where x = 0..7?
A quick query about test and jump instructions in XBYTE. The addressing is relative so next instruction executed after TJx flag,routine if branch occurs is at routine+x where x = 0..7?
Yes, I think.
For those 9-bit immediate branches, addressing is relative. If you use a register to hold the address, it will be absolute.
Another XBYTE-related query. If more than 7 instructions are being skipped at once, then the 8th subsequent one will cancelled and act as a NOP. Presumably this resets the 'skipped-in-a-row' counter and 7 more can be skipped without incurring another penalty. Therefore, from the original instruction, up to 16 instructions can be skipped with only a two-cycle penalty. Does that sound correct?
Yes. And any non-skipped instruction resets the 7 counter.
So, if you want to mix the instructions being executed, you might be able to tune the sequence without wasted nop clocks. Probably not as readable tho.
Another XBYTE-related query. If more than 7 instructions are being skipped at once, then the 8th subsequent one will cancelled and act as a NOP. Presumably this resets the 'skipped-in-a-row' counter and 7 more can be skipped without incurring another penalty. Therefore, from the original instruction, up to 16 instructions can be skipped with only a two-cycle penalty. Does that sound correct?
Calculations I performed earlier in this thread suggest 16 instructions skipped will incur two penalties (4 cycles).
If the first instruction in a pattern is being skipped you get an automatic NOP, which will allow 7 more for free (8 total). All other circumstances provide up to 7 skipped for free, with 8 costing a NOP.
What I'm not clear on is whether the rapid advancement of the PC begins or ends with the NOP. Based on the first instruction case, it seem like it begins with the NOP, but all other cases suggest it ends with a NOP.
Ultimately, I don't think it matters all that much, as long as you remember that for SKIPF:
First instruction in skip pattern skipped + up to 7 more = NOP
1 to 7 consecutive instructions skipped = free
8 to 15 consecutive instructions skipped = NOP
16 to 23 consecutive instructions skipped = 2 NOPs
24 to 31 consecutive instructions skipped = 3 NOPs
EXECF (and by extension XBYTE) only has 22 skip bits, but I believe the same rules apply.
Are there speed advantages to code written using SKIP or is it really just a code compression mechanism?
SKIP is just code compression. From the Parallax Propeller 2 documentation on skipping:
SKIP:
Skipping is accomplished by either cancelling instructions as they come through the pipeline from hub or cog/LUT memory (effectively turning them into 2-clock NOP instructions)
SKIPF or EXECF:
or by leaping over them in cog/LUT memory (no clock penalty).
If using the FIFO for block reads and WRLONG for the write-back, be aware that the FIFO will snatch eight hubRAM clock cycles after every eight RFLONGs. If a WRLONG coincides then the WRLONG has to wait.
An option for most demanding scenario is cog pairing with shared lutRAM. Then the paired cog can be buffer management with block copying or both cogs could share the processing and maybe both FIFOs get used.
I'm interested in knowing when the FIFO is reloaded during XBYTE. Is it immediately after the hidden RFBYTE that reads the final 32nd byte in a block starting on a 32-byte/8-long boundary in hub RAM? If so, reloading the FIFO probably won't cause any delays for most XBYTE apps.
There is enough depth in the FIFO to allow time for refilling without any stalling. Not even the streamer going full-tilt can beat it. That the FIFO has priority over the cog is one reason why.
I think it has two hub rotations of depth. In an 8-cog Prop2, would be 16 longwords. And will be 100% filled on initial RDFAST.
Chip,
I presume there is only one internal skip register?
If i call a subroutine that suspends the skip, then i cannot load a new one in the subroutine as it would overwrite the currently suspended skip? What happens if i return after i loaded a new skip?
If the skip is suspended on a call, can i make another call, and will the suspended skip be reinviked correctly once i return from all the lower calls?
I guess i am asking are there any extra caveats?
BTW Having great fun with these and the spin interpreter
If i call a subroutine that suspends the skip, then i cannot load a new one in the subroutine as it would overwrite the currently suspended skip?
Correct. A new SKIP/SKIPF/EXECF inside a subroutine called from a skip sequence overwrites the original skip bits and resets the skip call counter (see below).
If the skip is suspended on a call, can i make another call, and will the suspended skip be reinvoked correctly once i return from all the lower calls?
Yes, there is a call counter that is incremented by a call and decremented by a return. Skipping resumes when the counter is zero and carries on until all the skip bits are used up.
BTW Having great fun with these and the spin interpreter
SKIPF, EXECF and XBYTE are superb and many thanks to Chip for inventing them. I'm completely biased because I suggested it, but a CALL during skipping is an excellent way to execute more instructions than there are bits. Also, inside a called routine one is 'safe' from the skip bits outside and therefore a CALL + SKIPF/EXECF provides a good escape from the original skip sequence.
A way to generate skip bit patterns semi-automatically would be very handy, though.
@TonyB_ wrote:
SKIPF, EXECF and XBYTE are superb and many thanks to Chip for inventing them. I'm completely biased because I suggested it, but a CALL during skipping is an excellent way to execute more instructions than there are bits. Also, inside a called routine one is 'safe' from the skip bits outside and therefore a CALL + SKIPF/EXECF provides a good escape from the original skip sequence.
A way to generate skip bit patterns semi-automatically would be very handy, though.
I'm making great use of them. Yes, suspending while in a call subroutine is fantastic too.
I've found the 22 bits a bit limiting (pardon the pun )
Once you use a feature, you find just how useful it can be, and you always want more
Would have been great to push the skip bits into a wider fifo and then RET them, but something has to ultimately give until it doesn't fit/run.
BTW I commented about my use on the P1 Spin Interpreter for P2 thread last night. These instructions are an absolutely brilliant idea!!!
I've been finding that with the additional space I can inline code some subroutines which of course speeds up the execution.
The penalty is the tables but they are easily placed in LUT.
I've been finding that with the additional space I can inline code some subroutines which of course speeds up the execution.
The penalty is the tables but they are easily placed in LUT.
The XBYTE/EXECF tables in LUT are not always sufficient on their own. If extra bytecode decoding is needed for special cases then SKIPF/EXECF {#}D can be used but the skip patterns for these must be in cog RAM.
Yes, I currently have two tables in LUT but i am fetching the long into cog and then doing an EXECF. Neither use XBYTE yet as i am outputting debug info in between the fetch and execf.
I think there’s a couple more tricks that can be used to extend the length too.
Comments
Not sure where you are getting x=0..7 from?
Provided the code is in cog/lut there are no issues with the tjxx and the skipf continues into the jumped to code.
Yes, I think.
For those 9-bit immediate branches, addressing is relative. If you use a register to hold the address, it will be absolute.
Yes and I should have written TJx flag,#routine.
So, if you want to mix the instructions being executed, you might be able to tune the sequence without wasted nop clocks. Probably not as readable tho.
Calculations I performed earlier in this thread suggest 16 instructions skipped will incur two penalties (4 cycles).
If the first instruction in a pattern is being skipped you get an automatic NOP, which will allow 7 more for free (8 total). All other circumstances provide up to 7 skipped for free, with 8 costing a NOP.
What I'm not clear on is whether the rapid advancement of the PC begins or ends with the NOP. Based on the first instruction case, it seem like it begins with the NOP, but all other cases suggest it ends with a NOP.
Ultimately, I don't think it matters all that much, as long as you remember that for SKIPF:
First instruction in skip pattern skipped + up to 7 more = NOP
1 to 7 consecutive instructions skipped = free
8 to 15 consecutive instructions skipped = NOP
16 to 23 consecutive instructions skipped = 2 NOPs
24 to 31 consecutive instructions skipped = 3 NOPs
EXECF (and by extension XBYTE) only has 22 skip bits, but I believe the same rules apply.
SKIP is just code compression. From the Parallax Propeller 2 documentation on skipping:
SKIP: SKIPF or EXECF:
I'm interested in knowing when the FIFO is reloaded during XBYTE. Is it immediately after the hidden RFBYTE that reads the final 32nd byte in a block starting on a 32-byte/8-long boundary in hub RAM? If so, reloading the FIFO probably won't cause any delays for most XBYTE apps.
I think it has two hub rotations of depth. In an 8-cog Prop2, would be 16 longwords. And will be 100% filled on initial RDFAST.
I presume there is only one internal skip register?
If i call a subroutine that suspends the skip, then i cannot load a new one in the subroutine as it would overwrite the currently suspended skip? What happens if i return after i loaded a new skip?
If the skip is suspended on a call, can i make another call, and will the suspended skip be reinviked correctly once i return from all the lower calls?
I guess i am asking are there any extra caveats?
BTW Having great fun with these and the spin interpreter
Yes. I suggested nested skip sequences but it was too difficult to implement.
Correct. A new SKIP/SKIPF/EXECF inside a subroutine called from a skip sequence overwrites the original skip bits and resets the skip call counter (see below).
The new skip sequence continues.
Yes, there is a call counter that is incremented by a call and decremented by a return. Skipping resumes when the counter is zero and carries on until all the skip bits are used up.
SKIPF, EXECF and XBYTE are superb and many thanks to Chip for inventing them. I'm completely biased because I suggested it, but a CALL during skipping is an excellent way to execute more instructions than there are bits. Also, inside a called routine one is 'safe' from the skip bits outside and therefore a CALL + SKIPF/EXECF provides a good escape from the original skip sequence.
A way to generate skip bit patterns semi-automatically would be very handy, though.
I've found the 22 bits a bit limiting (pardon the pun )
Once you use a feature, you find just how useful it can be, and you always want more
Would have been great to push the skip bits into a wider fifo and then RET them, but something has to ultimately give until it doesn't fit/run.
BTW I commented about my use on the P1 Spin Interpreter for P2 thread last night. These instructions are an absolutely brilliant idea!!!
The penalty is the tables but they are easily placed in LUT.
The XBYTE/EXECF tables in LUT are not always sufficient on their own. If extra bytecode decoding is needed for special cases then SKIPF/EXECF {#}D can be used but the skip patterns for these must be in cog RAM.
I think there’s a couple more tricks that can be used to extend the length too.