Just discovered (or maybe rediscovered) that the SD card reading code, FSRW, was twice as fast when I unrolled a REP loop in inline assembly.
The P2 documentation says this: "REP works in hub memory, as well, but executes a hidden jump to get back to the top of the repeated instructions."
What it doesn't say is that this makes it slow...
Just discovered (or maybe rediscovered) that the SD card reading code, FSRW, was twice as fast when I unrolled a REP loop in inline assembly.
The P2 documentation says this: "REP works in hub memory, as well, but executes a hidden jump to get back to the top of the repeated instructions."
What it doesn't say is that this makes it slow...
It’s any jumps, calls and rets as well as reps that cause a fifo load and a pause to sync the hub egg-beater that cause hubexec to be slower than cog or LUT.
REP was only meant for cog exec but Chip made it execute a jmp if you tried to use it in hubexec, simply for source code compatibility. I find hubexec is almost or just as fast as cog exec if you have a large linear section of code to execute. Just one jump or looping will slow it all down again.
However, if you need loops and you still want it fast but can't dedicate the memory for it in cog/lut, you use a SETQ + RDLONG at the start of your hubexec code and copy the code into cog or lut (SETQ2) and jump to that.
This is something I certainly do for my upscaler from 320x240 to 640x480 in my video player.
You must make sure that the rdlong address uses a HUB address and not a LUT address.
Note the positioning of the labels _hub_lut_begin and _USER_LUT_BEGIN with respect to ORGH and ORG $200,
or the use of @
Just noticed this really neat piece of code from jonny mac in his jm_fullduplexserial.spin2
It shows how to easily pass a set of parameters from a program (spin2 here, but doesn't need to be) to a pasm driver. These parameters can be picked up with two pasm instructions, setq and rdlong.
Also neat is the reading/writing the head and tail parameters using the prta[n] offset.
Lastly, the use of incmod to increment the head and tail parameters.
All these are examples of efficient use of the new P2 instruction set.
However, I would change one thing. I think the "org" should be an "org 0". Mostly it will not matter, but I think there are some possibilities where it may give incorrect results. So I have taken the liberty of modifying this line.
Here is an extract of the relevant sections
var
long cog ' cog flag/id
long rxp ' rx smart pin
long txp ' tx smart pin
long rxhub ' hub address of rxbuf
long txhub ' hub address of txbuf
long rxhead ' rx head index
long rxtail ' rx tail index
long txhead ' tx head index
long txtail ' tx tail index
long txdelay ' ticks to transmit one byte
byte rxbuf[BUF_SIZE] ' buffers
byte txbuf[BUF_SIZE]
.....
pub start(rxpin, txpin, mode, baud) : result | baudcfg, spmode
.....
cog := coginit(COGEXEC_NEW, @uart_mgr, @rxp) + 1 ' start uart manager cog
return cog
dat { smart pin uart/buffer manager }
org 0
uart_mgr setq #4-1 ' get 4 parameters from hub
rdlong rxd, ptra
uart_main testb rxd, #31 wc ' rx in use?
if_nc call #rx_serial
testb txd, #31 wc ' tx in use?
if_nc call #tx_serial
jmp #uart_main
rx_serial testp rxd wc ' anything waiting?
if_nc ret
rdpin t3, rxd ' read new byte
shr t3, #24 ' align lsb
mov t1, p_rxbuf ' t1 := @rxbuf
rdlong t2, ptra[4] ' t2 := rxhead
add t1, t2
wrbyte t3, t1 ' rxbuf[rxhead] := t3
incmod t2, #(BUF_SIZE-1) ' update head index
_ret_ wrlong t2, ptra[4] ' write head index back to hub
tx_serial rdpin t1, txd wc ' check busy flag
if_c ret ' abort if busy
rdlong t1, ptra[6] ' t1 = txhead
rdlong t2, ptra[7] ' t2 = txtail
cmp t1, t2 wz ' byte(s) to tx?
if_e ret
mov t1, p_txbuf ' start of tx buffer
add t1, t2 ' add tail index
rdbyte t3, t1 ' t3 := txbuf[txtail]
wypin t3, txd ' load into sp uart
incmod t2, #(BUF_SIZE-1) ' update tail index
_ret_ wrlong t2, ptra[7] ' write tail index back to hub
' --------------------------------------------------------------------------------------------------
rxd res 1 ' receive pin
txd res 1 ' transmit pin
p_rxbuf res 1 ' pointer to rxbuf
p_txbuf res 1 ' pointer to txbuf
t1 res 1 ' work vars
t2 res 1
t3 res 1
fit 472
However, I would change one thing. I think the "org" should be an "org 0". Mostly it will not matter, but I think there are some possibilities where it may give incorrect results. So I have taken the liberty of modifying this line.
I've started working on implementing a FIR filter to complement the Sinc filters in the smartpins. First part of constructing the table of taps is done and tested. In the process I've learnt a couple of things about using the ALTI prefixing instruction.
First: I hadn't quite got it that this instruction can manipulate multiple pointers at once. This is achieved like a SIMD operation. Because cogRAM is addressed in just 9 bits, there is room in a single 32-bit register to hold multiple cogRAM pointers, and so that's exactly what can be done.
Second: A more minor detail: The %RRR bits in the control word has two modes: One for manipulating the subsequent result register address and the other is a special case for re-encoding the subsequent opcode.
Anyway, here's a compact use of ALTI with %RRR to fill the FIR tap table in cogRAM:
' Build FIR table
'--------------------------------------------------------------
.step qfrac #1, #firsize 'calculate step angle
getqx .step 'retrive the calculated incremental angle of each step
sub s_angle, .step 'one step back from 180 deg to move off the zero value tap
qrotate s_mag, s_angle 'first cordic op for filling FIR table
.tabloop
sub s_angle, .step wc 'angle step, assumes stepping backward from 180 deg to 0 deg
getqx pa 'retrive the calculated cosine
if_nc qrotate s_mag, s_angle 'begin next cordic op
alti .firp1, #%111_000_000 'first half of table, post-incrementing index, start to middle
add pa, s_mag 'offset cosine to sit on top the zero
alti .firp2, #%110_000_000 'mirrored second half of table, post-decrementing index, end to middle
add pa, s_mag 'ditto
if_nc jmp #.tabloop
'--------------------------------------------------------------
...
...
.firp1 long firtab<<19
.firp2 long (firsize-1+firtab)<<19
s_angle long $8000_0000 'start angle ($8000_0000 == 180 deg)
s_mag long $7fff_ffff / firsize 'FIR table magnitude
firtab res firsize
firbuf res firsize
EDIT: Quality improvement by making QROTATE conditional execution to eliminate extraneous case
EDIT2: Automate setting of table magnitude (s_mag)
A neat attribute of the above use of ALTI is it not only provides the needed register indirection, but the nature of manipulating with %RRR is that it gives the ADD instruction a true third operand. The PA and S_MAG operands are still the ALU inputs as specified in the ADD instruction, unaffected by the prefixing ALTI.
Here's something that makes sense when you think about it, but isn't explicitly said anywhere in the documentation: When skipping (with SKIPF/EXECF/XBYTE) more than 7 instructions after an ALTx instruction, it won't work.
Hehe, it won't be quite that. I found this in the hardware doc:
like SKIP, but fast due to PC steps of 1..8
The way that text is layed in the document sugests it came from the instruction sheet. I suspect it got trimmed out of the instruction sheet at some stage.
The incremental limit of 8 does seem an unneeded burden. There must have been a reason why. I didn't follow the skipping conversations when it was developed so I don't know why myself.
EDIT: Oh, it's just the ALTx that fails. Yeah, makes complete sense because the ALTx would then be prefixing a cancelled instruction.
There is a limit of free skips to 7/8 instructions. After IIRC 7 skips a clock (or 2?) needs to be inserted as the skip continues.
I am unsure of the impact of ALT instructions. Whether they are treated any differently to other instructions Chip will need to answer.
To continue discussion, please use the discussion thread (link in top post).
Here's another funny one: When porting P1 code, make sure that when translating a MOVS that is being used to modify a jump instruction, you turn that jump into an absolute one (jmp #\whatever instead of jmp #whatever)
(This of course only works if the address is still a cog address)
Here's another thing one may not have realized: Bit 10 of an EXECF pattern (first SKIP bit) is always zero if you set the address field to the first instruction of interest
This means you can use that bit to store something interesting. Just make sure to use BITL whatever,#10 WCZ to test it, since that will clear the bit in the process.
SETSEn instructions glitch when an active trigger is still present. The prior event clearing seems to be done too early ... allowing retriggering to occur before the new mode and source are selected. Further reading - https://forums.parallax.com/discussion/comment/1558763/#Comment_1558763
NOTE: This is not an issue with regular retriggering. It's specifically with how SETSEn instructions initially setup and arm the event hardware.
Comments
In spin we did whereas in spin2 Note the reversed status return ie the NOT requirement
Postedit: Fixed the second line (remove -1 and rename)
Isn't there a REPEAT UNTIL that is the same as REPEAT WHILE NOT (except potentially faster?)
Just discovered (or maybe rediscovered) that the SD card reading code, FSRW, was twice as fast when I unrolled a REP loop in inline assembly.
The P2 documentation says this: "REP works in hub memory, as well, but executes a hidden jump to get back to the top of the repeated instructions."
What it doesn't say is that this makes it slow...
Warning:
Please be aware that the documentation Parallax Propeller 2 Documentation 2019-09-13 v33 (Rev B silicon) appears to be incorrect.
At 200MHz the clocks between an OUTx instruction and a following TESTP instruction appears to require a minimum of 7 clocks (waitx #5).
At 200MHz the clocks between an OUTx instruction and a following TEST instruction appears to require a minimum of 8 clocks (waitx #6).
Further clarification has been requested.
However, if you need loops and you still want it fast but can't dedicate the memory for it in cog/lut, you use a SETQ + RDLONG at the start of your hubexec code and copy the code into cog or lut (SETQ2) and jump to that.
This is something I certainly do for my upscaler from 320x240 to 640x480 in my video player.
You must make sure that the rdlong address uses a HUB address and not a LUT address.
Note the positioning of the labels _hub_lut_begin and _USER_LUT_BEGIN with respect to ORGH and ORG $200,
or the use of @ Postedit: added version using @ per the following post
It shows how to easily pass a set of parameters from a program (spin2 here, but doesn't need to be) to a pasm driver. These parameters can be picked up with two pasm instructions, setq and rdlong.
Also neat is the reading/writing the head and tail parameters using the prta[n] offset.
Lastly, the use of incmod to increment the head and tail parameters.
All these are examples of efficient use of the new P2 instruction set.
However, I would change one thing. I think the "org" should be an "org 0". Mostly it will not matter, but I think there are some possibilities where it may give incorrect results. So I have taken the liberty of modifying this line.
Here is an extract of the relevant sections Please, if you want to comment, use this thread
forums.parallax.com/discussion/167812/p2-tricks-traps-differences-between-p1-discussion/p1?new=1
and keep this thread for the actual tricks and traps. Thanks.
First: I hadn't quite got it that this instruction can manipulate multiple pointers at once. This is achieved like a SIMD operation. Because cogRAM is addressed in just 9 bits, there is room in a single 32-bit register to hold multiple cogRAM pointers, and so that's exactly what can be done.
Second: A more minor detail: The %RRR bits in the control word has two modes: One for manipulating the subsequent result register address and the other is a special case for re-encoding the subsequent opcode.
Anyway, here's a compact use of ALTI with %RRR to fill the FIR tap table in cogRAM:
EDIT: Quality improvement by making QROTATE conditional execution to eliminate extraneous case
EDIT2: Automate setting of table magnitude (s_mag)
This works because HUB $80000-$FBFFF is unmapped HUB RAM area indeed reads zeroes.
Note the 16KB from $7C000-$7FFFF is dual mapped to $FC000-$FFFFF.
Thanks @Wuerfel_21
Of course that won't work if we ever get a 1MB P2
With more people now using the P2 I thought it was worth bumping this thread.
For discussions, please use the discussion version of this thread linked in the first post. Thanks.
Here's something that makes sense when you think about it, but isn't explicitly said anywhere in the documentation: When skipping (with SKIPF/EXECF/XBYTE) more than 7 instructions after an ALTx instruction, it won't work.
Hehe, it won't be quite that. I found this in the hardware doc:
The way that text is layed in the document sugests it came from the instruction sheet. I suspect it got trimmed out of the instruction sheet at some stage.
The incremental limit of 8 does seem an unneeded burden. There must have been a reason why. I didn't follow the skipping conversations when it was developed so I don't know why myself.
EDIT: Oh, it's just the ALTx that fails. Yeah, makes complete sense because the ALTx would then be prefixing a cancelled instruction.
There is a limit of free skips to 7/8 instructions. After IIRC 7 skips a clock (or 2?) needs to be inserted as the skip continues.
I am unsure of the impact of ALT instructions. Whether they are treated any differently to other instructions Chip will need to answer.
To continue discussion, please use the discussion thread (link in top post).
Here's another funny one:
When porting P1 code, make sure that when translating a MOVS that is being used to modify a jump instruction, you turn that jump into an absolute one (
jmp #\whatever
instead ofjmp #whatever
)(This of course only works if the address is still a cog address)
Here's another thing one may not have realized:
Bit 10 of an EXECF pattern (first SKIP bit) is always zero if you set the address field to the first instruction of interest
This means you can use that bit to store something interesting. Just make sure to use BITL whatever,#10 WCZ to test it, since that will clear the bit in the process.
ALTB clashing with ADDPINS, workaround - https://forums.parallax.com/discussion/174788/altb-instruction-gotcha-trap/p1
SETSEn instructions glitch when an active trigger is still present. The prior event clearing seems to be done too early ... allowing retriggering to occur before the new mode and source are selected. Further reading - https://forums.parallax.com/discussion/comment/1558763/#Comment_1558763
NOTE: This is not an issue with regular retriggering. It's specifically with how SETSEn instructions initially setup and arm the event hardware.
I've attached a demonstration program.
The NOT instruction doesn't set the Z flag according to the instruction's result but rather according to its S operand.
That produces an output of %1 instead of the documented %10.