Just discovered (or maybe rediscovered) that the SD card reading code, FSRW, was twice as fast when I unrolled a REP loop in inline assembly.
The P2 documentation says this: "REP works in hub memory, as well, but executes a hidden jump to get back to the top of the repeated instructions."
What it doesn't say is that this makes it slow...
Just discovered (or maybe rediscovered) that the SD card reading code, FSRW, was twice as fast when I unrolled a REP loop in inline assembly.
The P2 documentation says this: "REP works in hub memory, as well, but executes a hidden jump to get back to the top of the repeated instructions."
What it doesn't say is that this makes it slow...
It’s any jumps, calls and rets as well as reps that cause a fifo load and a pause to sync the hub egg-beater that cause hubexec to be slower than cog or LUT.
REP was only meant for cog exec but Chip made it execute a jmp if you tried to use it in hubexec, simply for source code compatibility. I find hubexec is almost or just as fast as cog exec if you have a large linear section of code to execute. Just one jump or looping will slow it all down again.
However, if you need loops and you still want it fast but can't dedicate the memory for it in cog/lut, you use a SETQ + RDLONG at the start of your hubexec code and copy the code into cog or lut (SETQ2) and jump to that.
This is something I certainly do for my upscaler from 320x240 to 640x480 in my video player.
You must make sure that the rdlong address uses a HUB address and not a LUT address.
Note the positioning of the labels _hub_lut_begin and _USER_LUT_BEGIN with respect to ORGH and ORG $200,
or the use of @
Just noticed this really neat piece of code from jonny mac in his jm_fullduplexserial.spin2
It shows how to easily pass a set of parameters from a program (spin2 here, but doesn't need to be) to a pasm driver. These parameters can be picked up with two pasm instructions, setq and rdlong.
Also neat is the reading/writing the head and tail parameters using the prta[n] offset.
Lastly, the use of incmod to increment the head and tail parameters.
All these are examples of efficient use of the new P2 instruction set.
However, I would change one thing. I think the "org" should be an "org 0". Mostly it will not matter, but I think there are some possibilities where it may give incorrect results. So I have taken the liberty of modifying this line.
Here is an extract of the relevant sections
varlong cog ' cog flag/idlong rxp ' rx smart pinlong txp ' tx smart pinlong rxhub ' hub address of rxbuflong txhub ' hub address of txbuflong rxhead ' rx head indexlong rxtail ' rx tail indexlong txhead ' tx head indexlong txtail ' tx tail indexlong txdelay ' ticks to transmit one bytebyte rxbuf[BUF_SIZE] ' buffersbyte txbuf[BUF_SIZE]
.....
pubstart(rxpin, txpin, mode, baud) : result | baudcfg, spmode
.....
cog := coginit(COGEXEC_NEW, @uart_mgr, @rxp) + 1' start uart manager cogreturn cog
dat{ smart pin uart/buffer manager }org0
uart_mgr setq #4-1' get 4 parameters from hubrdlong rxd, ptra
uart_main testb rxd, #31wc' rx in use?if_nccall #rx_serial
testb txd, #31wc' tx in use?if_nccall #tx_serial
jmp #uart_main
rx_serial testp rxd wc' anything waiting?if_ncretrdpin t3, rxd ' read new byteshr t3, #24' align lsbmov t1, p_rxbuf ' t1 := @rxbufrdlong t2, ptra[4] ' t2 := rxheadadd t1, t2
wrbyte t3, t1 ' rxbuf[rxhead] := t3incmod t2, #(BUF_SIZE-1) ' update head index_ret_wrlong t2, ptra[4] ' write head index back to hub
tx_serial rdpin t1, txd wc' check busy flagif_cret' abort if busyrdlong t1, ptra[6] ' t1 = txheadrdlong t2, ptra[7] ' t2 = txtailcmp t1, t2 wz' byte(s) to tx?if_eretmov t1, p_txbuf ' start of tx bufferadd t1, t2 ' add tail indexrdbyte t3, t1 ' t3 := txbuf[txtail]wypin t3, txd ' load into sp uartincmod t2, #(BUF_SIZE-1) ' update tail index_ret_wrlong t2, ptra[7] ' write tail index back to hub' --------------------------------------------------------------------------------------------------
rxd res1' receive pin
txd res1' transmit pin
p_rxbuf res1' pointer to rxbuf
p_txbuf res1' pointer to txbuf
t1 res1' work vars
t2 res1
t3 res1fit472
However, I would change one thing. I think the "org" should be an "org 0". Mostly it will not matter, but I think there are some possibilities where it may give incorrect results. So I have taken the liberty of modifying this line.
I've started working on implementing a FIR filter to complement the Sinc filters in the smartpins. First part of constructing the table of taps is done and tested. In the process I've learnt a couple of things about using the ALTI prefixing instruction.
First: I hadn't quite got it that this instruction can manipulate multiple pointers at once. This is achieved like a SIMD operation. Because cogRAM is addressed in just 9 bits, there is room in a single 32-bit register to hold multiple cogRAM pointers, and so that's exactly what can be done.
Second: A more minor detail: The %RRR bits in the control word has two modes: One for manipulating the subsequent result register address and the other is a special case for re-encoding the subsequent opcode.
Anyway, here's a compact use of ALTI with %RRR to fill the FIR tap table in cogRAM:
' Build FIR table'--------------------------------------------------------------
.stepqfrac #1, #firsize 'calculate step anglegetqx .step'retrive the calculated incremental angle of each stepsub s_angle, .step'one step back from 180 deg to move off the zero value tapqrotate s_mag, s_angle 'first cordic op for filling FIR table
.tabloop
sub s_angle, .stepwc'angle step, assumes stepping backward from 180 deg to 0 deggetqxpa'retrive the calculated cosineif_ncqrotate s_mag, s_angle 'begin next cordic opalti .firp1, #%111_000_000'first half of table, post-incrementing index, start to middleaddpa, s_mag 'offset cosine to sit on top the zeroalti .firp2, #%110_000_000'mirrored second half of table, post-decrementing index, end to middleaddpa, s_mag 'dittoif_ncjmp #.tabloop
'--------------------------------------------------------------
...
...
.firp1 long firtab<<19
.firp2 long (firsize-1+firtab)<<19
s_angle long$8000_0000'start angle ($8000_0000 == 180 deg)
s_mag long$7fff_ffff / firsize 'FIR table magnitude
firtab res firsize
firbuf res firsize
EDIT: Quality improvement by making QROTATE conditional execution to eliminate extraneous case
EDIT2: Automate setting of table magnitude (s_mag)
A neat attribute of the above use of ALTI is it not only provides the needed register indirection, but the nature of manipulating with %RRR is that it gives the ADD instruction a true third operand. The PA and S_MAG operands are still the ALU inputs as specified in the ADD instruction, unaffected by the prefixing ALTI.
Here's something that makes sense when you think about it, but isn't explicitly said anywhere in the documentation: When skipping (with SKIPF/EXECF/XBYTE) more than 7 instructions after an ALTx instruction, it won't work.
Hehe, it won't be quite that. I found this in the hardware doc:
like SKIP, but fast due to PC steps of 1..8
The way that text is layed in the document sugests it came from the instruction sheet. I suspect it got trimmed out of the instruction sheet at some stage.
The incremental limit of 8 does seem an unneeded burden. There must have been a reason why. I didn't follow the skipping conversations when it was developed so I don't know why myself.
EDIT: Oh, it's just the ALTx that fails. Yeah, makes complete sense because the ALTx would then be prefixing a cancelled instruction.
There is a limit of free skips to 7/8 instructions. After IIRC 7 skips a clock (or 2?) needs to be inserted as the skip continues.
I am unsure of the impact of ALT instructions. Whether they are treated any differently to other instructions Chip will need to answer.
To continue discussion, please use the discussion thread (link in top post).
Here's another funny one: When porting P1 code, make sure that when translating a MOVS that is being used to modify a jump instruction, you turn that jump into an absolute one (jmp #\whatever instead of jmp #whatever)
(This of course only works if the address is still a cog address)
Here's another thing one may not have realized: Bit 10 of an EXECF pattern (first SKIP bit) is always zero if you set the address field to the first instruction of interest
This means you can use that bit to store something interesting. Just make sure to use BITL whatever,#10 WCZ to test it, since that will clear the bit in the process.
SETSEn instructions glitch when an active trigger is still present. The prior event clearing seems to be done too early ... allowing retriggering to occur before the new mode and source are selected. Further reading - https://forums.parallax.com/discussion/comment/1558763/#Comment_1558763
NOTE: This is not an issue with regular retriggering. It's specifically with how SETSEn instructions initially setup and arm the event hardware.
Comments
-- waitms() is limited to 2^31 / clkfreq milliseconds (e.g., 10737 ms at 200MHz)
it should be-- waitms() is limited to 2^31 / (clkfreq/1_000) milliseconds (e.g., 10737 ms at 200MHz)
In spin we did
repeat while(lockset(LockID))
whereas in spin2repeat while not (locktry(LockID))
Note the reversed status return ie the NOT requirementPostedit: Fixed the second line (remove -1 and rename)
Isn't there a REPEAT UNTIL that is the same as REPEAT WHILE NOT (except potentially faster?)
Just discovered (or maybe rediscovered) that the SD card reading code, FSRW, was twice as fast when I unrolled a REP loop in inline assembly.
The P2 documentation says this: "REP works in hub memory, as well, but executes a hidden jump to get back to the top of the repeated instructions."
What it doesn't say is that this makes it slow...
Warning:
Please be aware that the documentation Parallax Propeller 2 Documentation 2019-09-13 v33 (Rev B silicon) appears to be incorrect.
At 200MHz the clocks between an OUTx instruction and a following TESTP instruction appears to require a minimum of 7 clocks (waitx #5).
At 200MHz the clocks between an OUTx instruction and a following TEST instruction appears to require a minimum of 8 clocks (waitx #6).
Further clarification has been requested.
However, if you need loops and you still want it fast but can't dedicate the memory for it in cog/lut, you use a SETQ + RDLONG at the start of your hubexec code and copy the code into cog or lut (SETQ2) and jump to that.
This is something I certainly do for my upscaler from 320x240 to 640x480 in my video player.
You must make sure that the rdlong address uses a HUB address and not a LUT address.
Note the positioning of the labels _hub_lut_begin and _USER_LUT_BEGIN with respect to ORGH and ORG $200,
or the use of @
'+-------[ Load LUT code ??? ]-------------------------------------------------+ setq2 ##_USER_LUT_END-_USER_LUT_BEGIN-1 '\ load LUT ' rdlong 0, ##_USER_LUT_BEGIN '/ <-- uses a LUT address (wrong) rdlong 0, ##_hub_LUT_BEGIN '/ <-- uses a hub address (correct) ' rdlong 0, ##@_USER_LUT_BEGIN '/ <-- uses a hub address (correct) '+-----------------------------------------------------------------------------+ ..... orgh _hub_lut_begin org $200 ' LUT _USER_LUT_BEGIN go_lut drvl #57 mov lmm_x, #"L" call #\_hubTx ret _USER_LUT_END
Postedit: added version using @ per the following postIt shows how to easily pass a set of parameters from a program (spin2 here, but doesn't need to be) to a pasm driver. These parameters can be picked up with two pasm instructions, setq and rdlong.
Also neat is the reading/writing the head and tail parameters using the prta[n] offset.
Lastly, the use of incmod to increment the head and tail parameters.
All these are examples of efficient use of the new P2 instruction set.
However, I would change one thing. I think the "org" should be an "org 0". Mostly it will not matter, but I think there are some possibilities where it may give incorrect results. So I have taken the liberty of modifying this line.
Here is an extract of the relevant sections
var long cog ' cog flag/id long rxp ' rx smart pin long txp ' tx smart pin long rxhub ' hub address of rxbuf long txhub ' hub address of txbuf long rxhead ' rx head index long rxtail ' rx tail index long txhead ' tx head index long txtail ' tx tail index long txdelay ' ticks to transmit one byte byte rxbuf[BUF_SIZE] ' buffers byte txbuf[BUF_SIZE] ..... pub start(rxpin, txpin, mode, baud) : result | baudcfg, spmode ..... cog := coginit(COGEXEC_NEW, @uart_mgr, @rxp) + 1 ' start uart manager cog return cog dat { smart pin uart/buffer manager } org 0 uart_mgr setq #4-1 ' get 4 parameters from hub rdlong rxd, ptra uart_main testb rxd, #31 wc ' rx in use? if_nc call #rx_serial testb txd, #31 wc ' tx in use? if_nc call #tx_serial jmp #uart_main rx_serial testp rxd wc ' anything waiting? if_nc ret rdpin t3, rxd ' read new byte shr t3, #24 ' align lsb mov t1, p_rxbuf ' t1 := @rxbuf rdlong t2, ptra[4] ' t2 := rxhead add t1, t2 wrbyte t3, t1 ' rxbuf[rxhead] := t3 incmod t2, #(BUF_SIZE-1) ' update head index _ret_ wrlong t2, ptra[4] ' write head index back to hub tx_serial rdpin t1, txd wc ' check busy flag if_c ret ' abort if busy rdlong t1, ptra[6] ' t1 = txhead rdlong t2, ptra[7] ' t2 = txtail cmp t1, t2 wz ' byte(s) to tx? if_e ret mov t1, p_txbuf ' start of tx buffer add t1, t2 ' add tail index rdbyte t3, t1 ' t3 := txbuf[txtail] wypin t3, txd ' load into sp uart incmod t2, #(BUF_SIZE-1) ' update tail index _ret_ wrlong t2, ptra[7] ' write tail index back to hub ' -------------------------------------------------------------------------------------------------- rxd res 1 ' receive pin txd res 1 ' transmit pin p_rxbuf res 1 ' pointer to rxbuf p_txbuf res 1 ' pointer to txbuf t1 res 1 ' work vars t2 res 1 t3 res 1 fit 472
Please, if you want to comment, use this threadforums.parallax.com/discussion/167812/p2-tricks-traps-differences-between-p1-discussion/p1?new=1
and keep this thread for the actual tricks and traps. Thanks.
First: I hadn't quite got it that this instruction can manipulate multiple pointers at once. This is achieved like a SIMD operation. Because cogRAM is addressed in just 9 bits, there is room in a single 32-bit register to hold multiple cogRAM pointers, and so that's exactly what can be done.
Second: A more minor detail: The %RRR bits in the control word has two modes: One for manipulating the subsequent result register address and the other is a special case for re-encoding the subsequent opcode.
Anyway, here's a compact use of ALTI with %RRR to fill the FIR tap table in cogRAM:
' Build FIR table '-------------------------------------------------------------- .step qfrac #1, #firsize 'calculate step angle getqx .step 'retrive the calculated incremental angle of each step sub s_angle, .step 'one step back from 180 deg to move off the zero value tap qrotate s_mag, s_angle 'first cordic op for filling FIR table .tabloop sub s_angle, .step wc 'angle step, assumes stepping backward from 180 deg to 0 deg getqx pa 'retrive the calculated cosine if_nc qrotate s_mag, s_angle 'begin next cordic op alti .firp1, #%111_000_000 'first half of table, post-incrementing index, start to middle add pa, s_mag 'offset cosine to sit on top the zero alti .firp2, #%110_000_000 'mirrored second half of table, post-decrementing index, end to middle add pa, s_mag 'ditto if_nc jmp #.tabloop '-------------------------------------------------------------- ... ... .firp1 long firtab<<19 .firp2 long (firsize-1+firtab)<<19 s_angle long $8000_0000 'start angle ($8000_0000 == 180 deg) s_mag long $7fff_ffff / firsize 'FIR table magnitude firtab res firsize firbuf res firsize
EDIT: Quality improvement by making QROTATE conditional execution to eliminate extraneous case
EDIT2: Automate setting of table magnitude (s_mag)
SETQ #length-1 ' use SETQ for COG and SETQ2 for LUT RDLONG where,##$80000 ' clear cog or lut
This works because HUB $80000-$FBFFF is unmapped HUB RAM area indeed reads zeroes.
Note the 16KB from $7C000-$7FFFF is dual mapped to $FC000-$FFFFF.
Thanks @Wuerfel_21
Of course that won't work if we ever get a 1MB P2
With more people now using the P2 I thought it was worth bumping this thread.
For discussions, please use the discussion version of this thread linked in the first post. Thanks.
Here's something that makes sense when you think about it, but isn't explicitly said anywhere in the documentation: When skipping (with SKIPF/EXECF/XBYTE) more than 7 instructions after an ALTx instruction, it won't work.
Hehe, it won't be quite that. I found this in the hardware doc:
The way that text is layed in the document sugests it came from the instruction sheet. I suspect it got trimmed out of the instruction sheet at some stage.
The incremental limit of 8 does seem an unneeded burden. There must have been a reason why. I didn't follow the skipping conversations when it was developed so I don't know why myself.
EDIT: Oh, it's just the ALTx that fails. Yeah, makes complete sense because the ALTx would then be prefixing a cancelled instruction.
There is a limit of free skips to 7/8 instructions. After IIRC 7 skips a clock (or 2?) needs to be inserted as the skip continues.
I am unsure of the impact of ALT instructions. Whether they are treated any differently to other instructions Chip will need to answer.
To continue discussion, please use the discussion thread (link in top post).
Here's another funny one:
When porting P1 code, make sure that when translating a MOVS that is being used to modify a jump instruction, you turn that jump into an absolute one (
jmp #\whatever
instead ofjmp #whatever
)(This of course only works if the address is still a cog address)
Here's another thing one may not have realized:
Bit 10 of an EXECF pattern (first SKIP bit) is always zero if you set the address field to the first instruction of interest
This means you can use that bit to store something interesting. Just make sure to use BITL whatever,#10 WCZ to test it, since that will clear the bit in the process.
ALTB clashing with ADDPINS, workaround - https://forums.parallax.com/discussion/174788/altb-instruction-gotcha-trap/p1
SETSEn instructions glitch when an active trigger is still present. The prior event clearing seems to be done too early ... allowing retriggering to occur before the new mode and source are selected. Further reading - https://forums.parallax.com/discussion/comment/1558763/#Comment_1558763
NOTE: This is not an issue with regular retriggering. It's specifically with how SETSEn instructions initially setup and arm the event hardware.
I've attached a demonstration program.
The NOT instruction doesn't set the Z flag according to the instruction's result but rather according to its S operand.
PUB tester() | flags org mov flags, #0 mov pb, #0 ' 0 not pa, pb wz ' -1 if_z or flags, #1 sub pb, #1 ' -1 not pa, pb wz ' 0 shl flags, #1 if_z or flags, #1 end debug(ubin(flags))
That produces an output of %1 instead of the documented %10.