P2 Tricks, Traps & Differences between P1 (Reference Material Only)

Dave Hein · 2020-03-13 02:50

Jon, I think Roy is just saying that your statement about the maximum limit for milliseconds has a typo. Instead of

-- waitms() is limited to 2^31 / clkfreq milliseconds (e.g., 10737 ms at 200MHz)

it should be

-- waitms() is limited to 2^31 / (clkfreq/1_000) milliseconds (e.g., 10737 ms at 200MHz)

Cluso99 · 2020-05-17 04:48

Spin v Spin2: LOCKSET v LOCKTRY

In spin we did

repeat while(lockset(LockID))

whereas in spin2

repeat while not (locktry(LockID))

Note the reversed status return ie the NOT requirement

Postedit: Fixed the second line (remove -1 and rename)

Wuerfel_21 · 2020-05-17 08:12

Cluso99 wrote: »
Spin v Spin2: LOCKSET v LOCKTRY

In spin we did
repeat while(lockset(LockID))
whereas in spin2
repeat while not (locktry(cardLockID - 1))
Note the reversed status return ie the NOT requirement

Isn't there a REPEAT UNTIL that is the same as REPEAT WHILE NOT (except potentially faster?)

Rayman · 2020-05-21 17:42

Don't use REP in HUBEXEC if you care about speed.

Just discovered (or maybe rediscovered) that the SD card reading code, FSRW, was twice as fast when I unrolled a REP loop in inline assembly.
The P2 documentation says this: "REP works in hub memory, as well, but executes a hidden jump to get back to the top of the repeated instructions."
What it doesn't say is that this makes it slow...

Cluso99 · 2020-05-21 22:49

Rayman wrote: »

Don't use REP in HUBEXEC if you care about speed.

Just discovered (or maybe rediscovered) that the SD card reading code, FSRW, was twice as fast when I unrolled a REP loop in inline assembly.
The P2 documentation says this: "REP works in hub memory, as well, but executes a hidden jump to get back to the top of the repeated instructions."
What it doesn't say is that this makes it slow...

It’s any jumps, calls and rets as well as reps that cause a fifo load and a pause to sync the hub egg-beater that cause hubexec to be slower than cog or LUT.

Cluso99 · 2020-05-25 03:05

I/O PIN TIMING

Warning:
Please be aware that the documentation Parallax Propeller 2 Documentation 2019-09-13 v33 (Rev B silicon) appears to be incorrect.

At 200MHz the clocks between an OUTx instruction and a following TESTP instruction appears to require a minimum of 7 clocks (waitx #5).

At 200MHz the clocks between an OUTx instruction and a following TEST instruction appears to require a minimum of 8 clocks (waitx #6).

Further clarification has been requested.

Peter Jakacki · 2020-05-25 03:21

REP was only meant for cog exec but Chip made it execute a jmp if you tried to use it in hubexec, simply for source code compatibility. I find hubexec is almost or just as fast as cog exec if you have a large linear section of code to execute. Just one jump or looping will slow it all down again.
However, if you need loops and you still want it fast but can't dedicate the memory for it in cog/lut, you use a SETQ + RDLONG at the start of your hubexec code and copy the code into cog or lut (SETQ2) and jump to that.
This is something I certainly do for my upscaler from 320x240 to 640x480 in my video player.

Cluso99 · 2020-06-15 04:44

Just fell into this trap for loading LUT.

You must make sure that the rdlong address uses a HUB address and not a LUT address.
Note the positioning of the labels _hub_lut_begin and _USER_LUT_BEGIN with respect to ORGH and ORG $200,
or the use of @

'+-------[ Load LUT code ??? ]-------------------------------------------------+
              setq2     ##_USER_LUT_END-_USER_LUT_BEGIN-1 '\ load LUT
'             rdlong    0, ##_USER_LUT_BEGIN              '/   <-- uses a LUT address (wrong)
              rdlong    0, ##_hub_LUT_BEGIN               '/   <-- uses a hub address (correct)
'             rdlong    0, ##@_USER_LUT_BEGIN             '/   <-- uses a hub address (correct)
'+-----------------------------------------------------------------------------+
.....
                        orgh
_hub_lut_begin
                        org     $200                            ' LUT
_USER_LUT_BEGIN

go_lut        drvl      #57
              mov       lmm_x,      #"L"
              call      #\_hubTx
              ret

_USER_LUT_END

Postedit: added version using @ per the following post

ersmith · 2020-06-15 12:48

You can always use "@_USER_LUT_BEGIN" if you want the hub address of "_USER_LUT_BEGIN", rather than creating a new label for this.

Cluso99 · 2020-11-17 01:25

Just noticed this really neat piece of code from jonny mac in his jm_fullduplexserial.spin2

It shows how to easily pass a set of parameters from a program (spin2 here, but doesn't need to be) to a pasm driver. These parameters can be picked up with two pasm instructions, setq and rdlong.
Also neat is the reading/writing the head and tail parameters using the prta[n] offset.
Lastly, the use of incmod to increment the head and tail parameters.

All these are examples of efficient use of the new P2 instruction set.

However, I would change one thing. I think the "org" should be an "org 0". Mostly it will not matter, but I think there are some possibilities where it may give incorrect results. So I have taken the liberty of modifying this line.

Here is an extract of the relevant sections

var

  long  cog                                                     ' cog flag/id

  long  rxp                                                     ' rx smart pin
  long  txp                                                     ' tx smart pin
  long  rxhub                                                   ' hub address of rxbuf
  long  txhub                                                   ' hub address of txbuf

  long  rxhead                                                  ' rx head index
  long  rxtail                                                  ' rx tail index
  long  txhead                                                  ' tx head index
  long  txtail                                                  ' tx tail index

  long  txdelay                                                 ' ticks to transmit one byte

  byte  rxbuf[BUF_SIZE]                                         ' buffers
  byte  txbuf[BUF_SIZE]

.....

pub start(rxpin, txpin, mode, baud) : result | baudcfg, spmode
  .....
  cog := coginit(COGEXEC_NEW, @uart_mgr, @rxp) + 1              ' start uart manager cog

  return cog


dat { smart pin uart/buffer manager }

                org       0

uart_mgr        setq      #4-1                                  ' get 4 parameters from hub
                rdlong    rxd, ptra


uart_main       testb     rxd, #31                      wc      ' rx in use?
    if_nc       call      #rx_serial

                testb     txd, #31                      wc      ' tx in use?
    if_nc       call      #tx_serial

                jmp       #uart_main


rx_serial       testp     rxd                           wc      ' anything waiting?
    if_nc       ret

                rdpin     t3, rxd                               ' read new byte
                shr       t3, #24                               ' align lsb
                mov       t1, p_rxbuf                           ' t1 := @rxbuf
                rdlong    t2, ptra[4]                           ' t2 := rxhead
                add       t1, t2
                wrbyte    t3, t1                                ' rxbuf[rxhead] := t3
                incmod    t2, #(BUF_SIZE-1)                     ' update head index
    _ret_       wrlong    t2, ptra[4]                           ' write head index back to hub


tx_serial       rdpin     t1, txd                       wc      ' check busy flag
    if_c        ret                                             '  abort if busy

                rdlong    t1, ptra[6]                           ' t1 = txhead
                rdlong    t2, ptra[7]                           ' t2 = txtail
                cmp       t1, t2                        wz      ' byte(s) to tx?
    if_e        ret

                mov       t1, p_txbuf                           ' start of tx buffer
                add       t1, t2                                ' add tail index
                rdbyte    t3, t1                                ' t3 := txbuf[txtail]
                wypin     t3, txd                               ' load into sp uart
                incmod    t2, #(BUF_SIZE-1)                     ' update tail index
    _ret_       wrlong    t2, ptra[7]                           ' write tail index back to hub


' --------------------------------------------------------------------------------------------------

rxd             res       1                                     ' receive pin
txd             res       1                                     ' transmit pin
p_rxbuf         res       1                                     ' pointer to rxbuf
p_txbuf         res       1                                     ' pointer to txbuf

t1              res       1                                     ' work vars
t2              res       1
t3              res       1

                fit       472

Please, if you want to comment, use this thread
forums.parallax.com/discussion/167812/p2-tricks-traps-differences-between-p1-discussion/p1?new=1
and keep this thread for the actual tricks and traps. Thanks.

JonnyMac · 2020-11-17 03:20

However, I would change one thing. I think the "org" should be an "org 0". Mostly it will not matter, but I think there are some possibilities where it may give incorrect results. So I have taken the liberty of modifying this line.

Thanks, Ray. I will update my version.

evanh · 2020-12-14 12:22

I've started working on implementing a FIR filter to complement the Sinc filters in the smartpins. First part of constructing the table of taps is done and tested. In the process I've learnt a couple of things about using the ALTI prefixing instruction.

First: I hadn't quite got it that this instruction can manipulate multiple pointers at once. This is achieved like a SIMD operation. Because cogRAM is addressed in just 9 bits, there is room in a single 32-bit register to hold multiple cogRAM pointers, and so that's exactly what can be done.

Second: A more minor detail: The %RRR bits in the control word has two modes: One for manipulating the subsequent result register address and the other is a special case for re-encoding the subsequent opcode.

Anyway, here's a compact use of ALTI with %RRR to fill the FIR tap table in cogRAM:

' Build FIR table
'--------------------------------------------------------------
.step		qfrac	#1, #firsize			'calculate step angle
		getqx	.step				'retrive the calculated incremental angle of each step
		sub	s_angle, .step			'one step back from 180 deg to move off the zero value tap
		qrotate	s_mag, s_angle			'first cordic op for filling FIR table

.tabloop
		sub	s_angle, .step		wc	'angle step, assumes stepping backward from 180 deg to 0 deg
		getqx	pa				'retrive the calculated cosine
	if_nc	qrotate	s_mag, s_angle			'begin next cordic op

		alti	.firp1, #%111_000_000		'first half of table, post-incrementing index, start to middle
		add	pa, s_mag			'offset cosine to sit on top the zero
		alti	.firp2, #%110_000_000		'mirrored second half of table, post-decrementing index, end to middle
		add	pa, s_mag			'ditto
	if_nc	jmp	#.tabloop
'--------------------------------------------------------------
		...
		...

.firp1		long	firtab<<19
.firp2		long	(firsize-1+firtab)<<19

s_angle		long	$8000_0000			'start angle ($8000_0000 == 180 deg)
s_mag		long	$7fff_ffff / firsize		'FIR table magnitude

firtab		res	firsize
firbuf		res	firsize

EDIT: Quality improvement by making QROTATE conditional execution to eliminate extraneous case
EDIT2: Automate setting of table magnitude (s_mag)

evanh · 2020-12-14 12:35

A neat attribute of the above use of ALTI is it not only provides the needed register indirection, but the nature of manipulating with %RRR is that it gives the ADD instruction a true third operand. The PA and S_MAG operands are still the ALU inputs as specified in the ADD instruction, unaffected by the prefixing ALTI.

Cluso99 · 2020-12-30 22:53

Fastest way to clear COG or LUT RAM

  SETQ    #length-1       ' use SETQ for COG and SETQ2 for LUT
  RDLONG  where,##$80000  ' clear cog or lut

This works because HUB $80000-$FBFFF is unmapped HUB RAM area indeed reads zeroes.
Note the 16KB from $7C000-$7FFFF is dual mapped to $FC000-$FFFFF.

Thanks @Wuerfel_21

ersmith · 2020-12-31 00:20

Cluso99 wrote: »
Fastest way to clear COG or LUT RAM
  SETQ    #length-1       ' use SETQ for COG and SETQ2 for LUT
  RDLONG  where,##$80000  ' clear cog or lut
This works because HUB $80000-$FBFFF is unmapped HUB RAM area indeed reads zeroes.
Note the 16KB from $7C000-$7FFFF is dual mapped to $FC000-$FFFFF.

Thanks @Wuerfel_21

Of course that won't work if we ever get a 1MB P2

Wuerfel_21 · 2020-12-31 01:25

Even then, loading the old software wouldn't change that high RAM, so it'd still work. I guess ##$FB800 is slightly more future-proof

Cluso99 · 2021-03-06 03:11

With more people now using the P2 I thought it was worth bumping this thread.

For discussions, please use the discussion version of this thread linked in the first post. Thanks.

Wuerfel_21 · 2021-03-13 18:40

Here's something that makes sense when you think about it, but isn't explicitly said anywhere in the documentation: When skipping (with SKIPF/EXECF/XBYTE) more than 7 instructions after an ALTx instruction, it won't work.

evanh · 2021-03-14 05:55

Hehe, it won't be quite that. I found this in the hardware doc:

like SKIP, but fast due to PC steps of 1..8

The way that text is layed in the document sugests it came from the instruction sheet. I suspect it got trimmed out of the instruction sheet at some stage.

The incremental limit of 8 does seem an unneeded burden. There must have been a reason why. I didn't follow the skipping conversations when it was developed so I don't know why myself.

EDIT: Oh, it's just the ALTx that fails. Yeah, makes complete sense because the ALTx would then be prefixing a cancelled instruction.

Cluso99 · 2021-03-14 08:02

There is a limit of free skips to 7/8 instructions. After IIRC 7 skips a clock (or 2?) needs to be inserted as the skip continues.
I am unsure of the impact of ALT instructions. Whether they are treated any differently to other instructions Chip will need to answer.

To continue discussion, please use the discussion thread (link in top post).

Wuerfel_21 · 2021-03-16 20:29

Here's another funny one:
When porting P1 code, make sure that when translating a MOVS that is being used to modify a jump instruction, you turn that jump into an absolute one (jmp #\whatever instead of jmp #whatever)
(This of course only works if the address is still a cog address)

Wuerfel_21 · 2022-02-27 03:15

Here's another thing one may not have realized:
Bit 10 of an EXECF pattern (first SKIP bit) is always zero if you set the address field to the first instruction of interest

This means you can use that bit to store something interesting. Just make sure to use BITL whatever,#10 WCZ to test it, since that will clear the bit in the process.

evanh · 2022-06-14 00:48

ALTB clashing with ADDPINS, workaround - https://forums.parallax.com/discussion/174788/altb-instruction-gotcha-trap/p1

evanh · 2024-04-17 10:44

SETSEn instructions glitch when an active trigger is still present. The prior event clearing seems to be done too early ... allowing retriggering to occur before the new mode and source are selected. Further reading - https://forums.parallax.com/discussion/comment/1558763/#Comment_1558763

NOTE: This is not an issue with regular retriggering. It's specifically with how SETSEn instructions initially setup and arm the event hardware.

I've attached a demonstration program.

evanh · 2024-05-11 00:49

The NOT instruction doesn't set the Z flag according to the instruction's result but rather according to its S operand.

PUB  tester() | flags
    org
        mov flags, #0

        mov pb, #0    '  0
        not pa, pb  wz    ' -1
    if_z    or  flags, #1

        sub pb, #1    ' -1
        not pa, pb  wz    '  0
        shl flags, #1
    if_z    or  flags, #1
    end
    debug(ubin(flags))

That produces an output of %1 instead of the documented %10.

P2 Tricks, Traps &amp; Differences between P1 (Reference Material Only)

Comments

P2 Tricks, Traps & Differences between P1 (Reference Material Only)