Array of Longs in PASM2

JonnyMac · 2026-01-12 21:03

While making a suggestion in another post about putting an array into inline PASM (which works), I started some personal experiments and ran into a snag. In short, I'd like to access an array in PASM2 with a variable index, but code like I used to use in the P1 isn't working. Please -- gently -- show me where I've gone wrong. Thanks.

pub method(p_in, p_out)

  org

                        setq #(32-1)
                        rdlong array, p_in

                        mov       i, #0

loop                    mov       pntr, #array
                        add       pntr, i
                        setd      update, pntr
                        nop             
update                  shr       0-0, #1                       ' NOT WORKING       
                        incmod    i, #31                wc
        if_c            jmp       #loop

                        setq #(32-1)   
                        wrlong array, p_out

done                    ret                     

array                   res       32
pntr                    res       1
i                       res       1

  end

TonyB_ · 2026-01-12 21:44

Add another nop or
altd is an alternative that doesn't require nop

loop                    altd      i, #array
update                  shr       0-0, #1

JonnyMac · 2026-01-12 21:58

That did it. Thank you, Tony.

TonyB_ · 2026-01-12 22:13

@JonnyMac said:
That did it. Thank you, Tony.

setd is an instruction I've never used but altd and alts I use a lot.

Rayman · 2026-01-12 22:29

RES in inline assembly? Surprised that works...

One thing that trips me up a lot is only adding 1 to long pointers when need to add 4...

JonnyMac · 2026-01-12 22:36

@Rayman said:
RES in inline assembly? Surprised that works...

I was surprised, too. Remember, we have about $120 (288) instructions available for inline assembly, so we do need to be mindful of that, and that variables defined with RES are going to end up with whatever is in the cog RAM at those locations.

One thing that trips me up a lot is only adding 1 to long pointers when need to add 4...

Yeah, that sometimes gets me, too.

JonnyMac · 2026-01-12 22:40

@TonyB_ said:
setd is an instruction I've never used but altd and alts I use a lot.

I was so excited about fixing with another nop that I jumped right to that. Then I popped back and saw the altd -- much nicer and saves a bit of code. This is the adjustment.

pub method(p_in, p_out)

  org

                        setq #(32-1)
                        rdlong array, p_in

                        mov       i, #0

loop                    altd      i, #array         
update                  shl       0-0, #2
                        incmod    i, #31                wc
        if_nc           jmp       #loop

                        setq #(32-1)   
                        wrlong array, p_out

done                    ret                     

array                   res       32
i                       res       1

  end

Thanks, again!

TonyB_ · 2026-01-12 23:32

@JonnyMac said:

@TonyB_ said:
setd is an instruction I've never used but altd and alts I use a lot.

I was so excited about fixing with another nop that I jumped right to that. Then I popped back and saw the altd -- much nicer and saves a bit of code.

Jon, I think you are interested in PASM2 tips and the (untested) code below is optimal for speed. Two improvements: altd/alts/altr adds sign-extended S[17:9] to D (which leaves D unchanged for 9-bit #S) and rep repeats an instruction block by jumping back to the start in zero cycles.

pub method(p_in, p_out)

  org

                        setq    #(32-1)
                        rdlong  array, p_in

                        mov     i, #0
                        rep     #2,#32
                        altd    i, s         
                        shl     0-0, #2

                        setq    #(32-1)   
                        wrlong  array, p_out

done                    ret                     

s                       long    1 << 9 + array  'add +1 to i in altd
array                   res     32
i                       res     1

  end

JonnyMac · 2026-01-13 01:47

That's pretty nifty, Tony. I'll keep both in a recipe book as the first allows random access of array elements.

I'm still learning how to decode Chip's notes in the PASM instruction list, althought Ada's docs make mention of auto-incrementing value. By using a traditional loop and debug I was able to see this working. Based on your comment, I did change the increment value. One thing... I was a little confused that # is not required with array in the line that declares s.

pub method(p_in, p_out)

  org

                        setq      #(32-1)
                        rdlong    array, p_in

                        mov       i, #0
                        mov       j, #32

loop                    debug(uhex_long(i))   
                        altd      i, s         
                        shl       0-0, #2
                        djnz      j, #loop

                        setq      #(32-1)   
                        wrlong    array, p_out

done                    ret                       

s                       long      1 << 9 + array                ' add +1 to i in altd

array                   res       32
i                       res       1
j                       res       1

  end

Again, thanks!

bob_g4bby · 2026-01-13 06:59

One thing that trips me up a lot is only adding 1 to long pointers when need to add 4...
@Rayman, thank you - me too - I was wondering why the whole buffer wasn't being processed!

Kaio · 2026-01-16 00:20

@JonnyMac said:
I'm still learning how to decode Chip's notes in the PASM instruction list, althought Ada's docs make mention of auto-incrementing value.

Jon, here's an example with the powerful ALTI instruction using auto-increment for the D field. For simplicity one needs a copy of the instruction in a register which will be modified by the ALTI instruction and then the D field is copied into the instruction in the pipeline. The original instruction (after the ALTI) is not changed in the RAM, only in the pipeline like the other ALTx instructions.

                         'mov        s, .instr           ' necessary to initialize if code is not used as inline

                         rep        @.end, #32
                         alti       s,#%111_000          ' increment D, let S unchanged 
.instr                   shl        array, #2
.end
...                                    
s                        shl        array, #2            ' instruction which is modifed by ALTI
...

With the ALTI instruction it is possible to increment or decrement D and/or S and/or R field at the same time. E.g. one could also increment the D field and decrement the S field with only one ALTI instruction.

                         mov        _alti, .instr        ' necessary to initialize, otherwise the instruction will continue on next call with last values of D and S field in _alti
                         rep        @.end, #8
                         alti       _alti,#%111_110      ' increment D and decrement S 
.instr                   mov        array, buf+7         ' copy from buf[7] .. buf[0] to array[0] ... array[7]
.end
...
_alti                    res        1
array                    res        8
buf                      res        8

Unfortunately the examples in the documentation are incomplete, Therefore, it is difficult to see how the ALTI instruction is working.
There are some special modes available to control the size of the increment/decrement, e.g. useful for ring buffers. But this exceeds the 9 bit value range on the S field on ALTI. Hence a register is necessary or an AUGS will be used.

mwroberts · 2026-01-16 04:50

Here is a PASM circular buffer I recently wrote using AltS and AltD in cog ram.
The squiggly bracket comments in the PASM line below the AltX explain clearly
what the AltX is doing...
The incmod is great for auto rollover of the head and tail index.

NewHead       mov pHeadLo,sAcLoX   'putting new data in buffer at head index
              mov pHeadHi,sAcHiX   'called when new blocktime
              'setup new tail trigger for running average
              altd pHeadIdx,#pTailTime
              mov 0{pTailTime[pHeadIdx]},pSwClk 
              altd pHeadIdx,#pTailTime
              add 0{pTailTime[pHeadIdx]},pAccTC 
              altd pHeadIdx,#pTailLoNext
              mov 0{pTailLoNext[pHeadIdx]},sAcLoX 
              altd pHeadIdx,#pTailHiNext
              mov 0{pTailHiNext[pHeadIdx]},sAcHiX 
              incmod pHeadIdx,#3
              ret

NewTail       alts pTailIdx,#pTailLoNext            'taking tail index data out of buffer
              mov pTailLo,0{pTailLoNext[pTailIdx]} 'called when pTailTime == pSwClk
              alts pTailIdx,#pTailHiNext
              mov pTailHi,0{pTailHiNext[pTailIdx]}
              incmod pTailIdx,#3
              ret

DAT
pTailIdx       long 0
pHeadIdx       long 0
pHeadLo       long 0
pHeadHi       long 0
pTailLo       long 0
pTailHi       long 0
pTailTime     long 0,0,0,0
pTailLoNext   long 0,0,0,0
pTailHiNext   long 0,0,0,0

Kaio · 2026-01-16 09:10

@JonnyMac said:
That's pretty nifty, Tony. I'll keep both in a recipe book as the first allows random access of array elements.

One thing... I was a little confused that # is not required with array in the line that declares s.

>
Usually you don't need the s register as the ALTD instruction is already calculating the correct address for i, it's like array[i].
See the example from @mwroberts above.

                        altd      i, #array

After clarification with @TonyB_ some messages below, there is second use case possible with ALTD/ALTS/ALTR instructions, where the index register (provided via D) is updated additional. For this case a register is necessary (for performance reason) for S as the value exceeds the 9 bit value range.
The register holds the base address of the array and a value for the index update. The expression "1 << 9" is a 1 (moved above the 9 bit S value) which gets added to register i provided as D on the ALTD instruction in the code above from Jon.

                        altd      i, s
...                                   
s                       long      1 << 9 + array                ' add +1 to i in altd

ManAtWork · 2026-01-16 09:56

@Kaio said:
Jon, here's an example with the powerful ALTI instruction using auto-increment for the D field.
...
With the ALTI instruction it is possible to increment or decrement D and/or S and/or R field at the same time. E.g. one could also increment the D field and decrement the S field with only one ALTI instruction.
Unfortunately the examples in the documentation are incomplete, Therefore, it is difficult to see how the ALTI instruction is working.
...

Wow, I've learned something new, again. I've never used ALTI but only ALTS and ALTD because the original silicon doc was very vague about how ALTI actually works. The new Assembler manual is much better and at least explains the fields and options. But it would be even better if we knew not only HOW it works but also WHY. I mean, @cgracey for sure had some very specific intentions when he implemented those commands. An AI is very good at doing the "diligent work" generating and cleaning up the manuals but it can't guess someones intentions.

TonyB_ · 2026-01-16 12:41

@Kaio said:

@JonnyMac said:
That's pretty nifty, Tony. I'll keep both in a recipe book as the first allows random access of array elements.

One thing... I was a little confused that # is not required with array in the line that declares s.

>
You don't need the s register as the ALTD instruction is already calculating the correct address for i, it's like array[i].
See the example from @mwroberts above.
                        altd      i, #array         
...                                   
' not necessary for ALTD and ALTS ' s                       long      1 << 9 + array                ' add +1 to i in altd

The s register is needed in this case to increment D and S[17:9] holds the sign-extended increment. D cannot be changed with 9-bit #S. The inferior alternative is to use ##S but that is slower and takes just as many longs. long and word and byte values are numeric so do not require a # prefix.

Kaio · 2026-01-16 15:59

@TonyB_ said:
The s register is needed in this case to increment D and S[17:9] holds the sign-extended increment. D cannot be changed with 9-bit #S. The inferior alternative is to use ##S but that is slower and takes just as many longs. long and word and byte values are numeric so do not require a # prefix.

After clarification with @TonyB_ two messages below I correct this message.
You are thinking in the days of P1. But with ALTD you don't need to know on which bit position the D is located in the instruction. The ALTD instruction will do it for you in the right way.

You need only provide the base address of your array and the index and ALTD will do the magic change of the D field for you in the pipeline.

Have you seen the example from @mwroberts above? He is doing the same.

Added on edit:
There is second use case possible with ALTD/ALTS/ALTR instructions, where the index register (provided via D) is updated additional.

Kaio · 2026-01-16 16:36

@ManAtWork said:
Wow, I've learned something new, again. I've never used ALTI but only ALTS and ALTD because the original silicon doc was very vague about how ALTI actually works.

You're welcome, Nicolas. It looks like that the ALTI instruction was used not that much in the past. I'm glad that I could help clarify some of the usage.

TonyB_ · 2026-01-16 18:31

@Kaio said:

@TonyB_ said:
The s register is needed in this case to increment D and S[17:9] holds the sign-extended increment. D cannot be changed with 9-bit #S. The inferior alternative is to use ##S but that is slower and takes just as many longs. long and word and byte values are numeric so do not require a # prefix.

You are thinking in the days of P1. But with ALTD you don't need to know on which bit position the D is located in the instruction. The ALTD instruction will do it for you in the right way.
You need only provide the base address of your array and the index and ALTD will do the magic change of the D field for you in the pipeline.

ALTD/ALTS/ALTR do two separate operations. They add a signed value to the D register in the next instruction and replace the D or S field in the pipeline. This is mentioned in the document and here is the relevant section:

REGISTER INDIRECTION

Cog registers can be accessed indirectly most easily by using the ALTS/ALTD/ALTR instructions. These instructions sum their D[8:0] and S/#[8:0] values to compute an address that is directly substituted into the next instruction's S field, D field, or result register address (normally, this is the same as the D field). This all happens within the pipeline and does not affect the actual program code. The idea is that S/# can serve as a register base address and D can be used as an index.

Additionally, S[17:9] is always sign-extended and added to the D register for index updating. Normally, a nine-bit #address will be used for S, causing S[17:9] to be zero, so that D is unaffected.

Kaio · 2026-01-16 19:26

@TonyB_ said:
ALTD/ALTS/ALTR do two separate operations. They add a signed value to the D register in the next instruction and replace the D or S field in the pipeline. This is mentioned in the document and here is the relevant section:

Sometimes it needs a third and fourth look at the code to check what's going on. As it is working in that way and also stated in the documentation you are right.
But it is difficult to see in the code how the increment on register i is working.

I would prefer the use of the ALTI instruction for incrementing D in this case, as there is no index register necessary in the code. This is also not easy to understand as it is completely working in the P2.
But if one needs the index register also for other cases in the code the ALTD solution would be better.

evanh · 2026-01-17 05:46

@Kaio said:

@ManAtWork said:
Wow, I've learned something new, again. I've never used ALTI but only ALTS and ALTD because the original silicon doc was very vague about how ALTI actually works.

You're welcome, Nicolas. It looks like that the ALTI instruction was used not that much in the past. I'm glad that I could help clarify some of the usage.

I've used ALTI a number of times over the years. Came in very effective with the 512 528 byte compute of SD card CRC where, in addition to auto incrementing the D register, I was able to redirect the instruction result to eliminate a subsequent load instruction, bringing the inner loop time down to matching the 4-bit data rate of sysclock/3.

Here's the SD read data block CRC compute function. There is source comments, for my sanity, about altireg fields and what ALTI is doing.

crc_check
// CRC-16 check of prior read data block
// Z is clear upon first data block of transaction, returns quickly with Z set
        mov crc3, #0    // CRC result for DAT3 pin
        mov crc2, #0    // CRC result for DAT2 pin
        mov crc1, #0    // CRC result for DAT1 pin
        mov crc0, #0    // CRC result for DAT0 pin
        mov pb, altireg
    if_z    rep @.rend, #512/4+16/8    // one SD data block + CRC

    if_z    alti    pb, #0b100_111_000    // next D-field substitution, then increment PB, result goes to PA
    if_z    movbyts pa, #0b00_01_10_11    // byte swap within longword
    if_z    splitb  pa    // every 4th bit makes a byte, 8 x 4-bit parallel to 4 x 8-bit serial
    if_z    setq    pa    // 4 bytes, one per pin
    if_z    crcnib  crc3, poly    // 16-bit polynomial
    if_z    crcnib  crc3, poly
    if_z    crcnib  crc2, poly
    if_z    crcnib  crc2, poly
    if_z    crcnib  crc1, poly
    if_z    crcnib  crc1, poly
    if_z    crcnib  crc0, poly
    if_z    crcnib  crc0, poly
.rend
// pass/fail, pass == zero result, even parity, includes received CRC from SD card
        or  crc3, crc2
        or  crc1, crc0
    _ret_   or  crc3, crc1   wz    // Z set for pass, clear for fail

poly        long 0x8408    // reversed CRC-16-CCITT (x16 + x12 + x5 + x0: 0x1021, even parity)
altireg     long pa<<19 | cogdatbuf<<9    // register PA for ALTI result substitution, and CRC buffer address

crc3        res 1
crc2        res 1
crc1        res 1
crc0        res 1

cogdatbuf   res 512/4    // longwords for SD data block
cogcrcbuf   res 16/8    // longwords for CRC nibbles, must be contiguous with the SD data block

evanh · 2026-01-17 05:52

You'll have to excuse the code alignment. The forum is using size 4 for hard tabs. I do think that's unsuitable as most uses of hard tabs is for assembly like this - Which is always size 8.

Note: The movbyts pa, #0b00_01_10_11 could be written as movbyts 0-0, #0b00_01_10_11. I put PA there because that's where ALTI sends MOVBYTS's result to.

Kaio · 2026-01-17 09:12

@evanh said:
I've used ALTI a number of times over the years. Came in very effective with the 512 528 byte compute of SD card CRC where, in addition to auto incrementing the D register, I was able to redirect the instruction result to eliminate a subsequent load instruction, bringing the inner loop time down to matching the 4-bit data rate of sysclock/3.

Thank you, @evanh , for sharing this code snippet. This is a very interesting usage of the ALTI instruction, which enables the use of 3 operands for a P2 instruction.
This is the explained code performed by ALTI.

movbyts pa, cogdatbuf++, #0b00_01_10_11  ' R, D, S

'  this is the equivalent 
mov     pa, cogdatbuf++
movbyts pa, #0b00_01_10_11

This is possible by using the upper tripple bits in S on the ALTI instruction.

alti    pb, #0b100_111_000

results in
D=altireg        ' long pa<<19 | cogdatbuf<<9

S=#0b100_111_000
100 controls to use another register for the result (PA from altireg)
111 auto-increment D field from altireg (cogdatbuf)
000 no change on S field

evanh · 2026-01-17 10:34

Your dissection looks good. Yes, ALTI provides effective 3-operand instructions. ARM CPUs got that one very right.
I'd indicate a sudo-indirection-syntax like this: mov pa, [cogdatbuf++]

Amusingly, register-indirect-register I doubt has ever existed in any CPU architecture. CPUs just don't have that many general registers normally. Maybe GPU has something like it.

Array of Longs in PASM2

Comments