FlexSpin: is mixed HUB/LUT exec possible?

TonyB_ · 2023-09-11 17:52

@ManAtWork said:

@evanh said:
Tony used a technical term there - "Direct Addressing" is the way an instruction is encoded for fetching operand data. Any of the 512 cogRAM addresses can be directly specified in both the encoded S and D fields of an instruction. But, for lutRAM and hubRAM, because of the extra indexing options in the RDLUT/WRLUT instructions, only the first 256 addresses can be dirCorrect, but note that ectly specified this way and only in the S field.

I'm pretty sure that this does not affect my code if I place the instructions in LUT ram and the data in COG ram. Jumps and branches, REP etc. are PC relative so they still work in LUT ram.

Executing instructions in LUT RAM that access data in cog RAM is fine. Note that some branch instructions, indicated by Call/Jump to S** in the spreadsheet, have signed-relative addressing in the #S form, which means you cannot branch from everywhere in cog RAM to anywhere in LUT RAM or vice-versa, unless you use S. Usually, though, you can move code by hand so that the branch address is within the PC +/- 256 range for #S.

evanh · 2023-09-11 22:32

@ManAtWork said:
I'm pretty sure that this does not affect my code if I place the instructions in LUT ram and the data in COG ram. Jumps and branches, REP etc. are PC relative so they still work in LUT ram.

Correct. Addressing modes are about data accesses, not code fetching.

ManAtWork · 2023-09-12 07:20

@pik33 said:
Instead of rdword/add inx/rdword to get 2 samples to interpolate you can use rdlong and getword. rdlong does not have to be long aligned in P2. That saves one hub read, and the cost is one getword and one clock penalty for unaligned long read.
mul/add/shr maybe can be sped up using scas: scas returns an already shifted result in Q

Good point. Reading the two table entries at once also has the advantage that I can swap the two MULs which eliminates one of the SUBRs. The first two MULs can't be substituted by SCA because of the odd shifting of #9 but the third can.

              mov    inx,theta
              testb  inx,#31 wz                 ' bit 31 -> Z
              shl    inx,#2 wc                  ' bit 30 -> C
        if_c  not    inx                        ' mirror 2nd and 4th quadrant
              getbyte ipol,inx,#2               ' fraction -> interpolate
              shr    inx,#24                    ' MSB -> table index
              shl    inx,#1
              add    inx,adrTable
              rdlong sin1,inx                   ' fetch two consecutive table entries
              getword sin2,sin1,#1              ' 2nd table entry
              mul    sin2,ipol
              subr   ipol,#$100
              mul    sin1,ipol
              add    sin1,sin2
              shr    sin1,#9                    ' 8 bits + /2 for average of sin1+sin2
              sca    sin1,ampl
              negz   sin1                       ' flip 3rd and 4th quadrant

rogloh · 2023-09-12 08:57

@ManAtWork said:

@evanh said:
Tony used a technical term there - "Direct Addressing" is the way an instruction is encoded for fetching operand data. Any of the 512 cogRAM addresses can be directly specified in both the encoded S and D fields of an instruction. But, for lutRAM and hubRAM, because of the extra indexing options in the RDLUT/WRLUT instructions, only the first 256 addresses can be directly specified this way and only in the S field.

I'm pretty sure that this does not affect my code if I place the instructions in LUT ram and the data in COG ram. Jumps and branches, REP etc. are PC relative so they still work in LUT ram.

I recall in some cases you might have to use absolute addresses when calling LUTRAM from LUTRAM. I had something like that in my memory driver which had to call code in LUT RAM and I think I had to put in the #\ to make it work. Not sure if this was an assembler limitation at the time or something else.

Code in LUT RAM:

r_resume_burst              getnib  b, request, #0          ' a b c d e    get bank parameter LUT address
                            rdlut   b, b                    ' a b c d e    get bank limit/mask
                            bmask   mask, b                 ' | | | d e    build mask for addr
      <snip>
w_burst                     mov     orighubsize, count      '  a b c | e        save original hub size
w_locked_fill               cmp     count, #1 wz            '  a b c d |   g    optimization for single transfers
                            shl     count, a                '  a b c | |   |    scale into bytes
                            tjz     count, #nowrite_lut     '  a b c d e   |    check for any bytes to write
w_resume_burst              mov     c, count                '  a b c d e f g h  get the number of bytes to write
                            call    #\r_resume_burst        '  a b c d e f g h  get per bank limit and read delay info  <----- NOTE: #\ syntax

TonyB_ · 2023-09-12 09:19

@rogloh said:
I recall in some cases you might have to use absolute addresses when calling LUTRAM from LUTRAM. I had something like that in my memory driver which had to call code in LUT RAM and I think I had to put in the #\ to make it work. Not sure if this was an assembler limitation at the time or something else.

Code in LUT RAM:

r_resume_burst              getnib  b, request, #0          ' a b c d e    get bank parameter LUT address
                            rdlut   b, b                    ' a b c d e    get bank limit/mask
                            bmask   mask, b                 ' | | | d e    build mask for addr
      <snip>
w_burst                     mov     orighubsize, count      '  a b c | e        save original hub size
w_locked_fill               cmp     count, #1 wz            '  a b c d |   g    optimization for single transfers
                            shl     count, a                '  a b c | |   |    scale into bytes
                            tjz     count, #nowrite_lut     '  a b c d e   |    check for any bytes to write
w_resume_burst              mov     c, count                '  a b c d e f g h  get the number of bytes to write
                            call    #\r_resume_burst        '  a b c d e f g h  get per bank limit and read delay info  <----- NOTE: #\ syntax

This is skipping related and applies to cog RAM exec and LUT RAM exec. From the doc:

Special SKIPF Branching Rules

Within SKIPF sequences where CALL/CALLPA/CALLPB are used to execute subroutines in which skipping will be suspended until after RET, all CALL/CALLPA/CALLPB immediate (#) branch addresses must be absolute in cases where the instruction after the CALL/CALLPA/CALLPB might be skipped. This is not possible for CALLPA/CALLPB but CALL can use '#\address' syntax to achieve absolute immediate addressing. CALL/CALLPA/CALLPB can all use registers as branch addresses, since they are absolute.

ManAtWork · 2023-09-12 09:55

IMHO, SKIPF is evil. I've used it only once. In some cases it can really save a lot of instructions but it is so error prone especially when jumps, calls or branching is involved, it has to be used with great care.

TonyB_ · 2023-09-12 11:59

I think SKIPF is the best feature of P2. When used more than trivially, it really helps if skip patterns can be created automatically. For complicated code that's essential. https://forums.parallax.com/discussion/171125/skip-patterns-generated-automatically

Example:

' ROTATES & SHIFTS (8086 emulator)

'                               previous sign bit
'ROLb           '''a            C
'RORb           ''b                     6
'RCLb           ''c             C
'RCRb           ''d                     6
'SHLb           ''e             C
'SHRb           ''f                     6
'SETMOb         ''g             C                       not "SALb"
'SARb           ''h                     6

'ROLw           ''A             C
'RORw           ''B                     14
'RCLw           ''C             C
'RCRw           ''D                     14
'SHLw           ''E             C
'SHRw           ''F                     14
'SETMOw         ''G             C                       not "SALw"
'SARw           ''H                     14

'combined byte/word version
'count = rotate/shift bit count, > 0
'width_msb = msb = 7 or 15
'nc,nz

                movbyts dest,#%%1010                    '|||||||| ABCDE|G|
                movbyts dest,#%%0000                    'abcde|g| ||||||||
                signx   dest,width_msb                  '|||||||h |||||||H
                fle     count,#16                       '||||efgh ||||EFGH      shift 16 max
                testb   F,#C_bit                wc      '||cd|||| ||CD||||      c = old CF
                modz    _set                    wz      '||cd|||| ||||||||      z/nz if b/w
'operation
                rol     dest,count              wc      'a||||||| A|||||||      c = CF
                ror     dest,count              wc      '|b|||||| |B||||||      c = CF
                call    #\RCL_dest_count                '||c||||| ||C|||||      c = CF
                call    #\RCR_dest_count                '|||d|||| |||D||||      c = CF
                shl     dest,count              wc      '||||e|g| ||||E|G|      c = CF
                shr     dest,count              wc      '|||||f|h |||||F|H      c = CF
                not     dest,#0                         '||||||g| ||||||G|      = -1
'flags
                andn    F,#AF_CF_mask                   'abcdefgh ABCDEFGH      write AF = 0; CF = 0 if gG
                bitc    F,#C_bit                        'abcdef|h ABCDEF|H      write CF
                testb   dest,#6                 wc      '|b|d|f|h ||||||||
                testb   dest,#14                wc      '|||||||| |B|D|F|H
                testb   dest,width_msb          xorc    'abcdefgh ABCDEFGH      c = OF
                bitc    F,#V_bit                        'abcdefgh ABCDEFGH      write VF

ManAtWork · 2023-09-13 07:32

Yes, SKIPF can be helpful for interpreter and emulator kind of applications where the data flow is always similar but there are many different cases of what to do with the data. But for normal linear and procedural programming I just don't need it very often.

Unfortunatelly, for my VFD application, things are much more complicated than I first thought. The signals are noisy so I have to apply filtering. This is no problem because the main control loop runs at least 20 times slower than the interrupts processing the ADC samples. But I have to apply the filtering after the Park transformation because applying filtering to a rotating vector causes phase lag and jitter if the timing is not synchronized. After the park transformation the phase relation is quasi-static or at least changes with a much slower rate.

So I have to do all the vector math in the interrupt service routine and I definitely need the CORDIC. The main control loop then becomes rather trivial. It should be just a few lines of code which can be called from the motion control loop which also runs once per millisecond and hopefully has some CPU time left over. So I can use a dedicated cog running PASM for the VFD without the need for interrupts combined with compiled Spin code.

FlexSpin: is mixed HUB/LUT exec possible?

Comments