flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

__deets__ · 2026-02-08 19:20

Ok so I found the culprit. I've written a program (years ago) that opens the serial port and just prints out what it reads. I call it stail for serial tail. And this detects when the propeller is being programmed, and shuts down the serial port so the loadp2 can do it's work, waits a second and tries to re-open the port.

And it appears that has changed its behavior, possibly my new underlying Linux or something. After fiddling with it, I don't seem to suffer from this stray reset. And guess what, the not-working.spin2 happens to work.

DIEZ 2026-02-01: sx1268_init:0
sx1268-send-test: 0
Lora1: sx1268: start send test.

Lora1: sx1268: chip is busy.

Lora1: sx1268: set standby failed.

Lora1: sx1268: chip is busy.

sx1268-send-test:failed

So more investigation needed.

ManAtWork · 2026-03-04 18:16

I just tried this:

VAR
  long  position
  long  timeStamp

PUB GetPosition (): pos | p, t, ct, dt, v, h
  p:= @position
  ORG
        setq    #1              ' atomic read of p+t
        rdlong  p,p
  END
  ct:= getct ()
...

position and timeStamp are written by another cog. I assumed that p and t are consecutive registers which are later used in another assembler inline section. I think Chip used tricks like this in P1 Spin a lot which relied on the order of stack parameters and local variable in memory. But in P1 Spin those were in hub RAM, not cog RAM. The problem is that FlexSpin assigns registers in the order the variables appear in the code of the function, not in the list of local variables in the header. So _var01 is allocated to p and _var02 to ct instead of t.

I know this is not a bug. It was never promised to work, just my optimistic assumption. But is there a way to make it work? If I do

PUB GetPosition (): pos | p, t, ct, dt, v, h
  p:= position
  t:= timeStamp
...

instead my code works but, of course, the access to p and t is not atomic. SETQ + RDLONG is a nice way to avoid locks and delays.

Wuerfel_21 · 2026-03-04 18:34

@ManAtWork said:
But is there a way to make it work?

Easiest would be to add {++opt(!fast-inline-asm)} between PUB/PRI and the function name. That'll force it to do the whole interpreter-style variable copy (slow and nasty) for ORG/END sections.

But I think if the variable you read into is a quad, array or structure, it will always place that in consecutive registers, so try that first.

evanh · 2026-03-04 23:10

If those locals are exclusively Pasm used then move them into a group of labelled RES's at end of ORG section.

VAR
  long  position
  long  timeStamp

PUB GetPosition (): pos | ct, dt, v, h
  ct:= @position
  ORG
        setq    #1              ' atomic read of p+t
        rdlong  p,ct
        ...
        ret

p       res 1
t       res 1
  END
  ct:= getct ()

evanh · 2026-03-05 07:43

Eric,
Looks like a regression in the includes at some point. The newer _sdmm_open() base function, used by _vfs_open_sdcardx(), for starting up Flexspin's built-in SD driver is not declared anywhere now. I've previously been directly using this function due to its compatibility with the external driver plug-in mechanism. To get it in scope now, I place the following line in my top source file:

vfs_file_t *_sdmm_open(int pclk, int pss, int pdi, int pdo) _IMPL("filesys/block/sdmm_vfs.c");

ManAtWork · 2026-03-05 10:34

@Wuerfel_21 said:
But I think if the variable you read into is a quad, array or structure, it will always place that in consecutive registers, so try that first.

I didn't know that I can declare local array variables, so far. But I tried and it seems to work.

PUB GetPosition (): pos | p[2], t, ct, dt, v, h
  p:= @position
  ORG
        setq    #1              ' atomic read of p+t
        rdlong  p,p
  END
  t:= p[1]
  ct:= getct ()

@evanh Your idea is also good but unfortunatelly doesn't work in my case. I have a second inline assembler section after some calculations in Spin2 and I couldn't acess the p and t res 1 variables from the second section. Ok, it would work with some additional copy instructions but is way more complicated than the solution above. But thanks, anyway.

evanh · 2026-03-05 10:59

I'd never have two Pasm sections split. That's double dipping on the Fcache overheads. I sure hope there's a damn good reason not to just merge them.

ManAtWork · 2026-03-05 11:44

@evanh said:
I'd never have two Pasm sections split. That's double dipping on the Fcache overheads. I sure hope there's a damn good reason not to just merge them.

Yeah, you're absolutely right. I could translate the Spin code between the two assembler sections to assembler, too. I kept it because it was easier to debug and also for historical reasons. This was a P1 program, once. I've ported it to the P2 and added the 64 bit arithmetic later.

PUB GetPosition (): pos | p[2], t, ct, dt, v, h
  p:= @position
  ORG
        setq    #1              ' atomic read of p+t
        rdlong  p,p
  END
  t:= p[1]
  ct:= getct ()
  dt:= ct - lastTime
  if p <> lastPos       ' position changed?
    last2Pos:= lastPos
    last2Time:= lastTime
    lastPos:= p
    lastTime:= t
  elseif dt > maxPeriod ' < min. speed
    last2Pos:= p                ' force v=0
    lastTime:= ct - maxPeriod
    last2Time:= lastTime - maxPeriod    ' avoid timer overflow
  else
    lastTime:= t ' update in the case dithering changed t
  ' v:= (lastPos - last2Pos) * cycleTime / (lastTime - last2Time)
  ' pos:= lastPos + v*dt / cycleTime
  p:= lastPos - last2pos
  t:= lastTime - last2Time
  ct:= cycleTime
  ORG ' use 64 bit arithmetic because v*dt or p*cycleTime could be >2^31
        abs     p wc
        qmul    p,ct ' (lastPos - last2Pos) * cycleTime
        getqx   p
        getqy   h
        setq    h
        qdiv    p,t             ' / (lastTime - last2Time)
        getqx   v
        qmul    v,dt            ' * dt
        getqx   v
        getqy   h
        setq    h
        qdiv    v,ct   ' / cycleTime
        getqx   p
        negc    p
  END
  pos:= lastPos + p

ersmith · 2026-03-05 12:34

@evanh said:
Eric,
Looks like a regression in the includes at some point. The newer _sdmm_open() base function, used by _vfs_open_sdcardx(), for starting up Flexspin's built-in SD driver is not declared anywhere now. I've previously been directly using this function due to its compatibility with the external driver plug-in mechanism. To get it in scope now, I place the following line in my top source file:
vfs_file_t *_sdmm_open(int pclk, int pss, int pdi, int pdo) _IMPL("filesys/block/sdmm_vfs.c");

I don't think that function was ever declared in header files, it was probably getting dragged in as a side effect of something else. Anyway, I've added an explicit declaration now. Thanks for catching this.

ersmith · 2026-03-05 12:38

@ManAtWork said:

@Wuerfel_21 said:
But I think if the variable you read into is a quad, array or structure, it will always place that in consecutive registers, so try that first.

I didn't know that I can declare local array variables, so far. But I tried and it seems to work.

Yes, that should work. So should structures, but if you're directly manipulating elements in assembly language using an array might be clearer.

Wuerfel_21 · 2026-03-05 12:58

@ersmith said:
I don't think that function was ever declared in header files, it was probably getting dragged in as a side effect of something else. Anyway, I've added an explicit declaration now. Thanks for catching this.

Remember, I separated out some of the SDMM stuff because it was getting compiled in and bloating binaries even when using an external block driver (SDSD). I think dead code elimination doesn't work well for functions that are referred to by function pointers somewhere.

ersmith · 2026-03-05 14:26

@Wuerfel_21 said:

@ersmith said:
I don't think that function was ever declared in header files, it was probably getting dragged in as a side effect of something else. Anyway, I've added an explicit declaration now. Thanks for catching this.

Remember, I separated out some of the SDMM stuff because it was getting compiled in and bloating binaries even when using an external block driver (SDSD). I think dead code elimination doesn't work well for functions that are referred to by function pointers somewhere.

Right, but I don't think just declaring the function should cause any issues. I looked at your patch that did the re-arrangement, and I don't think sdmm_open was declared before that either (unless I missed something).

evanh · 2026-03-06 00:00

.../include/filesys/block/sdmm_vfs.c:60: warning: Redefining function or subroutine _sdmm_open with an incompatible type

I don't think struct vfs was a good choice.

ersmith · 2026-03-06 12:21

@evanh said:
.../include/filesys/block/sdmm_vfs.c:60: warning: Redefining function or subroutine _sdmm_open with an incompatible type
I don't think struct vfs was a good choice.

Oops yes, thanks. I've pushed an update.

ManAtWork · 2026-03-11 17:46

I've just tried to recompile some code I've written in 2022. IIRC it worked back then and didn't throw any error messages. But now I get

B2I_Axis.c:290: warning: Deleting apparently unused cordic instruction qmul

void BackpropAsm (AxisBuffEntry* buffer, int wri, int rdi)
{
    uint32_t dlo = 0;
    uint32_t dhi = 0;
    uint32_t v2lo = halfAcc2;
    uint32_t v2hi = 0;
    fixp30_t v1, vMax;
    uint32_t* ptr;
    uint32_t hA = halfAcc;
    uint32_t nA = nomAcc;
    uint32_t mlo, mhi;
    __asm{
.loop
        decmod wri,#AXIS_BUFFER_SIZE-1
        mov    ptr,wri
        shl    ptr,#6
        add    ptr,buffer
        add    ptr,#40   // v1= buffer[i].vNom
        rdlong v1,ptr
        mov    vMax,hA
        add    vMax,v1
        qmul   vMax,vMax // vMax² // <- here, line 290
        add    dlo,v1 wc
        addx   dhi,#0
        qmul   v1,nA     // v1 * nomAcc
        add    ptr,#8
        setq   #2-1
        wrlong dlo,ptr   // buffer[i].dDec= dist
        getqx  mlo
        getqy  mhi
        shr    mlo,#30
        mov    v1,mhi
        shl    v1,#2
        shr    mhi,#30
        or     mlo,v1
        cmp    v2lo,mlo wcz // if (v2 > vMax²) break
        cmpx   v2hi,mhi wcz
        getqx  mlo
        getqy  mhi
  if_a  jmp    #.break
        mov    v1,mlo
        shl    mhi,#3
        shl    mlo,#3
        shr    v1,#32-3
        or     mhi,v1
        add    v2lo,mlo wc // v2+= (v1 * nomAcc) * 8
        addx   v2hi,mhi
        cmp    wri,rdi wz
  if_nz jmp    #.loop
.break
    }
}

Ok, I can use __asm const. But "improving" the optimization leads to breaking code that has worked, before.

Wuerfel_21 · 2026-03-11 18:34

@ManAtWork said:
I've just tried to recompile some code I've written in 2022. IIRC it worked back then and didn't throw any error messages. But now I get

B2I_Axis.c:290: warning: Deleting apparently unused cordic instruction qmul

Ok, I can use __asm const. But "improving" the optimization leads to breaking code that has worked, before.

Pipelined CORDIC operations were never guaranteed to work in non const/volatile context because the compiler is free to delete/add/move instructions, since the semantics of pipelined operations depend on exact timing (i.e. if the result window is missed, an instruction is skipped). This is too difficult to deal with, so CORDIC-related IR transformations are created with the assumption of a single operation in flight at a time. Removing a QMUL when there's no associated GETQX (e.g. has been removed by dead code pass) is necessary for correctness (this was actually causing bugs at some point).

ManAtWork · 2026-03-12 09:05

@Wuerfel_21 said:
Removing a QMUL when there's no associated GETQX (e.g. has been removed by dead code pass) is necessary for correctness (this was actually causing bugs at some point).

Ok, this is true for compiler generated code where optimization and dead code removal makes sense. But if I'm pretty sure that my assembler code is already highly optimized I think it's safer to just use __asm const all the time, right? Or could you imagine a case where this causes problems?

evanh · 2026-03-12 09:27

const and volatile are generally a good thing to use when you're crafting it to the hardware. Leaving them out is you saying you're happy to have the optimiser rearrange things.

Wuerfel_21 · 2026-03-12 12:27

@ManAtWork said:

@Wuerfel_21 said:
Removing a QMUL when there's no associated GETQX (e.g. has been removed by dead code pass) is necessary for correctness (this was actually causing bugs at some point).

Ok, this is true for compiler generated code where optimization and dead code removal makes sense. But if I'm pretty sure that my assembler code is already highly optimized I think it's safer to just use __asm const all the time, right? Or could you imagine a case where this causes problems?

Basically, for the code you showed, use __asm const or __asm volatile (if you expect it to loop often).

Using __asm const prevents the compiler from modifying the insides of the block at all (except for register allocation) and can cause missed optimizations around the block, too. If you're just writing a quick couple of instructions that would be hard to express in C in the middle of a function, the plain __asm block is better.
random example pulled from a project:

                #ifdef __propeller2__
                    __asm {
                        sub bc,#1
                        testb d,bc wc
                        bmask bc
                        if_nc sub d,bc
                    }
                #else
                    bc = 1 << (bc - 1);             /* MSB position */
                    if (!(d & bc)) d -= (bc << 1) - 1;  /* Restore negative value if needed */
                #endif

A lot of the compiler's builtin functions are also implemented using this. For example the 64 bit compare:

' compare signed alo, ahi, return -1, 0, or +1
pri _int64_cmps(alo, ahi, blo, bhi) : r
  asm
      cmp  alo, blo wc,wz
      cmpsx ahi, bhi wc,wz
if_z  mov  r, #0
if_nz negc r, #1
  endasm

If the input values are constants, they can be propagated through the code.

To recap:

Spin syntax	C syntax	does what
ASM / ENDASM	__asm {}	Treated the same as compiler-generated instructions
ORGH/END	__asm const {}	Instructions can not be modified by optimizer (thus typically assemble as-is into hub RAM)
ORG/END	__asm volatile {}	Block is copied into cog RAM before executing, can not be modified by optimizer)

avsa242 · 2026-03-21 16:33

I'm not sure if this was added a long time ago and I just never noticed, or if it was really recently added (I don't look at the P(2)ASM output for most things), but this:

' pub main() | v, smp
_main
'   
'   smp := 0
'   repeat MAX_V with v
    mov _var01, #0
    callpa  #(@LR__0002-@LR__0001)>>2,fcache_load_ptr_
LR__0001
'       smp += vts[v].sample
    add _var01, #1
    cmps    _var01, #64 wc

annotating the PASM output with the original spin code is great, so thanks heaps for adding it! It really helps identify more at-a-glance "what" is being compiled to "what." I'd considered asking awhile back if it'd be possible to name the regs used more closely to the original spin locals, but this makes it much easier to trace.

Cheers

evanh · 2026-03-21 17:30

@avsa242 said:
I'm not sure if this was added a long time ago and I just never noticed, or if it was really recently added (I don't look at the P(2)ASM output for most things), but this:

It has always been there I think. I can see old compiles I still have, sitting in dark corners, from 2021, producing it.

EDIT: Ah, found a 2020 dated compile that doesn't have it.
EDIT2: Sept 2020 compile has it. 2019 compiles didn't produce a .p2asm by default.

avsa242 · 2026-03-21 18:12

Wow...I don't know how I never noticed it...again I don't use the pasm output much, but still...
EDIT: Ah ok I see why - I don't have -l in my editor's build task. I was using FlexProp in this particular case for the DEBUG output and it has that flag enabled by default.

evanh · 2026-03-21 20:34

I couldn't immediately remember the compiled source comments either, to be honest. But when I looked up one of my efforts I realised I'd been using them to double check where I looked in the assembly on those occasions I did check things.

flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

Comments