flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

ManAtWork · 2025-10-07 14:47

Next problem: How do I properly declare shared memory used by multiple cogs as volatile so that the compiler does not optimize away consecutive reads?

Example:

struct __using("Sano_trafo.spin2") asm;
static volatile TrafoCtrlDescr ctrl;
static int ccog;
...
   ccog= _coginit (ANY_COG, asm.GetStartAdr (), &ctrl);
    _waitms (1);
    _wrpin (pinTrI, P_ADC | P_ADC_GIO);
    _waitx (TRAFO_PERIOD * 100);
    int32_t lo= ctrl.currI;
    _wrpin (pinTrI, P_ADC | P_ADC_VIO);
    _waitx (TRAFO_PERIOD * 100);
    _waitms (1);
    int32_t hi= ctrl.currI;
    _cogstop (ccog);
    _wrpin (pinTrI, P_ADC | P_ADC_1X);
    int32_t gain= round (0x4000 * TRAFO_SCALE_I / (hi - lo));
    asm.SetOffset (lo);
    asm.SetGain (gain);

This works as it's supposed to. But as soon as I ommit the second _waitms() the compiler thinks he's smart and that the two accesses to ctrl.currI must have the same value so that hi and lo are the same and can be optimized into a single variable. The compiled code is:

    mov arg01, #1
    call    #__system___waitms ' first _waitms()
    wrpin   ##1048624, #8
    waitx   ##400000 ' first _waitx()
    add ptr__dat__, ##74320
    rdlong  local01, ptr__dat__ ' first read of ctrl.currI
    wrpin   ##1081392, #8
    waitx   ##400000 ' second _waitx()
    add ptr__dat__, #40
    rdlong  arg01, ptr__dat__ ' read of ccog
    sub ptr__dat__, ##74360 ' second read of ctrl.currI is optimized away!
    cogstop arg01
    mov arg02, ##1146928
    wrpin   ##1146928, #8
    mov arg01, local01 ' should be hi-lo
    sub arg01, local01 ' but compiles to lo-lo !

The call to _waitms() seems to act as some sort of "clear optimization cache" command. Same for printf() and the like. This makes debugging very unpredictable. As soon as I remove the debug output the behaviour of the program changes. I'm almost sure this is my fault. But it would be really helpful if I knew how to avoid that.
(@ersmith I can send you the complete project if you need it)

Wuerfel_21 · 2025-10-07 15:50

This read elimination is blocked by most function calls, other memory writes and certain other instructions.

We should add an explicit memory fence intrinsic though.

ManAtWork · 2025-10-07 16:00

I thought volatile is exactly meant for that purpose. But currently, it has no effect at all. I wonder how all my other programs work at all. Things like mailboxes used by at least two cogs are very common with the Propeller. There should be a safe way to implement them without guessing about the optimisations.

Wuerfel_21 · 2025-10-07 16:34

Yeah volatile is ignored by the frontend because ???

Spin doesn't have an equivalent annotation, so ??? there.

Branch instructions stop the optimization even if it could otherwise happen, so thongs like wait loops always act as a fence.

There's also a flag to disable these memory optimizations outright, but idk what its called and am phoneposting from a train rn.

evanh · 2025-10-07 20:45

There's also a flag to disable these memory optimizations outright ...

The optimising options are in the "general.md" file. Read the section on "Per-function control." I only just now realised Eric had given me help via this mechanism to force using the full assembler on a per-function basis in my 4-bit SD card driver code.

ManAtWork · 2025-10-08 08:15

Disclaimer: It's not my job to develop a compiler so I might be a little too naive. In reality it's probably a lot harder than I think so please take the following as suggestions from a somewhat ignorant outsider...

Optimisations are a good thing in general but the compiler shouldn't try to be smarter than the programmer. IMHO, it shouldn't try to optimize away whole lines of code. If the programmer wrote a specific statement then it's most likely there for a reason. Example

a= pointer->myStruct.a + (c+d);
b= pointer->myStruct.b + (c+d);

It's totally OK to not fetch the pointer again and keep it in a register. It's also OK to extract common subexpressions like (c+d). If there is no write in between then they most likely don't change. But the final read of myStruct.a and .b should always happen even if the compiler thinks that they have the same value. If I write two lines of code then I want it that way. And if it's not fully optimized it's my fault and/or intention.

I know... Changing the optimization strategies is probably not a good idea because it might introduce new bugs and cause a lot of work. Maybe it's easier to implement the volatile tag.

For now, I solved the problem by extracting the waitx() and the ctrl->currI read access into a seperate function and switching off optimization locally.

int32_t GetTrafoAdc () __attribute__(opt(!regs))
{
    _waitx (TRAFO_PERIOD * 2);
    return ctrl.currI;
}

This works but interestingly also switches off small method inlining. Without the attribute the two calls to the functions get inlined as two waitx() and rdlong instruction (of which the second gets optimized away).

Wuerfel_21 · 2025-10-08 12:45

A lot of the problem here is that there's no distinction between code you explitly wrote a certain way and code that comes to be through macros/inlining or as a result of other optimizations. Going aggressive on deleting RDLONG instructions is really valueable because it's extremely slow (even moreso in hubexec).

If you want to control code exactly, just use ASM, it's better for that.

ManAtWork · 2025-10-08 16:22

Come on! I know, ASM ist best when you want exact timing, exact memory and registerlayout and optimum performance. I enjoy programming in assembler as long as it's closely connected to the hardware and the code size remains managable. But it is also very time consuming and error prone. The whole purpose of a compiler is to not being forced to use assembler for everything. Especially for larger projects I prefer high level languages because it makes everything much more readable. Ok, C is not the best choice, probably...

I don't want to argue, you can't please everybody. If there's a workaround I'm happy. But it's one of those reasons why the Propeller is not so commonly used. It's very special. Everybody else uses C and "volatile".

ersmith · 2025-10-08 17:34

@ManAtWork said:

@ersmith said:
I would use mostly temporary labels (like .x and .y) in the __asm. Another good solution would be the one I think you've already found, to enclose the assembly code in a module.

Well, I could declare ALL labels locally except for the entry point which has to be visible to the C code.

That's what I meant.

I've found that there are DAT section namespaces in Spin2 ("%namesp x"). Are they available in Spin style assembler sections?

Unfortunately those aren't hooked up to C or BASIC yet. Probably they could be made to work without too much effort, I'll look at it when I get the chance.

But I think a seperate Spin2 file is the better solution because of the syntax highlighting. Spin style comments and constants ($ABCD) always look ugly inside C files. The only drawback is I can't use C style #defines in the Spin2 modules. So I have to use two seperate include files for my global constants like pin numbers.

I agree with you that a seperate Spin2 file is probably the best solution. I'm not sure what problem you're having with C style #define in Spin2; simple #define (enough to define constants like pin numbers) should work, although macros with parameters won't.

Wuerfel_21 · 2025-10-08 19:00

@ManAtWork said:
I don't want to argue, you can't please everybody. If there's a workaround I'm happy. But it's one of those reasons why the Propeller is not so commonly used. It's very special. Everybody else uses C and "volatile".

I don't want too argue either, I'm just not in a position right now to implement correct volatile handling or the explicit fence intrinsic I suggested earlier.

ersmith · 2025-10-09 00:00

Implementing volatile is slightly tricky, because we'll have to somehow propagate the variable information down into the registers that hold the addresses and keep track of it everywhere. In principle this is do-able, but getting all of the cases right is likely to be tricky, we'll probably miss some.

A simpler stop-gap solution is to suppress the read combinations if a waitx or similar instruction (like waitct or one of the interrupt waiting instructions) is found between the reads. This is a good idea, because if we're waiting then another COG may well change the memory. Without a wait (and without a loop) then there's a race condition anyway, so combining the reads is relatively safe.

ManAtWork · 2025-10-09 10:28

@ersmith said:
Implementing volatile is slightly tricky, because we'll have to somehow propagate the variable information down into the registers that hold the addresses and keep track of it everywhere. In principle this is do-able, but getting all of the cases right is likely to be tricky, we'll probably miss some.

I fully understand. But if volatile is ignored it might be a good idea to throw a warning.

A simpler stop-gap solution is to suppress the read combinations if a waitx or similar instruction (like waitct or one of the interrupt waiting instructions) is found between the reads. This is a good idea, because if we're waiting then another COG may well change the memory. Without a wait (and without a loop) then there's a race condition anyway, so combining the reads is relatively safe.

That would be great.

Rayman · 2025-10-11 19:00

@ersmith This way of using an sd driver doesn't work in Flexprop but does in Spin Tools:

    'Start uSD driver
    if \FS.FATEngineStart(uSD_MISO, uSD_CLK, uSD_MOSI, uSD_CS, 0)
        debug("Missing SD Card?")
        'repeat
    if \FS.mountPartition(0)
        debug("Can't mount FS")
        debug("Missing SD Card?")
        'repeat
    if \FS.openfile(string("back1b.bmp"),"R")
        debug("Missing file?")
        'repeat

If one comments out the "repeat" as show above it works. Maybe the "\" thing works differently here?

ersmith · 2025-10-12 12:11

@Rayman said:
@ersmith This way of using an sd driver doesn't work in Flexprop but does in Spin Tools:
    'Start uSD driver
    if \FS.FATEngineStart(uSD_MISO, uSD_CLK, uSD_MOSI, uSD_CS, 0)
        debug("Missing SD Card?")
        'repeat
    if \FS.mountPartition(0)
        debug("Can't mount FS")
        debug("Missing SD Card?")
        'repeat
    if \FS.openfile(string("back1b.bmp"),"R")
        debug("Missing file?")
        'repeat
If one comments out the "repeat" as show above it works. Maybe the "\" thing works differently here?

What exactly do you mean by "doesn't work" and "works"? Both examples compile fine for me, and I get the "MIssing SD Card?" message as expected.

Rayman · 2025-10-12 12:42

Ok have to test some more…
Thanks.

Could be that need to try mounting twice with flexprop for some reason.

Rayman · 2025-10-12 15:37

Well, it's very strange... I don't use abort personally, but this FAT code, ported from P1 does.
It works with Spin Tools, but not with FlexProp and don't see why at all...
But, code does work when abort catcher is removed like this:

    if FS.FATEngineStart(uSD_MISO, uSD_CLK, uSD_MOSI, uSD_CS, 0)==-1
        debug("Missing SD Card?")
        repeat
    FS.mountPartition(0)
    FS.openfile(string("back1b.bmp"),"R")

Guess that is fine for now.

ersmith · 2025-10-12 18:40

@Rayman said:
Well, it's very strange... I don't use abort personally, but this FAT code, ported from P1 does.
It works with Spin Tools, but not with FlexProp and don't see why at all...
But, code does work when abort catcher is removed like this:

Again, could you explain what you mean by "works" and "doesn't work"? Does the code crash? Does the debug statement not get triggered? Does it not compile, or have some warning? I have no idea what you're talking about, so I really can't help without some more details...

evanh · 2025-10-12 19:17

I didn't have a clue either until Roger posted this - https://forums.parallax.com/discussion/comment/1569631/#Comment_1569631

It looks like there is probably a timing problem somewhere between the SD cards, the SPI driver and FATEngine. A race condition or something. Roger looks keen to solve it. At this stage I'm not thinking Flexspin has any problems.

Rayman · 2025-10-12 20:52

Here's a minimal example.
The uSD mount fails with FlexProp. Works every time with Spin Tools and Prop Tool.

ersmith · 2025-10-12 23:56

@Rayman Ah, I think I see (one of) the issues: Chip changed how \foo() works in Spin2. In Spin1 this has the same value as foo() if there is no abort. In Spin2 it appears it always results in 0 (ignores the return value of the method if there is no abort). This is kind of a strange decision (why even return something if it's going to be ignored?) but I'll change flexspin to match this. For now you can work around it by returning 0 on success, instead of returning -1.

Wuerfel_21 · 2025-10-13 00:29

IIRC the problem was related to multi-return - can't do the old thing because there's exactly one abort code but there could be 0..15 result values. Not that the old behaviour ever made that much sense to begin with.

__deets__ · 2025-10-19 10:39

@ersmith said:
@deets I'm not able to reproduce your problem, and in fact can't even reproduce your build: the .p2asm I get from building with -O0 is considerably smaller than the one you sent. Could you send me the exact command line you used to build it, please? For me, I used:
flexspin -DPROVOKE_ERROR -I./rq2 -I./spin2lib -I./sx1268/src -2 -O0 sx1268-test.cpp sx1268/src/driver_sx1268.c

This is the command I used:

/opt/flexspin/bin/flexcc -DPROVOKE_ERROR -O0 -2 -Wall -g -Lspin2lib -Lrq2  -Lsx1268/src -o sx1268-test.binary sx1268-test.cpp sx1268/src/driver_sx1268.c && /opt/flexspin/bin/loadp2 -p /dev/serial/by-id/usb-FTDI_FT232R_USB_UART_00000000-if00-port0 @0=P2ES_flashloader.bin,@8000+./sx1268-test.binary

I adapted this to not set -Wall -g and used flexspin - still... no dice.

Really a mystery. What else can we try?

ersmith · 2025-10-19 21:49

@deets : It's worth trying with the most recent github of flexspin/flexcc (7.6.0 beta). There was an uninitialized variable being used to determine structure packing which could cause different behavior on different machines, which lines up with what we see here.

ManAtWork · 2025-10-23 09:17

Just a curiosity... This code

void CalcCRC (int32_t d)
{
    int p= MODBUS_CRCPOLY;
    int c= crc;
    __asm {
      rev    d
      setq   d
      crcnib c,p
      crcnib c,p
    }
    crc= c;
}

Is compiled to that:

_CalcCRC
    mov _var01, ##40961
    add ptr__dat__, ##1836
    rdlong  _var02, ptr__dat__
    rev arg01
    setq    arg01
    crcnib  _var02, _var01
    crcnib  _var02, **##40961**
    wrlong  _var02, ptr__dat__
    sub ptr__dat__, ##1836
_CalcCRC_ret
    ret

Funny un-optimization. Should be "var01" in both lines, I think.

TonyB_ · 2025-10-24 15:37

@ManAtWork said:
Just a curiosity... This code

void CalcCRC (int32_t d)
{
    int p= MODBUS_CRCPOLY;
    int c= crc;
    __asm {
      rev    d
      setq   d
      crcnib c,p
      crcnib c,p
    }
    crc= c;
}

Is compiled to that:

_CalcCRC
  mov _var01, ##40961
  add ptr__dat__, ##1836
  rdlong  _var02, ptr__dat__
  rev arg01
  setq    arg01
  crcnib  _var02, _var01
  crcnib  _var02, **##40961**
  wrlong  _var02, ptr__dat__
  sub ptr__dat__, ##1836
_CalcCRC_ret
  ret

Funny un-optimization. Should be "var01" in both lines, I think.

The compiled code wastes a long for the implicit AUGS but it's not any slower. WRLONG here takes 8 cycles. If second CRCNIB is same as first, REV + SETQ + CRCNIB + CRCNIB takes 8 cycles (exactly one hub RAM revolution) instead of 10 but WRLONG take 10 cycles (the worst-case) instead of 8.

evanh · 2025-10-24 19:37

@ManAtWork said:
Funny un-optimization. Should be "var01" in both lines, I think.

Agreed.

evanh · 2025-10-25 21:34

Okay, answers needed ... Why does sdmm.cc end up in the final binary, linked immediately after sdsd.cc, when I'm using sdsd.cc?
In fact it shows up during compile too:

Propeller Spin/PASM Compiler 'FlexSpin' (c) 2011-2025 Total Spectrum Software Inc. and contributors
Version 7.5.1-beta-v7.4.4-15-g89b6ef11 Compiled on: Oct  6 2025
sdfat-speedtest.c
|-sdsd.cc
fatfs_vfs.c
|-sdmm.cc
|-fatfs.cc
mount.c
fmt.c
fputs.c
fopen.c
fwrite.c
fseek.c
...

evanh · 2025-10-28 05:42

PS: I've found a way to make sdmm.cc go away - Delete the sdmm_open() function from include/filesys/fatfs/fatfs_vfs.c

evanh · 2025-10-29 21:55

Eric,
I'm not understanding pointers in FlexC. In the block driver plug-in init function, _sdsd_open(), it fills out the vfs_file_t structure that _get_vfs_file_handle() allocates. It does this one entry at a time across eleven lines of C code. I was thinking I could copy a static table of pointers or similar to reduce code size.

But when printing the pointers I get addresses that are very sparse and evenly spaced, eg: v_read=10c4a8 v_write=20c4a8 v_close=30c4a8. What's the story?

EDIT: Hmm, I think I see now. In the general.md doc, I note C++/Spin objects are special, the pointer is not an address. And these block drivers are packaged as objects, hence the .cc source file suffix.

I guess the short answer is not to do what I was thinking of.

ersmith · 2025-10-31 20:21

@evanh said:
Eric,
I'm not understanding pointers in FlexC. In the block driver plug-in init function, _sdsd_open(), it fills out the vfs_file_t structure that _get_vfs_file_handle() allocates. It does this one entry at a time across eleven lines of C code. I was thinking I could copy a static table of pointers or similar to reduce code size.

Yeah, as you've discovered, pointers in FlexC have 2 pieces of data, the function address and the object data address. These would need 40 bits if not compressed, so we put the functions in a table and use an index into that table as the upper 12 bits. The advantage of using objects for block drivers is that it makes it easy to have multiple instances, so at least in theory someone could hook up multiple SD cards to a single P2.

flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

Comments