LLVM Backend for Propeller 2

iseries · 2022-04-14 11:30

Is long long (64) bit supported?

I have this code that generates two different answers and don't know if this is correct:

int main(int argc, char** argv)
{
    long long t;
    int d1;
    unsigned int c5;
    int x;

    printf("Starting\n");

    x = rand() % 10;
    printf("Rand: %d\n", x);

    d1 = 8052495 + x;
    c5 = 31559;

    t = d1 - (c5 << 8);
    x = d1 - (c5 << 8);

    printf("t: %lld, x: %d\n", t, x);

    printf("Done\n");

    while (1)
    {
        wait(500);
    }
}

The long value is 4294941851 were the int value is -25445 which is the answer I was looking for.

Mike

n_ermosh · 2022-04-14 16:39

64 bit ints and doubles should work, but I haven't tested them thoroughly, so if you find bugs, let me know, but I've been using them and having found any issues yet. Many floating point operations aren't supported yet (the code is there, I just need to actually enable compiling it into the library. I've been doing it piece by piece to avoid adding a ton of code at once and then hunting down bugs later).

n_ermosh · 2022-04-17 18:02

@iseries I just merged in a rewrite of stdio and your SD card code. I haven't tested the sd stuff, and the changes to stdio shouldn't be breaking anything, EXCEPT I removed simple_printf. After the rewrite of printf, it didn't really makes sense to keep simple_printf since it was only slightly smaller and faster, not enough to keep both implementations. Let me know if you run into any issues.

iseries · 2022-04-18 01:08

Yes, the SD card functions has a bug in it and I didn't know how to send an updated version to an alread submitted pull request.

Apparently all I had to do was update it.

Anyway I need to send you an update to the SD card functions. Also hopefully the memory functions that I updated and you merged in work for you.

Mike

iseries · 2022-04-18 11:00

Submitted update SD driver code.

Speed test program that was used:

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#include <propeller.h>
#include <sys/sdcard.h>

uint32_t randfill(uint32_t *, size_t);
int  compare(uint32_t *, uint32_t *, size_t);

#define PIN_SS   23
#define PIN_MISO 20
#define PIN_CLK  21
#define PIN_MOSI 22

uint32_t  data1[25000];
uint32_t  data2[25000];
struct tm tv;


int main(int argc, char** argv)
{
    FILE  *fh;
    uint32_t  ticks;
    time_t t;
    struct timeval x;

    tv.tm_year = 2022 - 1900;
    tv.tm_mon = 3;
    tv.tm_mday = 7;
    tv.tm_hour = 6;
    tv.tm_min = 0;
    tv.tm_sec = 0;
    t = mktime(&tv);
    x.tv_sec = t;
    x.tv_usec = 0;

    settimeofday(&x, 0);

    printf( " clkfreq = %d   clkmode = 0x%x\n", _clkfreq, _clkmode);
    printf( " Randfill ticks = %d\n", randfill( data1, sizeof(data1) ) );

    printf( " Mounting: " );
    sd_mount(0, PIN_SS, PIN_CLK, PIN_MOSI, PIN_MISO);

    if( (fh = fopen( "SD0:/speed2.bin", "w" )) > 0 )
    {
        ticks = getms();
        fwrite( data1, 1, sizeof(data1), fh );
        fclose( fh );
        ticks = getms() - ticks;
        printf( " Writing %u bytes at %u kB/s\n", sizeof(data1), (sizeof(data1) * 1000 / ticks + 512) >> 10 );
    } else  printf( " SD card write error!\n" );

    if( (fh = fopen( "SD0:/speed2.bin", "r" )) > 0 )
    {
        ticks = getms();
        fread( data2, 1, sizeof(data2), fh );
        fclose( fh );
        ticks = getms() - ticks;
        printf( " Reading %u bytes at %u kB/s\n", sizeof(data2), (sizeof(data2) * 1000 / ticks + 512) >> 10 );
        if( compare( data1, data2, sizeof(data2) ) )  printf( " Matches!  :)\n" );
        else    printf( " Mis-matches!  :(\n" );
    } else  printf( " SD card read error!\n" );

    while (1)
    {
        waitms(500);
    }
}

uint32_t  randfill( uint32_t *addr, size_t size )
{
    uint32_t  ticks;

    size >>= 2;
    ticks = _cnt();
    do {
        *(addr++) = rand();
    } while( --size );

    return( _cnt() - ticks );
}


int  compare( uint32_t *addr1, uint32_t *addr2, size_t size )
{
    uint32_t  pass = 1;

    size >>= 2;
    do {
        if( *(addr1++) != *(addr2++) )  pass = 0;
    } while( --size );

    return( pass );
}

Mike

rogloh · 2026-03-17 03:28

Hi @n_ermosh finally got around to downloading LLVM on my M2 Pro based Mac to have a brief look. Not sure if you are still actively working on it or not. I downloaded the latest from your github source at ne75/p2llvm.

Hit a few problems along the way so I thought I'd mention them in case you wanted to figure it out. It seems to be related to the use of the modulus % operator in C which is being converted to QUREM psuedo instruction presumably for later access via the Cordic.

I've found a handful of failing files crashing LLVM Clang with an assert condition and so far the common denominator is always the use of the % operator.

Assertion failed: ((I.atEnd() || std::next(I) == def_instr_end()) && "getVRegDef assumes a single definition or no definition"), function getVRegDef, file MachineRegisterInfo.cpp, line 404.

This simple code below is enough to trigger it if you build with LLVM Clang using the P2 as the target. I was using the build of LLVM included in your latest tree as a submodule which I also built locally on my Mac M2 for acting as the P2 cross compiler.

#include <stdint.h>
uint32_t rand_seed = 2223;

uint32_t rand(uint32_t min, uint32_t max)
{
    return (rand_seed % max)+min;
}

In my digging into this problem I tried to decipher a few things using -emit-llvm on the clang command line and then passing the output file to "llc --print-after-isel" which shows this:

# After Instruction Selection:
# Machine code for function rand: IsSSA, TracksLiveness
Frame Objects:
  fi#-1: size=4, align=1, fixed, at location [SP]
Function Live Ins: $r0 in %0, $r1 in %1

bb.0.entry:
  liveins: $r0, $r1
  %1:p2gpr = COPY $r1
  %0:p2gpr = COPY $r0
  %2:p2gpr = MOVri @rand_seed, 15, 0
  %3:p2gpr = RDLONGrr killed %2:p2gpr, 15, 0 :: (dereferenceable load (s32) from @rand_seed, !tbaa !3)
  %4:p2gpr = QUREM killed %3:p2gpr, %1:p2gpr
  %5:p2gpr = ADDrr %4:p2gpr(tied-def 0), %0:p2gpr, 15, 0
  $r31 = COPY %5:p2gpr
  RETA 15, 0, implicit $r31

# End machine code for function rand.

Assertion failed: ((I.atEnd() || std::next(I) == def_instr_end()) && "getVRegDef assumes a single definition or no definition"), function getVRegDef, file MachineRegisterInfo.cpp, line 404.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /Users/roger/Applications/p2llvm/bin/llc --print-after-isel math.bc
1.      Running pass 'Function Pass Manager' on module 'math.bc'.
2.      Running pass 'Live Variable Analysis' on function '@rand'
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  llc                      0x0000000101ed7ffc llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) + 80
1  llc                      0x0000000101ed85a0 PrintStackTraceSignalHandler(void*) + 28
2  llc                      0x0000000101ed6578 llvm::sys::RunSignalHandlers() + 140
3  llc                      0x0000000101ed9a94 SignalHandler(int) + 276
4  libsystem_platform.dylib 0x000000018797ea24 _sigtramp + 56
5  libsystem_pthread.dylib  0x000000018794fc28 pthread_kill + 288
6  libsystem_c.dylib        0x000000018785dae8 abort + 180
7  libsystem_c.dylib        0x000000018785ce44 err + 0
8  llc                      0x0000000100b733b4 llvm::MachineRegisterInfo::getVRegDef(llvm::Register) const + 200
9  llc                      0x000000010098a888 llvm::LiveVariables::HandleVirtRegUse(llvm::Register, llvm::MachineBasicBlock*, llvm::MachineInstr&) + 64
10 llc                      0x000000010098d54c llvm::LiveVariables::runOnInstr(llvm::MachineInstr&, llvm::SmallVectorImpl<unsigned int>&) + 856
11 llc                      0x000000010098d9bc llvm::LiveVariables::runOnBlock(llvm::MachineBasicBlock*, unsigned int) + 488
12 llc                      0x000000010098e0b4 llvm::LiveVariables::runOnMachineFunction(llvm::MachineFunction&) + 428
13 llc                      0x0000000100a7f7ac llvm::MachineFunctionPass::runOnFunction(llvm::Function&) + 456
14 llc                      0x00000001011eecec llvm::FPPassManager::runOnFunction(llvm::Function&) + 536
15 llc                      0x00000001011f609c llvm::FPPassManager::runOnModule(llvm::Module&) + 116
16 llc                      0x00000001011ef5ac (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) + 672
17 llc                      0x00000001011ef134 llvm::legacy::PassManagerImpl::run(llvm::Module&) + 288
18 llc                      0x00000001011f64ac llvm::legacy::PassManager::run(llvm::Module&) + 36
19 llc                      0x00000001000bf494 compileModule(char**, llvm::LLVMContext&) + 4696
20 llc                      0x00000001000bda50 main + 1156
21 dyld                     0x00000001875f7fd8 start + 2412

So I'm guessing there is some problem with the QUREM "instruction" not doing the QDIV and taking the result from the GETQY or something like that before returning.

This same problem seems to also prevent me from completing the build of your P2 C library as the file(s) it fails on include the % operator in the C code. It fails on some time functions - I identified strftime.c and localtim.c failing so far, could be more although it got to 96% so it was close! EDIT: no it's just those two files with the % operator, and when I commented out the lines with the modulus, it let the build complete and install the (now broken) libc.a file to the LLVM installation folder.

 python3 build.py --skip_llvm --skip_libp2 --install /Users/roger/Applications/p2llvm
[  1%] Building C object time/CMakeFiles/time.dir/strftime.c.obj
[ 15%] Built target misc
[ 33%] Built target string
[ 38%] Built target math
[ 60%] Built target wchar
[ 90%] Built target stdlib
[ 91%] Built target stdio
[ 91%] Building C object time/CMakeFiles/time.dir/localtim.c.obj
[ 96%] Built target drivers
Assertion failed: ((I.atEnd() || std::next(I) == def_instr_end()) && "getVRegDef assumes a single definition or no definition"), function getVRegDef, file MachineRegisterInfo.cpp, line 404.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.  Program arguments: /Users/roger/Applications/p2llvm/bin/clang -I/Users/roger/Documents/Code/p2llvm/libc/include -Wall -Werror -ffunction-sections -fdata-sections -Oz -fno-exceptions --target=p2 -MD -MT time/CMakeFiles/time.dir/strftime.c.obj -MF CMakeFiles/time.dir/strftime.c.obj.d -o CMakeFiles/time.dir/strftime.c.obj -c /Users/roger/Documents/Code/p2llvm/libc/time/strftime.c
1.  <eof> parser at end of file
2.  Code generation
3.  Running pass 'Function Pass Manager' on module '/Users/roger/Documents/Code/p2llvm/libc/time/strftime.c'.
4.  Running pass 'Live Variable Analysis' on function '@strftime'

I also had a minor issue of the install scripts not fully working outright due to not finding some propeller.h and propeller2.h include files and stuff about missing stdio.h for files that included that. I just copied those files over to the include folder manually and commented out these unnecessary(?) #include <stdio.h> lines and was able to get it to complete the build of the libp2 but unfortunately it doesn't work out of the box from a clean slate so to speak.

rogloh · 2026-03-17 03:53

With respect to the prior post, I am wondering if this assert is something related to this P2 specific code below in P2ExpandPseudos.cpp that expands the Pseudo instructions where you mentioned in the comment a need to call GETQX first to flush it because I noticed that step was not happening for QUDIV which presumably works okay - although I should really try to go double check that too:

void P2ExpandPseudos::expand_QUREM(MachineFunction &MF, MachineBasicBlock::iterator SII) {
    MachineInstr &SI = *SII;

    LLVM_DEBUG(errs()<<"== lower pseudo unsigned remainder\n");
    LLVM_DEBUG(SI.dump());

    BuildMI(*SI.getParent(), SI, SI.getDebugLoc(), TII->get(P2::QDIVrr))
            .addReg(SI.getOperand(1).getReg())
            .addReg(SI.getOperand(2).getReg())
            .addImm(P2::ALWAYS);

    // first call getqx so that we flush it out of the cordic. This is in case another cordic operation
    // after this calls get qx before it's done. 
    BuildMI(*SI.getParent(), SI, SI.getDebugLoc(), TII->get(P2::GETQX), SI.getOperand(0).getReg())
            .addReg(P2::QX)
            .addImm(P2::ALWAYS)
            .addImm(P2::NOEFF);
    BuildMI(*SI.getParent(), SI, SI.getDebugLoc(), TII->get(P2::GETQY), SI.getOperand(0).getReg())
            .addReg(P2::QY)
            .addImm(P2::ALWAYS)
            .addImm(P2::NOEFF);

    SI.eraseFromParent();
}

Here's what you do for QUDIV which is one instruction less.

void P2ExpandPseudos::expand_QUDIV(MachineFunction &MF, MachineBasicBlock::iterator SII) {
    MachineInstr &SI = *SII;

    LLVM_DEBUG(errs()<<"== lower pseudo unsigned division\n");
    LLVM_DEBUG(SI.dump());

    BuildMI(*SI.getParent(), SI, SI.getDebugLoc(), TII->get(P2::QDIVrr))
            .addReg(SI.getOperand(1).getReg())
            .addReg(SI.getOperand(2).getReg())
            .addImm(P2::ALWAYS);
    BuildMI(*SI.getParent(), SI, SI.getDebugLoc(), TII->get(P2::GETQX), SI.getOperand(0).getReg())
            .addReg(P2::QX)
            .addImm(P2::ALWAYS)
            .addImm(P2::NOEFF);

    SI.eraseFromParent();
}

UPDATE: So when I removed the extra GETQX from QUREM and rebuilt LLVM it didn't crash anymore with that assert if I use the % operator in C code. So it would appear that this possible QX issue needs to be resolved in a different way somehow.

+#if 0
     BuildMI(*SI.getParent(), SI, SI.getDebugLoc(), TII->get(P2::GETQX), SI.getOperand(0).getReg())
             .addReg(P2::QX)
             .addImm(P2::ALWAYS)
             .addImm(P2::NOEFF);
+#endif

rogloh · 2026-04-08 02:31

Ugh, in my attempts to get MicroPython to compile with LLVM I'm running into a serious problem with stack frame generation and computation of parameter offset addresses in function calls with P2LLVM. This seems to be buggy and will crash MP printing code in some situations.

E.g here's my C calling code snippet (with some puts debug stuff added)

                puts("calling mp_print_int");
                if (fill==' ')
                    puts("fill is a space");
                chrs += mp_print_int(print, val, 0, base, fmt_c, flags, fill, width);
                puts("done printing int");

And here's the start of the called code which checks the "fill" argument after it is passed a space character. I found this value was not getting through correctly to the lower layers of code which is why I checked it and then printed the result here.

STATIC int mp_print_int(const mp_print_t *print, mp_uint_t x, int sgn, int base, int base_char, int flags, char fill, int width) {
    char sign = 0;
    if (fill==' ')
        puts("fill still a space");
    else
        puts("not a space");

Executing the code results in:

calling mp_print_int
fill is a space
not a space
... and then the code never returns due to infinite loop lockup later due to invalid data being passed from not addressing the arguments correctly

I disassembled the relevant code and it sure looks like the argument addressing is wrong.

Caller:

    44bc: ed a1 03 f6            mov r0, r29
    44c0: ff a0 07 f5            and r0, #255 
    44c4: 20 a0 5f f2            cmps r0, #32   wcz
    44c8: 08 00 90 5d       if_nz    jmp #8
    44cc: df a1 03 f6            mov r0, r15 
    44d0: 14 bd c3 fd            calla #\puts
    44d4: db a1 03 f6            mov r0, r11    
    44d8: 01 a0 07 f1            add r0, #1 
    44dc: 06 a0 07 f5            and r0, #6 
    44e0: dd c9 03 f6            mov r20, r13
    44e4: d0 c9 83 f1            sub r20, r0
    44e8: ff ff 7f ff            augs #8388607
    44ec: f0 b7 07 f5            and r11, #496
    44f0: 0f b6 87 f1            sub r11, #15   
    44f4: 07 da 67 f7            signx r29, #7
    44f8: f8 a1 03 f6            mov r0, ptra 
    44fc: 04 a0 07 f1            add r0, #4 
    4500: d0 db 63 fc            wrlong r29, r0  <---- write "fill" parameter to SP+4
    4504: f8 a1 03 f6            mov r0, ptra  
    4508: 08 a0 07 f1            add r0, #8 
    450c: d0 a7 63 fc            wrlong r3, r0  <--- write "flags" parameter to SP+8
    4510: f8 ab 63 fc            wrlong r5, ptra  <--- write "width" parameter to SP+0
    4514: 07 b6 67 f7            signx r11, #7  
    4518: f8 a1 03 f6            mov r0, ptra
    451c: 0c a0 07 f1            add r0, #12  
    4520: d0 b7 63 fc            wrlong r11, r0  <---- write "base_char" parameter to SP+12
    4524: d4 a1 03 f6            mov r0, r4  <--- setup "*print" parameter in r0
    4528: ec a5 03 f6            mov r2, r28 <--- setup "sgn" paramter in r2
    452c: e4 a7 03 f6            mov r3, r20 <--- setup "base" parameter in r3
    4530: 10 f0 07 f1            add ptra, #16  <---- advance SP by 4 long parameters passed on the stack (16 bytes)
    4534: 38 46 c0 fd            calla #\mp_print_int <---- invoke mp_print_int function
    4538: 10 f0 87 f1            sub ptra, #16
    453c: ef a3 03 f6            mov r1, r31
    4540: e0 a1 03 f6            mov r0, r16    
    4544: 14 bd c3 fd            calla #\puts

The beginning code from the called function is this:

00004638 <mp_print_int>:         
    4638: 28 18 64 fd            setq #12
    463c: 61 a1 67 fc            wrlong r0, ptra++  <----- saves 13 registers from r0-r12
    4640: 11 f0 07 f1            add ptra, #17   <---- advances SP (for local variables?)
    4644: d3 af 03 f6            mov r7, r3   <-- preserve arg r3
    4648: d0 a9 03 f6            mov r4, r0  <---preserve arg r0
    464c: 78 02 00 ff            augs #632
    4650: f7 a1 07 f6            mov r0, #503  <--- compute first string address
    4654: 72 02 00 ff            augs #626
    4658: 77 a7 07 f6            mov r3, #375 <---- compute alternative string address
    465c: 2b b1 07 fb            rdlong r8, ptra[-21]  <---- read "fill argument" by subtracting 21 longs worth (84 bytes)  
    4660: 20 b0 5f f2            cmps r8, #32   wcz  <--- check against space character
    4664: d3 a1 03 a6       if_z     mov r0, r3   <--- matches space
    4668: d0 a1 03 56       if_nz    mov r0, r0  <--- doesn't match space
    466c: f8 ed 03 f6            mov pa, ptra   
    4670: 11 ec 87 f1            sub pa, #17  
    4674: f6 01 48 fc            wrbyte #0, pa  <--- clear local?
    4678: 14 bd c3 fd            calla #\puts  <--- print result
    467c: 00 b2 07 f6            mov r9, #0  
    4680: 2c a7 07 fb            rdlong r3, ptra[-20]
    4684: 0f a4 97 fb            tjz r2, #15 
    4688: ff ff 7f ff            augs #8388607
    468c: ff a3 5f f2            cmps r1, #511  wcz
    4690: 04 00 90 1d   if_nc_and_nz     jmp #4

You can see that the SP (ptra) is being advanced by 17 for local variable use (and not divisible by 4) yet when the "fill" argument is read into r8 as a long from ptra[-21] to be checked against a space character (32) it has not really compensated for the offset correctly which is why it doesn't match a space anymore. The stack read won't be aligned to the "fill" data on the stack.

This is not good and makes the entire function call system unreliable. Not sure if/how easy it is to fix or where to start looking. Maybe some rounding up to the next long is needed when any local variables are assigned stack space.

EDIT: I found with some tweaks to the stack space by adding 3 more bytes to a local buffer it would align to a multiple of 4 and start to get further, so it's definitely some alignment issue. It still crashes though as there are deeper layers to also fix.

char buf[INT_BUF_SIZE+3]; <--- I added 3 to align the stack

Results:

calling mp_print_int
fill is a space
fill still a space

EDIT2: Digging a bit further I think this might well be related to the "eliminateFrameIndex" method in the target specific P2RegisterInfo.cpp file. If you google "llvm eliminateFrameIndex" the AI overview response indicates this is probably the area of code needing to be investigated for this problem I am seeing above. That method seems to be where the RDXXXX instructions that use PTRA indexing are being selected/configured and something may be going awry there with the stack frame offsets so I need to try to figure out what is does and if/how to fix it.

rogloh · 2026-04-09 04:51

@rogloh said:
EDIT2: Digging a bit further I think this might well be related to the "eliminateFrameIndex" method in the target specific P2RegisterInfo.cpp file. If you google "llvm eliminateFrameIndex" the AI overview response indicates this is probably the area of code needing to be investigated for this problem I am seeing above. That method seems to be where the RDXXXX instructions that use PTRA indexing are being selected/configured and something may be going awry there with the stack frame offsets so I need to try to figure out what is does and if/how to fix it.

Actually I found something else which seemed like it might fix the problem. When you instantiate the P2 specific derived class from its TargetFrameLowering parent class you can also configure the stack alignment. If I set this up to keep the stack aligned using Align(4) it now seems to not crash in MicroPython like it used to and the stack is rounded up to a multiple of 4 when you allocate local variables of arbitrary size.

In P2FrameLowering.h it was using this before, which results in the crash:

       explicit P2FrameLowering(const P2TargetMachine &TM)
             : TargetFrameLowering(StackGrowsUp, Align(1), 0), tm(TM) {

I just changed it to this and I don't get the crash any more:

        explicit P2FrameLowering(const P2TargetMachine &TM)
            : TargetFrameLowering(StackGrowsUp, Align(4), 0, Align(4)), tm(TM) {

The parent class from the LLVM codebase is this and I also setup the TransientStackAlignment to 4 which would otherwise default to 1 if not passed into the initializer.

The description of the TransientStackAlignment member is provided below, which is what we likely need to maintain on a 4 byte boundary to ensure that the computation of the indexes relative to PTRA will be correctly addressing the data items on the stack whenever they are accessed in the function.

  /// getTransientStackAlignment - This method returns the number of bytes to
  /// which the stack pointer must be aligned at all times, even between
  /// calls.
 Align getTransientStackAlign() const { return TransientStackAlignment; }

I imagine this fix is probably the way to resolve the issue even though it might have to pad the stack between every single small local item allocated to ensure 32 and 16 bit data can be accessed via indexed reads/writes. Regardless it's good if this is a workable fix in all cases as we do need one if P2LLVM is to be reliable when generating P2 machine code from C. Code can just crash otherwise.

Another idea I was thinking about was using the currently unused PTRB as a some sort of stack frame base register that is loaded from the SP (PTRA) in the prolog and is used for accessing stack frame variables via indexed reads/writes. This lets PTRA grow by arbitrary byte sizes during function execution without affecting the indexing offsets as they'll remain static. It may also help to allow performing "alloca" operations on the stack as well, which I'm not sure is currently supported by P2LLVM. Coding all that is beyond my level of capability at this point however.

Rayman · 2026-04-09 05:14

@rogloh sounds like good progress. Think I remember building this in Windows a while ago. Guess I’ll try again when you think it’s ready…

Rayman · 2026-04-09 10:41

Guess it'd be interesting to try to compile RISC-V P2 on Windows again with this. Couldn't get anywhere before, but maybe worth a shot... Then, maybe could try that version of Micropython too... Would love to have this in Windows. Not really at home in Linux...

Rayman · 2026-04-20 21:39

Wonder what the chances are of Spin2Cpp producing C or C++ code that will work with Clang...
Guess it would be nice if it did, but guessing there's no way that embedded PASM2 code would work.

Also wondering about XMM. Does P2 GCC have this?
That was long before there was the PSRAM driver.
This seems like a lot of work to implement...

Wuerfel_21 · 2026-04-20 21:58

I think spin2cpp for P2 code is a bit of an uncharted water. If it doesn't work it should be fixable though.
Inline ASM, might actually be workable. Should be possible to convert flexspin IR for inline ASM into GCC-style asm blocks. Though with some limitations of course.

Rayman · 2026-04-20 22:12

@rogloh Believe have applied your fixes from:
https://forums.parallax.com/discussion/169862/micropython-for-p2/p23
post# 687

Does one of the test programs prove that might have done it right?

Rayman · 2026-04-21 02:00

Just occurred that micropython might be the best test…

But don’t have that for Windows version …

rogloh · 2026-04-21 02:22

@Rayman said:
@rogloh Believe have applied your fixes from:
https://forums.parallax.com/discussion/169862/micropython-for-p2/p23
post# 687

Does one of the test programs prove that might have done it right?

To prove changes were applied to your toolchain you can try compiling a test program using simple modulus operation in C. This would fail to even build before and crash the toolchain.

uint32_t test(uint32_t x, uint32_t y) 
{
return x % y;
}

The other changes I put in are harder to test as it depends on optimized code generation etc, but I've already tested those.

Inline asm is actually working in Clang if there is definition for the assembly instruction, and you use the correct (IBM? AT&T) format for inline instruction syntax. Right now not all of the instructions have been defined since the C compiler didn't need them all. I've been adding a few more as I go to my local P2LLVM build but still need to test they are generating the correct machine instructions. The good thing is that once they are defined they will also show up in the disassembly listings when the opcode is decoded.

Here's the list I am making with the new P2 instructions I've added in to date, primarily because evanh's inline code uses some of these for the SD card driver in SD mode and I hope to port that to MicroPython:

defm JMPREL     : P2InstLDm<0b1101011, 0b000110000, "jmprel">;
defm MUXNITS    : P2InstIDSm<0b100111100, "muxnits">;
defm MUXNIBS    : P2InstIDSm<0b100111101, "muxnibs">;
defm MUXQ       : P2InstIDSm<0b100111110, "muxq">;
defm ALTR   : P2InstIDSm<0b100110000, "altr">;
defm ALTB   : P2InstIDSm<0b100110011, "altb">;
defm ALTI   : P2InstIDSm<0b100110100, "alti">;
defm XZERO      : P2InstLIDSm<0b11001011, "xzero">;
defm CRCBIT     : P2InstIDSm<0b100111010, "crcbit">;
defm CRCNIB     : P2InstIDSm<0b100111011, "crcnib">;
def GETRND      : P2InstCZD<0b1101011, 0b000011011, (outs P2GPR:$d), (ins), "getrnd $d">;
defm COGATN     : P2InstLDm<0b1101011, 0b000111111, "cogatn">;
def WAITPAT     : P2InstCZ<0b11010110, 0b000011000, 0b000100100, (outs), (ins), "waitpat">;
def WAITFBW     : P2InstCZ<0b11010110, 0b000011001, 0b000100100, (outs), (ins), "waitfbw">;
def WAITXMT     : P2InstCZ<0b11010110, 0b000011010, 0b000100100, (outs), (ins), "waitxmt">;
def WAITXFI     : P2InstCZ<0b11010110, 0b000011011, 0b000100100, (outs), (ins), "waitxfi">;
def WAITXRO     : P2InstCZ<0b11010110, 0b000011100, 0b000100100, (outs), (ins), "waitxro">;
def WAITXRL     : P2InstCZ<0b11010110, 0b000011101, 0b000100100, (outs), (ins), "waitxrl">;
def WAITATN     : P2InstCZ<0b11010110, 0b000011110, 0b000100100, (outs), (ins), "waitatn">;
def WAITQMT     : P2InstCZ<0b11010110, 0b000011111, 0b000100100, (outs), (ins), "waitqmt">; 
defm SKIP       : P2InstLDm<0b1101011, 0b000110001, "skip">;
defm SKIPF      : P2InstLDm<0b1101011, 0b000110010, "skipf">;
defm EXECF      : P2InstLDm<0b1101011, 0b000110011, "execf">;
defm SETR       : P2InstIDSm<0b100110101, "setr">;
defm SETD       : P2InstIDSm<0b100110110, "setd">;
defm SETS       : P2InstIDSm<0b100110111, "sets">;

I also need to add MODCZ but is probably not going to parse correctly yet as that instruction format isn't really mapping into one of the existing defined formats and I'd need to extend it with a new format. Or temporarily I could perhaps hack it to take an 8 bit argument instead of two 4 bit argument values.

Ideally in the long term all P2 instructions would be defined so the disassembler won't crash if it finds some opcode that is still missing.

@Wuerfel_21 said:
Inline ASM, might actually be workable. Should be possible to convert flexspin IR for inline ASM into GCC-style asm blocks. Though with some limitations of course.

Exactly, and I'm hoping to do this for that SD card code. Trick will be ensuring the correct registers names are tracked and used, not from COG labels. Also need to prepend some sort of Fcache loader into it as well so it can bring in the inline PASM2 code into COGRAM before calling it. I think it's achievable. LLVM uses registers r0-r31 at $1e0-$1df and below is free. LUTRAM is already used with some "fcached" libc stuff although it's questionable as to which libc builtins deserve to be there. EDIT: one problem I have is that the assembler won't know this Fcache'd code is targeting COGRAM so the call addresses inside this block will be wrong and need adjustments somehow.

Rayman · 2026-04-21 07:57

Haven’t had coffee yet but why fcache inline assembly? Does Clang not convert everything to assembly like flexprop?

rogloh · 2026-04-21 10:39

@Rayman said:
Haven’t had coffee yet but why fcache inline assembly? Does Clang not convert everything to assembly like flexprop?

Yes Clang does compile C applications to hubexec code by default. I needed a way to do Fcache inline only because some time critical SD card routines in evanh's code have to run in COGRAM as they use the streamer which is incompatible with hubexec.

Rayman · 2026-04-21 10:54

Hmm PSRAM uses streamer too, but guess can be in own cog...

Maybe can combine PSRAM and 4-bit SD into one driver that uses it's own cog?

Rayman · 2026-04-23 11:18

@rogloh Just tried your patches on Windows build...

Doesn't crash, but gives wrong answer. Or, maybe I'm doing it wrong?

Maybe missed something, guess I'll try building LLVM on Windows again...

rogloh · 2026-04-23 12:41

@Rayman said:
@rogloh Just tried your patches on Windows build...

Doesn't crash, but gives wrong answer. Or, maybe I'm doing it wrong?

Maybe missed something, guess I'll try building LLVM on Windows again...

Looks like it is working. 100 mod 50 is zero which is what is being printed.

By the way, just noticed that in the disassembly of your code, the arguments in r0 and r1 are being saved on the stack and read back unnecessarily. Maybe you didn't have the optimizer on. I should retry with the -Os enabled. This sort of stuff is really annoying to see ending up in the code. We shouldn't be trashing the registers so early given we have 32 of them now.

00000aa4 <test>:
     aa4: 28 02 64 fd                           setq    #1
     aa8: 61 a1 67 fc                           wrlong  r0, ptra++
     aac: 08 f0 07 f1                           add     ptra, #8 
     ab0: 3e a1 67 fc                           wrlong  r0, ptra[-2]
     ab4: 3f a3 67 fc                           wrlong  r1, ptra[-1]
     ab8: 3e a1 07 fb                           rdlong  r0, ptra[-2] 
     abc: 3f a3 07 fb                           rdlong  r1, ptra[-1] 
     ac0: d1 a1 13 fd                           qdiv    r0, r1
     ac4: 19 a0 63 fd                           getqy   r0 
     ac8: d0 df 03 f6                           mov     r31, r0 
     acc: 08 f0 87 f1                           sub     ptra, #8 
     ad0: 28 02 64 fd                           setq    #1
     ad4: 5f a1 07 fb                           rdlong  r0, --ptra 
     ad8: 2e 00 64 fd                           reta

In other news I've been adding a couple more instructions to P2LLVM. LOC and MODCZ. LOC was a PITA to get working and there is still a hard coded register value in there that I don't understand where it comes from but need to use to make it work. Currently working on MODCZ which doesn't have a class template yet and needs to be built up from scratch. LOC is assembling and disassembling now and only accepts PA, PB, PTRA, PTRB as destinations register and generates the correct machine codes. I also tidied up the disassembly output format with a tab between the mnemonic and operands (see above) which looks neater.
UPDATE: nope, all optimizing settings from -Os through -O3 don't change the listing and they all preserve r0, r1 on the stack and read them back from there even though the arguments are being passed through registers in the call to the test function. Not sure why LLVM is doing this when it shouldn't need to. It's not like QDIV or GETQY have to make use of r0, and r1 (any registers can be used) and given it's a leaf function it doesn't need to preserve its locals on the stack either. It should be using r31 in the GETQY and ideally not even save the registers on a stack frame with the block save/restore setq bursts.

     a58: 64 a0 07 f6                       mov r0, #100
     a5c: 32 a2 07 f6                       mov r1, #50
     a60: a4 0a c0 fd                       calla   #\test

Rayman · 2026-04-23 12:50

So it is! This is the danger of posting something before coffee...

rogloh · 2026-04-27 06:49

Further to this post above: https://forums.parallax.com/discussion/comment/1572936/#Comment_1572936
I have identified the remaining missing P2 assembly instructions that are not yet recognised using inline assembly in P2LLVM. The C language does not currently need to make use of them, although it could for TJS/TJNS if comparing and branching on a signed value. Can't see all that much use for the others with C/C++ functionality, unless we can identify instruction masking/shifting sequence candidates for using SETNIB/GETNIB/SETBYTE/GETBYTE/SETWORD/GETWORD, and maybe register bit setting/testing with BITH/BITL and TESTB instead of and/or masking with constants etc. We need some common patterns of instructions to identify in order to make use of a specific or custom instruction.

It's a fair bit of work to include all these for inline assembly but these remaining instructions ideally should be added over time (there's about 142 or so in this list, although some are just aliases).

ADDPIX AKPIN ALLOWI ALTGB ALTGN ALTGW ALTSB ALTSN ALTSW BITC BITH BITL BITNC BITNOT BITNZ BITRND BITZ BLNPIX CALLB CALLD CALLPA CALLPB CMPM CMPSUB DECMOD DIRC DIRNC DIRNZ DIRRND DIRZ DRVC DRVNC DRVNZ DRVRND DRVZ FBLOCK FLTC FLTNC FLTNZ FLTRND FLTZ GETPTR GETSCP GETXACC INCMOD JATN JCT1 JCT2 JCT3 JFBW JINT JNATN JNCT1 JNCT2 JNCT3 JNFBW JNINT JNPAT JNQMT JNSE1 JNSE2 JNSE3 JNSE4 JNXFI JNXMT JNXRL JNXRO JPAT JQMT JSE1 JSE2 JSE3 JSE4 JXFI JXMT JXRL JXRO MIXPIX MODC MODZ MULPIX MUXC MUXNC MUXNZ MUXZ NEGC NEGNC NEGNZ NEGZ NIXINT1 NIXINT2 NIXINT3 OUTC OUTNC OUTNZ OUTRND OUTZ POPA POPB PUSHA PUSHB RCZL RCZR RESI0 RESI1 RESI2 RESI3 RETI0 RETI1 RETI2 RETI3 RFVAR RFVARS SCA SCAS SETCFRQ SETCI SETCMOD SETCQ SETCY SETDACS SETINT1 SETINT2 SETINT3 SETLUTS SETPAT SETPIV SETPIX SETSCP STALLI TESTN TJF TJNF TJNS TJS TJV TRGINT1 TRGINT2 TRGINT3 WAITINT WMLONG XSTOP

The aliases in this set above are these:
AKPIN, MODC, MODZ, PUSHA, PUSHB, POPA, POPB

and the remaining ones were mentioned in the linked post above. I also discovered there is no such "WAITQMT" instruction even though there appears to be an encoding available for it. I wonder what that will do if executed.

If it existed this would be its opcode definition
WAITQMT {WC/WZ/WCZ} Events - Wait EEEE 1101011 CZ0 000011111 000100100

Update: just thought of a use for MOVBYTS instruction (already coded but not linked to any pattern). LLVM has an endian swap intrinsic which may be able to make use of it.

2. Clang C/C++ Builtins 
If you are writing C or C++ code and want to ensure Clang emits the most efficient machine-level instruction (like BSWAP on x86 or REV on ARM), use the following Clang built-ins:
__builtin_bswap16(uint16_t x)
__builtin_bswap32(uint32_t x)
__builtin_bswap64(uint64_t x)

Also there are possible uses of FGES/FGE/FLE/FLES for umax/umin/smax/smin intrinsics. I don't think they are connected up yet as a pattern.

Rayman · 2026-04-27 09:45

Maybe the ones that operate on pins would be nice. Also smart pins like ackpin. Not seeing much else for inline assembly…

BTW: working building llvm on a few Ubuntu installs. Probably the only way to build micropython. Previous attempts to build upy on windows were failures…

Rayman · 2026-04-27 09:47

Push and pop maybe? Think that can be useful to save registers that one is not supposed to touch…

Rayman · 2026-04-27 09:51

Might attempt to compile tensor flow micro again with this…. Last attempt (using riscvp2) failed, probably because not so great with Linux…

rogloh · 2026-04-27 10:48

@Rayman said:
Maybe the ones that operate on pins would be nice. Also smart pins like ackpin. Not seeing much else for inline assembly…

We should try to get all these in eventually IMO. Pin control is definitely needed in inline ASM probably as a priority but there already are some in there like DRVH/DRVL/DIRH/DIRL/FLTH/FLTL, just not those that C or Z affect, or the RND.

BTW: working building llvm on a few Ubuntu installs. Probably the only way to build micropython. Previous attempts to build upy on windows were failures…

Lachlan talked to his buddies at his MicroPython meetup and I think some seemed to use Windows PCs but they might have needed Cygwin or other GNU like environment configured (MSYS?) so that make and other utilities it needs worked etc. But a real Linux machine or Mac is probably more ideal in order to build it without so much messing about first.

Push and pop maybe? Think that can be useful to save registers that one is not supposed to touch…

PUSHA and POPA (hub stack) were already in there. I also recently added PUSH and POP for use with the small internal COG stack along with MODCZ and LOC and the other ones listed above.

LOC could be used for loading immediates into PA/PB/PTRA/PTRB if sized to fit in 20 bits. But PTRA is already used for the STACK so loading immediates to it are not likely. The others aren't used a lot compared to R0-R31 GP regs. In fact, I think it'd be nice to make use of PTRB as a base pointer (BP) so that a function's locals and arguments can be accessed with indexed RDLONG x, PTRB[nn] type operations instead of offsetting the stack pointer in a separate register every time. It would also enable proper alloca() to work which dynamically adds stack based memory to a function when needed and automatically frees it upon function return. Having the extra BP register would allow PTRA to float around as PTRB would remain fixed. To achieve this requires an extra push to the stack to save and a pop to restore it upon returning so it's not cost free but it would let the P2 run C code that requires alloca().

So basically for function calls it'd be more like this -

caller:
  optionally push any varargs, and any other args that don't fit into registers
  call the function - (return address is pushed automatically)

callee:
  push ptrb  ' save old BP to stack
  mov ptrb, ptra ' setup new BP for this new stack frame

  setq #xx  ' xx is number of registers used in function to preserve - 1
  wrlong r0, ptra++  ' this block is done only when needed

  add ptra, #sizeof(locals) 
  ...
  alloca increases ptra as needed and use old ptra as alloc'd pointer
  ...

  then when the function returns:

  add ptrb, #4*(xx+1)  - move BP to end of preserved regs
  setq #xx
  rdlong r0, --ptrb ' restore regs (this block only done if needed)

  mov ptra, ptrb ' restore stack pointer
  pop ptrb ' restore BP to prior stack frame
  reta ' return to caller

caller:
  sub ptra, #yy - account for argument space consumed

In the function:
PTRB[-x] would access varargs and other non-register arguments passed in.
PTRB[+x] would access a functions locals (when they fit within first 32 indexed entries, or via AUGS use otherwise)

Downside is extra register push and pop plus additional instructions in functions, although some of that space is reclaimable when code can be simplified if arguments are otherwise computing a stack offset manually like here (in this case with the SP but could also be done with the BP).

   33d7c: f8 a3 03 f6                       mov r1, ptra
   33d80: 04 a2 07 f1                       add r1, #4
   33d84: d1 01 00 ff                       augs    #$3a200 >> 9
   33d88: a8 a9 07 f6                       mov r4, #$1a8
   33d8c: d1 a9 63 fc                       wrlong  r4, r1

That code could become this:

  augs #3a200>>9
  mov r4, #$1a8
  wrlong r4, ptra[-4]

or ideally for immediates, just this, which is only 2 instructions:

  wrlong ##$3a201a8, ptra[-4]

evanh · 2026-04-27 10:55

Shouldn't that be augs #$3a201a8 >> 9

rogloh · 2026-04-27 10:57

@evanh said:
Shouldn't that be augs #$3a201a8 >> 9

Oh yeah. You are right. Makes sense to print the whole value here so you don't have to concat in your head with the following #S constant.

evanh · 2026-04-27 11:02

The last one could then be:

  augd    #$3a201a8 >> 9
  wrlong  #$3a201a8 & $1ff, ptra[-4]

LLVM Backend for Propeller 2

Comments