FlexSpin: is mixed HUB/LUT exec possible?

ManAtWork · 2023-08-29 08:43

The next challange for my CNC project is to implement a VFD, if possible in a single cog. My plan is to run two loops.

The faster loop polls the current sensor ADCs, calculates the next sine wave values from the frequency and voltage commands and feeds the PWM smart pins. This has to be coded in assembler and could be done in an interrupt triggered 20,000 times per second.

The slower loop calculates frequency and voltage commands from the speed command and motor load (current and phase angle). This needs to be updated at ~1000 times per second. Maybe even 100/s are enough. It would be a lot easier if I could code this in Spin or C using floating point math.

Is this a bad idea or even possible? Of course the compiler uses most of the COG ram register space so there will not be much COG ram left for my interrupt service code. But it could also execute from LUT ram, I think. And I need the CORDIC for sine wave synthesis which the compiler also uses for floating point operations. So I have to protect all float ops from interrupts with STALLI/ALLOWI. But some delay should be no problem as long as the next PWM Y-values are ready before the next PWM frame period (50µs).

How much LUT ram and which cog registers are available?

pik33 · 2023-08-29 11:24

No interrupts are available for a cog that executes Flexprop's compiled code. You need to use a dedicated cog there. I don't know how Pnut/Propeller Tool treats this.

evanh · 2023-08-29 11:45

Pnut has everything critical shielded so supports the use of interrupts with Spin2 code.

As an alternative that could be used with Flexspin, the JMPRET instruction in the Prop1 is used for cooperative switching. Fast way to task switch. Same as the LINK instruction from other CPUs.

Prop2 names it CALLD. The ISR call mechanism of the Prop2 is hardwired CALLDs using two shadow RAM registers for debug ISR and six plain cogRAM registers for the three regular ISRs.

An example is provided in the silicon doc ADC_to_VGA_millivolts.spin2 using RESI instructions. These again are aliases of specific CALLD use.

ManAtWork · 2023-08-29 12:58

@pik33 said:
No interrupts are available for a cog that executes Flexprop's compiled code.

Why? I'd never say never. But maybe the restrictions that apply makes it so painfully that it's easier to do it in a different way. I'll wait for Eric's opinion.

@evanh said:
As an alternative that could be used with Flexspin, the JMPRET instruction in the Prop1 is used for cooperative switching. Fast way to task switch. Same as the LINK instruction from other CPUs.
...
An example is provided in the silicon doc ADC_to_VGA_millivolts.spin2 using RESI instructions. These again are aliases of specific CALLD use.

Good idea. But as the code of the fast loop is executed exactly once per interrupt trigger and the slower loop resumes execution after return from interrupt, anyway, there is no need for RESIx. But that would mean I have to write both parts in PASM. I could try to find or strip down a floating point library that fits into cog ram and can be called from PASM code. Or I could bite the bullet and implement everything with fixed point math.

evanh · 2023-08-29 13:42

Ha, I was meaning CALLD can be interspersed, as inline pasm, within C code to do low-overhead cooperative switching. Not using actual interrupts at all. The RESI thingy was just an example I knew of.

evanh · 2023-08-29 13:45

@ManAtWork said:

@pik33 said:
No interrupts are available for a cog that executes Flexprop's compiled code.

Why?

The compiler and libraries would need reworked to protect all sensitive parts that need to be atomic. Chip has carefully done exactly this the whole time he's been developing Pnut. If you ever look at his sources you'll see plenty cases of REP @label, #1 inserted ahead of an atomic operation.

ManAtWork · 2023-08-29 15:22

Ok, making Spin and C code generally compatible with interupts would require a lot of work, I agree. But if I'm careful and don't call any libraries it should be possible if there are enough free registers so that interrupts and compiled code don't step on each others feet.

The CORDIC operations (hopefully, AFAICS) are the only things that need special attention. I have to figure out which floating point operations use the CORDIC and protect them with inline STALLI/ALLOWI pairs.

Wuerfel_21 · 2023-08-29 15:27

REP is better for interrupt protection. Also, most cordic ops are 32 bit multiplies, which can be emulated in roughly the same time as a QMUL using MUL/MULS and some bit fumbling.

ManAtWork · 2023-08-29 15:33

On the other hand, although the CORDIC unit is cool and perfectly usable to generate three 120° phase shifted sine waves with minimum effort, I could also do it the good old way like the P1, that is with a lookup table. 16 bits resolution would be plenty for PWM so a 128 entry LUT only takes 256 bytes hub ram. Interpolating linearly between table entries should give nearly 16 bit precision. SCAS or MULS operations are fast. So there would be no need to protect the main code from interrupts.

Wuerfel_21 · 2023-08-29 15:57

I haven't really tried it, but I think you should be able to save CORDIC state in an interrupt.

Something like

irq_vector
            pollqmt wc
            getqx x_save
            getqy y_save
            pollqmt wc
            wrnc cordic_save_flag

            [your code]

            cmp cordic_save_flag,#0 wz
    if_nz   setq y_save
    if_nz   qrotate x_save,#0
            iret123

ManAtWork · 2023-08-29 16:20

Interesting idea but I'm not sure if that works in all cases. What happens if the interrupt is triggered after a getqx but before the corresponding getqy? And there might be code that doesn't call getqy at all but only getqx. Should do no harm theoretically and you only restore qx/qy one time too much...

ersmith · 2023-08-29 18:31

@ManAtWork said:

@pik33 said:
No interrupts are available for a cog that executes Flexprop's compiled code.

Why? I'd never say never. But maybe the restrictions that apply makes it so painfully that it's easier to do it in a different way. I'll wait for Eric's opinion.

The compiler has no notion of interrupts at all, and does not even attempt to generate code that's interrupt safe. It'll happily convert loops to use REP (hence blocking interrupts for as long as the loop executes) and uses CORDIC operations wherever it can. So my advice is not to try to run interrupts in a COG running compiled code.

RossH · 2023-08-30 01:25

Catalina allows interrupts in C programs, and you have full access to the C library (including maths functions) in interrupt routines. While 20000 interrupts per second is not a problem, you can't do very much floating point maths in such an interrupt routine!

Here is an example:

 /***************************************************************************\
 *                                                                           *
 *    Demonstrates a C main program, with C interrupt service functions      *
 *                                                                           *
 * NOTE: This program will only work on the P2, since the P1 does not        *
 * support interrupts.                                                       *
 *                                                                           *
 * Compile it with a command like (in Catalina 5.9.4 or earlier):            *
 *                                                                           *
 *   catalina -p2 -lc -lmc -linterrupts interrupts.c -C P2_EVAL              *
 *                                                                           *
 * or (in Catalina 6.0 or later):                                            *
 *                                                                           *
 *   catalina -p2 -lc -lmc -lint interrupts.c -C P2_EDGE                     *
 *                                                                           *
 * Run it with a command like:                                               *
 *                                                                           *
 *   payload -i interrupts                                                   *
 *                                                                           *
 \***************************************************************************/

#include <math.h>
#include <stdio.h>
#include <propeller2.h>
#include <catalina_interrupts.h>

// define the stack size each interrupt needs 
#define INT_STACK_SIZE (MIN_INT_STACK_SIZE + 100)

// define the LED pin to use:
#if defined (__CATALINA_P2_EDGE)
#define PIN_SHIFT (38-32) // use pin 38 on the P2_EDGE
#define _dir _dirb
#define _out _outb
#elif defined (__CATALINA_P2_EVAL)
#define PIN_SHIFT (56-32) // use pin 56 on the P2_EVAL
#define _dir _dirb
#define _out _outb
#else
#define PIN_SHIFT 0 // use pin 0 on other platforms
#define _dir _dira
#define _out _outa
#endif

static unsigned long led_mask = 1<<(PIN_SHIFT);
static unsigned long on_off   = 1<<(PIN_SHIFT);
static unsigned long counter  = 0;

// return a random from 1 .. range
int random(int range) {
  return 1 + (int)(((float)range*(float)rand())/32768);
}

// interrupt_x : these functions can serve as interrupt service routines. 
//               Such functions must have no arguments, and must not return 
//               a value. They must be set as an interrupt using one of the 
//               _set_int_x functions. Note that we would not normally do 
//               anything as silly as calling 'printf' in an interrupt 
//               service routine - but we CAN!

// interrupt_1 - increment the counter 20000 times every second
void interrupt_1(void) {
    // we are a counter interrupt, so add to the counter for next time
    _add_CT1(_clockfreq()/20000);
    counter++;

}

// interrupt_2 - print the current value of the counter every second
void interrupt_2(void) {
    // we are a counter interrupt, so add to the counter for next time
    _add_CT2(_clockfreq());
    printf(" Counter = %lu \n", counter);
}

// interrupt 3 - throw a random spanner into the works!
void interrupt_3(void) {
    // we are a counter interrupt, so add to the counter for next time
    _add_CT3((random(10))*_clockfreq());
    printf(" << A SPANNER >> \n");
}

void main(void) {
    int i;

    // declare some space that will be used as stack by the interrupts
    long int_stack[INT_STACK_SIZE * 3];

    // set up the direction and initial value of the LED pin 
    _dir(led_mask, led_mask);
    _out(led_mask, on_off);

    printf("Press a key to start\n");
    k_wait();

    // these are counter interrupts so set up the initial counters 
    _set_CT1(_clockfreq()/20000);
    _set_CT2(_clockfreq());
    _set_CT3(1+random(10)*_clockfreq());

    // set up our interrupt service routines
    _set_int_1(CT1, &interrupt_1, &int_stack[INT_STACK_SIZE * 1]);
    _set_int_2(CT2, &interrupt_2, &int_stack[INT_STACK_SIZE * 2]);
    _set_int_3(CT3, &interrupt_3, &int_stack[INT_STACK_SIZE * 3]);

    // now just let the interrupts do their thing
    while(1) {
        // toggle the LED to show some activity. 
        _out(led_mask, (on_off ^= led_mask));
        for (i = 1; i < 2000000; i++) { }
    }
}

When executed, this prints:

Press a key to start
Counter = 20000
Counter = 40000
Counter = 60000
Counter = 80000
Counter = 100000
Counter = 120000
<< A SPANNER >>
Counter = 140000
Counter = 160000
<< A SPANNER >>
Counter = 180000
Counter = 200000
Counter = 220000
Counter = 240000
<< A SPANNER >>
Counter = 260000
Counter = 280000
Counter = 300000
Counter = 320000
Counter = 340000
Counter = 360000
<< A SPANNER >>
Counter = 380000
Counter = 400000
Counter = 420000
Counter = 440000
Counter = 460000
Counter = 480000
Counter = 500000

... etc ...

Ross.

ManAtWork · 2023-08-30 07:28

@ersmith said:
The compiler has no notion of interrupts at all, and does not even attempt to generate code that's interrupt safe. It'll happily convert loops to use REP (hence blocking interrupts for as long as the loop executes) and uses CORDIC operations wherever it can. So my advice is not to try to run interrupts in a COG running compiled code.

As I said, I know that it would be hard to generate interrupt safe code for any imaginable user and application. And if you said this was supported you'd have to continue to support it forever so I understand that you don't. But my code will be very simple. It's just one big loop doing calculations on a very limited number of variables on the stack and mailbox memory. No library functions will be called except for math functions.

All I need is at least one register where I can pass a stack pointer to save the state of some more registers and say 1/4 of the LUT ram for my code.

ManAtWork · 2023-08-30 07:37

@Wuerfel_21 said:
I haven't really tried it, but I think you should be able to save CORDIC state in an interrupt. Something like
irq_vector
            pollqmt wc
            getqx x_save
            getqy y_save
            pollqmt wc
            wrnc cordic_save_flag

If the interrupt is triggered just in the middle of a getqx and getqy pair then the pollqmt should set the C flag and clear the cordic_save_flag. So I think the save-flag is of not much use and we should try to just restore the CORDIC register in all cases.

ManAtWork · 2023-08-30 07:41

@RossH said:
Catalina allows interrupts in C programs, and you have full access to the C library (including maths functions) in interrupt routines. While 20000 interrupts per second is not a problem, you can't do very much floating point maths in such an interrupt routine!

I don't need floats in the IRS routine, just in the main loop running only some 100 times/s.

But unfortunatelly, Catalina is no option. Many of the hardware drivers I use (OLED display and SPI bus, flash file system, for example) are written in Spin2. If I have to port all that code to C it would be more work than writing the VFD code all in assembler.

rogloh · 2023-08-30 07:42

@ManAtWork
What if you created some code that wrote out the full state of COGRAM & LUTRAM & CORDIC to HUB and then loaded in your own code to COG/LUT for execution, branch to it, and at the end it restores the COG&LUT RAM later including any CORDIC state? This would not be restarting the COG with new code, just taking it over temporarily. It could potentially work as long as you can preserve all state that flexspin VM uses. You probably need the Q register too, and not do this stuff inside an already executing REP block or something crazy. I think there should be ways it can be read/restored. You may only need to save a subset of memory if your code is small too.

rogloh · 2023-08-30 07:52

I think an ISR can already branch directly into HUB exec code right? If so, then the only resource you'd need to consume in the COG is whatever register(s) are needed for initially preserving the state. You can push the first one onto the HW stack and then use that free register to work from there, assuming we have at least one HW stack level free in the flexspin VM.
EDIT: One problem to consider with this might be preserving the fifo pointer...a getptr might not work if we have already jumped into HUB exec, unless that state is saved as part of the ISR process?

RossH · 2023-08-30 10:41

@ManAtWork said:
Many of the hardware drivers I use (OLED display and SPI bus, flash file system, for example) are written in Spin2.

Yes, I understand. The increased integration between Spin2 and PASM has become a barrier to entry for the P2.

ManAtWork · 2023-08-30 11:46

@RossH said:
Yes, I understand. The increased integration between Spin2 and PASM has become a barrier to entry for the P2.

No it's not. The problem is that every developper tends to plan his/her projects in a way that all available resources are just used up. If there are pins, memory space, board space or money left over then more features are added. This continues until no space, money, time or what else is left. And this is also a good explanation why nearly all programmers spend half of their time debugging no matter how clever they are.

In other words, I try to push the limits and see what might be possible. If it's not I can fall back to what was possible before. But I don't want to fall back more than necessary. Being able to combine C, Spin and PASM has already proven its worth and I don't want to miss it.

RossH · 2023-08-30 13:05

@ManAtWork said:
Being able to combine C, Spin and PASM has already proven its worth and I don't want to miss it.

Sure - I get that. But the downside is that you often have to have a detailed working knowledge of C, PASM and Spin in order to understand, adapt or debug existing code.

This is fine for those of us that have the necessary knowledge - but it is a very high barrier to entry for those that don't. Higher than is typically required for other processors.

ManAtWork · 2023-08-30 14:21

@RossH said:
This is fine for those of us that have the necessary knowledge - but it is a very high barrier to entry for those that don't. Higher than is typically required for other processors.

Yes, also true. If you don't like Spin you don't need to use it. But the problem is that most libraries and drivers are written in Spin. Parallax is a small company compared to Microchip or Intel. They can't afford supporting C programmers or compiler developers like you and Eric. But let's stop complaining and develop real products instead...

              mov    inx,theta
              testb  inx,#31 wz                 ' bit 31 -> C
              shl    inx,#2 wc                  ' bit 30 -> C
        if_c  not    inx                        ' mirror 2nd and 4th quadrant
              getbyte ipol,inx,#2               ' fraction -> interpolate
              shr    inx,#24                    ' MSB -> table index
              shl    inx,#1
              add    inx,adrTable
              rdword sin1,inx
              subr   ipol,#$100
              mul    sin1,ipol
              add    inx,#2
              rdword sin2,inx
              subr   ipol,#$100
              mul    sin2,ipol
              add    sin1,sin2
              shr    sin1,#10
              mul    sin1,ampl
              negz   sin1                       ' flip 3rd and 4th quadrant
              sar    sin1,#15

This is a table based sine function with 32 bit theta and 16 bit unsigned amplitude input and 16 bit signed output. It's 20 instructions long and even a bit faster than an unpipelined CORDIC operation.

I'll implement the main loop in C and try if I could put both together.

pik33 · 2023-08-30 14:56

Instead of rdword/add inx/rdword to get 2 samples to interpolate you can use rdlong and getword. rdlong does not have to be long aligned in P2. That saves one hub read, and the cost is one getword and one clock penalty for unaligned long read.
mul/add/shr maybe can be sped up using scas: scas returns an already shifted result in Q

pik33 · 2023-08-30 14:59

@RossH said:

@ManAtWork said:
Being able to combine C, Spin and PASM has already proven its worth and I don't want to miss it.

Sure - I get that. But the downside is that you often have to have a detailed working knowledge of C, PASM and Spin in order to understand, adapt or debug existing code.

This is fine for those of us that have the necessary knowledge - but it is a very high barrier to entry for those that don't. Higher than is typically required for other processors.

FlexBasic lowers this barrier. Easy to learn, easy to combine with existing Spin drivers. We however need a real OBEX to find these drivers easier.

RossH · 2023-08-31 00:45

@pik33 said:
FlexBasic lowers this barrier. Easy to learn, easy to combine with existing Spin drivers. We however need a real OBEX to find these drivers easier.

I've never tried FlexBasic, so I'm not sure about its capabilities. But while I wouldn't expect Basic to support things like interrupts, it is reasonable to expect that an implementation of Spin and C should do so. If they don't, then it means you have to do all the tricky stuff in PASM, which is hard. Integrating languages is good, but you should not be required to use them all just to do what you want in any one. Also, there are other low-level languages used on the Propeller, such as Forth. Are there plans for a FlexForth? If not, then there is still a "barrier" for such languages to get over. A common well-defined PASM driver interface would address that. But we've been down that road before on the Propeller 1, and I remember where it leads!

To go back to the original topic, I am doing some interrupt work in Catalina at the moment, and I like the idea of being able to load the LUT with interrupt routines. I only currently use the LUT "behind the scenes" in Catalina, but I will look at adding some functions to make this easier to do so in front of the curtain, so to speak

However, I have a vague memory that LUT code differs from ordinary Hub or Cog code, but I can't quite remember the details. I just tried looking in the documentation and can't find anything relevant. Does anyone have a reference to any limitations in "LUT execution" mode?

Ross.

TonyB_ · 2023-08-31 10:09

@RossH said:
To go back to the original topic, I am doing some interrupt work in Catalina at the moment, and I like the idea of being able to load the LUT with interrupt routines. I only currently use the LUT "behind the scenes" in Catalina, but I will look at adding some functions to make this easier to do so in front of the curtain, so to speak

However, I have a vague memory that LUT code differs from ordinary Hub or Cog code, but I can't quite remember the details. I just tried looking in the documentation and can't find anything relevant. Does anyone have a reference to any limitations in "LUT execution" mode?

Reading/writing data in LUT RAM requires RDLUT/WRLUT and only bottom half is directly addressable, but non-self-modifying code execution in LUT RAM is same as in cog RAM.

ManAtWork · 2023-09-11 11:59

RossH said: ... Does anyone have a reference to any limitations in "LUT execution" mode?
@TonyB_ said:
Reading/writing data in LUT RAM requires RDLUT/WRLUT and only bottom half is directly addressable, but non-self-modifying code execution in LUT RAM is same as in cog RAM.

AFAIK, there are no restrictions other than the LUT ram has to be loaded, first, before execution. This can be done with SETQ2+RDLONG to load a whole block at once. I haven't found anything about a limitation to the bottom half. Executing ISR code in LUT should be possible by simply setting the IRQ vector register IJMPx to a LUT address ($200..$3FF).

I'll try it out...

evanh · 2023-09-11 15:08

Tony used a technical term there - "Direct Addressing" is the way an instruction is encoded for fetching operand data. Any of the 512 cogRAM addresses can be directly specified in both the encoded S and D fields of an instruction. But, for lutRAM and hubRAM, because of the extra indexing options in the RDLUT/WRLUT instructions, only the first 256 addresses can be directly specified this way and only in the S field. In the case of hubRAM it's only the first 256 bytes! Accessing data beyond address 255 requires using a cogRAM register to hold the lutRAM address, AKA register-direct addressing.

Chip's docs erroneously calls this an "immediate hub address" or "immediate LUT address" probably because a "#" is used in Pasm to signify such direct addressing with the load and store instructions - RDLUT/WRLUT/RDLONG/WRLONG/...

So, as result, those first storage locations in main memory can be valuable for more compact referencing of regularly used metrics like timers. This has been called the zero-page in some systems in the past.

Wuerfel_21 · 2023-09-11 15:10

@evanh said:
Fetching data beyond address 255 requires using a cogRAM register to hold the lutRAM address, AKA register-direct addressing.

I'm pretty sure a ##-extended immediate also works (as it does for hub addresses)

evanh · 2023-09-11 15:23

okay, yes, "requires" was incorrect. But the compactness is shredded even more if not using a register.

ManAtWork · 2023-09-11 17:01

@evanh said:
Tony used a technical term there - "Direct Addressing" is the way an instruction is encoded for fetching operand data. Any of the 512 cogRAM addresses can be directly specified in both the encoded S and D fields of an instruction. But, for lutRAM and hubRAM, because of the extra indexing options in the RDLUT/WRLUT instructions, only the first 256 addresses can be directly specified this way and only in the S field.

I'm pretty sure that this does not affect my code if I place the instructions in LUT ram and the data in COG ram. Jumps and branches, REP etc. are PC relative so they still work in LUT ram.

FlexSpin: is mixed HUB/LUT exec possible?

Comments