Shop OBEX P1 Docs P2 Docs Learn Events
LUT execution (Fastspin) — Parallax Forums

LUT execution (Fastspin)

I tried to put some code of functions that I call often in the LUT RAM to speed up execution. However, I'm doing something wrong, obviously. The program crashes immediately unless I comment out the call ##0x200. I reduced everything to the absolute minimum to demonstrate it.
#include <stdio.h>
#include <propeller.h>

#define P2_TARGET_MHZ 180
#include "sys/p2es_clock.h"

#ifndef _BAUD
#define _BAUD 230400
#endif

__asm { // runs in LUT RAM
//ORG 0x200 
// compiler complains about ORG but doesn't matter as long as code is PC relative

Lut_code
	// put code here later, for now we just return...
	ret
}

int InitLut ()
{
  int32_t* s= &Lut_code;
  int32_t* d= 0x00;
  int i= 256;
  int tmp1;
  __asm {
.loop	rdlong 	tmp1,s
    	wrlut	tmp1,d
	add	s,#4
	add	d,#1
	djnz	i,#.loop
	rdlut	i,#0 // read back the LUT contents to verify
  }
  return i;
}

int CallLut (int in)
{
  __asm {
	call 	##0x200
	}
  return in;
}


void main()
{
    clkset(_SETFREQ, _CLOCKFREQ);
    _setbaud(_BAUD);

    int i= InitLut ();
    printf ("after Init i=%08x\n", i);

    i= CallLut (1);
    printf ("after Call i=%08x\n", i);
}
If I comment out the call the output is "i=fd64002d" and "i=00000001". The first number is the ret instruction after the Lut_code label. So I think my code is copied correctly to LUT RAM. What else can be wrong? The docs say
LOOKUP EXECUTION
When the PC is in the range of $00200 and $003FF, the cog is fetching instructions from cog lookup RAM. This is commonly referred to as "lut execution mode." There is no special consideration when taking branches to a cog lookup address,
I know that I can't use the LUT when compiling with -O2 because the compiler will use the LUT itself. But I've checked the .p2asm file and haven't found any conflicts. My djnz instruction gets optimized into a rep (!) but everything else is compiled as expected.

Comments

  • replacing call ##0x200 with
    int i=0x0200;
      __asm {
    	call 	i
    	}
    
    seems to fix it. That's interesting. Wasn't there a bug that some instructions didn't like the AUGS/D? Or is it a compiler or assembler bug?
  • Wuerfel_21Wuerfel_21 Posts: 4,507
    edited 2020-05-23 14:50
    CALL, CALLA, CALLB, CALLD, JMP and LOC support full 20 bit addresses (either relative or absolute) without AUGS, so using AUGS probably messes it up somehow

    tl;dr; use
    CALL #\$200
    
  • Here's a snippet from my reformatted color columns version of the P2 instruction set. You can see that calls are already 20-bits except calls that use wc/wz/wcz. Calls can be relative or absolute.
    1563 x 708 - 180K
  • Cluso99Cluso99 Posts: 18,069
    If you can be sure that you can keep cog $1e0-$1ef free for the serial monitor then your program can call the monitor rom and you can examine cog, LUT and hub, and return to your code with Q<enter>.
    Btw it uses the internal stack.
  • Wuerfel_21 wrote: »
    use
    CALL #\$200
    

    Doh! facepalm.gif
    I've needed absolute adresses so rarely that I have completely forgotten the use of "\".
  • Hmm, the compiler translates this
    CALL #\$200
    
    to that
    call	512
    
    :neutral:
  • evanhevanh Posts: 15,192
    Just use a plain CALL #$200.
  • ManAtWork wrote: »
    Hmm, the compiler translates this
    CALL #\$200
    
    to that
    call	512
    
    :neutral:

    That's a bug in the inline assembly (and it will only happen for inline assembly). It's fixed in github now.

    If you cannot build fastspin from source, I suggest you wait a few days for the next release. It will have a way to copy inline assembly to LUT automatically (__asm volatile will do this).
  • ersmith wrote: »
    If you cannot build fastspin from source,

    BTW, what do I need to do this? Unlike FlexGUI it shouldn't need any special libraries. It's a console application that can be compiled with MinGW, isn't it.
    I suggest you wait a few days for the next release. It will have a way to copy inline assembly to LUT automatically (__asm volatile will do this).

    I guess this is handled like -O2 optimization. I mean the code is copied into LUT each time before execution. What I'm currently looking for is a feature to copy code to LUT only once and then call it multiple times. If the code is called often but contains no loops the LUT execution would otherwise not give much benefit.
  • ManAtWork wrote: »
    ersmith wrote: »
    If you cannot build fastspin from source,

    BTW, what do I need to do this? Unlike FlexGUI it shouldn't need any special libraries. It's a console application that can be compiled with MinGW, isn't it.
    Yes, mingw should work fine -- I've used msys on a Windows machine to build fastspin (although normally I cross-compile on Linux with mingw for Linux).
    I suggest you wait a few days for the next release. It will have a way to copy inline assembly to LUT automatically (__asm volatile will do this).

    I guess this is handled like -O2 optimization. I mean the code is copied into LUT each time before execution. What I'm currently looking for is a feature to copy code to LUT only once and then call it multiple times. If the code is called often but contains no loops the LUT execution would otherwise not give much benefit.

    Putting functions into COG or LUT memory is on my TODO list, but it's not as easy as hijacking the FCACHE mechanism to load inline assembly before executing it. Functions in internal memory will take a while.
Sign In or Register to comment.