LUT execution (Fastspin)

ManAtWork · 2020-05-23 13:56

I tried to put some code of functions that I call often in the LUT RAM to speed up execution. However, I'm doing something wrong, obviously. The program crashes immediately unless I comment out the call ##0x200. I reduced everything to the absolute minimum to demonstrate it.

#include <stdio.h>
#include <propeller.h>

#define P2_TARGET_MHZ 180
#include "sys/p2es_clock.h"

#ifndef _BAUD
#define _BAUD 230400
#endif

__asm { // runs in LUT RAM
//ORG 0x200 
// compiler complains about ORG but doesn't matter as long as code is PC relative

Lut_code
	// put code here later, for now we just return...
	ret
}

int InitLut ()
{
  int32_t* s= &Lut_code;
  int32_t* d= 0x00;
  int i= 256;
  int tmp1;
  __asm {
.loop	rdlong 	tmp1,s
    	wrlut	tmp1,d
	add	s,#4
	add	d,#1
	djnz	i,#.loop
	rdlut	i,#0 // read back the LUT contents to verify
  }
  return i;
}

int CallLut (int in)
{
  __asm {
	call 	##0x200
	}
  return in;
}


void main()
{
    clkset(_SETFREQ, _CLOCKFREQ);
    _setbaud(_BAUD);

    int i= InitLut ();
    printf ("after Init i=%08x\n", i);

    i= CallLut (1);
    printf ("after Call i=%08x\n", i);
}

If I comment out the call the output is "i=fd64002d" and "i=00000001". The first number is the ret instruction after the Lut_code label. So I think my code is copied correctly to LUT RAM. What else can be wrong? The docs say

LOOKUP EXECUTION
When the PC is in the range of $00200 and $003FF, the cog is fetching instructions from cog lookup RAM. This is commonly referred to as "lut execution mode." There is no special consideration when taking branches to a cog lookup address,

I know that I can't use the LUT when compiling with -O2 because the compiler will use the LUT itself. But I've checked the .p2asm file and haven't found any conflicts. My djnz instruction gets optimized into a rep (!) but everything else is compiled as expected.

ManAtWork · 2020-05-23 14:32

replacing call ##0x200 with

int i=0x0200;
  __asm {
	call 	i
	}

seems to fix it. That's interesting. Wasn't there a bug that some instructions didn't like the AUGS/D? Or is it a compiler or assembler bug?

Wuerfel_21 · 2020-05-23 14:48

CALL, CALLA, CALLB, CALLD, JMP and LOC support full 20 bit addresses (either relative or absolute) without AUGS, so using AUGS probably messes it up somehow

tl;dr; use

CALL #\$200

Peter Jakacki · 2020-05-23 14:58

Here's a snippet from my reformatted color columns version of the P2 instruction set. You can see that calls are already 20-bits except calls that use wc/wz/wcz. Calls can be relative or absolute.

Cluso99 · 2020-05-23 20:31

If you can be sure that you can keep cog $1e0-$1ef free for the serial monitor then your program can call the monitor rom and you can examine cog, LUT and hub, and return to your code with Q<enter>.
Btw it uses the internal stack.

ManAtWork · 2020-05-25 08:48

Wuerfel_21 wrote: »
use
CALL #\$200

Doh!

I've needed absolute adresses so rarely that I have completely forgotten the use of "\".

ManAtWork · 2020-05-25 12:24

Hmm, the compiler translates this

CALL #\$200

to that

call	512

evanh · 2020-05-25 12:26

Just use a plain CALL #$200.

ersmith · 2020-05-25 13:20

ManAtWork wrote: »
Hmm, the compiler translates this
CALL #\$200
to that
call	512

That's a bug in the inline assembly (and it will only happen for inline assembly). It's fixed in github now.

If you cannot build fastspin from source, I suggest you wait a few days for the next release. It will have a way to copy inline assembly to LUT automatically (__asm volatile will do this).

ManAtWork · 2020-05-25 17:21

ersmith wrote: »

If you cannot build fastspin from source,

BTW, what do I need to do this? Unlike FlexGUI it shouldn't need any special libraries. It's a console application that can be compiled with MinGW, isn't it.

I suggest you wait a few days for the next release. It will have a way to copy inline assembly to LUT automatically (__asm volatile will do this).

I guess this is handled like -O2 optimization. I mean the code is copied into LUT each time before execution. What I'm currently looking for is a feature to copy code to LUT only once and then call it multiple times. If the code is called often but contains no loops the LUT execution would otherwise not give much benefit.

ersmith · 2020-05-25 18:27

ManAtWork wrote: »

ersmith wrote: »

If you cannot build fastspin from source,

BTW, what do I need to do this? Unlike FlexGUI it shouldn't need any special libraries. It's a console application that can be compiled with MinGW, isn't it.

Yes, mingw should work fine -- I've used msys on a Windows machine to build fastspin (although normally I cross-compile on Linux with mingw for Linux).

I suggest you wait a few days for the next release. It will have a way to copy inline assembly to LUT automatically (__asm volatile will do this).

I guess this is handled like -O2 optimization. I mean the code is copied into LUT each time before execution. What I'm currently looking for is a feature to copy code to LUT only once and then call it multiple times. If the code is called often but contains no loops the LUT execution would otherwise not give much benefit.

Putting functions into COG or LUT memory is on my TODO list, but it's not as easy as hijacking the FCACHE mechanism to load inline assembly before executing it. Functions in internal memory will take a while.

LUT execution (Fastspin)

Comments