flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

evanh · 2022-07-17 03:02

@ersmith said:
I checked in a change to bump the default heap size on P2; the new file system code does need a bit more RAM sometimes.

Isn't the heap usually just the spare RAM? I'm surprised it needs to be defined.

On that note, for buffer allocations, I've been using __builtin_alloca() instead of C's malloc(). Malloc() seems to be broken.

ersmith · 2022-07-17 10:22

@evanh said:
Isn't the heap usually just the spare RAM? I'm surprised it needs to be defined.

No, because in FlexSpin the stack grows up rather than down; also there's a garbage collection system. For both of these reasons there's a symbol (HEAPSIZE) that gives the size of the heap. It defaults to 6K now on P2, which is pretty msall.

On that note, for buffer allocations, I've been using __builtin_alloca() instead of C's malloc(). Malloc() seems to be broken.

Could you be more specific? I use malloc() inside the libraries and it's worked fine. I'm guessing you may need a bigger heap (see above).

ersmith · 2022-07-17 10:34

@Wuerfel_21 said:
Also, regarding the last couple int64 related commits - do note that long is not a 64bit type on Windows and it can and will blow up if you try to printf("%ld\n",(int64_t)something);. You need long long

Aargh. I guess I'll have to switch over to using the PRId64 (and similar) macros from inttypes.h. Those are ugly, but should always work.

evanh · 2022-07-17 20:07

@ersmith said:

@evanh said:
Isn't the heap usually just the spare RAM? I'm surprised it needs to be defined.

No, because in FlexSpin the stack grows up rather than down; also there's a garbage collection system. For both of these reasons there's a symbol (HEAPSIZE) that gives the size of the heap. It defaults to 6K now on P2, which is pretty msall.

I'd make the stack a defined size, rather than the heap. Or have the heap grow down maybe.

On that note, for buffer allocations, I've been using __builtin_alloca() instead of C's malloc(). Malloc() seems to be broken.

Could you be more specific? I use malloc() inside the libraries and it's worked fine. I'm guessing you may need a bigger heap (see above).

I'm repeatedly allocating two buffers of various sizes from 200 Bytes to 200 kBytes in that speed tester.

ersmith · 2022-07-17 22:53

@evanh said:

@ersmith said:

@evanh said:
Isn't the heap usually just the spare RAM? I'm surprised it needs to be defined.

No, because in FlexSpin the stack grows up rather than down; also there's a garbage collection system. For both of these reasons there's a symbol (HEAPSIZE) that gives the size of the heap. It defaults to 6K now on P2, which is pretty msall.

I'd make the stack a defined size, rather than the heap. Or have the heap grow down maybe.

In order to do garbage collection we need to know the size of the heap. It was easiest to do it this way. I realize it's not the way other compilers do it, but it is what it is (and it's documented). I am always open to pull requests to change things .

On that note, for buffer allocations, I've been using __builtin_alloca() instead of C's malloc(). Malloc() seems to be broken.

Could you be more specific? I use malloc() inside the libraries and it's worked fine. I'm guessing you may need a bigger heap (see above).

I'm repeatedly allocating two buffers of various sizes from 200 Bytes to 200 kBytes in that speed tester.

Then you'll need a HEAPSIZE definition of at least 200 KB (more actually, since the internal buffers need allocating too) in order to do it. Or, you can use __builtin_alloca(), which you are doing and which works fine.

evanh · 2022-07-18 05:05

Hmm, my list of To-Do's is growing too.

evanh · 2022-07-23 05:30

There is a feature discrepancy, between pnut and flexspin, that I only just realised existed. Spin2 doc says there is a set of eight PR0..PR7 symbols that map to general cog registers. I could make great use of these as persistent parameters for inline assembly. I see you've got $1e0..$1ef spare for the same. Just need the symbols to map to those.

They'll end up being like systems variables of old. Where they all get allocated for regularly used drivers. But that's an issue for system integration.

rogloh · 2022-07-23 05:35

Also it would be nice to reserve the first 16 LUTRAMs before you use them for explicitly allocated functions being stored in LUT. Maybe via a command line switch if you don't want to make it the default. PropTool keeps the first 16 longs free for streamer LUT use, so I'm thinking it would be nice to have the same capability from Flex as well so we can sort of rely on this memory being available in both toolchains.

JRoark · 2022-07-23 13:21

@ersmith In FlexBASIC v5.9.14, it appears that the compiler is not automatically handling longint results correctly where the source variables are 32-bit.
This code:

    dim a as ulongint
    dim b as ulong
    b = &hFFFF_FFFF
    a = b * 128
    print  a
    print  b

produces this:

4294967168                                                                      
4294967295

In order to force the correct 64-bit answer, you have to do this:

    dim a as ulongint
    dim b as ulong
    b = &hFFFF_FFFF
    a = cast(ulongint, b) * 128
    print  a
    print  b

which yields the correct answer:

15565212719110946688                                                            
4294967295

It's not a show-stopper, but it seems to be a bug, and I know how you like bug reports.

pik33 · 2022-07-23 13:32

b and 128 are both 32bits so the result is 32 bit, truncated. The same thing happens if

let a!=1/16

The result is 0. while

let a!=1/128.0

gives 0.0625

Maybe we need something like 128L to tell the compiler that it is long one

JRoark · 2022-07-23 14:17

@pik33 said:
b and 128 are both 32bits so the result is 32 bit, truncated.

IMHO should not be the case and hasnt been the case in any of the other BASICs I’ve used. YMMV.

If the result is 64-bit, then the input types (byte/short/long/whatever) should not put a limit the range of the result. I guess if what you are saying is correct, then if necessary, both of the inputs should be promoted to 64-bits before the calc is done. But this should be transparent to the user.

Maybe we need something like 128L to tell the compiler that it is long

That could be useful too.

evanh · 2022-07-23 14:24

Reported behaviour is typical of C code. It's one reason why C is fast.

Wuerfel_21 · 2022-07-23 14:25

@JRoark said:
If the result is 64-bit, then the input types (byte/short/long/whatever) should not put a limit the range of the result. I guess if what you are saying is correct, then if necessary, both of the inputs should be promoted to 64-bits before the calc is done. But this should be transparent to the user.

That'd be a really terrible thing. An expression should evaluate the same in all contexts.

JRoark · 2022-07-23 14:29

@evanh said:
Reported behaviour is typical of C code. It's one reason why C is fast.

No argument on that point, Evan. But… its BASIC. 😜

This is not a show-stopper by any means. As long as a simple CAST fixes this (and is part of the doc), its probably not worth changing. But it has been a long time since I rattled ERSmith’s cage, and I didnt want him to think I wasnt paying attention, so… yeah. Lol

JRoark · 2022-07-23 17:51

Okay, a dim light is beginning to shine in my grey matter, but check my math:

Operand Operand Result
------------------------------------------------
ubyte   ubyte   ubyte
ubyte   byte    byte
ubyte   ushort  ushort
ubyte   short   short
ubyte   ulong   ulong
ubyte   long    long
ubyte   ulongint    ulongint
ubyte   longint longint

byte    ushort  short
byte    short   short
byte    ulong   long
byte    long    long
byte    ulongint    longint
byte    longint longint

ushort  ushort  ushort
ushort  short   short
ushort  ulong   ulong
ushort  long    long
ushort  ulongint    ulongint
ushort  longint longint

short   ulong   long
short   long    long
short   ulongint    longint
short   longint longint

ulong   ulong   ulong
ulong   long    long
ulong   ulongint    ulongint
ulong   longint longint

ulongint    ulongint    ulongint
ulonging    longint     longint

So simply stated this comes down to three rules:
1). if any operand is signed, the result will be signed
2). the result will always be constrained to the bit-length of the largest operand.
3). if the result will not fit in the data type declared by the programmer (too many bits), the result is truncated to fit while preserving the sign.

How badly did I screw this up?

evanh · 2022-07-23 23:42

[err]

evanh · 2022-07-23 23:46

I got a problem. Spin ORG/END inline pasm, I added a second # to an immediate operand, for padding purposes, but it failed to assemble with the AUG prefixing without me also extending the number of bits of the literal.

This particular case is not a worry, but the driver coding I use pasm for, often heavily relies on instruction counting for correct timing. Sometimes the number of bits needed in a literal are not fixed (Calculated from compile time parameters) and I will use ## and rely on that AUG always existing.

Testing of Pnut obeys the ## with small value literal.

Wuerfel_21 · 2022-07-23 23:48

Hmm. The inline ASM passes through the IR, I think, so the distinction gets lost. Or does it?

rogloh · 2022-07-24 00:59

@evanh said:
I got a problem. Spin ORG/END inline pasm, I added a second # to an immediate operand, for padding purposes, but it failed to assemble with the AUG prefixing without me also extending the number of bits of the literal.

Hmm, I thought that could be perhaps be worked around by prefixing the line with your own AUGS #0 but if you do that you get this:

error: internal error bad operand

JRoark · 2022-07-24 15:37

Is there any real difference between a LONG and a BOOLEAN in FlexBASIC? You can set a boolean variable to a non-boolean, so I'm wondering if the difference is just semantics:

dim flag as boolean
dim i as ulong
flag = false
for i = 1 to 10
   print flag
   flag += 1
next i

prints "0,1,2...9"

ersmith · 2022-07-25 00:32

@JRoark said:

You raised an interesting point about the 32 bit vs. 64 bit calculation. At the moment FlexBasic defaults all integers to 32 bits, and does all operations on 32 bit (and smaller) numbers in 32 bits. Expanding automatically to 64 bits would be a lot of work; but it would be straightforward to add some kind of suffix to literals to flag them as being 64 bits (and thus forcing operations with them to be 64 bit). C already does this.

```

Operand Operand Result

ubyte ubyte ubyte
'''

...
Actually if either or both operand(s) is smaller than 32 bits then it/they widened to 32 bits first, and then operation is performed.

If one of the operand is 64 bits then both are widened to 64 bits before the operation, and the result is 64 bits.

If either operand is a float then both operands are converted to float, and the result is a float.

ersmith · 2022-07-25 00:32

@evanh said:
I got a problem. Spin ORG/END inline pasm, I added a second # to an immediate operand, for padding purposes, but it failed to assemble with the AUG prefixing without me also extending the number of bits of the literal.

This particular case is not a worry, but the driver coding I use pasm for, often heavily relies on instruction counting for correct timing. Sometimes the number of bits needed in a literal are not fixed (Calculated from compile time parameters) and I will use ## and rely on that AUG always existing.

Testing of Pnut obeys the ## with small value literal.

Ouch. Yeah, this one will be tricky to fix. For now it's probably best to stick a big literal in there to force the AUG to be present.

ersmith · 2022-07-25 00:34

@JRoark said:
Is there any real difference between a LONG and a BOOLEAN in FlexBASIC? You can set a boolean variable to a non-boolean, so I'm wondering if the difference is just semantics:

No, there really isn't any difference, it's just an indicator to the reader of how the variable is going to be used. As usual, any non-zero value is TRUE and 0 is FALSE.

ManAtWork · 2022-07-26 09:27

I've just tested the ILI9143 LCD driver and demo with FlexSpin. It works if I switch optimization off. With default optimization it doesn't, at least not fully. It doesn't crash but some of the commands don't work. For example clearScreen() does nothing and all fill block operations (filled rectangles and text) don't work. But line drawing and circles (single pixel operations) work.

The older versions used execution timing dependant SPI operations so it would have been no wonder if that didn't work when execution speeds up. But the newest version uses smart pins so I'm pretty sure the SPI timing should be independent of instruction execution.

The main program is plain Spin2 which passes commands via a mailbox to the PASM driver running in it's own cog. Does FlexSpin optimize the assembler code? Any idea where I should look at?

Needless to say that the exact same code works normally in Propeller tool. Flexprop is the latest (5.9.14).

evanh · 2022-07-26 09:33

@ersmith said:

@evanh said:
I got a problem. Spin ORG/END inline pasm, I added a second # to an immediate operand, for padding purposes, but it failed to assemble with the AUG prefixing without me also extending the number of bits of the literal.

This particular case is not a worry, but the driver coding I use pasm for, often heavily relies on instruction counting for correct timing. Sometimes the number of bits needed in a literal are not fixed (Calculated from compile time parameters) and I will use ## and rely on that AUG always existing.

Testing of Pnut obeys the ## with small value literal.

Ouch. Yeah, this one will be tricky to fix. For now it's probably best to stick a big literal in there to force the AUG to be present.

Grr, it's a little worse than I thought. I had in fact already used what I thought was a workaround a little earlier than I reported this. The workaround was to attempt to turn it into a local register variable and use register direct addressing mode. But Flexspin just turns that straight back into immediate addressing mode, not using any register.

I'm a little surprised the timing discrepancy had gone unnoticed until now. More testing ...

EDIT: And this new detail now also explains a more pronounced problem I've been having in the last day. I can probably work around it but then it won't be compatible in Pnut/Proptool.

EDIT2: Aargh! This frigin' optimiser! Even function parameters with constants passed into them are candidates for this issue. Lucky I've found a reliable solution that's sort of superior (Down side is the double naming) - https://forums.parallax.com/discussion/comment/1541412/#Comment_1541412 Nooo, that doesn't solve the function parameters variant ...

ersmith · 2022-07-26 14:49

@ManAtWork said:
I've just tested the ILI9143 LCD driver and demo with FlexSpin. It works if I switch optimization off. With default optimization it doesn't, at least not fully. It doesn't crash but some of the commands don't work. For example clearScreen() does nothing and all fill block operations (filled rectangles and text) don't work. But line drawing and circles (single pixel operations) work.

The main program is plain Spin2 which passes commands via a mailbox to the PASM driver running in it's own cog. Does FlexSpin optimize the assembler code? Any idea where I should look at?

FlexSpin does not optimize PASM code in DAT sections; it only ever touches inline PASM. So if the driver is mainly PASM code running in its own COG then that should be fine. The most likely issue is then the communication between the Spin COG and the driver COG. The things I'd look for are:

(1) Race conditions: the Spin2 code may make implicit assumptions about how long it takes to write to the mailbox variables. If the inter-COG communication isn't carefully protected, then the faster Spin2 code generated by flexspin may break this.
(2) Can you put DEBUG() statements into the PASM to verify that the arguments being passed between COGs are correct?
(3) If the code works w/o optimization but fails with optimization, it could be a flexspin optimizer bug. You could try turning off optimizations on individual functions, e.g. if clearScreen is failing you could change its definition to:

PUB {++opt(0)} clearScreen()

The comment with ++opt(0) after the PUB makes it so that function is compiled as if the command line said -O0, i.e. optimization is disabled for that function.

I've looked at the generated code and nothing seems obviously wrong, so I suspect it may be a race condition. But I don't have hardware to actually test with.

ersmith · 2022-07-26 14:51

@evanh said:

@ersmith said:

@evanh said:
I got a problem. Spin ORG/END inline pasm, I added a second # to an immediate operand, for padding purposes, but it failed to assemble with the AUG prefixing without me also extending the number of bits of the literal.

This particular case is not a worry, but the driver coding I use pasm for, often heavily relies on instruction counting for correct timing. Sometimes the number of bits needed in a literal are not fixed (Calculated from compile time parameters) and I will use ## and rely on that AUG always existing.

Testing of Pnut obeys the ## with small value literal.

Ouch. Yeah, this one will be tricky to fix. For now it's probably best to stick a big literal in there to force the AUG to be present.

Grr, it's a little worse than I thought. I had in fact already used what I thought was a workaround a little earlier than I reported this. The workaround was to attempt to turn it into a local register variable and use register direct addressing mode. But Flexspin just turns that straight back into immediate addressing mode, not using any register.

That actually seems like the right way to do it -- it's certainly easier to update a register rather than changing the bits in the AUGS instruction. But I don't understand why you're having problems. Are you using ORG/END, or ASM/ENDASM? If ORG/END then the optimizer shouldn't be run at all. Can you share your code, or at least this fragment of it?

evanh · 2022-07-26 16:14

Damn it! I've just got everything worked around, and now, trying to revert the changes, I can't reproduce any of the issues.

PS: So part of what I thought had to be a workaround wasn't actually needed. I've left it out now.

evanh · 2022-07-26 17:21

I've managed to save one case - full source code attached - of weird behaviour by making a backup of it at that time. It's a different problem I think but one that has cropped up in completely different code before ... In the DAT section below, if I swap the two lines io_delay long 1 and prblob long 0[4] with each other then I get a repeating-print crash. Otherwise it runs fine.

DAT
prblob          long    0[4]
io_delay        long    1
txdata          byte    0[BLOCKSIZE]
rxdata          byte    0[BLOCKSIZE]

PS: The crash occurs at the first block copying to/from the PSRAM. After it has reset and ID queried each chip.
PPS: And a little debug shows it is during the transmit routine. So it's the first attempt to block write from the buffers. But not the first use of either io_delay or prblob.

ManAtWork · 2022-07-26 19:07

@ersmith said:
The most likely issue is then the communication between the Spin COG and the driver COG. The things I'd look for are:

(1) Race conditions: the Spin2 code may make implicit assumptions about how long it takes to write to the mailbox variables. If the inter-COG communication isn't carefully protected, then the faster Spin2 code generated by flexspin may break this.

The code actually contained some race conditions. There was a possibility that parts of the mailbox were overwritten before the last command was completed. I think I have removed all those traps but it still doesn't work with optimization turned on.

(2) Can you put DEBUG() statements into the PASM to verify that the arguments being passed between COGs are correct?

The problem is that the driver does a lot of initialisation before the first actual drawing command can be executed. If the command fails I don't know if the initialisation was correct or not without debugging the whole sequence.

I'll try to find the shortest possible program which reproduces the problem, tomorrow. I'm almost sure that it's not a bug of FlexSpin. However, as I knew such cases were discussed here a lot I asked first before investing a lot of time searching for the wrong thing.

PUB {++opt(0)} clearScreen()

Thanks a lot, I'll try that. But I'd rather go the other way and start with -O0 and turn on optimisation of single functions one by one. If you have a faulty system and flip one switch and it's still faulty you have gained nothing. On the other hand, if you have a working system, flip one switch and it starts to fail you have identified a problem, possibly one out of multiples.

PS: Ok, I have narrowed it down. The whole demo works perfectly when compiled with -O0. Turning on optimisation of setWindow() with

PUB {++opt(1)} setWindow(xs, ys, xe, ye)

breaks all commands which act on larger blocks than 1x1 pixels. Interestingly, setWindow() is still called for sngle pixels but the timing problem or whatever it is seems to have no effect then. I'll have to examine the generated code, tomorrow...

flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

Comments

Operand Operand Result