@evanh Yeah, the inline assembly code really isn't equipped to deal with _RET_ modifiers in any sensible way. Probably it should just throw an error on seeing them rather than trying to insert the RET instruction.
@evanh said:
... But I've hit a niggle that is probably something Chip needs to sort out ...
Huh, LOL, I tried to make an example for Chip to work with, only to find Pnut completely handles everything no problem.
Output from the Pnut v46 compiled example:
Errors, including debug's undefines, compiling wih Flexspin Version 7.0.0-beta2-v7.0.0-beta2-34-gcbe5f1a4 Compiled on: Jan 3 2025
reg_ref_test.spin2:36: error: Undefined symbol p_clk
reg_ref_test.spin2:36: error: Undefined symbol p_dat
reg_ref_test.spin2:36: error: Undefined symbol m_lead
reg_ref_test.spin2:36: error: Undefined symbol v_nco
reg_ref_test.spin2:36: error: Undefined symbol m_se2
reg_ref_test.spin2:36: error: Undefined symbol m_dat
reg_ref_test.spin2:36: error: Undefined symbol m_crc
reg_ref_test.spin2:45: error: Undefined symbol p_clk
reg_ref_test.spin2:45: error: Undefined symbol p_dat
reg_ref_test.spin2:45: error: Undefined symbol m_lead
reg_ref_test.spin2:45: error: Undefined symbol v_nco
reg_ref_test.spin2:45: error: Undefined symbol m_se2
reg_ref_test.spin2:45: error: Undefined symbol m_dat
reg_ref_test.spin2:45: error: Undefined symbol m_crc
reg_ref_test.spin2:49: error: data item is not constant
@evanh said:
... But I've hit a niggle that is probably something Chip needs to sort out ...
Huh, LOL, I tried to make an example for Chip to work with, only to find Pnut completely handles everything no problem.
Output from the Pnut v46 compiled example:
Unfortunately this is where my model of inline assembly (constructed prior to the existence of Spin2) breaks down. I had envisioned it as being used for short snippets of inline code, and wanted therefore to make it fast to call and tightly integrated with the rest of the compiler. This made sense because the compiled code is usually almost as fast as inline assembly would be anyway, so the asm mostly is needed only for things that are hard to express in the source language. But it meant having to convert it into the internal representation used by the compiler, which is somewhat restricted compared to general assembly.
In PNut, by contrast, the inline assembly blocks are essentially mini-DAT sections. Setting them up and running them is relatively expensive. The interpreter is slow enough that the trade-off is worth it, and people are using inline assembly more and more.
I already have something like the mini-DAT sections implemented for nucode (-2nu output) and I can probably port that over to the regular inline assembly. It'll be much slower to get into/out of these blocks, because we'll have to move the data in variables to and from temporary registers. On the plus side that will mean that the restriction on using variables in memory will be lifted. Is it worth it? I guess PNut compatibility is paramount, so I'll probably go ahead and do it, but the generated code will be ugly. For backwards compatibility (and performance) the current system will continue to be used for plain unadorned asm blocks.
But the DAT like "variables" aren't scoped outside of register space though. They can't be, and I don't need them, accessed as C locals. I'm struggling to see why a drastic change would be needed.
All I'm looking for here is the compiler to accept the symbol label as a register number to fill those DAT like LONGs. Specifically a fix for this error: reg_ref_test.spin2:49: error: data item is not constant
EDIT: I guess you're looking at the other errors with the debug() function. I didn't know they would be a headache. Forget them then. I don't need that to work.
PS: I'm looking at a C solution here too. The Spin example was done for testing Pnut is all.
In the below listing the ptr symbol is currently set with two bit-shifted constants, but I'd like to replace the 0x1f6 with PA register symbol, and, importantly, COGCRCBUFF will become a symbolic reference to one of the later RES symbols. Both are cogRAM addresses and get used by an ALTI instruction.
clocks long 1 + 512 * 2 // start-bit + nibble count
lnco long 0x8000_0000 // tells RDFAST/WRFAST not to wait for FIFO ready
endbit long 0xffff_ffff
poly long 0x8408 // CRC-16-CCITT reversed (x16 + x12 + x5 + x0: 0x1021, even parity)
ptr long 0x1f6<<19 | COGCRCBUFF<<9 // register PA for ALTI result substitution, and CRC buffer address
crc3 long 0
crc2 long 0
crc1 long 0
crc0 long 0
@evanh said:
But the DAT like "variables" aren't scoped outside of register space though. They can't be, and I don't need them, accessed as C locals. I'm struggling to see why a drastic change would be needed.
All I'm looking for here is the compiler to accept the symbol label as a register number to fill those DAT like LONGs. Specifically a fix for this error: reg_ref_test.spin2:49: error: data item is not constant
Yeah, that's the problem -- the current inline assembly parser does not know any symbol values, and in general cannot because it runs before the optimizer and "linker". Of course for the case of inline assembly that goes into COG memory we can in principle calculate them, but that's adding another patch onto an already creaky structure. The whole LONG/RES support is already something that's kind of bolted on to the compiler internal representation. That's also the reason that _RET_ prefixes aren't supported, the compiler doesn't generate them so there's no way to represent them in the IR.
I think the right solution is to bite the bullet and do a full assembly of the contents of ORG/END to binary, just as though it's inside the DAT section, rather than a partial assembly to IR that's then passed to later compiler passes. I've already got that for the P2 bytecode backend. The only tricky part I have left is accessing the local variables inside assembly; in bytecode the variables are always in HUB memory so we just copy from HUB to a designated part of COG before the inline asm runs, and back again after. We can do something similar for assembly output too. This will also apply to C __asm volatile (but not regular __asm, which will continue to be parsed with the existing code).
Removing the way it works currently is no good - I use ORG/END a lot to write loops with FIFO or overlapped CORDIC that only work in cogexec. Would be very annoying if those had extra startup overhead heaped on (compared to a regular fcache block).
Unrelatedly ORGH/END seems to have been broken at some point.
@Wuerfel_21 said:
Removing the way it works currently is no good - I use ORG/END a lot to write loops with FIFO or overlapped CORDIC that only work in cogexec. Would be very annoying if those had extra startup overhead heaped on (compared to a regular fcache block).
We can add an option to make it work the current way, but the default should be the new (more compatible) way. Making it an optimization flag would let us change it at function level granularity, which should be good enough I think?
Unrelatedly ORGH/END seems to have been broken at some point.
@Wuerfel_21 said:
Removing the way it works currently is no good - I use ORG/END a lot to write loops with FIFO or overlapped CORDIC that only work in cogexec. Would be very annoying if those had extra startup overhead heaped on (compared to a regular fcache block).
We can add an option to make it work the current way, but the default should be the new (more compatible) way. Making it an optimization flag would let us change it at function level granularity, which should be good enough I think?
I'd do it the opposite way - the current thing is good enough for most/all inline ASM that isn't super cracked and more efficient. You'd regress all the code that worked fine before to being slower. So make it an opt-in flag. The way attributes are given doesn't disturb PNut's parsing, so this is fine in all cases.
OK, for now the current method of inline assembly is the default for -O1 and above. It's in an optimization flag called -Ofast-inline-asm, so it can be turned off globally with -Ono-fast-inline-asm or in a specific function by placing {++opt(no-fast-inline-asm)} after the PUB or PRI (and similarly in C with an __attribute__(opt(no-fast-inline-asm)) on the function declaration).
It looks like something's going wrong with the calculation of the address of testset in loc pa, #testset. Which isn't surprising, since addresses are not known at that point, but I don't know why there isn't a warning. For now you can work around this by passing in the address in another local variable void *ptr = &testset and then use ptr in place of pa.
Oh, that figures, it is the same with Pnut. And same solution.
Thanks for looking for me.
EDIT: Yep, works now.
EDIT2: Now added to the SD card driver in two places and appears to be working flawless. The block read and block write low level routines required this to allow automatic packing of the CRC processing buffer tightly against the code sitting in Fcache space. And it also allows the compiler to know the existence of the processing buffer at all.
So, it should now be able to compile-time warn me if the whole lot doesn't fit ... hmm nope, it compiles without error. Then crashes after reading a block or two ... ah, maybe I need to add a FIT ... right, yeah, that worked.
Thanks for testing this @evanh. I've updated the code slightly to fix some bugs (e.g. it was copying things to local memory that it shouldn't have). I also tried to create relocations for unknown HUB addresses. That didn't work, but it did have the happy side effect (which I'd forgotten about) of providing comments in the binary blob showing the original assembly.
I should say, rather, Fcache size is not know when unspecified. Or is there a compiler generated value that could be compared?
@ersmith said:
... but it did have the happy side effect (which I'd forgotten about) of providing comments in the binary blob showing the original assembly.
Oh, so the hex dump in the .p2asm file is literally that - from an assembled binary file?
I see the assembly now
EDIT: Hang-on, there is an auto-generated FIT there. It ain't working though. I only get a compile error when I explicitly put my own FIT 128 in after the large RES 130 I have for the CRC processing buffer.
Example failure attached.
@evanh: Yeah, the auto generated fit is inserted after we've already compiled the ORG mini-dat, so it's only testing the size of the binary blob (the bytes from the code and data) and doesn't know about any RES inside. It's better than nothing, but it won't catch RES caused errors.
Back to wanting to know the Fcache size for me to add as a FIT directive then. And even better, an Fcache that can use whatever available cogRAM there is.
@evanh said:
Back to wanting to know the Fcache size for me to add as a FIT directive then. And even better, an Fcache that can use whatever available cogRAM there is.
It's documented to be 256 longs, but for some reason the code was only allocating 128. I've checked in an update to actually provide 256 (which is also what the P2 bytecode interpreter provides) but beware -- it's been 128 for a long time, so I would not be surprised to find that 256 is too big and we have to shrink it.
I have tried in the past to auto-allocate the amount of fcache, but that is extremely complicated due to all of the things that compete for COG memory, and it's never worked out very well.
Comments
@evanh Yeah, the inline assembly code really isn't equipped to deal with
_RET_
modifiers in any sensible way. Probably it should just throw an error on seeing them rather than trying to insert theRET
instruction.Huh, LOL, I tried to make an example for Chip to work with, only to find Pnut completely handles everything no problem.
Output from the Pnut v46 compiled example:
Errors, including debug's undefines, compiling wih Flexspin Version 7.0.0-beta2-v7.0.0-beta2-34-gcbe5f1a4 Compiled on: Jan 3 2025
Unfortunately this is where my model of inline assembly (constructed prior to the existence of Spin2) breaks down. I had envisioned it as being used for short snippets of inline code, and wanted therefore to make it fast to call and tightly integrated with the rest of the compiler. This made sense because the compiled code is usually almost as fast as inline assembly would be anyway, so the asm mostly is needed only for things that are hard to express in the source language. But it meant having to convert it into the internal representation used by the compiler, which is somewhat restricted compared to general assembly.
In PNut, by contrast, the inline assembly blocks are essentially mini-DAT sections. Setting them up and running them is relatively expensive. The interpreter is slow enough that the trade-off is worth it, and people are using inline assembly more and more.
I already have something like the mini-DAT sections implemented for nucode (
-2nu
output) and I can probably port that over to the regular inline assembly. It'll be much slower to get into/out of these blocks, because we'll have to move the data in variables to and from temporary registers. On the plus side that will mean that the restriction on using variables in memory will be lifted. Is it worth it? I guess PNut compatibility is paramount, so I'll probably go ahead and do it, but the generated code will be ugly. For backwards compatibility (and performance) the current system will continue to be used for plain unadornedasm
blocks.But the DAT like "variables" aren't scoped outside of register space though. They can't be, and I don't need them, accessed as C locals. I'm struggling to see why a drastic change would be needed.
All I'm looking for here is the compiler to accept the symbol label as a register number to fill those DAT like LONGs. Specifically a fix for this error:
reg_ref_test.spin2:49: error: data item is not constant
EDIT: I guess you're looking at the other errors with the debug() function. I didn't know they would be a headache. Forget them then. I don't need that to work.
PS: I'm looking at a C solution here too. The Spin example was done for testing Pnut is all.
In the below listing the
ptr
symbol is currently set with two bit-shifted constants, but I'd like to replace the0x1f6
withPA
register symbol, and, importantly,COGCRCBUFF
will become a symbolic reference to one of the later RES symbols. Both are cogRAM addresses and get used by an ALTI instruction.Yeah, that's the problem -- the current inline assembly parser does not know any symbol values, and in general cannot because it runs before the optimizer and "linker". Of course for the case of inline assembly that goes into COG memory we can in principle calculate them, but that's adding another patch onto an already creaky structure. The whole LONG/RES support is already something that's kind of bolted on to the compiler internal representation. That's also the reason that
_RET_
prefixes aren't supported, the compiler doesn't generate them so there's no way to represent them in the IR.I think the right solution is to bite the bullet and do a full assembly of the contents of ORG/END to binary, just as though it's inside the DAT section, rather than a partial assembly to IR that's then passed to later compiler passes. I've already got that for the P2 bytecode backend. The only tricky part I have left is accessing the local variables inside assembly; in bytecode the variables are always in HUB memory so we just copy from HUB to a designated part of COG before the inline asm runs, and back again after. We can do something similar for assembly output too. This will also apply to C
__asm volatile
(but not regular__asm
, which will continue to be parsed with the existing code).Removing the way it works currently is no good - I use ORG/END a lot to write loops with FIFO or overlapped CORDIC that only work in cogexec. Would be very annoying if those had extra startup overhead heaped on (compared to a regular fcache block).
Unrelatedly ORGH/END seems to have been broken at some point.
We can add an option to make it work the current way, but the default should be the new (more compatible) way. Making it an optimization flag would let us change it at function level granularity, which should be good enough I think?
Huh, that's funny. I'll look into it.
I'd do it the opposite way - the current thing is good enough for most/all inline ASM that isn't super cracked and more efficient. You'd regress all the code that worked fine before to being slower. So make it an opt-in flag. The way attributes are given doesn't disturb PNut's parsing, so this is fine in all cases.
A switch on the function definition would work fine for me. These big chunks of assembly definitely have their own wrapper each.
OK, for now the current method of inline assembly is the default for -O1 and above. It's in an optimization flag called
-Ofast-inline-asm
, so it can be turned off globally with-Ono-fast-inline-asm
or in a specific function by placing{++opt(no-fast-inline-asm)}
after thePUB
orPRI
(and similarly in C with an__attribute__(opt(no-fast-inline-asm))
on the function declaration).Oh noes ... where's the assembly gone?
PS: It doesn't work either. I should be getting
but instead get this
It looks like something's going wrong with the calculation of the address of testset in
loc pa, #testset
. Which isn't surprising, since addresses are not known at that point, but I don't know why there isn't a warning. For now you can work around this by passing in the address in another local variablevoid *ptr = &testset
and then useptr
in place ofpa
.Oh, that figures, it is the same with Pnut. And same solution.
Thanks for looking for me.
EDIT: Yep, works now.
EDIT2: Now added to the SD card driver in two places and appears to be working flawless. The block read and block write low level routines required this to allow automatic packing of the CRC processing buffer tightly against the code sitting in Fcache space. And it also allows the compiler to know the existence of the processing buffer at all.
So, it should now be able to compile-time warn me if the whole lot doesn't fit ... hmm nope, it compiles without error. Then crashes after reading a block or two ... ah, maybe I need to add a FIT ... right, yeah, that worked.
Next problem is Fcache size is not known.
Thanks for testing this @evanh. I've updated the code slightly to fix some bugs (e.g. it was copying things to local memory that it shouldn't have). I also tried to create relocations for unknown HUB addresses. That didn't work, but it did have the happy side effect (which I'd forgotten about) of providing comments in the binary blob showing the original assembly.
I should say, rather, Fcache size is not know when unspecified. Or is there a compiler generated value that could be compared?
Oh, so the hex dump in the .p2asm file is literally that - from an assembled binary file?
I see the assembly now
EDIT: Hang-on, there is an auto-generated FIT there. It ain't working though. I only get a compile error when I explicitly put my own FIT 128 in after the large RES 130 I have for the CRC processing buffer.
Example failure attached.
@evanh: Yeah, the auto generated fit is inserted after we've already compiled the ORG mini-dat, so it's only testing the size of the binary blob (the bytes from the code and data) and doesn't know about any RES inside. It's better than nothing, but it won't catch RES caused errors.
Back to wanting to know the Fcache size for me to add as a FIT directive then. And even better, an Fcache that can use whatever available cogRAM there is.
It's documented to be 256 longs, but for some reason the code was only allocating 128. I've checked in an update to actually provide 256 (which is also what the P2 bytecode interpreter provides) but beware -- it's been 128 for a long time, so I would not be surprised to find that 256 is too big and we have to shrink it.
I have tried in the past to auto-allocate the amount of fcache, but that is extremely complicated due to all of the things that compete for COG memory, and it's never worked out very well.
Thank you for raising it up again. I don't have to force it on the compile line now. That's also much closer to Pnut's level now too.
Fair enough about the auto-sizing.
A bonus of the change in assembler,
_RET_
is working now.Hmm, Pnut checks for RES allocation size and spits an error if beyond the max allowed. Not much of a need for using FIT there.
I've just added some code so flexspin should do this now too.
Champ! Thank you Eric.