@TonyB_ said:
How do I tell FlexSpin to assemble CALLD PA,#A with relative addressing, not CALLD PA,#S ? Thanks in advance.
The "\" is an option for requesting absolute, so I would've thought it would be relative by default ...
There are two different instructions with the name CALLD. I want the one with 20-bit A relative (opcode $FE_xx_xx_xx with no prefix) not the other one that FlexSpin gives me.
@Wuerfel_21 said:
If there's a situation where PNut gives you the A encoding and flexspin gives you the S encoding, that's probably a bug.
I've never used PNut because I can't. I can manually code the CALLD instruction with relative 20-bit #A but then it's hard to subtract skipping offset N (see here) due to the byte addressing that this CALLD uses, especially if the branch is negative. FlexSpin subtracts N from the address with no trouble for the similar instructions near the top of opcode map.
Flexspin will use absolute addressing when crossing memory type boundaries. If you place everything in hubRAM and make the address difference larger than 256 then Flexspin will build the instruction using relative branch and the 20-bit immediate encoding. ie: CALLD PA,#R
Snippet:
orgh $800
_main
mov pa, num
call #itod
call #putsp
call #call0
calld pa, #call1
call #call0
mov pa, num
call #itod
call #putnl
.end
waitx #500
jmp #.end
orgh $1000
call0
add num, #1
ret
call1
add num, #1
push pa
ret
@TonyB_ said:
There are two different instructions with the name CALLD. I want the one with 20-bit A relative (opcode $FE_xx_xx_xx with no prefix) not the other one that FlexSpin gives me.
It's extremely unfortunate that there are different instructions with the same name, but that's an architectural issue that you'd need to take up with @cgracey . My goal with FlexSpin is to always produce the same binary for input assembly as the official assembler (PNut), so for this case if the relative offset fits in 9 bits FlexSpin picks the general CALLD D, #S instruction just like PNut does.
If you need a particular binary encoding for an instruction the only real way to force that is to use LONG to insert the binary in to the stream.
@TonyB_ said:
There are two different instructions with the name CALLD. I want the one with 20-bit A relative (opcode $FE_xx_xx_xx with no prefix) not the other one that FlexSpin gives me.
It's extremely unfortunate that there are different instructions with the same name, but that's an architectural issue that you'd need to take up with @cgracey . My goal with FlexSpin is to always produce the same binary for input assembly as the official assembler (PNut), so for this case if the relative offset fits in 9 bits FlexSpin picks the general CALLD D, #S instruction just like PNut does.
If you need a particular binary encoding for an instruction the only real way to force that is to use LONG to insert the binary in to the stream.
Thanks for the info, Eric. This is my workaround:
'calld pa,#a 'a in cog RAM
long $FE_10_00_00 + (a-$-1)<<2 & $FFFFF
@ersmith
Hi Eric - is it expected that using function pointers carry a pretty sizable memory penalty? Building the example code from the flexspin docs (added main function and vars to build):
' --- indirect call --- 1872 bytes prog/1044var bc (4592 native)
{
VAR
LONG funcptr
pub main | a, y
funcptr := @twice
y := funcptr(a)
}
' --- direct call --- 24 bytes prog/0var bc (432 native)
'{
pub main | a, y
y := twice(a)
'}
PUB twice(x) : r
r := x + x
calling twice() indirectly resulted in a huge increase in the binary size, whether bytecode or native. On the P2, spin2 nucode seems to be a bit better, though native code there is also a hefty increase.
A program that does nothing usually gets optimized away.
Eric's optimizer is just that good. In the second example that function call is completely eliminated, and then main is basically eliminated because it doesn't do anything.
But a function pointer throws all assumptions out the window. So no optimizing that out (you can call anything).
The problem is that Spin2 method pointers need to create a heap object because flexspin doesn't have the kind of RTTI that PNut does. Thus the heap allocator gets pulled in... Try reducing the heap size. Also, the heap code as-is obnoxiously depends on the system serial functionality for the sole purpose of a (rather poor) heap corruption check. That should probably be fixed, especially on BC (where the system serial I/O is ass and wastes a cog IIRC...)
@avsa242 said:
@ersmith
Hi Eric - is it expected that using function pointers carry a pretty sizable memory penalty? Building the example code from the flexspin docs (added main function and vars to build):
As Ada and Evan have already answered, method pointers are dynamically allocated on the heap, so all the garbage collection code as well as the heap itself get pulled in when one is used. I think there is an exception for compile time initialized C function pointers, but Spin doesn't have that.
@avsa242 said:
@ersmith
Hi Eric - is it expected that using function pointers carry a pretty sizable memory penalty? Building the example code from the flexspin docs (added main function and vars to build):
As Ada and Evan have already answered, method pointers are dynamically allocated on the heap, so all the garbage collection code as well as the heap itself get pulled in when one is used. I think there is an exception for compile time initialized C function pointers, but Spin doesn't have that.
Why did you choose to do it that way, rather than doing it like PNut? You'd only have to add a vtable to objects/classes that have method pointers taken, and it'd only have to include entries for methods that actually have their pointers taken.
As Ada and Evan have already answered, method pointers are dynamically allocated on the heap, so all the garbage collection code as well as the heap itself get pulled in when one is used. I think there is an exception for compile time initialized C function pointers, but Spin doesn't have that.
Why did you choose to do it that way, rather than doing it like PNut? You'd only have to add a vtable to objects/classes that have method pointers taken, and it'd only have to include entries for methods that actually have their pointers taken.
Because it had to work for all languages, and in some cases (e.g. C) the vtable would frequently end up being bigger than the struct itself, and would have to be added on to each instance. There are probably ways to work around this, but they get complicated. Method pointers aren't used very often in Spin, but they are used a lot in BASIC and C, and those languages tend to pull in the memory allocator anyway.
place arp_set_opcode()'s code inline if I call it? This is if -O1 is spec'd on the command line, or if {++opt(inline-small} is added to the func definition. The opt flags don't seem to affect the binary size, whereas if I manually copy the code from arp_set_opcode() into arp_reply(), the binary shrinks by 12 bytes (I haven't checked execution time yet to see if there's a difference between the two but I'm guessing there wouldn't be). It seems the same in spin2 (Nu or pasm2).
Probably goes without saying, but this is a simplified excerpt from a much larger piece of code that I'd originally tried the optimizations on (same there - no effect).
Thanks!
@avsa242 said:
What's the maximum amount/complexity of code that inline-small will optimize? (or is this not trivial to quantify?)
It depends on the size of the generated code, not the input code, and generally amounts to about 4 instructions plus 1 instruction per parameter (so 5 instructions for your example). Doing single byte operations tends to cause a lot of instructions to be generated. If this is intended for a P2 where unaligned accesses are allowed then you could do:
PUB arp_set_opcode(op)
word[@_arp_data[ARP_OP_CODE]] := op REV 15
which would generate fewer instructions. (EDIT: whoops... it would also be wrong, as Ada pointed out below, use her code instead ).
The inlining decision depends on the number of instructions in the function after regular compilation (threshold differs between P1 and P2) and the number of arguments that it takes. The BC/Nu backends do not perform inlining.
Also, assuming _arp_data is a byte array, you might want to try this instead:
I have a bug caused by how FlexSpin assembles JMP #A. I'm using an old version (5.5.2) and I don't know whether this has been changed since. The doc says:
Relative addressing is convenient for relocatable code, or code which can run from either cog RAM or hub RAM. Relative addressing is the default when cog code references cog labels or hub code references hub labels. On the other hand, absolute addressing is highly recommended, and forced by the assembler, when crossing between cog and hub domains.
It doesn't mention LUT RAM. If I use JMP #A to jump from cog to LUT RAM or vice-versa, FlexSpin uses absolute addressing and I don't get the advantage of relative jumps during skipping unless I hand-code the jumps.
@Wuerfel_21 said:
Inexplicably we have a never-inline flag, but no always-inline flag, IIRC.
Good point. I've added an "inline" attribute which will encourage the function to be inlined (raises the threshold from 4 instructions to 100 instructions).
@TonyB_ said:
I have a bug caused by how FlexSpin assembles JMP #A. I'm using an old version (5.5.2) and I don't know whether this has been changed since. The doc says:
Relative addressing is convenient for relocatable code, or code which can run from either cog RAM or hub RAM. Relative addressing is the default when cog code references cog labels or hub code references hub labels. On the other hand, absolute addressing is highly recommended, and forced by the assembler, when crossing between cog and hub domains.
It doesn't mention LUT RAM. If I use JMP #A to jump from cog to LUT RAM or vice-versa, FlexSpin uses absolute addressing and I don't get the advantage of relative jumps during skipping unless I hand-code the jumps.
Crossing from cog to LUT is treated the same as any other memory transition (e.g. cog to HUB) and forces absolute addressing. This is the way PNut does it, or at least the way it did -- if it's changed and flexspin no longer matches PNut please let me know. I wanted to be conservative and match what Chip does in this case. In fact I suspect relative addressing would work for cog <-> LUT transitions, but Chip designed the hardware and if he specified that cog/LUT transitions should be absolute jumps, that's what flexspin will do.
@TonyB_ said:
I have a bug caused by how FlexSpin assembles JMP #A. I'm using an old version (5.5.2) and I don't know whether this has been changed since. The doc says:
Relative addressing is convenient for relocatable code, or code which can run from either cog RAM or hub RAM. Relative addressing is the default when cog code references cog labels or hub code references hub labels. On the other hand, absolute addressing is highly recommended, and forced by the assembler, when crossing between cog and hub domains.
It doesn't mention LUT RAM. If I use JMP #A to jump from cog to LUT RAM or vice-versa, FlexSpin uses absolute addressing and I don't get the advantage of relative jumps during skipping unless I hand-code the jumps.
Crossing from cog to LUT is treated the same as any other memory transition (e.g. cog to HUB) and forces absolute addressing. This is the way PNut does it, or at least the way it did -- if it's changed and flexspin no longer matches PNut please let me know. I wanted to be conservative and match what Chip does in this case. In fact I suspect relative addressing would work for cog <-> LUT transitions, but Chip designed the hardware and if he specified that cog/LUT transitions should be absolute jumps, that's what flexspin will do.
Relative branches between cog and LUT definitely work. I don't know what PNut does as I am unable to run it. If somebody can confirm PNut uses absolute addressing in this case then I'll ask for that to be changed. Absolute can be specified whenever wanted but relative never can.
I seem to have some problem alike this, but the other way around.
I need absolute JMPs inside COG ram and FlexSpin seems to code relative JMPs. Any way to force absolute JMPs (just JMP, CALL not needed) inside COG code out of a normal DAT section, not inline.
Comments
Printf is very stack memory hungry on its own. Because it is basically another runtime interpreted language.
How do I tell FlexSpin to assemble CALLD PA,#A with relative addressing, not CALLD PA,#S ? Thanks in advance.
The "\" is an option for requesting absolute, so I would've thought it would be relative by default ...
There are two different instructions with the name CALLD. I want the one with 20-bit A relative (opcode $FE_xx_xx_xx with no prefix) not the other one that FlexSpin gives me.
If there's a situation where PNut gives you the A encoding and flexspin gives you the S encoding, that's probably a bug.
I've never used PNut because I can't. I can manually code the CALLD instruction with relative 20-bit #A but then it's hard to subtract skipping offset N (see here) due to the byte addressing that this CALLD uses, especially if the branch is negative. FlexSpin subtracts N from the address with no trouble for the similar instructions near the top of opcode map.
Flexspin will use absolute addressing when crossing memory type boundaries. If you place everything in hubRAM and make the address difference larger than 256 then Flexspin will build the instruction using relative branch and the 20-bit immediate encoding. ie:
CALLD PA,#R
Snippet:
It's extremely unfortunate that there are different instructions with the same name, but that's an architectural issue that you'd need to take up with @cgracey . My goal with FlexSpin is to always produce the same binary for input assembly as the official assembler (PNut), so for this case if the relative offset fits in 9 bits FlexSpin picks the general CALLD D, #S instruction just like PNut does.
If you need a particular binary encoding for an instruction the only real way to force that is to use LONG to insert the binary in to the stream.
Thanks for the info, Eric. This is my workaround:
@ersmith
Hi Eric - is it expected that using function pointers carry a pretty sizable memory penalty? Building the example code from the flexspin docs (added main function and vars to build):
calling
twice()
indirectly resulted in a huge increase in the binary size, whether bytecode or native. On the P2, spin2 nucode seems to be a bit better, though native code there is also a hefty increase.Thanks
A program that does nothing usually gets optimized away.
Eric's optimizer is just that good. In the second example that function call is completely eliminated, and then main is basically eliminated because it doesn't do anything.
But a function pointer throws all assumptions out the window. So no optimizing that out (you can call anything).
The problem is that Spin2 method pointers need to create a heap object because flexspin doesn't have the kind of RTTI that PNut does. Thus the heap allocator gets pulled in... Try reducing the heap size. Also, the heap code as-is obnoxiously depends on the system serial functionality for the sole purpose of a (rather poor) heap corruption check. That should probably be fixed, especially on BC (where the system serial I/O is ass and wastes a cog IIRC...)
I like my explanation better because it doesn't add items to a seemingly endless to-do list.
The heap space, oddly, is allocated as space in the binary file itself. So if the heap size is 64 kB it'll add 64 kB to the file size.
As Ada and Evan have already answered, method pointers are dynamically allocated on the heap, so all the garbage collection code as well as the heap itself get pulled in when one is used. I think there is an exception for compile time initialized C function pointers, but Spin doesn't have that.
Why did you choose to do it that way, rather than doing it like PNut? You'd only have to add a vtable to objects/classes that have method pointers taken, and it'd only have to include entries for methods that actually have their pointers taken.
Because it had to work for all languages, and in some cases (e.g. C) the vtable would frequently end up being bigger than the struct itself, and would have to be added on to each instance. There are probably ways to work around this, but they get complicated. Method pointers aren't used very often in Spin, but they are used a lot in BASIC and C, and those languages tend to pull in the memory allocator anyway.
The way PNut does it only adds one long to each instance (pointer to its PBASE, which is only initialized when a method pointer is actually taken).
As mentioned above, the real killer for P1 BC is the heap corruption detector pulling in the serial print code.
I've changed that so it only happens now if debug is enabled (-g, for P1), since debug pulls in the tx code anyway.
What's the maximum amount/complexity of code that
inline-small
will optimize? (or is this not trivial to quantify?)Should something like:
place
arp_set_opcode()
's code inline if I call it? This is if -O1 is spec'd on the command line, or if{++opt(inline-small}
is added to the func definition. The opt flags don't seem to affect the binary size, whereas if I manually copy the code fromarp_set_opcode()
intoarp_reply()
, the binary shrinks by 12 bytes (I haven't checked execution time yet to see if there's a difference between the two but I'm guessing there wouldn't be). It seems the same in spin2 (Nu or pasm2).Probably goes without saying, but this is a simplified excerpt from a much larger piece of code that I'd originally tried the optimizations on (same there - no effect).
Thanks!
It depends on the size of the generated code, not the input code, and generally amounts to about 4 instructions plus 1 instruction per parameter (so 5 instructions for your example). Doing single byte operations tends to cause a lot of instructions to be generated. If this is intended for a P2 where unaligned accesses are allowed then you could do:
which would generate fewer instructions. (EDIT: whoops... it would also be wrong, as Ada pointed out below, use her code instead ).
The inlining decision depends on the number of instructions in the function after regular compilation (threshold differs between P1 and P2) and the number of arguments that it takes. The BC/Nu backends do not perform inlining.
Also, assuming
_arp_data
is a byte array, you might want to try this instead:EDIT: Lol eric sniped me but
REV
is the wrong operator.Ah! Okay, misunderstanding on my part...thanks to you both.
Inexplicably we have a never-inline flag, but no always-inline flag, IIRC.
I have a bug caused by how FlexSpin assembles JMP #A. I'm using an old version (5.5.2) and I don't know whether this has been changed since. The doc says:
It doesn't mention LUT RAM. If I use JMP #A to jump from cog to LUT RAM or vice-versa, FlexSpin uses absolute addressing and I don't get the advantage of relative jumps during skipping unless I hand-code the jumps.
Good point. I've added an "inline" attribute which will encourage the function to be inlined (raises the threshold from 4 instructions to 100 instructions).
Crossing from cog to LUT is treated the same as any other memory transition (e.g. cog to HUB) and forces absolute addressing. This is the way PNut does it, or at least the way it did -- if it's changed and flexspin no longer matches PNut please let me know. I wanted to be conservative and match what Chip does in this case. In fact I suspect relative addressing would work for cog <-> LUT transitions, but Chip designed the hardware and if he specified that cog/LUT transitions should be absolute jumps, that's what flexspin will do.
Relative branches between cog and LUT definitely work. I don't know what PNut does as I am unable to run it. If somebody can confirm PNut uses absolute addressing in this case then I'll ask for that to be changed. Absolute can be specified whenever wanted but relative never can.
I seem to have some problem alike this, but the other way around.
I need absolute JMPs inside COG ram and FlexSpin seems to code relative JMPs. Any way to force absolute JMPs (just JMP, CALL not needed) inside COG code out of a normal DAT section, not inline.
desperate,
Mike