I was thinking the same thing... It's essentially the same as UTF-8.
No, it isn't -- in UTF-8 the first byte (if it's not 0-127) contains what's essentially a count of the remaining bytes in the upper bits of the first byte, and the remaining bytes store 6 bits per byte (upper bits must be 1 and then 0). UTF-8 was pretty specifically designed so that ASCII bytes (0-127) do not appear at all in the representation of values above 127, and also so that finding character boundaries is very simple (characters start with either 0x or 11, never 10, and bytes starting with 0x or 11 never appear as other than the first byte of a character). The RFVAR representation is much simpler (7 bits per byte, upper bit is 0 when finished).
Except this is limited to 24 bits? Why not allow 32-bits like UTF-8?
I think it's actually limited to 28 bits (4 bytes). I'm not sure why the option for 5 bytes wasn't provided -- it would have been nice to be able to cover the whole range of 32 bit values, but I presume something internally made it hard to go above 4 bytes per read.
@Rayman said:
Except this is limited to 24 bits? Why not allow 32-bits like UTF-8?
I think it's actually limited to 28 bits (4 bytes). I'm not sure why the option for 5 bytes wasn't provided -- it would have been nice to be able to cover the whole range of 32 bit values, but I presume something internally made it hard to go above 4 bytes per read.
RFVAR/RFVARS limit is 29 bits, as 4th byte can use all eight bits. I think FIFO being long-wide probably prevented a 5th byte and there is always RFLONG if 29 bits aren't enough. At first sight perhaps not very useful, RFVAR and RFVARS are actually a handy pair of space-saving instructions that take two cycles whether 1, 2, 3 or 4 bytes are read.
Is there any way to prevent the flexspin optimizer from removing inline PASM register assignments on an individual method basis? Or do I have to resort to removing all optimizations for the whole project with the "-O0" option ?
In my inline PASM code I'm writing to sequential local registers and then performing a setq burst transfer to write all these registers into HUB. I don't explicitly identify the consecutive register names after the first one written by the burst transfer. I'm finding the flexspin optimizer is then removing the instruction assignments to these registers because it believes the registers are no longer accessed so there is no need to perform the instructions, but this is bad as they are indirectly accessed in the setq wrlong transfer, and I get corrupted output. I've been trying to add some dummy register access instructions at the end of my loop that uses all these registers and sometimes it tricks the optimizer into believing the registers are used and so won't remove the code, but it's a little bit hit and miss. I'm thinking it's safer to just disable optimizations in these cases, but ideally do this on a per method basis. Is there a way?
Yeah I was the same and thought it would only work for other languages but SPIN is included as well now. I found it buried in the docs, seems to work okay.
Spin2-style ORG/END inline ASM is not optimized and always loads into cog RAM.
Flexspin-style ASM/ENDASM is treated just like compiler-generated code.
The 3rd inline ASM style (__asm volatile in C) isn't supported in Spin because uhhhhhhhhh.
Unrelatedly, do not assume that local variables defined consecutively actually end up in consecutive registers. Register renaming will assign them in order of use. Should work if you use an array though?
@Wuerfel_21 said:
Spin2-style ORG/END inline ASM is not optimized and always loads into cog RAM.
Flexspin-style ASM/ENDASM is treated just like compiler-generated code.
Ok
Unrelatedly, do not assume that local variables defined consecutively actually end up in consecutive registers. Register renaming will assign them in order of use. Should work if you use an array though?
Arggh, what is the syntax for the register array? Is it a local or put in the VAR block? I need to do setq burst transfers on it so it needs to be a local register accessible from inline ASM/ENDASM block.
@Wuerfel_21 said:
Unrelatedly, do not assume that local variables defined consecutively actually end up in consecutive registers. Register renaming will assign them in order of use.
Hmm, I've seen many examples of "official" code in the library from Parallax where the adresses of parameters on the stack and also local variables were assumed to be in the order of declaration. Since Spin does not support any composed data types such as structs in C this makes sense although it makes some optimisations difficult.
If the variables are forced onto the stack (as happens when you ask for the address of one), then yes, they are in order, but the whole function gets slow and bloated, so don't do that if writing for flexspin specifically.
Yeah I just hit the problem with the ordering of registers....I had to trick the optimizer again through referencing them explicitly in sequential order. You can't define a pix[8] array here instead, the inline PASM code doesn't deal with indexes, so I need explicit names. I don't think this would be even possible with FlexProp - too many args+local variables...?
PRI {++opt(!regs)} gfxBlend32(src1, stride1, src2, stride2, destaddr, stride3, w, h, blendval, mixval) | pix1, pix2, pix3, pix4, pix5, pix6, pix7, pix8, len, srcptr, destptr, src, dest
w <#= LINEBUFSIZE ' stay inside scratch buffer
if w == 0 or h == 0
return
src := @srcbuf
dest := @textbuf
asm
mov pix1, #0 ' only put here to ensure optimizer assigns registers in the usable order
mov pix2, #0
mov pix3, #0
mov pix4, #0
mov pix5, #0
mov pix6, #0
mov pix7, #0
mov pix8, #0
endasm
repeat h
mem.read(@srcbuf, src1, w<<2)
src1 += stride1
mem.read(@textbuf, src2, w<<2)
src2 += stride2
asm
mov len, w
setpiv blendval
setpix mixval
mov srcptr, src
mov destptr, dest
mixloop setq #3
rdlong pix1, srcptr
add srcptr, #16
setq #3
rdlong pix5, destptr
mixpix pix5, pix1
mixpix pix6, pix2
mixpix pix7, pix3
mixpix pix8, pix4
cmp len, #4 wc ' check if less than 4 pixels
if_c altd len, #$1ff
setq #3
wrlong pix5, destptr
add destptr, #16
sub len, #4 wcz
if_nc_and_nz jmp #mixloop
endofline
endasm
'write back the rendered buffer data
mem.write(@textbuf, destaddr, w<<2)
destaddr += stride3
@rogloh : I think if you use ORG/END instead of ASM/ENDASM you won't run into so many optimizer issues (and your code will be compatible with PropTool/PNut).
@ersmith said:
@rogloh : I think if you use ORG/END instead of ASM/ENDASM you won't run into so many optimizer issues (and your code will be compatible with PropTool/PNut).
Yeah it can compile that way but I see in the listing that would drag in the fcached inline PASM code for every scan line, slowing down performance during rendering. It might have to be the way though for a PropTool variant if that's the way inline code works there and if this code ever wants to be executed on that platform.
UPDATE: Actually in this case, if the outer loop SPIN functions are reasonably slow, it may not be that much slower over all. I'll have to benchmark it at some point.
Comments
No, it isn't -- in UTF-8 the first byte (if it's not 0-127) contains what's essentially a count of the remaining bytes in the upper bits of the first byte, and the remaining bytes store 6 bits per byte (upper bits must be 1 and then 0). UTF-8 was pretty specifically designed so that ASCII bytes (0-127) do not appear at all in the representation of values above 127, and also so that finding character boundaries is very simple (characters start with either 0x or 11, never 10, and bytes starting with 0x or 11 never appear as other than the first byte of a character). The RFVAR representation is much simpler (7 bits per byte, upper bit is 0 when finished).
I think it's actually limited to 28 bits (4 bytes). I'm not sure why the option for 5 bytes wasn't provided -- it would have been nice to be able to cover the whole range of 32 bit values, but I presume something internally made it hard to go above 4 bytes per read.
RFVAR/RFVARS limit is 29 bits, as 4th byte can use all eight bits. I think FIFO being long-wide probably prevented a 5th byte and there is always RFLONG if 29 bits aren't enough. At first sight perhaps not very useful, RFVAR and RFVARS are actually a handy pair of space-saving instructions that take two cycles whether 1, 2, 3 or 4 bytes are read.
@ersmith, @TonyB_ Thank's for explaining rfvar/rfvars. I wonder now if it'd be better to just copy UTF-8... But, I guess it's a done deal...
@ersmith,
but @@@data1 does what I think and put absolute addresses or doesn't it? I am quite annoyed with relative addressing,
mike
Yes,
@@@data1
is always absolute; but that's only available in flexspin, I don't think pnut/proptool support it.Is there any way to prevent the flexspin optimizer from removing inline PASM register assignments on an individual method basis? Or do I have to resort to removing all optimizations for the whole project with the "-O0" option ?
In my inline PASM code I'm writing to sequential local registers and then performing a setq burst transfer to write all these registers into HUB. I don't explicitly identify the consecutive register names after the first one written by the burst transfer. I'm finding the flexspin optimizer is then removing the instruction assignments to these registers because it believes the registers are no longer accessed so there is no need to perform the instructions, but this is bad as they are indirectly accessed in the setq wrlong transfer, and I get corrupted output. I've been trying to add some dummy register access instructions at the end of my loop that uses all these registers and sometimes it tricks the optimizer into believing the registers are used and so won't remove the code, but it's a little bit hit and miss. I'm thinking it's safer to just disable optimizations in these cases, but ideally do this on a per method basis. Is there a way?
Here's an example of the problem...
Source is this:
and the bad code generated is this, which is missing several intermediate instructions:
It should be this (which is what happens when the optimizer is globally disabled):
Ok, found my answer I think. I might need to use this:
PUB {++opt(!regs)} methodName(args...)
My code seems to work now.
__asm volatile {
}
Oh, is that Spin? I thought Flexspin didn't optimise inline assembly there.
Yeah I was the same and thought it would only work for other languages but SPIN is included as well now. I found it buried in the docs, seems to work okay.
Spin2-style ORG/END inline ASM is not optimized and always loads into cog RAM.
Flexspin-style ASM/ENDASM is treated just like compiler-generated code.
The 3rd inline ASM style (
__asm volatile
in C) isn't supported in Spin because uhhhhhhhhh.Unrelatedly, do not assume that local variables defined consecutively actually end up in consecutive registers. Register renaming will assign them in order of use. Should work if you use an array though?
Ok
Arggh, what is the syntax for the register array? Is it a local or put in the VAR block? I need to do setq burst transfers on it so it needs to be a local register accessible from inline ASM/ENDASM block.
I think just a regular local array is sufficient? May not actually work like that, idk.
Hmm, I've seen many examples of "official" code in the library from Parallax where the adresses of parameters on the stack and also local variables were assumed to be in the order of declaration. Since Spin does not support any composed data types such as structs in C this makes sense although it makes some optimisations difficult.
If the variables are forced onto the stack (as happens when you ask for the address of one), then yes, they are in order, but the whole function gets slow and bloated, so don't do that if writing for flexspin specifically.
Yeah I just hit the problem with the ordering of registers....I had to trick the optimizer again through referencing them explicitly in sequential order. You can't define a pix[8] array here instead, the inline PASM code doesn't deal with indexes, so I need explicit names. I don't think this would be even possible with FlexProp - too many args+local variables...?
@rogloh : I think if you use ORG/END instead of ASM/ENDASM you won't run into so many optimizer issues (and your code will be compatible with PropTool/PNut).
That makes sense now - I've always used ORG/END for Spin based inline assembly.
Yeah it can compile that way but I see in the listing that would drag in the fcached inline PASM code for every scan line, slowing down performance during rendering. It might have to be the way though for a PropTool variant if that's the way inline code works there and if this code ever wants to be executed on that platform.
UPDATE: Actually in this case, if the outer loop SPIN functions are reasonably slow, it may not be that much slower over all. I'll have to benchmark it at some point.