Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?
I doubt it. The non-executing penalty is just like any other branch. If the FIFO wasn't being flushed then there'd be something to play with.
Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?
I doubt it. The non-executing penalty is just like any other branch. If the FIFO wasn't being flushed then there'd be something to play with.
If the FIFO were not flushed immediately when bit 31 is set, there might be time for a couple of instructions after the RDFAST, in which case the RDFAST "jump" might be placed two instructions early.
This is a conjecture, though, without a full understanding of FIFO reloading.
Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?
I doubt it. The non-executing penalty is just like any other branch. If the FIFO wasn't being flushed then there'd be something to play with.
If the FIFO were not flushed immediately when bit 31 is set, there might be time for a couple of instructions after the RDFAST, in which case the RDFAST "jump" might be placed two instructions early.
This is a conjecture, though, without a full understanding of FIFO reloading.
Early jump could be useless, it's very late here.
RDFAST with bit 31 set is useful, though, for cog/LUT exec.
Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?
I doubt it. The non-executing penalty is just like any other branch. If the FIFO wasn't being flushed then there'd be something to play with.
If the FIFO were not flushed immediately when bit 31 is set, there might be time for a couple of instructions after the RDFAST, in which case the RDFAST "jump" might be placed two instructions early.
This is a conjecture, though, without a full understanding of FIFO reloading.
Early jump could be useless, it's very late here.
RDFAST with bit 31 set is useful, though, for cog/LUT exec.
I would hope in some future version of the P2 that there will be separate instruction and data FIFOs. It would also be nice if the instruction FIFO could be implemented as a circular buffer, and the read pointer could be moved backwards a few instructions when jumping backwards. This would require that the instruction FIFO logic retain a certain number of longs -- maybe something like 16 should be sufficient. This would eliminate the need to reload the FIFO when executing small loops.
In the new code I just checked in to github (source only for now) ORG/END will also flag its block of assembly to be copied to FCACHE, i.e. LUT memory, before execution. In C `__asm const` does the same, and in BASIC it's `ASM CPU`. FCACHE is turned on for P2 now by default. If you're able to build from source, please give it a try, I'd like to shake out the bugs.
Rayman's sdspi_bashed.spin2 code is crapping out badly with this. It seems to be restarting all the time. I'm guessing it's most likely because it needed the timing delays that came with hubexec ... My earlier optimisations aren't any better ...
Ok, I see now that fpix was already a float, so didn't need to do float() on it.
Surprised it was working before, now that I think about it... In both fastspin and PNut.
Somehow, doing the round() made it check for this in a different way?
Rayman's sdspi_bashed.spin2 code is crapping out badly with this. It seems to be restarting all the time. I'm guessing it's most likely because it needed the timing delays that came with hubexec ... My earlier optimisations aren't any better ...
Thanks Evan. I think the optimizer was trying to fcache a block with inlined fcache code, which failed. That should be fixed now.
> Is there a way to force an FCACHE around a piece of inline pasm code, or something similar.
Not yet, but there's clearly a desire for it, so I'll see what I can do.
I'd also vote for something like that. What about a mechanism to load a small bunch of functions coded in assembly into LUT ram. Then it would be nice being able to declare those functions as LUT-resident. Maybe something like
int MyFunction (int input) __fromLUT(&myLabel)
That would speed up calls to regularly used functions a lot, I think.
I think the optimizer was trying to fcache a block with inlined fcache code, which failed. That should be fixed now.
Tested and working beautifully! I've rebuilt that sdspi code for extra fast pure bit-bashing via inline assembly. Next step is give the streamer a kick for the same job. See how fast an SD card in SPI mode can go.
The use of constants without "#" causes compiler warnings suggesting to use "-0" after the operand. But If I do exactly that the compiler erronously inserts a false "#". (I added some #defines because I thought it had something to do with the preprocessor but that's not the case)
I can confirm this to be fixed, now.
However, Murphy says there's always one more bug (or two):
The rdlong is optimized away. I don't know if the reason is that d is the same as the source address of the RDLUT. However, RDLONG is supposed to copy 256 longs and RDLUT reads only one. It works correctly if I use __asm const.
And another one:
call #\0x200
is copmpiled to
call 512
The "#\" (intended to be a jump into the LUT) seems to be ignored.
Yes I too have noticed the optimizer forget to remove the preceding setq or setq2 (or really any prefix) when deleting an instruction. Then the prefix errantly acts on the next instruction.
int32_t* d= 0x00;
int i;
__asm {
setq2 #256 // LUT block transfer
rdlong d,s
rdlut i,#0
}
Note that this will read 256 bytes into LUT starting at the address of d in COG, not at 0 (the value of d). I'm not entirely surprised the optimizer was confused by it although it probably should not have been optimized away; I'll try to add some more checks around SETQ.
I never said my code does something useful But there might be cases where somebody wants to do this. Today, I tried out so many different things that it's no wonder I'm totally confused. But yes, you are right, the way it should be is
I just noticed that Fastspin doesn't initialize variables in PUB methods whereas PNut does.
Maybe I knew this already, don't know.
But, it may break compatibility of PNut written code with Fastspin...
"If it hurts when you do that, don't do that." In other words, if you need a variable set to 0, set it to 0 explicitly.
Maybe I can fix this (it sounds like another difference between Spin1 and SPin2) but honestly I don't think I'll be able to chase down every possible incompatibility between fastspin and PNut.
In Spin 1, the local variables are on the stack, and they are not initialized, except for the result variable. It seems like always initializing local variables would waste cycles when entering a method. In most cases, a local variable does not need to be initialized. What would you do with local arrays? Would they be initialized also? Does PNut really do that?
I agree with Eric. If a local variable needs to be initialized, it should be done explicitly.
In Spin 1, the local variables are on the stack, and they are not initialized, except for the result variable. It seems like always initializing local variables would waste cycles when entering a method. In most cases, a local variable does not need to be initialized. What would you do with local arrays? Would they be initialized also? Does PNut really do that?
I agree with Eric. If a local variable needs to be initialized, it should be done explicitly.
Or just figure out if there is any chance of the variable being read while uninitialized and only then initialize it automatically (and potentially emit a warning? idk).
So what happens if I pass a local array to another method? Have the compiler generate code to zero the whole thing, even if it's not necessary? I think explicit initialization is better.
Thanks Evan. I think the optimizer was trying to fcache a block with inlined fcache code, which failed. That should be fixed now.
Uh-oh, not quite perfect after all. With -O2 the SD mount routine fails to return after certain compiles ... and it only takes a string length edit to trip it up. Basically the first timing sensitive code falls over.
-O1 optimising seems to work every time. Edits don't upset it.
EDIT: Ha! I think I may have found it. There is a one-liner inline assembly in the middle of the "start_explicit" mount routine within the sdspi file. All it does is fill a local variable from a GETCT instruction. I've replaced it with getct() instead.
EDIT2: Nope, cleaning out that one-liner didn't fix it. After some more edits the SD mount problem is back with -O2 again. I guess if it's an alignment issue then -O1 is just realigning things.
"If it hurts when you do that, don't do that." In other words, if you need a variable set to 0, set it to 0 explicitly.
I totally agree. But we are all imperfect and forget it from time to time. Initializing everything to 0 is a great thing for debugging. It makes everything more deterministic. I hate nothing more than hunting a heisenbug for hours just to find out that I forgot to initialize a variable.
So I'd suggest this: Pre-initialize everything with zero when compiling with -O0 and don't do it for higher optimization levels. This way I could fall back to no optimization if I suspect that there's something wrong with my code. If it works with -O0 and doesn't with -O1 then I know where to search and it saves me lots of time.
There are good arguments both ways for conflicts between global and local variables.
On the one hand, allowing you a local with any name at all means you can write a function without worrying about what global variable names there might be (so for example you can cut and paste functions between objects and the function will still work properly in its new place).
On the other hand, if you forget and try to use the global in a function that already has a local with that name, you won't get what you want.
Can you declare arrays as local variables like this?
PUB SendCMD(op,parm)|o[6],i,c,r1,r3
'Send command and sometimes get response
'first, need to calculate CRC7
o[0]:=$40+op
o[1]:=(parm>>24)&$FF
o[2]:=(parm>>16)&$FF
o[3]:=(parm>>8)&$FF
o[4]:=(parm>>0)&$FF
o[5]:=0
I thought you could, but it doesn't seem to work...
Comments
If the FIFO were not flushed immediately when bit 31 is set, there might be time for a couple of instructions after the RDFAST, in which case the RDFAST "jump" might be placed two instructions early.
This is a conjecture, though, without a full understanding of FIFO reloading.
Early jump could be useless, it's very late here.
RDFAST with bit 31 set is useful, though, for cog/LUT exec.
Early jump could be useless, it's very late here.
RDFAST with bit 31 set is useful, though, for cog/LUT exec.
I see that the VGA driver needs to do round() here to turn pixel rate into an integer:
What I tried to do is do this rounding in the CON section like this:
Gives this error:
Surprised it was working before, now that I think about it... In both fastspin and PNut.
Somehow, doing the round() made it check for this in a different way?
Thanks Evan. I think the optimizer was trying to fcache a block with inlined fcache code, which failed. That should be fixed now.
I'd also vote for something like that. What about a mechanism to load a small bunch of functions coded in assembly into LUT ram. Then it would be nice being able to declare those functions as LUT-resident. Maybe something like That would speed up calls to regularly used functions a lot, I think.
However, Murphy says there's always one more bug (or two): is compiled to The rdlong is optimized away. I don't know if the reason is that d is the same as the source address of the RDLUT. However, RDLONG is supposed to copy 256 longs and RDLUT reads only one. It works correctly if I use __asm const.
And another one: is copmpiled to The "#\" (intended to be a jump into the LUT) seems to be ignored.
Maybe I knew this already, don't know.
But, it may break compatibility of PNut written code with Fastspin...
Maybe I can fix this (it sounds like another difference between Spin1 and SPin2) but honestly I don't think I'll be able to chase down every possible incompatibility between fastspin and PNut.
In Spin 1, I think globals are initialized, but locals are not (except maybe the result one?).
I agree with Eric. If a local variable needs to be initialized, it should be done explicitly.
Or just figure out if there is any chance of the variable being read while uninitialized and only then initialize it automatically (and potentially emit a warning? idk).
So I'm happy to need to initialise. Needs to be in the gotchas or tricks and traps tho'.
May be nice to have a global flag to force initialisec all to zero at some stage???
-O1 optimising seems to work every time. Edits don't upset it.
EDIT: Ha! I think I may have found it. There is a one-liner inline assembly in the middle of the "start_explicit" mount routine within the sdspi file. All it does is fill a local variable from a GETCT instruction. I've replaced it with getct() instead.
EDIT2: Nope, cleaning out that one-liner didn't fix it. After some more edits the SD mount problem is back with -O2 again. I guess if it's an alignment issue then -O1 is just realigning things.
I totally agree. But we are all imperfect and forget it from time to time. Initializing everything to 0 is a great thing for debugging. It makes everything more deterministic. I hate nothing more than hunting a heisenbug for hours just to find out that I forgot to initialize a variable.
So I'd suggest this: Pre-initialize everything with zero when compiling with -O0 and don't do it for higher optimization levels. This way I could fall back to no optimization if I suspect that there's something wrong with my code. If it works with -O0 and doesn't with -O1 then I know where to search and it saves me lots of time.
C has rules about uninitialized variables, which need to be followed to make it standards compliant, as well.
Turns out I had both a global and local variable named R2
Tried it in PNut and that showed me the error in my ways...
On the one hand, allowing you a local with any name at all means you can write a function without worrying about what global variable names there might be (so for example you can cut and paste functions between objects and the function will still work properly in its new place).
On the other hand, if you forget and try to use the global in a function that already has a local with that name, you won't get what you want.
I thought you could, but it doesn't seem to work...