Can a well-placed conditional RDFAST, perhaps with D[31] set, while in hubexec mode, reliably act as an alternative to REP without the overhead? Perhaps only if the loop size is a multiple of 8 or 16 instructions?
In hubexec, any branch/jump is going to cause the fifo to need to refill from scratch starting at the new address. Stalling execution until the fifo gets the first longs from the new address. That's why it's slow.
Doubt using an RDFAST will do anything but make it worse.
Yeah, what happens when RDFAST is executed in hubexec mode, in general?
That seems like one of those real funky edge cases.
Anyways, the obvious thing is that any loop small enough where REP vs. normal jump would matter is likely also small enough to be copied to cog/lut and REP'd there (does P2 fastspin do this yet? I think the P1 codegen has FCACHE?)
For a small number of loops it's better to unroll the loop when running hub exec. If there are a large number of loops, it might be better to copy the code to cog or lut memory, and run it there.
It would be a nice feature if the assembler would automatically unroll REP loops when assembling code targeted for hub exec. The feature could be controlled by a command line option.
Doesn't that infer that any loops in hubexec will also be slow to execute as there will always be a jump involved for the loop.
Absolutely yes. Every jump causes a new hub block load, with the accompanied delay waiting for hub alignment. This is the disadvantage of hubexec, but it’s a small price to pay. There are times where a hub loop would be better copied to cog to execute like an overlay.
But it’s way faster than lmm.
In fastspin, compiling with -O2 for the P2 turns on FCACHE, which copies loops to LUT before executing them. I was surprised at how little benefit this gave on many benchmarks; it helped, but was nowhere near as big a win as the corresponding P1 feature. hubexec really isn't too bad; the branch cost for the loop is typically amortized over quite a few instructions.
@ersmith
Is there some description somewhere as to what sort of optimisations get enabled at which level in Fastspin? Is it something we will likely need to concern ourselves with?
Can a well-placed conditional RDFAST, perhaps with D[31] set, while in hubexec mode, reliably act as an alternative to REP without the overhead? Perhaps only if the loop size is a multiple of 8 or 16 instructions?
Instructions spreadsheet says RDFAST not available in hubexec mode ("FIFO IN USE").
Another question:
Should there be a minimum time gap of 19 cycles between a RDFAST with D[31] set and a random hub read or write? If the gap is less then the FIFO might not be completely filled and the random read/write could miss its egg beater slot, costing an extra 8 cycles?
Can a well-placed conditional RDFAST, perhaps with D[31] set, while in hubexec mode, reliably act as an alternative to REP without the overhead? Perhaps only if the loop size is a multiple of 8 or 16 instructions?
Instructions spreadsheet says RDFAST not available in hubexec mode ("FIFO IN USE").
Yes, that's the key to my trick. If you do RDFAST while in hubexec, you'll confuse hubexec. The question is, it it possible to issue a RDFAST while in hubexec to confuse hubexec in a particular way that will consistently give the same effect as a jump with no overhead?
Another question:
Should there be a minimum time gap of 19 cycles between a RDFAST with D[31] set and a random hub read or write? If the gap is less then the FIFO might not be completely filled and the random read/write could miss its egg beater slot, costing an extra 8 cycles?
FIFO always takes priority over random access (even block transfers!)
@ersmith
Is there some description somewhere as to what sort of optimisations get enabled at which level in Fastspin? Is it something we will likely need to concern ourselves with?
There's an "Optimizations.md" file in the doc folder, but it didn't describe the levels at which the optimizations were enabled and didn't describe all of them. I've updated it with the extra info. A copy of it is:
Some of fastspin's optimizations
================================
Below are discussed some of the optimizations performed by fastspin, and at what level they are enabled.
Multiplication conversion (always)
-------------------------
Multiplies by powers of two, or numbers near a power of two, are converted to shifts. For example
```
a := a*10
```
is converted to
```
a := (a<<3) + (a<<1)
```
A similar optimization is performed for divisions by powers of two.
Unused method removal (-O1)
---------------------
This is pretty standard; if a method is not used, no code is emitted for it.
Dead code elimination (-O1)
---------------------
Within functions if code can obviously never be reached it is also removed. So for instance in something like:
```
CON
pin = 1
...
if (pin == 2)
foo
```
The if statement and call to `foo` are removed since the condition is always false.
Small Method inlining (-O1)
---------------------
Very small methods are expanded inline.
Register optimization (-O1)
---------------------
The compiler analyzes assignments to registers and attempts to minimize the number of moves (and temporary registers) required.
Branch elimination (-O1)
------------------
Short branch sequences are converted to conditional execution where possible.
Constant propagation (-O1)
--------------------
If a register is known to contain a constant, arithmetic on that register can often be replaced with move of another constant.
Peephole optimization (-O1)
---------------------
In generated assembly code, various shorter combinations of instructions can sometimes be substituted for longer combinations.
Loop optimization (basic in -O1, stronger in -O2)
-----------------
In some circumstances the optimizer can re-arrange counting loops so that the `djnz` instruction may be used instead of a combination of add/sub, compare, and branch. In -O2 a more thorough loop analysis makes this possible in more cases.
Fcache (-O1 for P1, -O2 for P2)
------
Small loops are copied to internal memory (COG on P1, LUT on P2) to be executed there. These loops cannot have any non-inlined calls in them.
Single Use Method inlining (-O2)
--------------------------
If a method is called only once in a whole program, it is expanded inline at the call site.
Common Subexpression Elimination (-O2)
--------------------------------
Code like:
```
c := a*a + a*a
```
is automaticaly converted to something like:
```
tmp := a*a
c := tmp + tmp
```
Loop Strength Reduction (-O2)
-----------------------
### Array indexes
Array lookups inside loops are converted to pointers. So:
```
repeat i from 0 to n-1
a[i] := b[i]
```
is converted to the equivalent of
```
aptr := @a[0]
bptr := @b[0]
repeat n
long[aptr] := long[bptr]
aptr += 4
bptr += 4
```
### Multiply to addition
An expression like `(i*100)` where `i` is a loop index can be converted to
something like `itmp \ itmp + 100`
... The question is, it it possible to issue a RDFAST while in hubexec to confuse hubexec in a particular way that will consistently give the same effect as a jump with no overhead?
Got me interested ... after some experimenting, short answer is no. Hubexec stalls waiting for the FIFO to refill. Bit 31 of D is ignored. You get the next instruction after the RDFAST, ie: what's in the instruction pipeline already, before the stall occurs.
... The question is, it it possible to issue a RDFAST while in hubexec to confuse hubexec in a particular way that will consistently give the same effect as a jump with no overhead?
Got me interested ... after some experimenting, short answer is no. Hubexec stalls waiting for the FIFO to refill. Bit 31 of D is ignored. You get the next instruction after the RDFAST, ie: what's in the instruction pipeline already, before the stall occurs.
So, then, it works reliably as a delayed JMP that doesn't cancel the next instruction already in the pipeline? This sounds like it would still have other uses.
Err, small correction, the cog doesn't stall, the FIFO supplies NOPs until the reload is complete. I say this because I observed another FIFO reload consistent with the length of the REP block without stalls.
So, I guess the intervening RDFAST data lasts until the independent REP position causes a reload.
Oh! If I remove the set bit 31 of D in the RDFAST, the REP loop time gets extended accordingly but the NOP time doesn't change, the NOP time becomes a stall time instead. I expected a change but now I realise that's just how long the hub rotation takes to even start to fill the FIFO.
So bit 31 of D works for this but the FIFO is flushed immediately on RDFAST.
There's an "Optimizations.md" file in the doc folder, but it didn't describe the levels at which the optimizations were enabled and didn't describe all of them. I've updated it with the extra info. A copy of it is:
Thanks Eric. I hadn't checked that you already had that, and thanks for updating it with the levels.
Oh! If I remove the set bit 31 of D in the RDFAST, the REP loop time gets extended accordingly but the NOP time doesn't change, the NOP time becomes a stall time instead. I expected a change but now I realise that's just how long the hub rotation takes to even start to fill the FIFO.
So bit 31 of D works for this but the FIFO is flushed immediately on RDFAST.
Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
I've updated fastspin to 4.1.11, and binaries are available from github and Patreon. The changes since 4.1.9 are:
Version 4.1.11
- Made _rxraw more forgiving (but slower) in P1
- No longer align P2 output to 32 bytes (matches newer PNut)
- Fixed various bugs in POSIX file functions
- Special case @ operation in REP @x, #N in inline asm
Version 4.1.10
- Added APPEND mode for open
- Added +<= and +>= aliases for +=< and +=> in Spin2
- Allow @"stuff" notation in Spin; means the same as STRING("stuff")
- Added COGCHK function
- Allowed assembly-only C and BASIC files (similar to Spin with only DAT)
- Implemented ONES, QEXP, and QLOG Spin2 operators
- Fixed a problem with negative numbers in Spin CASE statements
- Fixed bytemove, wordmove, longmove to work with overlapping extents
- Fixed a bogus name conflict with some internal variables like "_dir"
- Made OPTION EXPLICIT apply to FOR loop variables
(Version 4.1.10 didn't get a binary release, except as part of the FlexGUI 4.1.10 that I put on my Patreon page.)
Eric,
Is there a way to force an FCACHE around a piece of inline pasm code, or something similar. How big a lump can FCACHE handle? I've started optimising sdspi_bashed.spin2 and I'm wanting to handcraft some bit-bashing timing for better speed. I suppose spin code could be trusted but eventually will be wanting replace it with even faster streamer ops in bursts. So the FIFO will be dual purpose then.
If code doesn't work with -O2 set, are there any clues as for how to fix it?
Look at the .lst file and try to figure out what's wrong? One thing you could try is setting the fcache size to 0 with -O2 --fcache=0 to see if the problem is in fcache somehow.
For that matter I think you can enable fcache separately from -O2 by giving --fcache=N, where N is the number of longs to reserve in LUT.
Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
It trades stalling for NOP'ing. The REP loop timing is affected but that is barely relevant when the execution path is all screwy. With bit 31 set, instructions go missing until the FIFO has reloaded. So, for example, if you time it to match the REP looping then the remainder of the REP block is not executed.
Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?
In the new code I just checked in to github (source only for now) ORG/END will also flag its block of assembly to be copied to FCACHE, i.e. LUT memory, before execution. In C `__asm const` does the same, and in BASIC it's `ASM CPU`. FCACHE is turned on for P2 now by default. If you're able to build from source, please give it a try, I'd like to shake out the bugs.
Comments
EDIT: add "an alternative to"
Doubt using an RDFAST will do anything but make it worse.
That seems like one of those real funky edge cases.
Anyways, the obvious thing is that any loop small enough where REP vs. normal jump would matter is likely also small enough to be copied to cog/lut and REP'd there (does P2 fastspin do this yet? I think the P1 codegen has FCACHE?)
It would be a nice feature if the assembler would automatically unroll REP loops when assembling code targeted for hub exec. The feature could be controlled by a command line option.
But it’s way faster than lmm.
You need to use @ in pnut to get loop length for REP blocks.
I think inline assembly in FastSpin behaves differently...
Is there some description somewhere as to what sort of optimisations get enabled at which level in Fastspin? Is it something we will likely need to concern ourselves with?
Instructions spreadsheet says RDFAST not available in hubexec mode ("FIFO IN USE").
Another question:
Should there be a minimum time gap of 19 cycles between a RDFAST with D[31] set and a random hub read or write? If the gap is less then the FIFO might not be completely filled and the random read/write could miss its egg beater slot, costing an extra 8 cycles?
Yes, that's the key to my trick. If you do RDFAST while in hubexec, you'll confuse hubexec. The question is, it it possible to issue a RDFAST while in hubexec to confuse hubexec in a particular way that will consistently give the same effect as a jump with no overhead?
FIFO always takes priority over random access (even block transfers!)
There's an "Optimizations.md" file in the doc folder, but it didn't describe the levels at which the optimizations were enabled and didn't describe all of them. I've updated it with the extra info. A copy of it is:
So, I guess the intervening RDFAST data lasts until the independent REP position causes a reload.
So bit 31 of D works for this but the FIFO is flushed immediately on RDFAST.
Does bit 31 make a difference, then? And is RDFAST any quicker than the hidden jump in REP?
(Version 4.1.10 didn't get a binary release, except as part of the FlexGUI 4.1.10 that I put on my Patreon page.)
Is there a way to force an FCACHE around a piece of inline pasm code, or something similar. How big a lump can FCACHE handle? I've started optimising sdspi_bashed.spin2 and I'm wanting to handcraft some bit-bashing timing for better speed. I suppose spin code could be trusted but eventually will be wanting replace it with even faster streamer ops in bursts. So the FIFO will be dual purpose then.
Look at the .lst file and try to figure out what's wrong? One thing you could try is setting the fcache size to 0 with -O2 --fcache=0 to see if the problem is in fcache somehow.
For that matter I think you can enable fcache separately from -O2 by giving --fcache=N, where N is the number of longs to reserve in LUT.
Can you get it do to anything useful if you omit the REP entirely and rely on just the RDFAST to do your branching?