Fast Block Fills
TonyB_
Posts: 2,196
in Propeller 2
[
Above is copied from another thread.
This is probably no help for the P2, but here is an idea for an optional enhancement for Fast Block Moves to create Fast Block Fills, such that a single long is read which is then written multiple times. For SETQ+RDLONG only the first hub long is read and for SETQ+WRLONG/WMLONG only the first cog long is read if Q[31] = 1 (similarly for SETQ2).
I wish I'd thought of this at the time as it would have saved 12 longs in the Spin2 interpreter with no more cycles than now on average.
I think Chip said he was dead out of space for the interpreter; I wonder if that drove the decision to limit the timing to 32 bits.
With more code, it could be achieved. There are 15 identical instructions in the BYTEFILL/WORDFILL/LONGFILL routine that would sure be nice to condense, somehow. Then, we'd have some pennies left over for a 64-bit WAITMS/WAITUS.
Above is copied from another thread.
This is probably no help for the P2, but here is an idea for an optional enhancement for Fast Block Moves to create Fast Block Fills, such that a single long is read which is then written multiple times. For SETQ+RDLONG only the first hub long is read and for SETQ+WRLONG/WMLONG only the first cog long is read if Q[31] = 1 (similarly for SETQ2).
I wish I'd thought of this at the time as it would have saved 12 longs in the Spin2 interpreter with no more cycles than now on average.
Comments
Yes, we don't have any firepower when it comes to efficiently replicating a value into cog registers.
The alternative is use post-increment WRLONG PB, PTRA++. But then it's slower at 9 sysclocks per longword.
With a future Fast Block Fill, the last 15 MOVs could be replaced by:
Average execution time ~ 30 cycles, the same as now, or slightly quicker with setq reg.
Filling a 64kB block
This code is strictly cogexec.
So here's a question then. In TAQOZ I reserve some cog memory to be able to load modules that benefit from or can only run from cogexec. The load using setq is superfast, so there is minimal overhead. Does Spin2 allow this type of "fcache"?
Here's the existing fill portion of the Spin2 code that transfers the 16 pre filled registers to hub RAM in bursts:
Once it is up and running, the loop seems to take 20 clocks for the 9 instructions + 16 cycles for the transfer, plus a hub access window delay of 4 additional clocks to align back to a multiple of 8 clock cycles for the egg-beater. To me that implies that the fill efficiency is 40% (16/40 clocks). Perhaps if the loop could be shrunk by removing the jump instruction then we can boost it to 50% (16/32), which makes the fill 25% faster. So how about this variant below...can it help us?
It doesn't take any more COG RAM space and it doesn't try to branch at the end of the rep loop either which is safe I think, right @evanh? Plus when the source and dest buffers are aligned to the same 8 long boundary I think it might be able to speed up the copy as well because we are again on a multiple of 8 clocks (16 for instructions + 16 for reads + 16 for writes) and the bus transfer efficiency is then 66%.
Is my simple reasoning here correct or am I missing something?
There is one thing you are missing: interrupts. They will be stalled for a long time during a fill, due to REP. The way the interpreter is set up now, nothing hogs much time, so interrupts can be very granular in time.
The first $138 are available. You can also have terminate-stay-resident PASM programs in that space.
Whenever PASM is called from spin, including inline assembly, the FIFO is available to the code, since it is reinitialized afterwards.
Yeah I thought it was a bit too easy...
What's the current longest time that interrupts are stalled in the SPIN2 interpreter?
At the expense of one extra long it looks like both could be accommodated.
A compile time decision could be made that adjusts the entry point to select between fast and interrupt friendly versions.
That's a lot more than expected, very nice!
Your PASM code just needs to not touch registers $138..$1D7 and the LUT. You can use all the I/O registers, 6 stack levels, and the FIFO. Pretty complicated programs can run in the background on interrupts that you launch from in-line PASM or REGEXEC.
SETQ + MOV to move S to D, D+1, ..., utilising part of SETQ + RDLONG logic.
In the Spin2 code above, RDLONG has been done to copy fill data to a cog register so no need to do RDLONG again.
Correct, unless you determine what mode the FIFO is in and where it's at so that you can restore it. There could also be a SKIP pattern in progress from the main code. What could be simply done in an interrupt would be some I/O pin servicing, buffer updating, and flag setting. Better tread lightly.