1 cycle/long memory zero
moony
Posts: 27
in Propeller 2
setq len wrlong #0, ptr
@cgracey This seems to be unintended behavior in the P2, but it's quite useful, so was hoping you could declare it honorary intended behavior for me
What's going on:
When the P2 sees an immediate there, it actually treats it as a 9-bit value to fill in 32-bit blocks. AUGD does not work here, but if it did, this would be a full-blown memset. Maybe, if this is declared Intended(tm), AUGD could be fixed to work in future revs, enabling a high-speed memset.
I haven't verified this works for lengths >511, so hopefully chip can answer that.
Thanks to @Wuerfel_21 for asking me to test this behavior in the first place, allowing us to find this neat little trick. Not the memset we were hoping for, but bzero is good too.
Comments
Remember that the value you set in setq (the contents of len) should actually be 1 less than the total length.
Also, why would you need AUGD to make it a memset? If you want to fill in "len+1" bytes you could use "wrbyte #0, ptr" instead of "wrlong".
@ersmith because wrbyte does not have this behavior. I tried. wrbyte simply does not implement the block copy functionality.
Ah, interesting, I hadn't thought that the setq was restricted to wrlong / rdlong, but now that you mention it it does kind of make sense. Thanks for testing this!
Now that I''ve tried it myself, it works for me, so idk
As it turns out, AUGD does work, and I simply made a mistake. This means this is a general purpose memset should you use SMC. Perfect!
Ha, cool. It took me a while to notice what being advertised here - The #0 immediate mode for source data doesn't block copy the cogRAM data the way SETQ+WRLONG is documented but just keeps reusing the immediate data, hence it achieves hubRAM filling function.
This needs added to the documentation.
Being designed for COGRAM transfers I strongly suspect it would have been limited to 512 transfers only at a time with a 9 bit counter. You could add REPs around it though and deal with the start/end cases outside the loop to workaround that and it will still be very efficient. Good find!
I am pretty sure it uses an 18-bit counter, in order to be able to fill the whole hub memory. The Cog RAM or LUT will just wrap. There is a hardware bug that stops PTRx from automatically incrementing or decrementing by the full amount when SETQ/SETQ2 are used, if there is an intervening instruction such as ALTx or AUGx.
Dang! ALTD is needed for specifying an arbitrary fill value. Still, a fast erase is cool.
It will fill, it just won't automatically inc/dec PTRx correctly. You probably don't care, in the case of a fill.
The arbitrary fill works perfectly fine, just have to self-modify the value into the right place.
Me and @moony have been hacking at VJET, the polygon rendering thing from the P1. I have it running right now at 1024x768, displaying a 181-sided shape with just two renderer cogs, but I broke something when converting it to use FIFO ops to read display list data. The fast fill trick is used for both filling the background and filling the polygon spans. The latter looks like this:
Oh, cool. No issue really. Just a minor detail.
ALTD is better than AUGD. It provides a plain specified fill value. eg:
PS: ALTI will work too.
Very nice to have a fast fill, especially for graphics use with simple patterns (eg, fine checkerboard or other solid backgrounds). Maybe you can have a ret on the wrlong to save further space and two clocks?
That doesn't work, ALTD masks to 9 bits
Oh, doh, you're right, immediate is 9 bits unless AUG'd. Shows I haven't tested it.
Does the Spin2 bytefill(), longfill(), etc. use this?
It did, at one point. I can't remember if it still does that, or if it fills a buffer with data and then just keeps writing the buffer.
Cool, RET can be used on that WRLONG too. eg:
Lol, the underscores turned it into italics.
But not in the code block. That's nice.
Here's the speed results, in clock cycles, including calling and handling overheads, of five routines all filling 64 kB of hubRAM:
- Straight line incrementing WRLONGs, first to last (done in two instructions): 147464 (every 9 clocks)
- Straight line decrementing WRLONGs, last to first: 114705 (every 7 clocks)
- Interleaved WRLONG addressing, 16 byte granularity, split into four REPs: 65584 (every 4 clocks)
- Straight line REP'd WFLONG using the FIFO: 32785 (every 2 clocks)
- And the above SETQ+WRLONG block copy with immediate data: 16410 (every clock)
Note: The first two are both interrupt friendly. A REP can't speed them up so they're using DJNZ instead.
I've managed a low of 20978 clock cycles for 64 kB using SETQ+WRLONG with DJNZ and broken up chunks of 31 longwords at a time. So this is reasonably interrupt friendly. Next best is chunks of 24 longwords achieving 21938 clock cycles.
PS: I'm guessing this might be similar to what Chip has done in newer spin2 engine.
PPS: For extra small chunk buffer size of 12 you score 30122, which is still better than the REP'd RFLONG.
Here's the source code that achieves that. I've used lutRAM for the chunk copy buffer but cogRAM can be used just as easily.
PS: I've not verified it fills correctly so it could be horribly broken.
Yep. I do SETQ+WRLONG for 16 longs at a whack.
@evanh try the self-modifying solution with AUGD instead of filling LUT. It'd be just as fast, I think (Unless the extra AUGD has a big impact?), but have less overhead per fill and of course doesn't need a buffer.
Yes, I should do that too. I'd sort of written it off because of the extra setup overhead, but there is so much of that now it'll hardly be notable. Off to work now though ...
Ye, filling any decent size of LUT/cog is much slower than poking that immediate in there.
Okay, so my code above is two-way interleaved chunking. Single-way like how Chip has it isn't as fast, but basically, its speed is flipped? log of chunk size (improves with larger size but with diminishing returns) ... so 64 kB with chunk size of 16 longwords, as Chip stated, it's score is 32852.
Making Wuerfel's immediate mode into a single-way chunking algorithm tracks Chip's speed almost exactly. Very little diff and very little noise between them. eg: 64 kB at size 16 is 32854. The extra instructions within the inner loop doesn't add any loop time.
Chip's method (guessed):
Wuerfel's method:
Two-way interleaving is negatively impacted by the bigger inner loop compared to Chip's method. It still helps though, size 16 scores 28766. The best I found was at chunk size 28 where it scored a 22322. But then single-way at size 28 also scores 25830. They all get better with larger chunks.
Source code:
Hmm, the three interleaved routines are failing the verify test. I guess I shouldn't be surprised ...
EDIT: Double hmm, I think the PTRA++/PTRB++ steps seem to be going four times too far at the end of each chunk. Doh! No, not it, just me confusing hub addresses with word count.
EDIT2: Ah, and that's what's wrong with the interleaving routines too. I'd missed one case of scaling the count by 4 and then duplicated the bug.
So the source above changes from:
to:
Grr, found another small bug with the remainder section. It can need more than a chunk size. Too tired to fix now ...