1 cycle/long memory zero

moony · 2021-02-21 15:13

        setq len
        wrlong #0, ptr

@cgracey This seems to be unintended behavior in the P2, but it's quite useful, so was hoping you could declare it honorary intended behavior for me
What's going on:
When the P2 sees an immediate there, it actually treats it as a 9-bit value to fill in 32-bit blocks. AUGD does not work here, but if it did, this would be a full-blown memset. Maybe, if this is declared Intended(tm), AUGD could be fixed to work in future revs, enabling a high-speed memset.
I haven't verified this works for lengths >511, so hopefully chip can answer that.
Thanks to @Wuerfel_21 for asking me to test this behavior in the first place, allowing us to find this neat little trick. Not the memset we were hoping for, but bzero is good too.

ersmith · 2021-02-21 15:24

Remember that the value you set in setq (the contents of len) should actually be 1 less than the total length.

Also, why would you need AUGD to make it a memset? If you want to fill in "len+1" bytes you could use "wrbyte #0, ptr" instead of "wrlong".

moony · 2021-02-21 15:26

@ersmith because wrbyte does not have this behavior. I tried. wrbyte simply does not implement the block copy functionality.

ersmith · 2021-02-21 15:38

Ah, interesting, I hadn't thought that the setq was restricted to wrlong / rdlong, but now that you mention it it does kind of make sense. Thanks for testing this!

Wuerfel_21 · 2021-02-21 15:50

@moony said:
AUGD does not work here

Now that I''ve tried it myself, it works for me, so idk

moony · 2021-02-21 16:09

As it turns out, AUGD does work, and I simply made a mistake. This means this is a general purpose memset should you use SMC. Perfect!

evanh · 2021-02-21 21:07

Ha, cool. It took me a while to notice what being advertised here - The #0 immediate mode for source data doesn't block copy the cogRAM data the way SETQ+WRLONG is documented but just keeps reusing the immediate data, hence it achieves hubRAM filling function.

This needs added to the documentation.

rogloh · 2021-02-21 21:22

Being designed for COGRAM transfers I strongly suspect it would have been limited to 512 transfers only at a time with a 9 bit counter. You could add REPs around it though and deal with the start/end cases outside the loop to workaround that and it will still be very efficient. Good find!

cgracey · 2021-02-21 21:57

I am pretty sure it uses an 18-bit counter, in order to be able to fill the whole hub memory. The Cog RAM or LUT will just wrap. There is a hardware bug that stops PTRx from automatically incrementing or decrementing by the full amount when SETQ/SETQ2 are used, if there is an intervening instruction such as ALTx or AUGx.

evanh · 2021-02-21 22:07

@cgracey said:
... if there is an intervening instruction such as ALTx or AUGx.

Dang! ALTD is needed for specifying an arbitrary fill value. Still, a fast erase is cool.

cgracey · 2021-02-21 22:15

@evanh said:

@cgracey said:
... if there is an intervening instruction such as ALTx or AUGx.

Dang! ALTD is needed for specifying an arbitrary fill value. Still, a fast erase is cool.

It will fill, it just won't automatically inc/dec PTRx correctly. You probably don't care, in the case of a fill.

Wuerfel_21 · 2021-02-21 22:16

The arbitrary fill works perfectly fine, just have to self-modify the value into the right place.

Me and @moony have been hacking at VJET, the polygon rendering thing from the P1. I have it running right now at 1024x768, displaying a 181-sided shape with just two renderer cogs, but I broke something when converting it to use FIFO ops to read display list data. The fast fill trick is used for both filling the background and filling the polygon spans. The latter looks like this:

draw_span
        ' range check edges
        mov span_right,poly_right
        sar span_right,#16
        fles span_right,render_width
        mov ptrb,poly_left
        sar ptrb,#16
        fges ptrb,#0

        mov span_length,span_right
        subs span_length,ptrb wcz
if_be   ret ' return if span length/width =< 0

        ' draw span
        add ptrb,display_base

        mov span_color,poly_color
        movbyts span_color,#%%0000

        shr span_length,#1 wc
if_c    wrbyte span_color,ptrb++
        shr span_length,#1 wcz
if_c    wrword span_color,ptrb++
if_z    ret

        setd .thewr,span_color
        ror span_color,#9
        setq augd_mask
        muxq .theaug,span_color
        sub span_length,#1

        setq span_length
.theaug augd #0-0
.thewr  wrlong #0-0,ptrb
        ret

evanh · 2021-02-21 22:28

Oh, cool. No issue really. Just a minor detail.

ALTD is better than AUGD. It provides a plain specified fill value. eg:

fill_hub
'  PA    = loop count
'  PTRA  = start address
'  PB    = fill value
'-------------
        setq    pa
        altd    pb, #0
        wrlong  #0-0, ptra
        ret

PS: ALTI will work too.

rogloh · 2021-02-21 22:46

Very nice to have a fast fill, especially for graphics use with simple patterns (eg, fine checkerboard or other solid backgrounds). Maybe you can have a ret on the wrlong to save further space and two clocks?

Wuerfel_21 · 2021-02-21 22:46

That doesn't work, ALTD masks to 9 bits

evanh · 2021-02-21 22:50

@Wuerfel_21 said:
That doesn't work, ALTD masks to 9 bits

Oh, doh, you're right, immediate is 9 bits unless AUG'd. Shows I haven't tested it.

Rayman · 2021-02-21 23:06

Does the Spin2 bytefill(), longfill(), etc. use this?

cgracey · 2021-02-21 23:16

@Rayman said:
Does the Spin2 bytefill(), longfill(), etc. use this?

It did, at one point. I can't remember if it still does that, or if it fills a buffer with data and then just keeps writing the buffer.

evanh · 2021-02-22 00:09

Cool, RET can be used on that WRLONG too. eg:

        setq span_length
.theaug augd #0-0
.thewr  _ret_  wrlong #0-0,ptrb

evanh · 2021-02-22 00:10

Lol, the underscores turned it into italics.
But not in the code block. That's nice.

evanh · 2021-02-22 00:28

Here's the speed results, in clock cycles, including calling and handling overheads, of five routines all filling 64 kB of hubRAM:
- Straight line incrementing WRLONGs, first to last (done in two instructions): 147464 (every 9 clocks)
- Straight line decrementing WRLONGs, last to first: 114705 (every 7 clocks)
- Interleaved WRLONG addressing, 16 byte granularity, split into four REPs: 65584 (every 4 clocks)
- Straight line REP'd WFLONG using the FIFO: 32785 (every 2 clocks)
- And the above SETQ+WRLONG block copy with immediate data: 16410 (every clock)

Note: The first two are both interrupt friendly. A REP can't speed them up so they're using DJNZ instead.

evanh · 2021-02-23 09:20

I've managed a low of 20978 clock cycles for 64 kB using SETQ+WRLONG with DJNZ and broken up chunks of 31 longwords at a time. So this is reasonably interrupt friendly. Next best is chunks of 24 longwords achieving 21938 clock cycles.

PS: I'm guessing this might be similar to what Chip has done in newer spin2 engine.
PPS: For extra small chunk buffer size of 12 you score 30122, which is still better than the REP'd RFLONG.

evanh · 2021-02-23 09:35

Here's the source code that achieves that. I've used lutRAM for the chunk copy buffer but cogRAM can be used just as easily.

PS: I've not verified it fills correctly so it could be horribly broken.

'----------------------------------------------------------
fill_hub5
'  PA    = loop count
'  PTRA  = start address
'  PB    = fill value
'-------------
        qdiv    pa, #chunk*2        'loop count / chunk size

        mov ptrb, #0
        rep #1, #chunk      'fill lutRAM chunk buffer with the fill value
        wrlut   pb, ptrb++

        getqy   pb          'remainder of count/chunk
        sub pb, #1      wc
    if_nc   setq2   pb          'chunk copy from lutRAM to hubRAM
    if_nc   wrlong  0, ptra++

        getqx   pa      wz  'dividend of count/chunk
        mov ptrb, ptra
        mov pb, pa
        mul pb, #chunk
        add ptrb, pb
.floop
    if_nz   setq2   #chunk-1        'chunk copy from lutRAM to hubRAM
    if_nz   wrlong  0, ptra++
    if_nz   setq2   #chunk-1        'chunk copy from lutRAM to hubRAM
    if_nz   wrlong  0, ptrb++
    if_nz   djnz    pa, #.floop
        ret         wcz
'----------------------------------------------------------

cgracey · 2021-02-23 15:13

@evanh said:
I've managed a low of 20978 clock cycles for 64 kB using SETQ+WRLONG with DJNZ and broken up chunks of 31 longwords at a time. So this is reasonably interrupt friendly. Next best is chunks of 24 longwords achieving 21938 clock cycles.

PS: I'm guessing this might be similar to what Chip has done in newer spin2 engine.
PPS: For extra small chunk buffer size of 12 you score 30122, which is still better than the REP'd RFLONG.

Yep. I do SETQ+WRLONG for 16 longs at a whack.

Wuerfel_21 · 2021-02-23 15:58

@evanh try the self-modifying solution with AUGD instead of filling LUT. It'd be just as fast, I think (Unless the extra AUGD has a big impact?), but have less overhead per fill and of course doesn't need a buffer.

evanh · 2021-02-23 17:43

@Wuerfel_21 said:
@evanh try the self-modifying solution with AUGD instead of filling LUT. It'd be just as fast, I think (Unless the extra AUGD has a big impact?), but have less overhead per fill and of course doesn't need a buffer.

Yes, I should do that too. I'd sort of written it off because of the extra setup overhead, but there is so much of that now it'll hardly be notable. Off to work now though ...

Wuerfel_21 · 2021-02-23 17:57

Ye, filling any decent size of LUT/cog is much slower than poking that immediate in there.

evanh · 2021-02-24 12:07

Okay, so my code above is two-way interleaved chunking. Single-way like how Chip has it isn't as fast, but basically, its speed is flipped? log of chunk size (improves with larger size but with diminishing returns) ... so 64 kB with chunk size of 16 longwords, as Chip stated, it's score is 32852.

Making Wuerfel's immediate mode into a single-way chunking algorithm tracks Chip's speed almost exactly. Very little diff and very little noise between them. eg: 64 kB at size 16 is 32854. The extra instructions within the inner loop doesn't add any loop time.

Chip's method (guessed):

fill_hub7
'  PA    = loop count
'  PTRA  = start address
'  PB    = fill value
'-------------
        qdiv    pa, #chunk      'loop count / chunk size

        mov ptrb, #0
        rep #1, #chunk      'fill lutRAM chunk buffer with the fill value
        wrlut   pb, ptrb++

        getqy   pb          'remainder of count/chunk
        sub pb, #1      wc
    if_nc   setq2   pb          'chunk copy from lutRAM to hubRAM
    if_nc   wrlong  0, ptra++

        getqx   pa      wz  'dividend of count/chunk
.floop
    if_nz   setq2   #(chunk - 1)        'chunk copy from lutRAM to hubRAM
    if_nz   wrlong  0, ptra++
    if_nz   djnz    pa, #.floop
        ret         wcz

Wuerfel's method:

fill_hub8
'  PA    = loop count
'  PTRA  = start address
'  PB    = fill value
'-------------
        qdiv    pa, #chunk      'loop count / chunk size

        setd    .thewr, pb
        ror pb, #9
        setq    aug_mask
        muxq    .theaug, pb

        getqy   pb          'remainder of count/chunk
        sub pb, #1      wc
    if_nc   setq2   pb          'chunk copy from lutRAM to hubRAM
    if_nc   wrlong  0, ptra++

        getqx   pa      wz  'dividend of count/chunk
.floop
    if_nz   setq    #(chunk - 1)
.theaug if_nz   augd    #0-0
.thewr  if_nz   wrlong  #0-0, ptra
        add ptra, #(chunk * 4)
    if_nz   djnz    pa, #.floop
        ret         wcz

evanh · 2021-02-24 12:36

Two-way interleaving is negatively impacted by the bigger inner loop compared to Chip's method. It still helps though, size 16 scores 28766. The best I found was at chunk size 28 where it scored a 22322. But then single-way at size 28 also scores 25830. They all get better with larger chunks.

Source code:

fill_hub9
'  PA    = loop count
'  PTRA  = start address
'  PB    = fill value
'-------------
        qdiv    pa, #(chunk * 2)    'loop count / chunk size

        setd    .wr1, pb
        setd    .wr2, pb
        ror pb, #9
        setq    aug_mask
        muxq    .aug1, pb
        muxq    .aug2, pb

        getqy   pb          'remainder of count/chunk
        sub pb, #1      wc
    if_nc   setq2   pb          'chunk copy from lutRAM to hubRAM
    if_nc   wrlong  0, ptra++

        getqx   pa      wz  'dividend of count/chunk
        mov ptrb, ptra
        mov pb, pa
        mul pb, #chunk
        add ptrb, pb
.floop
    if_nz   setq    #(chunk - 1)
.aug1   if_nz   augd    #0-0
.wr1    if_nz   wrlong  #0-0, ptra
        add ptra, #(chunk * 4)

    if_nz   setq    #(chunk - 1)
.aug2   if_nz   augd    #0-0
.wr2    if_nz   wrlong  #0-0, ptrb
        add ptrb, #(chunk * 4)

    if_nz   djnz    pa, #.floop
        ret         wcz

evanh · 2021-02-26 07:51

Hmm, the three interleaved routines are failing the verify test. I guess I shouldn't be surprised ...

EDIT: Double hmm, I think the PTRA++/PTRB++ steps seem to be going four times too far at the end of each chunk. Doh! No, not it, just me confusing hub addresses with word count.

EDIT2: Ah, and that's what's wrong with the interleaving routines too. I'd missed one case of scaling the count by 4 and then duplicated the bug.

So the source above changes from:

        ...
        mov pb, pa
        mul pb, #chunk
        add ptrb, pb
        ...

to:

        ...
        mov pb, pa
        mul pb, #(chunk * 4)
        add ptrb, pb
        ...

evanh · 2021-02-26 10:35

Grr, found another small bug with the remainder section. It can need more than a chunk size. Too tired to fix now ...

1 cycle/long memory zero

Comments