Shop OBEX P1 Docs P2 Docs Learn Events
1 cycle/long memory zero — Parallax Forums

1 cycle/long memory zero

        setq len
        wrlong #0, ptr

@cgracey This seems to be unintended behavior in the P2, but it's quite useful, so was hoping you could declare it honorary intended behavior for me :)
What's going on:
When the P2 sees an immediate there, it actually treats it as a 9-bit value to fill in 32-bit blocks. AUGD does not work here, but if it did, this would be a full-blown memset. Maybe, if this is declared Intended(tm), AUGD could be fixed to work in future revs, enabling a high-speed memset.
I haven't verified this works for lengths >511, so hopefully chip can answer that.
Thanks to @Wuerfel_21 for asking me to test this behavior in the first place, allowing us to find this neat little trick. Not the memset we were hoping for, but bzero is good too.

«1

Comments

  • Remember that the value you set in setq (the contents of len) should actually be 1 less than the total length.

    Also, why would you need AUGD to make it a memset? If you want to fill in "len+1" bytes you could use "wrbyte #0, ptr" instead of "wrlong".

  • @ersmith because wrbyte does not have this behavior. I tried. wrbyte simply does not implement the block copy functionality.

  • Ah, interesting, I hadn't thought that the setq was restricted to wrlong / rdlong, but now that you mention it it does kind of make sense. Thanks for testing this!

  • @moony said:
    AUGD does not work here

    Now that I''ve tried it myself, it works for me, so idk

  • As it turns out, AUGD does work, and I simply made a mistake. This means this is a general purpose memset should you use SMC. Perfect!

  • evanhevanh Posts: 15,126

    Ha, cool. It took me a while to notice what being advertised here - The #0 immediate mode for source data doesn't block copy the cogRAM data the way SETQ+WRLONG is documented but just keeps reusing the immediate data, hence it achieves hubRAM filling function. :)

    This needs added to the documentation.

  • Being designed for COGRAM transfers I strongly suspect it would have been limited to 512 transfers only at a time with a 9 bit counter. You could add REPs around it though and deal with the start/end cases outside the loop to workaround that and it will still be very efficient. Good find!

  • cgraceycgracey Posts: 14,133
    edited 2021-02-21 21:59

    I am pretty sure it uses an 18-bit counter, in order to be able to fill the whole hub memory. The Cog RAM or LUT will just wrap. There is a hardware bug that stops PTRx from automatically incrementing or decrementing by the full amount when SETQ/SETQ2 are used, if there is an intervening instruction such as ALTx or AUGx.

  • evanhevanh Posts: 15,126

    @cgracey said:
    ... if there is an intervening instruction such as ALTx or AUGx.

    Dang! ALTD is needed for specifying an arbitrary fill value. Still, a fast erase is cool.

  • cgraceycgracey Posts: 14,133

    @evanh said:

    @cgracey said:
    ... if there is an intervening instruction such as ALTx or AUGx.

    Dang! ALTD is needed for specifying an arbitrary fill value. Still, a fast erase is cool.

    It will fill, it just won't automatically inc/dec PTRx correctly. You probably don't care, in the case of a fill.

  • The arbitrary fill works perfectly fine, just have to self-modify the value into the right place.

    Me and @moony have been hacking at VJET, the polygon rendering thing from the P1. I have it running right now at 1024x768, displaying a 181-sided shape with just two renderer cogs, but I broke something when converting it to use FIFO ops to read display list data. The fast fill trick is used for both filling the background and filling the polygon spans. The latter looks like this:

    draw_span
            ' range check edges
            mov span_right,poly_right
            sar span_right,#16
            fles span_right,render_width
            mov ptrb,poly_left
            sar ptrb,#16
            fges ptrb,#0
    
            mov span_length,span_right
            subs span_length,ptrb wcz
    if_be   ret ' return if span length/width =< 0
    
            ' draw span
            add ptrb,display_base
    
            mov span_color,poly_color
            movbyts span_color,#%%0000
    
            shr span_length,#1 wc
    if_c    wrbyte span_color,ptrb++
            shr span_length,#1 wcz
    if_c    wrword span_color,ptrb++
    if_z    ret
    
            setd .thewr,span_color
            ror span_color,#9
            setq augd_mask
            muxq .theaug,span_color
            sub span_length,#1
    
            setq span_length
    .theaug augd #0-0
    .thewr  wrlong #0-0,ptrb
            ret
    
  • evanhevanh Posts: 15,126
    edited 2021-02-21 22:32

    Oh, cool. No issue really. Just a minor detail.

    ALTD is better than AUGD. It provides a plain specified fill value. eg:

    fill_hub
    '  PA    = loop count
    '  PTRA  = start address
    '  PB    = fill value
    '-------------
            setq    pa
            altd    pb, #0
            wrlong  #0-0, ptra
            ret
    

    PS: ALTI will work too.

  • Very nice to have a fast fill, especially for graphics use with simple patterns (eg, fine checkerboard or other solid backgrounds). Maybe you can have a ret on the wrlong to save further space and two clocks?

  • That doesn't work, ALTD masks to 9 bits

  • evanhevanh Posts: 15,126

    @Wuerfel_21 said:
    That doesn't work, ALTD masks to 9 bits

    Oh, doh, you're right, immediate is 9 bits unless AUG'd. Shows I haven't tested it.

  • RaymanRayman Posts: 13,797

    Does the Spin2 bytefill(), longfill(), etc. use this?

  • cgraceycgracey Posts: 14,133

    @Rayman said:
    Does the Spin2 bytefill(), longfill(), etc. use this?

    It did, at one point. I can't remember if it still does that, or if it fills a buffer with data and then just keeps writing the buffer.

  • evanhevanh Posts: 15,126

    Cool, RET can be used on that WRLONG too. eg:

            setq span_length
    .theaug augd #0-0
    .thewr  _ret_  wrlong #0-0,ptrb
    
  • evanhevanh Posts: 15,126
    edited 2021-02-22 00:10

    Lol, the underscores turned it into italics.
    But not in the code block. That's nice.

  • evanhevanh Posts: 15,126
    edited 2021-02-22 00:51

    Here's the speed results, in clock cycles, including calling and handling overheads, of five routines all filling 64 kB of hubRAM:
    - Straight line incrementing WRLONGs, first to last (done in two instructions): 147464 (every 9 clocks)
    - Straight line decrementing WRLONGs, last to first: 114705 (every 7 clocks)
    - Interleaved WRLONG addressing, 16 byte granularity, split into four REPs: 65584 (every 4 clocks)
    - Straight line REP'd WFLONG using the FIFO: 32785 (every 2 clocks)
    - And the above SETQ+WRLONG block copy with immediate data: 16410 (every clock)

    Note: The first two are both interrupt friendly. A REP can't speed them up so they're using DJNZ instead.

  • evanhevanh Posts: 15,126
    edited 2021-02-23 10:18

    I've managed a low of 20978 clock cycles for 64 kB using SETQ+WRLONG with DJNZ and broken up chunks of 31 longwords at a time. So this is reasonably interrupt friendly. Next best is chunks of 24 longwords achieving 21938 clock cycles.

    PS: I'm guessing this might be similar to what Chip has done in newer spin2 engine.
    PPS: For extra small chunk buffer size of 12 you score 30122, which is still better than the REP'd RFLONG.

  • evanhevanh Posts: 15,126
    edited 2021-02-23 09:41

    Here's the source code that achieves that. I've used lutRAM for the chunk copy buffer but cogRAM can be used just as easily.

    PS: I've not verified it fills correctly so it could be horribly broken.

    '----------------------------------------------------------
    fill_hub5
    '  PA    = loop count
    '  PTRA  = start address
    '  PB    = fill value
    '-------------
            qdiv    pa, #chunk*2        'loop count / chunk size
    
            mov ptrb, #0
            rep #1, #chunk      'fill lutRAM chunk buffer with the fill value
            wrlut   pb, ptrb++
    
            getqy   pb          'remainder of count/chunk
            sub pb, #1      wc
        if_nc   setq2   pb          'chunk copy from lutRAM to hubRAM
        if_nc   wrlong  0, ptra++
    
            getqx   pa      wz  'dividend of count/chunk
            mov ptrb, ptra
            mov pb, pa
            mul pb, #chunk
            add ptrb, pb
    .floop
        if_nz   setq2   #chunk-1        'chunk copy from lutRAM to hubRAM
        if_nz   wrlong  0, ptra++
        if_nz   setq2   #chunk-1        'chunk copy from lutRAM to hubRAM
        if_nz   wrlong  0, ptrb++
        if_nz   djnz    pa, #.floop
            ret         wcz
    '----------------------------------------------------------
    
  • cgraceycgracey Posts: 14,133

    @evanh said:
    I've managed a low of 20978 clock cycles for 64 kB using SETQ+WRLONG with DJNZ and broken up chunks of 31 longwords at a time. So this is reasonably interrupt friendly. Next best is chunks of 24 longwords achieving 21938 clock cycles.

    PS: I'm guessing this might be similar to what Chip has done in newer spin2 engine.
    PPS: For extra small chunk buffer size of 12 you score 30122, which is still better than the REP'd RFLONG.

    Yep. I do SETQ+WRLONG for 16 longs at a whack.

  • Wuerfel_21Wuerfel_21 Posts: 4,369
    edited 2021-02-23 15:58

    @evanh try the self-modifying solution with AUGD instead of filling LUT. It'd be just as fast, I think (Unless the extra AUGD has a big impact?), but have less overhead per fill and of course doesn't need a buffer.

  • evanhevanh Posts: 15,126

    @Wuerfel_21 said:
    @evanh try the self-modifying solution with AUGD instead of filling LUT. It'd be just as fast, I think (Unless the extra AUGD has a big impact?), but have less overhead per fill and of course doesn't need a buffer.

    Yes, I should do that too. I'd sort of written it off because of the extra setup overhead, but there is so much of that now it'll hardly be notable. Off to work now though ...

  • Ye, filling any decent size of LUT/cog is much slower than poking that immediate in there.

  • evanhevanh Posts: 15,126
    edited 2021-02-24 12:47

    Okay, so my code above is two-way interleaved chunking. Single-way like how Chip has it isn't as fast, but basically, its speed is flipped? log of chunk size (improves with larger size but with diminishing returns) ... so 64 kB with chunk size of 16 longwords, as Chip stated, it's score is 32852.

    Making Wuerfel's immediate mode into a single-way chunking algorithm tracks Chip's speed almost exactly. Very little diff and very little noise between them. eg: 64 kB at size 16 is 32854. The extra instructions within the inner loop doesn't add any loop time.

    Chip's method (guessed):

    fill_hub7
    '  PA    = loop count
    '  PTRA  = start address
    '  PB    = fill value
    '-------------
            qdiv    pa, #chunk      'loop count / chunk size
    
            mov ptrb, #0
            rep #1, #chunk      'fill lutRAM chunk buffer with the fill value
            wrlut   pb, ptrb++
    
            getqy   pb          'remainder of count/chunk
            sub pb, #1      wc
        if_nc   setq2   pb          'chunk copy from lutRAM to hubRAM
        if_nc   wrlong  0, ptra++
    
            getqx   pa      wz  'dividend of count/chunk
    .floop
        if_nz   setq2   #(chunk - 1)        'chunk copy from lutRAM to hubRAM
        if_nz   wrlong  0, ptra++
        if_nz   djnz    pa, #.floop
            ret         wcz
    

    Wuerfel's method:

    fill_hub8
    '  PA    = loop count
    '  PTRA  = start address
    '  PB    = fill value
    '-------------
            qdiv    pa, #chunk      'loop count / chunk size
    
            setd    .thewr, pb
            ror pb, #9
            setq    aug_mask
            muxq    .theaug, pb
    
            getqy   pb          'remainder of count/chunk
            sub pb, #1      wc
        if_nc   setq2   pb          'chunk copy from lutRAM to hubRAM
        if_nc   wrlong  0, ptra++
    
            getqx   pa      wz  'dividend of count/chunk
    .floop
        if_nz   setq    #(chunk - 1)
    .theaug if_nz   augd    #0-0
    .thewr  if_nz   wrlong  #0-0, ptra
            add ptra, #(chunk * 4)
        if_nz   djnz    pa, #.floop
            ret         wcz
    
  • evanhevanh Posts: 15,126
    edited 2021-02-24 12:41

    Two-way interleaving is negatively impacted by the bigger inner loop compared to Chip's method. It still helps though, size 16 scores 28766. The best I found was at chunk size 28 where it scored a 22322. But then single-way at size 28 also scores 25830. They all get better with larger chunks.

    Source code:

    fill_hub9
    '  PA    = loop count
    '  PTRA  = start address
    '  PB    = fill value
    '-------------
            qdiv    pa, #(chunk * 2)    'loop count / chunk size
    
            setd    .wr1, pb
            setd    .wr2, pb
            ror pb, #9
            setq    aug_mask
            muxq    .aug1, pb
            muxq    .aug2, pb
    
            getqy   pb          'remainder of count/chunk
            sub pb, #1      wc
        if_nc   setq2   pb          'chunk copy from lutRAM to hubRAM
        if_nc   wrlong  0, ptra++
    
            getqx   pa      wz  'dividend of count/chunk
            mov ptrb, ptra
            mov pb, pa
            mul pb, #chunk
            add ptrb, pb
    .floop
        if_nz   setq    #(chunk - 1)
    .aug1   if_nz   augd    #0-0
    .wr1    if_nz   wrlong  #0-0, ptra
            add ptra, #(chunk * 4)
    
        if_nz   setq    #(chunk - 1)
    .aug2   if_nz   augd    #0-0
    .wr2    if_nz   wrlong  #0-0, ptrb
            add ptrb, #(chunk * 4)
    
        if_nz   djnz    pa, #.floop
            ret         wcz
    
  • evanhevanh Posts: 15,126
    edited 2021-02-26 08:58

    Hmm, the three interleaved routines are failing the verify test. :( I guess I shouldn't be surprised ...

    EDIT: Double hmm, I think the PTRA++/PTRB++ steps seem to be going four times too far at the end of each chunk. Doh! No, not it, just me confusing hub addresses with word count.

    EDIT2: Ah, and that's what's wrong with the interleaving routines too. I'd missed one case of scaling the count by 4 and then duplicated the bug.

    So the source above changes from:

            ...
            mov pb, pa
            mul pb, #chunk
            add ptrb, pb
            ...
    

    to:

            ...
            mov pb, pa
            mul pb, #(chunk * 4)
            add ptrb, pb
            ...
    
  • evanhevanh Posts: 15,126

    Grr, found another small bug with the remainder section. It can need more than a chunk size. Too tired to fix now ...

Sign In or Register to comment.