Shop OBEX P1 Docs P2 Docs Learn Events
All PASM2 gurus - help optimizing a text driver over DVI? - Page 7 — Parallax Forums

All PASM2 gurus - help optimizing a text driver over DVI?

145791029

Comments

  • jmgjmg Posts: 15,140
    edited 2019-10-16 02:21
    TonyB_ wrote: »
    Here is how I see it.
    The FIFO was reduced from 16 longs to 8 longs when 16 cogs didn't fit, otherwise I agree with Evan. Block size for RDFAST/WRFAST was kept at 64 bytes so that 16 cogs will work most easily in the future. Fast block moves with setq don't use the one-and-only FIFO because they don't need to - these transfers happen in one clock cycle, once correct slice is selected and provided FIFO is not accessing hub RAM for its burst reads/writes.
    Yes, where the data is vague, is what happens when that slot is taken ?
    If data travels in order, the fast block move must defer to +8, and then from there, run until the next defer, or it is done.
    Q:Can fast block move transfer data out-of-order ? (data does not say so?)

  • jmgjmg Posts: 15,140
    rogloh wrote: »
    I think I'm thankfully getting better than 17% setq throughput in 32bpp modes. It's at least over 50%. :smile:

    EDIT: Seems to be ~66% when I factor out overheads.
    That's good, Is that with streamer running 1 in 10 at the same time ?

  • evanhevanh Posts: 15,091
    edited 2019-10-16 04:50
    The instruction data bursts can be interceded upon by the FIFO bursts, possibly multiple times in a single opcode if the instruction is copying a large number of longwords.

    HubRAM burst data order for both the FIFO and instructions is always sequential - incrementing consecutive longword addresses.

    PS: Minimum RDLONG execution time is the burst length + 8 clocks. So longer burst lengths have a throughput advantage within the cog. This won't limit the FIFOs access to hubRAM.
  • That's good, Is that with streamer running 1 in 10 at the same time ?
    Yes it's for the 32bpp mode, which averages 1 long read from the hub every 10 clocks via the streamer.
  • roglohrogloh Posts: 5,119
    edited 2019-10-16 03:55
    If it takes around 1900 clocks to do the 1280 transfers in the non-pixel doubled 32bpp mode, then for a pixel doubled variant which only reads in 320 longs and writes out 640 longs I'm anticipating it will be just 75% of this, or 1425 clocks. The optimized unrolled copy routine I came up with then uses up 1280 more cycles for pixel doubling, so 1425 + 1280 = 2705 leaving ~500 clocks for any loop overheads and the time to dynamically build the 64 copy instructions in COG RAM plus per scanline housekeeping. I think we can still do it but it will be tight!
    The only thing I see which might be different is that this mode I might burst transfer 32 longs at a time instead of 40 (for space reasons), which may interfere with the streamer in a different way. I guess if that doesn't work I could try to move down to 20 at a time instead. I can't do over 32 at a time as I don't have room for the unrolled instructions. To build the unrolled instruction sequence actually takes a while though.

    To generate this unrolled sequence:
    wrlut buffer, ptra++
    wrlut buffer, ptra++
    wrlut buffer+1, ptra++
    wrlut buffer+1, ptra++
    wrlut buffer+2, ptra++
    wrlut buffer+2, ptra++
    ...
    
    I think I will need something like this loop...which itself takes 320 clocks! Anyone know a faster way to do it without reading it dynamically (which I could do but is sort of cheating)?
    rep #5, #32
    mov 0-0, instr
    add $-1, _d0
    mov 1-0, instr
    add $-1, _d0
    add instr, _d0
    ...
    instr wrlut buffer, ptra++
    _d0 long $200
    

    UPDATE: I've added things up for what I'd consider to be a heavily optimized case for doing this processing and get these numbers...

    1) hub transfers: ~1425 clocks (anticipated based on measured values but could change)
    2) pixel doubling unrolled loop setup: 320 clocks
    3) pixel doubling: 1280 clocks
    4) outer call overhead + scanline housekeeping: 210 clocks
    Total = 3235 clocks.

    Total result is just over my desired 3200 budget but may still be okay...but to put this in perspective a TERC4 packet encode of 2 audio samples plus housekeeping appears to take ~3142 clocks when I counted it up and I am allowing for 3200...
  • evanhevanh Posts: 15,091
    edited 2019-10-16 05:05
    I'm measuring FIFOs to be 19 longwords deep. This will have a bearing on the high and low water marks, which in turn has bearing on which slot/slice is the refilling position. So not as simple as just the start address, but still deterministic.
  • jmgjmg Posts: 15,140
    edited 2019-10-16 05:24
    evanh wrote: »
    I'm measuring FIFOs to be 19 longwords deep. This will have a bearing on the high and low water marks, which in turn has bearing on which slot/slice is the refilling position. So not as simple as just the start address, but still deterministic.
    Good point about high and low water marks, do you know what those are on the streamer ?
    I was basing my spacing/interactions on the streamer filling, and then taking a long when available.

    If it instead has hysteresis, and takes more longs, then it needs to do so much less often, and that makes a big difference to blocking effects.
    eg a hyst of 8, means it reads 8 longs every 80 cycles, giving 70 free adjacent slots to any fast block move, and now impact on fast block move is much reduced.

  • evanhevanh Posts: 15,091
    When I talk about the streamer doing DMA ops, the FIFO is always in between. From the prop2 doc:
    There are three ways the hub FIFO interface can be used, and it can only be used for one of these, at a time:
    Hub execution (when the PC is $00400..$FFFFF)
    Streamer usage (background transfers from hub RAM → pins/DACs, or from pins → hub RAM)
    Software usage (fast sequential-reading or sequential-writing instructions)

    The FIFO's 8-clock (one rotation) bursts are how the streamer gets its data. Presumably, the FIFO's high water is 19 and the low water is 11.
  • evanhevanh Posts: 15,091
    jmg wrote: »
    eg a hyst of 8, means it reads 8 longs every 80 cycles, giving 70 free adjacent slots to any fast block move.
    Correct.
  • Roger
    Maybe using register indirection can help.
    	setd	ptrs,#buffer
    	sets	ptrs,#0
    	rep	#4,#32
    	alti	ptrs,#%100_111
    	wrlut	0-0,#0-0
    	alti	ptrs,#%111_111
    	wrlut	0-0,#0-0
    
    
  • evanhevanh Posts: 15,091
    That's smooth Brian.

    Kind of amusing how there is 36 unused machine code bits.
  • ozpropdev wrote: »
    Roger
    Maybe using register indirection can help.
    	setd	ptrs,#buffer
    	sets	ptrs,#0
    	rep	#4,#32
    	alti	ptrs,#%100_111
    	wrlut	0-0,#0-0
    	alti	ptrs,#%111_111
    	wrlut	0-0,#0-0
    
    

    I want to believe but I don't quite get it yet, how exactly does it move the 64 wrlut instructions into the 64 long buffer space in COG RAM for later execution? ALTI sometimes gets confusing.
  • evanhevanh Posts: 15,091
    edited 2019-10-16 08:08
    The ALTI instructions entirely take over the addressing of both cogram and lutram. Register "ptrs" holds both working addresses.

    Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.
  • roglohrogloh Posts: 5,119
    edited 2019-10-16 10:16
    I have some mouse sprite code coming together...it's bloating a little bit more that I'd like but I might still squeeze it in..

    It's still incomplete as I still need to get the X offset computed and clip sides, accounting for X hotspot etc. Hopefully not too many more instructions...then time to test.
    do_mouse
                getword a, mouse_xy, #1         'get mouse y co-ordinate
                getnib  b, mouseptr, #7         'get y hotspot (0-15)
                sub     a, b                    'compensate for it
                subr    a, scanline wc          'check if sprite is on scanline
        if_nc   cmpr    a, #15 wc               'check if sprite is on scanline
        if_c    ret     wcz                     'mouse sprite not in range, exit
                
                alts    bpp, #bittable          'bpp starts out as index from 0-5
                mov     bitmask, 0-0            'get table entry
                mul     a, bitmask              'multiply mouse row by its length
                shr     bitmask, #16 wz         'get mask portion
        if_z    not     bitmask                 'fix up the 32 bpp case
                mov     bpp, bitmask
                ones    bpp                     'convert to real bpp
                add     a, mouseptr             'add offset to base mouse address
    
                setq2   #17-1                   'read 17 longs max, mask is first
                rdlong  $120, a                 '..followed by mouse image data
                setq2   #16-1                   'read 16 scanline longs into LUT
                rdlong  $110, linebufaddr       ' TODO: determine X offset/clipping etc    
                    
                mov     ptra, #$110             'ptra used for source image data
                mov     ptrb, #$120             'ptrb used for mouse image data
                rdlut   c, ptrb++               'the mouse mask is stored first
    
                mul     offset, bpp             'calculate mouse offset (0-30)
                mov     mask, bitmask           'setup mask for pixel's data size
                rol     mask, offset            'align first pixel
                rdlut   a, ptra                 'get original scanline pixel data
                testb   bitmask, #0 wc          'set c initially to read sprite in
    
                rep     @endmouse, #16          'repeat loop for 16 pixels
        if_c    rdlut   b, ptrb++               'get next mouse sprite pixel(s)
        if_c    rol     b, offset               'align with the source input data
                shr     c, #1 wc                'get mask bit 1=set, 0=transparent
        if_c    setq    mask                    'apply bit mask to muxq mask
        if_nc   setq    #0                      'setup a transparent mouse pixel
                muxq    a, b                    'select original or mouse pixel
                rol     mask, bpp wc            'advance mask by 1,2,4,8,16,32
        if_c    wrlut   a, ptra++               'write back data once we filled it 
        if_c    rdlut   a, ptra                 '...and read next source pixel(s)
                rol     bitmask, bpp wc         'rotate mouse data refill tracker 
    endmouse
                setq2   #16-1
                wrlong  $110, linebufaddr       'write LUT image data back to hub
                ret
    
    bittable
                long    $0001_00_08             '1bpp
                long    $0003_00_08             '2bpp
                long    $000F_00_0C             '4bpp
                long    $00FF_00_14             '8bpp
                long    $FFFF_00_24             '16bpp
                long    $0000_00_44             '32bpp 
    
    bitmask     long    0  'TODO eventually share other register names to eliminate
    bpp         long    0
    offset      long    0
    mask        long    0
    
  • roglohrogloh Posts: 5,119
    edited 2019-10-16 08:16
    evanh wrote: »
    The ALTI instructions entirely take over the addressing of both cogram and lutram. Register "ptrs" holds both working addresses.

    Lutram addresses (S field of WRLUTs) are incremented in ptrs by both ALTI instructions, whereas the cogram addresses (D field of WRLUTs) is incremented in ptrs only by the second ALTI.

    Yeah but it doesn't construct new unrolled code anywhere does it? If this is for moving data to LUT it takes 4 clocks per long vs unrolled loop of 2 clocks per long.
  • evanhevanh Posts: 15,091
    edited 2019-10-16 08:21
    Correct. There is still some trade-off. Space and construction time for unrolled 2 clocks per longword vs compact 4 clocks per longword. Better than earlier loops speeds though.
  • roglohrogloh Posts: 5,119
    edited 2019-10-16 08:25
    I'd like to create the unrolled loop code using indirection if it could get the loop to 4 instructions instead of 5. I wonder if that is possible. It would buy another 64 clocks.
  • evanhevanh Posts: 15,091
    edited 2019-10-16 08:36
    Brian's loop is four instructions.

    BTW: I don't know if it's useful but he's freed up the PTRA register too.

  • evanhevanh Posts: 15,091
    edited 2019-10-16 09:51
    Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
    'copy buff to lutram with longword doubling
    		setd	.dptr, #buff		'cogram address
    		mov	ptra, #0		'lutram address
    
    		rep	@.rend, #LEN
    		alti	.dptr, #%111_000
    		wrlut	0-0, ptra++
    .dptr		wrlut	0, ptra++
    .rend
    
  • evanh wrote: »
    Brian's loop is four instructions.

    BTW: I don't know if it's useful but he's freed up the PTRA register too.

    Brian's code copies longs from a COGRAM buffer to LUT taking 4 clocks per long with indirection, with 4 instructions in the loop. Yes it is faster than my current duplication code which is less than 6400 but still not fast enough for a 3200 cycle budget. To hit that lower timing budget I found I will need to use an unrolled loop of 64 COGRAM instructions, which then copy longs at a rate of one long per 2 clocks into LUT RAM (ie. 2x faster that Brian's code).

    The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total. If I can shrink that creation loop down to 4 instructions somehow with indirection perhaps, it will then drop to 256 clocks and buy me a bit more of a safety buffer.
  • roglohrogloh Posts: 5,119
    edited 2019-10-16 09:56
    evanh wrote: »
    Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.

    That's pretty clever. Can we do something like that for the unrolled loop creation code?
  • evanhevanh Posts: 15,091
    rogloh wrote: »
    The problem is that the actual code to create this unrolled loop dynamically is itself taking 5 instructions to create each pair of WRLUT instructions, or 5 x 2 x 32 = 320 clocks in total.
    Oh! Must be time to not do the construction method then. See my latest rolled up version above. :)
  • rogloh wrote: »
    evanh wrote: »
    Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.

    That's pretty clever. Can we do something like that for the unrolled loop creation code?

    I don't see how the dynamically generated unrolled loop can beat evanh's code, and you can do 40 long bursts like the other routines.
  • roglohrogloh Posts: 5,119
    edited 2019-10-16 10:04
    Let's see.

    Read 320 pixels into COG at 66% efficiency.
    Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
    Write 640 pixels from LUT to hub at 66% efficiency.
    320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.

    The unrolled version is still a bit faster.
    320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.

    Edit: updated earlier compensation factors from 4/3 to 3/2.
  • evanh wrote: »
    Three instructions per loop - Tested and working. It's very specific to the prop2 instruction pipeline because of the self-modifying code! This also means it's bound to executing from cogram only.
    'copy buff to lutram with longword doubling
    		setd	.dptr, #buff		'cogram address
    		mov	ptra, #0		'lutram address
    
    		rep	@.rend, #LEN
    		alti	.dptr, #%111_000
    		wrlut	0-0, ptra++
    .dptr		wrlut	0, ptra++
    .rend
    

    That's a nifty use of the pipeline there Evan! Very Cool :smiley:

  • evanhevanh Posts: 15,091
    All trial and error. Lots of printing in the console.

    And that's powerful partly because there is a recent history of the output I can look back on and compare what I'd got on the previous attempts.
  • evanhevanh Posts: 15,091
    edited 2019-10-16 11:13
    As requested, a smaller construction loop. I've cheated a little by using lutram as the target
    'construct in lutram an unrolled - copy buff to lutram with longword doubling
    		setd	.instr, #buff			'reset cogram address
    		mov	ptra, #(LUT_end - LUT_code)	'lutram address
    
    		rep	@.rend, #LEN
    		wrlut	.instr, ptra++			'copy instruction to lutram
    		wrlut	.instr, ptra++			'  ditto, double up
    		alti	.instr, #%011_000		'increment buff pixel address in D field
    .rend
    		...
    
    .instr		wrlut	buff, ptra++
    
  • roglohrogloh Posts: 5,119
    edited 2019-10-16 11:24
    Very cool evanh! Plus LUTRAM may work out ok for this special optimized 32bpp code doubler because the 256 entry palette is not going to be required for this 32bpp mode and I can reclaim it. Also it would let me use more of the COGRAM as the initial buffer area so get 40 or 80 long transfer loops. If it works, I like! :lol: You might have just saved me about another 200 80 128? clocks.
  • AJLAJL Posts: 511
    edited 2019-10-16 13:42
    rogloh wrote: »
    Let's see.

    Read 320 pixels into COG at 66% efficiency.
    Generate 640 pixels in LUT, 3 instructions for each of the 320 source pixels = 960 instructions x 2 clocks = 1920
    Write 640 pixels from LUT to hub at 66% efficiency.
    320 * 3/2 + 1920 + 640 * 3/2 = 3360 cycles, not accounting for loop overheads and other things.

    The unrolled version is still a bit faster.
    320 * 3/2 + 1280 + 640 * 3/2 + 320 (for unrolled code loop creation) = 3040. Still 320 cycles less.

    Edit: updated earlier compensation factors from 4/3 to 3/2.

    For some reason I had it in my head that you were dynamically generating the code before each burst, rather than each line.

    As you aren't likely to be switching modes from one line to the next, could you set a flag somewhere when the code is generated, and skip the generation if it is already there?

    Then the generation happens once at start of frame, and it collapses to a simple flag check each line.

    If you do switch modes from line to line, you regenerate as necessary, with the code that overwrites the routine clearing the flag.

    Addit: Or use self-modifying code to switch between the generation routine and the generated routine as needed.
  • cgraceycgracey Posts: 14,131
    Seeing there was some concern here about FIFO metrics, I just added this to the documentation:

    The FIFO contains (cogs+11) stages. When in read mode, the FIFO loads continuously until (cogs+6) stages are filled, after which 5 more longs stream in, potentially filling all (cogs+11) stages, in case no intervening read(s) had occurred. These metrics ensure that the FIFO never underflows under all potential scenarios.
Sign In or Register to comment.