Shop OBEX P1 Docs P2 Docs Learn Events
Fast Block Fills — Parallax Forums

Fast Block Fills

[
cgracey wrote: »
JonnyMac wrote: »
I think Chip said he was dead out of space for the interpreter; I wonder if that drove the decision to limit the timing to 32 bits.

With more code, it could be achieved. There are 15 identical instructions in the BYTEFILL/WORDFILL/LONGFILL routine that would sure be nice to condense, somehow. Then, we'd have some pennies left over for a 64-bit WAITMS/WAITUS.

Above is copied from another thread.

This is probably no help for the P2, but here is an idea for an optional enhancement for Fast Block Moves to create Fast Block Fills, such that a single long is read which is then written multiple times. For SETQ+RDLONG only the first hub long is read and for SETQ+WRLONG/WMLONG only the first cog long is read if Q[31] = 1 (similarly for SETQ2).

I wish I'd thought of this at the time as it would have saved 12 longs in the Spin2 interpreter with no more cycles than now on average.

Comments

  • cgraceycgracey Posts: 14,133
    TonyB_ wrote: »
    [
    cgracey wrote: »
    JonnyMac wrote: »
    I think Chip said he was dead out of space for the interpreter; I wonder if that drove the decision to limit the timing to 32 bits.

    With more code, it could be achieved. There are 15 identical instructions in the BYTEFILL/WORDFILL/LONGFILL routine that would sure be nice to condense, somehow. Then, we'd have some pennies left over for a 64-bit WAITMS/WAITUS.

    Above is copied from another thread.

    This is probably no help for the P2, but here is an idea for an optional enhancement for Fast Block Moves to create Fast Block Fills, such that a single long is read which is then written multiple times. For SETQ+RDLONG only the first hub long is read and for SETQ+WRLONG/WMLONG only the first cog long is read if Q[31] = 1 (similarly for SETQ2).

    I wish I'd thought of this at the time as it would have saved 12 longs in the Spin2 interpreter with no more cycles than now on average.

    Yes, we don't have any firepower when it comes to efficiently replicating a value into cog registers.
  • evanhevanh Posts: 15,192
    Here's a compact one for non-bursting and interruptible. 7 sysclocks per longword.
    '----------------------------------------------------------
    fill_hub
    '  PA    = loop count
    '  PTRA  = start address
    '  PB    = fill value
    '-------------
    		add	ptra, pa
    .floop
    		wrlong	pb, --ptra
    	_ret_	djnz	pa, #.floop
    '----------------------------------------------------------
    
  • RaymanRayman Posts: 13,903
    Wouldn't REP and RFLONG be faster?
  • evanhevanh Posts: 15,192
    In this case, hub timing slot occurs only every 7 clocks, so won't be faster. And a REP will block all interrupts as well.

  • evanhevanh Posts: 15,192
    edited 2020-03-13 22:28
    Oops, the ADD is flawed ... bugger, fixing that ruins the compactness. :( It would work correctly for WRBYTE.

    The alternative is use post-increment WRLONG PB, PTRA++. But then it's slower at 9 sysclocks per longword.

  • RaymanRayman Posts: 13,903
    I think wflong loop with rep can send long every two clocks...
  • evanhevanh Posts: 15,192
    Yep, but needs the FIFO.
  • TonyB_TonyB_ Posts: 2,127
    edited 2020-03-13 23:42
    cgracey wrote: »
    TonyB_ wrote: »
    This is probably no help for the P2, but here is an idea for an optional enhancement for Fast Block Moves to create Fast Block Fills, such that a single long is read which is then written multiple times. For SETQ+RDLONG only the first hub long is read and for SETQ+WRLONG/WMLONG only the first cog long is read if Q[31] = 1 (similarly for SETQ2).

    I wish I'd thought of this at the time as it would have saved 12 longs in the Spin2 interpreter with no more cycles than now on average.

    Yes, we don't have any firepower when it comes to efficiently replicating a value into cog registers.
    Here's the Spin2 code as it is:
    longmove_	shl	x,#1			'cnt<<2 for long
    wordmove_	shl	x,#1			'cnt<<1 for word
    bytemove_
    		setq	#2-1			'pop dst into buff+14
    		rdlong	buff+14,--ptra		'pop src/val into buff+15
    
    		mov	y,ptra			'save ptra
    
    		mov	ptra,buff+14		'set ptra to dst
    		mov	ptrb,buff+15		'set ptrb to src/val
    
    		testbn	pa,#1		wz	'move (Z=0) or fill (Z=1)?
    
      if_nz		cmp	ptrb,ptra	wc	'forward or reverse move?
      if_nz_and_nc	jmp	#move_fwd
      if_nz_and_c	jmp	#move_rev
    
    
    		cmp	pa,#bc_longfill	wc	'word fill?
      if_c		movbyts	buff+15,#%%1010
    		cmp	pa,#bc_wordfill	wc	'byte fill?
      if_c		movbyts	buff+15,#%%0000
    
    		mov	buff+00,buff+15		'fill buff
    		mov	buff+01,buff+15
    		mov	buff+02,buff+15
    		mov	buff+03,buff+15
    		mov	buff+04,buff+15
    		mov	buff+05,buff+15
    		mov	buff+06,buff+15
    		mov	buff+07,buff+15
    		mov	buff+08,buff+15
    		mov	buff+09,buff+15
    		mov	buff+10,buff+15
    		mov	buff+11,buff+15
    		mov	buff+12,buff+15
    		mov	buff+13,buff+15
    		mov	buff+14,buff+15
    
    With a future Fast Block Fill, the last 15 MOVs could be replaced by:
    		setq	##15-1 | 1 << 31
    		rdlong	buff+00,y		'only one rdlong needed for 15 cog reg writes
    
    Average execution time ~ 30 cycles, the same as now, or slightly quicker with setq reg.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2020-03-13 23:09
    This is what I use for byte fills which works out 2 cycles per byte, or the same if it were a wflong.
    ' FILL ( addr cnt fillch -- )
    CFILL                   wrfast  #0,c
                            rep     @.L0,b
                            wfbyte  a
    

    Filling a 64kB block
    TAQOZ# $2.0000 $1.0000 'a' LAP FILL LAP .LAP --- 131,168 cycles= 655,840ns @200MHz ok
    
  • RaymanRayman Posts: 13,903
    Hmm... I guess you can't use the fifo with inline assembly and Spin2?
  • Rayman wrote: »
    Hmm... I guess you can't use the fifo with inline assembly and Spin2?

    This code is strictly cogexec.

    So here's a question then. In TAQOZ I reserve some cog memory to be able to load modules that benefit from or can only run from cogexec. The load using setq is superfast, so there is minimal overhead. Does Spin2 allow this type of "fcache"?
  • evanhevanh Posts: 15,192
    SETQ method doesn't use the FIFO so that one works as hubexec.
  • Sorry, I meant the actual code that is executed in cog although it is loaded from hub using setq rdlong etc. Does Spin2 reserve cog memory, even a handful of longs?
  • evanhevanh Posts: 15,192
    I think so. Not seeing the info in the docs though.
  • RaymanRayman Posts: 13,903
    Spin2 leaves a lot of lower cog ram free for inline asm
  • I believe it's the first 128 longs are available for inline asm.
  • roglohrogloh Posts: 5,171
    edited 2020-03-14 01:21
    Maybe the Spin2 interpreter's fill operation can be sped up...

    Here's the existing fill portion of the Spin2 code that transfers the 16 pre filled registers to hub RAM in bursts:
    move_fwd_loop	mov	w,#16
    		fle	w,x
    		sub	x,w
    		djf	w,#move_done
      if_nz		setq	w
      if_nz		rdlong	buff,ptrb++
    		setq	w
    		wrlong	buff,ptra++
    		jmp	#move_fwd_loop
    

    Once it is up and running, the loop seems to take 20 clocks for the 9 instructions + 16 cycles for the transfer, plus a hub access window delay of 4 additional clocks to align back to a multiple of 8 clock cycles for the egg-beater. To me that implies that the fill efficiency is 40% (16/40 clocks). Perhaps if the loop could be shrunk by removing the jump instruction then we can boost it to 50% (16/32), which makes the fill 25% faster. So how about this variant below...can it help us?

    move_fwd_loop	
    		rep     #8, #0
    		mov	w,#16
    		fle	w,x
    		sub	x,w
    		djf	w,#move_done
      if_nz		setq	w
      if_nz		rdlong	buff,ptrb++
    		setq	w
    		wrlong	buff,ptra++
    

    It doesn't take any more COG RAM space and it doesn't try to branch at the end of the rep loop either which is safe I think, right @evanh? Plus when the source and dest buffers are aligned to the same 8 long boundary I think it might be able to speed up the copy as well because we are again on a multiple of 8 clocks (16 for instructions + 16 for reads + 16 for writes) and the bus transfer efficiency is then 66%.

    Is my simple reasoning here correct or am I missing something?
  • cgraceycgracey Posts: 14,133
    edited 2020-03-14 01:37
    rogloh wrote: »
    Maybe the Spin2 interpreter's fill operation can be sped up...

    Here's the existing fill portion of the Spin2 code that transfers the 16 pre filled registers to hub RAM in bursts:
    move_fwd_loop	mov	w,#16
    		fle	w,x
    		sub	x,w
    		djf	w,#move_done
      if_nz		setq	w
      if_nz		rdlong	buff,ptrb++
    		setq	w
    		wrlong	buff,ptra++
    		jmp	#move_fwd_loop
    

    Once it is up and running, the loop seems to take 20 clocks for the 9 instructions + 16 cycles for the transfer, plus a hub access window delay of 4 additional clocks to align back to a multiple of 8 clock cycles for the egg-beater. To me that implies that the fill efficiency is 40% (16/40 clocks). Perhaps if the loop could be shrunk by removing the jump instruction then we can boost it to 50% (16/32), which makes the fill 25% faster. So how about this variant below...can it help us?

    move_fwd_loop	
    		rep     #8, #0
    		mov	w,#16
    		fle	w,x
    		sub	x,w
    		djf	w,#move_done
      if_nz		setq	w
      if_nz		rdlong	buff,ptrb++
    		setq	w
    		wrlong	buff,ptra++
    

    It doesn't take any more COG RAM space and it doesn't try to branch at the end of the rep loop either which is safe I think, right @evanh? Plus when the source and dest buffers are aligned to the same 8 long boundary I think it might be able to speed up the copy as well because we are again on a multiple of 8 clocks (16 for instructions + 16 for reads + 16 for writes) and the bus transfer efficiency is then 66%.

    Is my simple reasoning here correct or am I missing something?

    There is one thing you are missing: interrupts. They will be stalled for a long time during a fill, due to REP. The way the interpreter is set up now, nothing hogs much time, so interrupts can be very granular in time.
  • cgraceycgracey Posts: 14,133
    Roy Eltham wrote: »
    I believe it's the first 128 longs are available for inline asm.

    The first $138 are available. You can also have terminate-stay-resident PASM programs in that space.

    Whenever PASM is called from spin, including inline assembly, the FIFO is available to the code, since it is reinitialized afterwards.
  • evanhevanh Posts: 15,192
    edited 2020-03-14 02:02
    Nice. The FIFO solution is most compact then.

  • cgracey wrote: »
    rogloh wrote: »
    Is my simple reasoning here correct or am I missing something?

    There is one thing you are missing: interrupts. They will be stalled for a long time during a fill, due to REP. The way the interpreter is set up now, nothing hogs much time, so interrupts can be very granular in time.

    Yeah I thought it was a bit too easy... :smile:
  • rogloh wrote: »
    cgracey wrote: »
    rogloh wrote: »
    Is my simple reasoning here correct or am I missing something?

    There is one thing you are missing: interrupts. They will be stalled for a long time during a fill, due to REP. The way the interpreter is set up now, nothing hogs much time, so interrupts can be very granular in time.

    Yeah I thought it was a bit too easy... :smile:

    What's the current longest time that interrupts are stalled in the SPIN2 interpreter?

    At the expense of one extra long it looks like both could be accommodated.
    f_move_fwd_loop rep #8, #0
    move_fwd_loop	mov	w,#16
    		fle	w,x
    		sub	x,w
    		djf	w,#move_done
      if_nz		setq	w
      if_nz		rdlong	buff,ptrb++
    		setq	w
    		wrlong	buff,ptra++
    		jmp	#move_fwd_loop
    

    A compile time decision could be made that adjusts the entry point to select between fast and interrupt friendly versions.
  • Chip,
    That's a lot more than expected, very nice!
  • cgraceycgracey Posts: 14,133
    Roy Eltham wrote: »
    Chip,
    That's a lot more than expected, very nice!

    Your PASM code just needs to not touch registers $138..$1D7 and the LUT. You can use all the I/O registers, 6 stack levels, and the FIFO. Pretty complicated programs can run in the background on interrupts that you launch from in-line PASM or REGEXEC.
  • RaymanRayman Posts: 13,903
    Interrupt driven code can’t touch fifo though I bet
  • evanhevanh Posts: 15,192
    Rayman wrote: »
    Interrupt driven code can’t touch fifo though I bet
    Time to start up your own cog.

  • TonyB_TonyB_ Posts: 2,127
    edited 2020-03-14 11:31
    Another possible future option:

    SETQ + MOV to move S to D, D+1, ..., utilising part of SETQ + RDLONG logic.

    In the Spin2 code above, RDLONG has been done to copy fill data to a cog register so no need to do RDLONG again.
  • cgraceycgracey Posts: 14,133
    Rayman wrote: »
    Interrupt driven code can’t touch fifo though I bet

    Correct, unless you determine what mode the FIFO is in and where it's at so that you can restore it. There could also be a SKIP pattern in progress from the main code. What could be simply done in an interrupt would be some I/O pin servicing, buffer updating, and flag setting. Better tread lightly.
Sign In or Register to comment.