Fast Block Fills

TonyB_ · 2020-03-13 19:16

[

cgracey wrote: »

JonnyMac wrote: »

I think Chip said he was dead out of space for the interpreter; I wonder if that drove the decision to limit the timing to 32 bits.

With more code, it could be achieved. There are 15 identical instructions in the BYTEFILL/WORDFILL/LONGFILL routine that would sure be nice to condense, somehow. Then, we'd have some pennies left over for a 64-bit WAITMS/WAITUS.

Above is copied from another thread.

This is probably no help for the P2, but here is an idea for an optional enhancement for Fast Block Moves to create Fast Block Fills, such that a single long is read which is then written multiple times. For SETQ+RDLONG only the first hub long is read and for SETQ+WRLONG/WMLONG only the first cog long is read if Q[31] = 1 (similarly for SETQ2).

I wish I'd thought of this at the time as it would have saved 12 longs in the Spin2 interpreter with no more cycles than now on average.

cgracey · 2020-03-13 21:45

TonyB_ wrote: »

[

cgracey wrote: »

JonnyMac wrote: »

I think Chip said he was dead out of space for the interpreter; I wonder if that drove the decision to limit the timing to 32 bits.

With more code, it could be achieved. There are 15 identical instructions in the BYTEFILL/WORDFILL/LONGFILL routine that would sure be nice to condense, somehow. Then, we'd have some pennies left over for a 64-bit WAITMS/WAITUS.

Above is copied from another thread.

This is probably no help for the P2, but here is an idea for an optional enhancement for Fast Block Moves to create Fast Block Fills, such that a single long is read which is then written multiple times. For SETQ+RDLONG only the first hub long is read and for SETQ+WRLONG/WMLONG only the first cog long is read if Q[31] = 1 (similarly for SETQ2).

I wish I'd thought of this at the time as it would have saved 12 longs in the Spin2 interpreter with no more cycles than now on average.

Yes, we don't have any firepower when it comes to efficiently replicating a value into cog registers.

evanh · 2020-03-13 21:59

Here's a compact one for non-bursting and interruptible. 7 sysclocks per longword.

'----------------------------------------------------------
fill_hub
'  PA    = loop count
'  PTRA  = start address
'  PB    = fill value
'-------------
		add	ptra, pa
.floop
		wrlong	pb, --ptra
	_ret_	djnz	pa, #.floop
'----------------------------------------------------------

Rayman · 2020-03-13 22:09

Wouldn't REP and RFLONG be faster?

evanh · 2020-03-13 22:19

In this case, hub timing slot occurs only every 7 clocks, so won't be faster. And a REP will block all interrupts as well.

evanh · 2020-03-13 22:22

Oops, the ADD is flawed ... bugger, fixing that ruins the compactness.

It would work correctly for WRBYTE.

The alternative is use post-increment WRLONG PB, PTRA++. But then it's slower at 9 sysclocks per longword.

Rayman · 2020-03-13 22:40

I think wflong loop with rep can send long every two clocks...

evanh · 2020-03-13 22:57

Yep, but needs the FIFO.

TonyB_ · 2020-03-13 23:08

cgracey wrote: »

TonyB_ wrote: »

This is probably no help for the P2, but here is an idea for an optional enhancement for Fast Block Moves to create Fast Block Fills, such that a single long is read which is then written multiple times. For SETQ+RDLONG only the first hub long is read and for SETQ+WRLONG/WMLONG only the first cog long is read if Q[31] = 1 (similarly for SETQ2).

I wish I'd thought of this at the time as it would have saved 12 longs in the Spin2 interpreter with no more cycles than now on average.

Yes, we don't have any firepower when it comes to efficiently replicating a value into cog registers.

Here's the Spin2 code as it is:

longmove_	shl	x,#1			'cnt<<2 for long
wordmove_	shl	x,#1			'cnt<<1 for word
bytemove_
		setq	#2-1			'pop dst into buff+14
		rdlong	buff+14,--ptra		'pop src/val into buff+15

		mov	y,ptra			'save ptra

		mov	ptra,buff+14		'set ptra to dst
		mov	ptrb,buff+15		'set ptrb to src/val

		testbn	pa,#1		wz	'move (Z=0) or fill (Z=1)?

  if_nz		cmp	ptrb,ptra	wc	'forward or reverse move?
  if_nz_and_nc	jmp	#move_fwd
  if_nz_and_c	jmp	#move_rev


		cmp	pa,#bc_longfill	wc	'word fill?
  if_c		movbyts	buff+15,#%%1010
		cmp	pa,#bc_wordfill	wc	'byte fill?
  if_c		movbyts	buff+15,#%%0000

		mov	buff+00,buff+15		'fill buff
		mov	buff+01,buff+15
		mov	buff+02,buff+15
		mov	buff+03,buff+15
		mov	buff+04,buff+15
		mov	buff+05,buff+15
		mov	buff+06,buff+15
		mov	buff+07,buff+15
		mov	buff+08,buff+15
		mov	buff+09,buff+15
		mov	buff+10,buff+15
		mov	buff+11,buff+15
		mov	buff+12,buff+15
		mov	buff+13,buff+15
		mov	buff+14,buff+15

With a future Fast Block Fill, the last 15 MOVs could be replaced by:

		setq	##15-1 | 1 << 31
		rdlong	buff+00,y		'only one rdlong needed for 15 cog reg writes

Average execution time ~ 30 cycles, the same as now, or slightly quicker with setq reg.

Peter Jakacki · 2020-03-13 23:08

This is what I use for byte fills which works out 2 cycles per byte, or the same if it were a wflong.

' FILL ( addr cnt fillch -- )
CFILL                   wrfast  #0,c
                        rep     @.L0,b
                        wfbyte  a

Filling a 64kB block

TAQOZ# $2.0000 $1.0000 'a' LAP FILL LAP .LAP --- 131,168 cycles= 655,840ns @200MHz ok

Rayman · 2020-03-13 23:29

Hmm... I guess you can't use the fifo with inline assembly and Spin2?

Peter Jakacki · 2020-03-13 23:37

Rayman wrote: »

Hmm... I guess you can't use the fifo with inline assembly and Spin2?

This code is strictly cogexec.

So here's a question then. In TAQOZ I reserve some cog memory to be able to load modules that benefit from or can only run from cogexec. The load using setq is superfast, so there is minimal overhead. Does Spin2 allow this type of "fcache"?

evanh · 2020-03-14 00:12

SETQ method doesn't use the FIFO so that one works as hubexec.

Peter Jakacki · 2020-03-14 00:19

Sorry, I meant the actual code that is executed in cog although it is loaded from hub using setq rdlong etc. Does Spin2 reserve cog memory, even a handful of longs?

evanh · 2020-03-14 00:43

I think so. Not seeing the info in the docs though.

Rayman · 2020-03-14 01:08

Spin2 leaves a lot of lower cog ram free for inline asm

Roy Eltham · 2020-03-14 01:13

I believe it's the first 128 longs are available for inline asm.

rogloh · 2020-03-14 01:17

Maybe the Spin2 interpreter's fill operation can be sped up...

Here's the existing fill portion of the Spin2 code that transfers the 16 pre filled registers to hub RAM in bursts:

move_fwd_loop	mov	w,#16
		fle	w,x
		sub	x,w
		djf	w,#move_done
  if_nz		setq	w
  if_nz		rdlong	buff,ptrb++
		setq	w
		wrlong	buff,ptra++
		jmp	#move_fwd_loop

Once it is up and running, the loop seems to take 20 clocks for the 9 instructions + 16 cycles for the transfer, plus a hub access window delay of 4 additional clocks to align back to a multiple of 8 clock cycles for the egg-beater. To me that implies that the fill efficiency is 40% (16/40 clocks). Perhaps if the loop could be shrunk by removing the jump instruction then we can boost it to 50% (16/32), which makes the fill 25% faster. So how about this variant below...can it help us?

move_fwd_loop	
		rep     #8, #0
		mov	w,#16
		fle	w,x
		sub	x,w
		djf	w,#move_done
  if_nz		setq	w
  if_nz		rdlong	buff,ptrb++
		setq	w
		wrlong	buff,ptra++

It doesn't take any more COG RAM space and it doesn't try to branch at the end of the rep loop either which is safe I think, right @evanh? Plus when the source and dest buffers are aligned to the same 8 long boundary I think it might be able to speed up the copy as well because we are again on a multiple of 8 clocks (16 for instructions + 16 for reads + 16 for writes) and the bus transfer efficiency is then 66%.

Is my simple reasoning here correct or am I missing something?

cgracey · 2020-03-14 01:36

rogloh wrote: »
Maybe the Spin2 interpreter's fill operation can be sped up...

Here's the existing fill portion of the Spin2 code that transfers the 16 pre filled registers to hub RAM in bursts:
move_fwd_loop	mov	w,#16
		fle	w,x
		sub	x,w
		djf	w,#move_done
  if_nz		setq	w
  if_nz		rdlong	buff,ptrb++
		setq	w
		wrlong	buff,ptra++
		jmp	#move_fwd_loop
Once it is up and running, the loop seems to take 20 clocks for the 9 instructions + 16 cycles for the transfer, plus a hub access window delay of 4 additional clocks to align back to a multiple of 8 clock cycles for the egg-beater. To me that implies that the fill efficiency is 40% (16/40 clocks). Perhaps if the loop could be shrunk by removing the jump instruction then we can boost it to 50% (16/32), which makes the fill 25% faster. So how about this variant below...can it help us?
move_fwd_loop	
		rep     #8, #0
		mov	w,#16
		fle	w,x
		sub	x,w
		djf	w,#move_done
  if_nz		setq	w
  if_nz		rdlong	buff,ptrb++
		setq	w
		wrlong	buff,ptra++
It doesn't take any more COG RAM space and it doesn't try to branch at the end of the rep loop either which is safe I think, right @evanh? Plus when the source and dest buffers are aligned to the same 8 long boundary I think it might be able to speed up the copy as well because we are again on a multiple of 8 clocks (16 for instructions + 16 for reads + 16 for writes) and the bus transfer efficiency is then 66%.

Is my simple reasoning here correct or am I missing something?

There is one thing you are missing: interrupts. They will be stalled for a long time during a fill, due to REP. The way the interpreter is set up now, nothing hogs much time, so interrupts can be very granular in time.

cgracey · 2020-03-14 01:44

Roy Eltham wrote: »

I believe it's the first 128 longs are available for inline asm.

The first $138 are available. You can also have terminate-stay-resident PASM programs in that space.

Whenever PASM is called from spin, including inline assembly, the FIFO is available to the code, since it is reinitialized afterwards.

evanh · 2020-03-14 01:54

Nice. The FIFO solution is most compact then.

rogloh · 2020-03-14 01:55

cgracey wrote: »

rogloh wrote: »

Is my simple reasoning here correct or am I missing something?

There is one thing you are missing: interrupts. They will be stalled for a long time during a fill, due to REP. The way the interpreter is set up now, nothing hogs much time, so interrupts can be very granular in time.

Yeah I thought it was a bit too easy...

AJL · 2020-03-14 03:54

rogloh wrote: »

cgracey wrote: »

rogloh wrote: »

Is my simple reasoning here correct or am I missing something?

There is one thing you are missing: interrupts. They will be stalled for a long time during a fill, due to REP. The way the interpreter is set up now, nothing hogs much time, so interrupts can be very granular in time.

Yeah I thought it was a bit too easy...

What's the current longest time that interrupts are stalled in the SPIN2 interpreter?

At the expense of one extra long it looks like both could be accommodated.

f_move_fwd_loop rep #8, #0
move_fwd_loop	mov	w,#16
		fle	w,x
		sub	x,w
		djf	w,#move_done
  if_nz		setq	w
  if_nz		rdlong	buff,ptrb++
		setq	w
		wrlong	buff,ptra++
		jmp	#move_fwd_loop

A compile time decision could be made that adjusts the entry point to select between fast and interrupt friendly versions.

Roy Eltham · 2020-03-14 04:02

Chip,
That's a lot more than expected, very nice!

cgracey · 2020-03-14 05:56

Roy Eltham wrote: »

Chip,
That's a lot more than expected, very nice!

Your PASM code just needs to not touch registers $138..$1D7 and the LUT. You can use all the I/O registers, 6 stack levels, and the FIFO. Pretty complicated programs can run in the background on interrupts that you launch from in-line PASM or REGEXEC.

Rayman · 2020-03-14 10:37

Interrupt driven code can’t touch fifo though I bet

evanh · 2020-03-14 10:58

Rayman wrote: »

Interrupt driven code can’t touch fifo though I bet

Time to start up your own cog.

TonyB_ · 2020-03-14 11:27

Another possible future option:

SETQ + MOV to move S to D, D+1, ..., utilising part of SETQ + RDLONG logic.

In the Spin2 code above, RDLONG has been done to copy fill data to a cog register so no need to do RDLONG again.

cgracey · 2020-03-14 15:05

Rayman wrote: »

Interrupt driven code can’t touch fifo though I bet

Correct, unless you determine what mode the FIFO is in and where it's at so that you can restore it. There could also be a SKIP pattern in progress from the main code. What could be simply done in an interrupt would be some I/O pin servicing, buffer updating, and flag setting. Better tread lightly.

Fast Block Fills

Comments