RDLUT/WRLUT with auto-incrementing address

TonyB_ · 2018-11-09 02:32

TonyB_ wrote: »
ozpropdev wrote: »
Actually thinking about it you can do this little trick.
		setq2	#10-1
		rdlong	0,ptrb++[10]
PTRB increments +40.
Tautological but it saves two cycles.

[10] is the index and is limited to -16 to +15 without AUGS, which means it won't save us two cycles in the soft HDMI code.

ozpropdev · 2018-11-09 02:46

rogloh wrote: »

Cool ozpropdev, so you actually tested this on real P2 HW and it behaves that way?

Yes, confirmed on hardware.

BTW. It has a limited reach of 15 though.

cgracey · 2018-11-09 07:35

There is no obvious policy for updating PTRx on 'SETQ(2)+RD/WRLONG reg,PTRx_expression'. Here goes, though:

	SETQ	#99			'move 100 longs
	RDLONG	reg,PTRA		'read from PTRA (already does this)

	SETQ	#99			'move 100 longs
	RDLONG	reg,PTRA++		'read from PTRA, PTRA += 100*4

	SETQ	#99			'move 100 longs
	RDLONG	reg,PTRA--		'read from PTRA, PTRA -= 100*4

	SETQ	#99			'move 100 longs
	RDLONG	reg,++PTRA		'read from PTRA+100*4, PTRA += 100*4

	SETQ	#99			'move 100 longs
	RDLONG	reg,--PTRA		'read from PTRA-100*4, PTRA -= 100*4

Here are the more problematic cases:

	SETQ	#99			'move 100 longs
	RDLONG	reg,PTRA[index]		'read from PTRA+index*4 (already does this)

	SETQ	#99			'move 100 longs
	RDLONG	reg,PTRA++[index]	'read from PTRA+index*4, PTRA += (100+index)*4

	SETQ	#99			'move 100 longs
	RDLONG	reg,PTRA--[index]	'read from PTRA-index*4, PTRA -= (100+index)*4

	SETQ	#99			'move 100 longs
	RDLONG	reg,++PTRA[index]	'read from PTRA+(100+index)*4, PTRA += (100+index)*4

	SETQ	#99			'move 100 longs
	RDLONG	reg,--PTRA[index]	'read from PTRA-(100+index)*4, PTRA -= (100+index)*4

In fact, the problematic cases involving [index] expressions can't even be supported because the non-index ++/-- cases already use %00001 (+1) or %11111 (-1) for index values within the 8-bit PTRx bitfield.

So, changing the behavior of SETQ(2)+RD/WRLONG would only work for index values of 0, +1, and -1. So, forget that second example set. Only the first would be sensible.

cgracey · 2018-11-09 08:04

Thinking more about this, we could fully involve the 5-bit index in the PTRx bitfield by multiplying the index by (Q+1)*4 for the full size of the operation. This would involve a 6*18-bit multiplier, though, which would eat lots of gates and probably not be of much practical use. If we only allow 0/+1/-1 then it's just an adder, but that would preclude '[index]' usage for SETQ(2)+RD/WRLONG operations.

cgracey · 2018-11-09 10:11

I'm tempted to put the index multiplier in, so that the unit size becomes (Q+1)*4, as opposed to just 4 for normal RDLONG/WRLONG operations.

The multiplier needs 5 bits (signed) for the index and then 19 bits (unsigned) for the Q+1 value. The Q+1 value needs 19 bits for the case of Q=$3FFFF, which represents the entire memory map, which is only a use case for something like "WRLONG #0,address" which probably wouldn't use PTRA/B, as it wraps completely around.

For practical transfer sizes, the limit is 512 longs (Q=$1FF), as this represents the whole register or LUT memory. It would only take 10 bits to represent 512. So, we'd need a 5-signed by 10-unsigned multiplier, which isn't too big and slow. This would let you transfer blocks of memory with integral-block granularity using the [index] expression. I don't know if that's even worth it. Then, there's the limitation of that 10-bit multiplier. It wouldn't work for memory clearing where your block size is over 512 longs and your doing "SETQ+WRLONG #0,{++/--}PTRx{++/--}".

I think this is maybe best left untouched. If the programmer is doing SETQ2+RD/WRLONG, he needs to be taking responsibility for the address, and not using PTRA/PTRB with automatic updating.

rogloh · 2018-11-09 10:27

Yeah it is a real pity there are multiple cases which seem to add further complexity here, and do byte/word burst transfers with auto-incrementing pointers also need to be handled too? Only the LONG cases are shown above. I would wonder how many times a burst transfer with SETQ needs to be initiated from an index offset, rather from just a current address pointer which will be more common, but I suspect someone would eventually want/try it. Perhaps it is useful to accumulate burst transferred randomly accessed subsets of arrays or something. Pretty special case. For reads you are probably doing pointer changes outside this. For burst writes to hub, you might be chaining data from different subsets held in LUT/COG to hub in a sequence so it may be of more value. Don't know, that seems far less common that just writing data repeatedly into a growing buffer like memcpy/memmove might do.

Perhaps some restriction is required on pointer update stuff for those index cases involving SETQ/2 with pre and post increment etc, though no doubt differences like this can also irritate some people expecting all possible cases to be supported in a consistent way. Be nice to have something simple to put in that just updates the pointer each iteration cycle with simple addition/subtraction sequentially not extra multipliers with lots of logic. If this is found to be too hard or risky we should drop it.

cgracey · 2018-11-09 10:29

Actually, supporting index values of just 0/+1/-1 would be easy and practical, as most block access will be done sequentially, anyway.

I'm thinking this is what we'll support (note no random [index] term will be allowed, as only 0/1/-1 are implied):

	SETQ	#99			'move (99+1) longs
	RDLONG	reg,PTRA		'read from PTRA (already does this)

	SETQ	#99			'move (99+1) longs
	RDLONG	reg,PTRA++		'read from PTRA, PTRA += (99+1)*4

	SETQ	#99			'move (99+1) longs
	RDLONG	reg,PTRA--		'read from PTRA, PTRA -= (99+1)*4

	SETQ	#99			'move (99+1) longs
	RDLONG	reg,++PTRA		'read from PTRA+(99+1)*4, PTRA += (99+1)*4

	SETQ	#99			'move (99+1) longs
	RDLONG	reg,--PTRA		'read from PTRA-(99+1)*4, PTRA -= (99+1)*4

This is a little interesting in that the block size (Q+1) could be set by a register, as opposed to #immediate, making the step size variable.

rogloh · 2018-11-09 10:35

I see being able to transfer bursts using a variable in a register to define the transfer length is handy. You might have something that runs periodically and needs to transfer newly arriving data into hub, the particular amount of which can change from time to time in each given time interval. A fixed SETQ limits this, unless alts/altd is used which I guess is still possible. It's only going to save one instruction though so not that critical.

cgracey · 2018-11-09 11:00

rogloh wrote: »

I see being able to transfer bursts using a variable in a register to define the transfer length is handy. You might have something that runs periodically and needs to transfer newly arriving data into hub, the particular amount of which can change from time to time in each given time interval. A fixed SETQ limits this, unless alts/altd is used which I guess is still possible. It's only going to save one instruction though so not that critical.

Right. And I was just remembering that we can do things like this, using AUGS:

	setq2	#$1FF			'ready to write 512 LUT longs
	wrlong	0,--ptra[##$800]	'write to PTRA-$800, update PTRA

I think I'm NOT going to make any changes for this SETQ(2)+RD/WRLONG functionality.

rogloh · 2018-11-09 11:10

Fair enough, it seems complex. It's probably just then worth mentioning in the documentation this particular pointer increment behaviour after SETQ or maybe even in an errata list for P2 that doesn't give a date for any HW fix but at least show the workaround using extra instructions, which would be the need to adjust the new PTR value with the additional burst transfer size in software. I think some people might simply assume this behaviour otherwise unless known in advance.

cgracey · 2018-11-09 11:22

"SETQ(2) does not alter PTRx behavior in SETQ(2)+RD/WRLONG-PTRx instruction sequences."

TonyB_ · 2018-11-09 12:08

cgracey wrote: »
rogloh wrote: »

I see being able to transfer bursts using a variable in a register to define the transfer length is handy. You might have something that runs periodically and needs to transfer newly arriving data into hub, the particular amount of which can change from time to time in each given time interval. A fixed SETQ limits this, unless alts/altd is used which I guess is still possible. It's only going to save one instruction though so not that critical.

Right. And I was just remembering that we can do things like this, using AUGS:
	setq2	#$1FF			'ready to write 512 LUT longs
	wrlong	0,--ptra[##$800]	'write to PTRA-$800, update PTRA
I think I'm NOT going to make any changes for this SETQ(2)+RD/WRLONG functionality.

Saving one instruction could be critical.

I can't see that AUGS is an alternative because the whole point is to save two cycles in a loop, which could end up saving 100s if the eggbeater slice could be hit one rotation earlier.

There seems to be an existing fix for adjusting ptrx correctly when using index, but we can't use that for the soft HDMI. Could ptrx be incremented correctly when index is not used?

No change is a real disappointment. Also the pointer value on exit is not what users will expect.

cgracey · 2018-11-09 12:19

TonyB_, do you realize that when/if these changes get added into the future silicon, real HDMI will be there, too, making digital video quite mindless? There will be no need for soft HDMI then. Am I missing something?

TonyB_ · 2018-11-09 12:29

cgracey wrote: »

TonyB_, do you realize that when/if these changes get added into the future silicon, real HDMI will be there, too, making digital video quite mindless? There will be no need for soft HDMI then. Am I missing something?

The soft HDMI is just an example but there will be other applications that come up against the same problem. It's not about saving two cycles. In time-critical loops that involve hub reading/writing, a single instruction avoided might save hundreds of cycles.

If what I've requested is not doable then I accept that.

cgracey · 2018-11-09 15:45

TonyB_ wrote: »

cgracey wrote: »

TonyB_, do you realize that when/if these changes get added into the future silicon, real HDMI will be there, too, making digital video quite mindless? There will be no need for soft HDMI then. Am I missing something?

The soft HDMI is just an example but there will be other applications that come up against the same problem. It's not about saving two cycles. In time-critical loops that involve hub reading/writing, a single instruction avoided might save hundreds of cycles.

If what I've requested is not doable then I accept that.

I have been thinking more about it. It might be pretty easy to do. It would only allow for 0, 1, -1 index values, though, that would represent the entire transfer block size. Would that satisfy your needs?

TonyB_ · 2018-11-09 16:03

cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_, do you realize that when/if these changes get added into the future silicon, real HDMI will be there, too, making digital video quite mindless? There will be no need for soft HDMI then. Am I missing something?

The soft HDMI is just an example but there will be other applications that come up against the same problem. It's not about saving two cycles. In time-critical loops that involve hub reading/writing, a single instruction avoided might save hundreds of cycles.

If what I've requested is not doable then I accept that.

I have been thinking more about it. It might be pretty easy to do. It would only allow for 0, 1, -1 index values, though, that would represent the entire transfer block size. Would that satisfy your needs?

Chip, thanks for looking into this. If it would reduce the following to just SETQ2+WRLONG, then it would satisfy my needs. Block size is small enough for x to be a 9-bit immediate but too big to use the simple offset to adjust ptra:

	setq2	#x-1
	wrlong	#buf,ptra
	add	ptra,#x*4

EDIT:
Mistake corrected.

cgracey · 2018-11-09 16:10

TonyB_ wrote: »
cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_, do you realize that when/if these changes get added into the future silicon, real HDMI will be there, too, making digital video quite mindless? There will be no need for soft HDMI then. Am I missing something?

The soft HDMI is just an example but there will be other applications that come up against the same problem. It's not about saving two cycles. In time-critical loops that involve hub reading/writing, a single instruction avoided might save hundreds of cycles.

If what I've requested is not doable then I accept that.

I have been thinking more about it. It might be pretty easy to do. It would only allow for 0, 1, -1 index values, though, that would represent the entire transfer block size. Would that satisfy your needs?

Chip, thanks for looking into this. If it would reduce the following to just SETQ2+WRLONG, then it would satisfy my needs. Block size is small enough for x to be a 9-bit immediate but too big to use the simple offset to adjust ptra:
	setq2	#x-1
	wrlong	#buf,ptra
	add	ptra,#x

I think you meant: ADD PTRA,#x*4

With the change, you would just do:

SETQ2 #x-1
WRLONG #buf,PTRA++

PTRA would be incremented by x*4.

TonyB_ · 2018-11-09 16:12

cgracey wrote: »

I think you meant:

ADD PTRA,#x*4

Yes, sorry for the confusion.

Cluso99 · 2018-11-09 19:31

cgracey wrote: »
TonyB_ wrote: »
cgracey wrote: »

TonyB_ wrote: »

cgracey wrote: »

TonyB_, do you realize that when/if these changes get added into the future silicon, real HDMI will be there, too, making digital video quite mindless? There will be no need for soft HDMI then. Am I missing something?

The soft HDMI is just an example but there will be other applications that come up against the same problem. It's not about saving two cycles. In time-critical loops that involve hub reading/writing, a single instruction avoided might save hundreds of cycles.

If what I've requested is not doable then I accept that.

I have been thinking more about it. It might be pretty easy to do. It would only allow for 0, 1, -1 index values, though, that would represent the entire transfer block size. Would that satisfy your needs?

Chip, thanks for looking into this. If it would reduce the following to just SETQ2+WRLONG, then it would satisfy my needs. Block size is small enough for x to be a 9-bit immediate but too big to use the simple offset to adjust ptra:
	setq2	#x-1
	wrlong	#buf,ptra
	add	ptra,#x
I think you meant: ADD PTRA,#x*4

With the change, you would just do:

SETQ2 #x-1
WRLONG #buf,PTRA++

PTRA would be incremented by x*4.

This saves 1 instruction (2 clocks) or 2 instructions if x*4 > 511 in a SETQ2+WRLONG loop that takes x*2+2 clocks. So for 100 loops = 202 clocks we can save 4 clocks and 2 instructions that are executed once. Definitely not worth the trouble and risk IMHO.

TonyB_ · 2018-11-09 20:26

Cluso99 wrote: »

cgracey wrote: »

With the change, you would just do:

SETQ2 #x-1
WRLONG #buf,PTRA++

PTRA would be incremented by x*4.

This saves 1 instruction (2 clocks) or 2 instructions if x*4 > 511 in a SETQ2+WRLONG loop that takes x*2+2 clocks. So for 100 loops = 202 clocks we can save 4 clocks and 2 instructions that are executed once. Definitely not worth the trouble and risk IMHO.

In the case of interest, the SETQ2+WRLONG fast writes are inside a repeat block with lots of other instructions. Each fast write starts at the same eggbeater slice. Eliminating one instruction could mean hitting this slice 8 cycles earlier, or not. If the block is repeated 40 times the overall saving will be 320 314 cycles, or two.

Cluso99 · 2018-11-09 21:31

TonyB_ wrote: »

Cluso99 wrote: »

cgracey wrote: »

With the change, you would just do:

SETQ2 #x-1
WRLONG #buf,PTRA++

PTRA would be incremented by x*4.

This saves 1 instruction (2 clocks) or 2 instructions if x*4 > 511 in a SETQ2+WRLONG loop that takes x*2+2 clocks. So for 100 loops = 202 clocks we can save 4 clocks and 2 instructions that are executed once. Definitely not worth the trouble and risk IMHO.

In the case of interest, the SETQ2+WRLONG fast writes are inside a repeat block with lots of other instructions. Each fast write starts at the same eggbeater slice. Eliminating one instruction could mean hitting this slice 8 cycles earlier, or not. If the block is repeated 40 times the overall saving will be 320 314 cycles, or two.

Are you sure that a REP can contain a SETQ2+WRLONG pair?

Cluso99 · 2018-11-09 21:53

I am very concerned about playing with the PTR options at this final stage in the process. A little bug here could render all the PTR modes useless. If the next silicon isn't golden there is a real possibility that there may never be a P2. And what you are after is a very specific case where the extra 2 clocks and an instruction may be important.

However, what you are asking for seems to not be the easy way to add it in. Let's look at this...

What I believe you really want is for the PTRA/PTRB to be updated with the next value at the completion of the SETQ+xxLONG loop ?

If so, then a bit in the SETQ/SETQ2 instruction, or the value (eg b31), could be set to cause a write back of the internal updated pointer to PTRA/PTRB. This may be much easier to do, and would perhaps be useful generally.

Thoughts?

rogloh · 2018-11-09 21:58

If it worked that easily Cluso with the P2 pipeline etc, yes that might be a nice way to get the updated PTR register at the end.

Update: problem is there does not seem to be such a bit in the SETQ/SET2 instruction like WC etc when I looked just now. Using the MSB seems a bit hacky.

TonyB_ · 2018-11-09 23:39

Cluso99 wrote: »

I am very concerned about playing with the PTR options at this final stage in the process. A little bug here could render all the PTR modes useless. If the next silicon isn't golden there is a real possibility that there may never be a P2. And what you are after is a very specific case where the extra 2 clocks and an instruction may be important.

However, what you are asking for seems to not be the easy way to add it in. Let's look at this...

What I believe you really want is for the PTRA/PTRB to be updated with the next value at the completion of the SETQ+xxLONG loop ?

If so, then a bit in the SETQ/SETQ2 instruction, or the value (eg b31), could be set to cause a write back of the internal updated pointer to PTRA/PTRB. This may be much easier to do, and would perhaps be useful generally.

Yes, I think PTRx should point to the next address automatically at the end of the fast block move. I don't mind how the change is done internally and I didn't actually suggest a way. Perhaps the SETQ(2) itself could act as the flag for this change? It would correct the current anomaly with PTRx incrementing by +4 no matter how big the block.

Cluso99 · 2018-11-09 23:54

Oh? Does PTRA/PTRB change after a SETQ + RDLONG X,PTRA/PTRB now?

rogloh · 2018-11-10 00:29

Yes, but only by one transfer if ptr++ is used, I think. It doesn't take into account the number of transfers at the moment.

Cluso99 · 2018-11-10 01:30

I would expect that an update (write) to update PTRx would need an extra clock.

ozpropdev · 2018-11-10 02:31

Cluso99 wrote: »

Are you sure that a REP can contain a SETQ2+WRLONG pair?

Sure can.
I have tried it and it works fine.

ozpropdev · 2018-11-10 02:33

Cluso99 wrote: »

I would expect that an update (write) to update PTRx would need an extra clock.

It already updates now, but not in the way most would expect.

Cluso99 · 2018-11-10 02:56

So the RDLONG takes 1 clock for each execution of the SETQ loop. But if there is an update to the PTR then it would require an extra clock for the PTR write. I guess this is why PTR doesn't get updated (beyond the first update which is probably an artificial).

I wonder if the first SETQ takes an additional clock for that PTRx update?

RDLUT/WRLUT with auto-incrementing address

Comments