RDLUT/WRLUT with auto-incrementing address

TonyB_ · 2018-11-08 18:31

As things stand, there is going to be awful lot of this

	rdlut	lut_data,lut_ptr	' or wrlut
	add	lut_ptr,#1

Would it be possible to post-increment PTRA/PTRB if S = $1F8/1F9 in RDLUT/WRLUT? The above code would then be

	rdlut	lut_data,ptra		' or wrlut/ptrb

For the software-only 640x480x256 HDMI, this would reduce a loop of 33 cycles to 31, a saving of 6% and shorter loops would benefit more. The TMDS encoding for rev B will be done in hardware, but that won't eliminate the wish/need to save every cycle possible when using the two LUT instructions.

cgracey · 2018-11-08 20:48

I will look into that today.

ersmith · 2018-11-08 20:58

Whoa! What if the user doesn't want to post-increment? That should be something explicitly requested in the source code via a "ptra++". Which means having some kind of flag in the instruction encoding (like the one we already have for rdbyte/wrbyte/etc.). In principle this would be a nice thing, but do we really want to be making such big changes this late in the game?

There's already the setq/rdlut combination for quickly filling the LUT, which is probably the most common use case.

cgracey · 2018-11-08 21:05

The rule would be that if PTRA is used as the address in RD/WRLUT, it would always increment. Pretty low risk. Just something for people to be aware of as they program.

TonyB_ · 2018-11-08 21:10

Thanks, Chip. People could use the 510 other registers to avoid the post-increment. The use-case here is read a long from LUT, process it, then read the next long, etc. (or process, then write, etc.) in a loop as fast as possible.

David Betz · 2018-11-08 21:11

There is no chance someone might be using PTRA with RD/WRLUT but not want the auto increment feature? At the very least the assembler should probably require a "++" for this instruction form and issue an error if it is not present even though it doesn't actually change the binary encoding of the instruction.

TonyB_ · 2018-11-08 21:14

Related to the soft HDMI, will the following code (a) work and (b) add 40 to ptrb?

	setq2	#10-1
	wrlong	first_lut,ptrb++

cgracey · 2018-11-08 21:24

I think it will just add one.

cgracey · 2018-11-08 21:31

The rule should be that if either PTRA or PTRB are used by RD/WRLUT, they will auto-increment.

TonyB_ · 2018-11-08 21:31

cgracey wrote: »
TonyB_ wrote: »
Related to the soft HDMI, will the following code (a) work and (b) add 40 to ptrb?
	setq2	#10-1
	wrlong	first_lut,ptrb++
I think it will just add one.

Is incrementing ptrx every write not possible?

TonyB_ · 2018-11-08 21:35

cgracey wrote: »

The rule should be that if either PTRA or PTRB are used by RD/WRLUT, they will auto-increment.

Yes, that was my intention.

jmg · 2018-11-08 21:35

David Betz wrote: »

There is no chance someone might be using PTRA with RD/WRLUT but not want the auto increment feature? At the very least the assembler should probably require a "++" for this instruction form and issue an error if it is not present even though it doesn't actually change the binary encoding of the instruction.

Yes, I agree, you do not want any invisible ++ , so the Assembler needs to be explicit here.

cgracey · 2018-11-08 21:35

TonyB_ wrote: »

Is incrementing ptrx every long not possible?

I'm sure it's possible. It just hurts to think about. You are talking about SETQ(2)+RD/WRLONG cases, right?

TonyB_ · 2018-11-08 21:55

cgracey wrote: »

TonyB_ wrote: »

Is incrementing ptrx every long not possible?

I'm sure it's possible. It just hurts to think about. You are talking about SETQ(2)+RD/WRLONG cases, right?

Yes, the fast block moves, read and write. Again it's to avoid an extra instruction to adjust the pointer.

If both of my suggestions here were implemented, that would save 400 cycles per line for soft 640x480x256 HDMI and there are only 8000 in total.

EDIT:
~7500 of the 8000 have been used, so a saving of 400 would almost double the time available for other stuff.

Tubular · 2018-11-08 21:55

ersmith wrote: »

In principle this would be a nice thing, but do we really want to be making such big changes this late in the game?

I think that its absolutely critical that the instruction set is stable, and also perceived as stable, during this next period where we test with real silicon.

What is being proposed is 'nice' and undoubtedly useful, but the action involved undermines testing (which should uncover other issues needing attention). We've seen this play out several times already, most recently with Peter and Ray's ROM testing.

There will likely be an opportunity to gather up and slip this stuff in, but this *really* has to be after a good number of people have tested with the P2 evaluation boards.

ersmith · 2018-11-08 23:39

cgracey wrote: »

The rule would be that if PTRA is used as the address in RD/WRLUT, it would always increment. Pretty low risk. Just something for people to be aware of as they program.

Umm... so in order to make things easier for programmers, we're going to require them to remember an additional special case? This really seems dubious to me. The more special cases, exceptions, etc. in the instruction set, the harder the chip is to program.

Moreover, PTRA and PTRB aren't just any registers. They're very useful (because there are many addressing modes for these in rdlong/wrlong/rdbyte/wrbyte/etc.) and so likely to be heavily used. Hence a lot of people are likely to be tripped up by instructions using them in a non-standard way.

Worse, tools are going to be made more complicated. Every compiler for the P2 is going to have to special case these instructions, and it's going to be a real pain because PTRA and/or PTRB are going to be actively in use (e.g. fastspin uses PTRA for the stack pointer) and so code is going to be generated using these.

I *strongly* oppose this change. Please, if you're going to make a register auto-increment, use a new instruction or set some bit in the instruction to indicate that it's being handled differently. Or use a new hardware register.

Rayman · 2018-11-08 23:47

I think I'm OK with this one change...

PTRA/B can get modified in other instructions, so why not here too.

What are the chances you'd want the same PTRA value for both hub/cog and LUT access?

TonyB_ · 2018-11-08 23:52

ersmith wrote: »

cgracey wrote: »

The rule would be that if PTRA is used as the address in RD/WRLUT, it would always increment. Pretty low risk. Just something for people to be aware of as they program.

Umm... so in order to make things easier for programmers, we're going to require them to remember an additional special case? This really seems dubious to me. The more special cases, exceptions, etc. in the instruction set, the harder the chip is to program.

Moreover, PTRA and PTRB aren't just any registers. They're very useful (because there are many addressing modes for these in rdlong/wrlong/rdbyte/wrbyte/etc.) and so likely to be heavily used. Hence a lot of people are likely to be tripped up by instructions using them in a non-standard way.

Worse, tools are going to be made more complicated. Every compiler for the P2 is going to have to special case these instructions, and it's going to be a real pain because PTRA and/or PTRB are going to be actively in use (e.g. fastspin uses PTRA for the stack pointer) and so code is going to be generated using these.

I *strongly* oppose this change. Please, if you're going to make a register auto-increment, use a new instruction or set some bit in the instruction to indicate that it's being handled differently. Or use a new hardware register.

There are no spare instructions slots now, nor spare opcode bits in RD/WRLUT. You're right, PTRA/B are special registers and if compilers use them already then other registers would have to be used for RD/WRLUT which would execute exactly the same as now.

ozpropdev · 2018-11-08 23:54

TonyB_ wrote: »
Related to the soft HDMI, will the following code (a) work and (b) add 40 to ptrb?
	setq2	#10-1
	wrlong	first_lut,ptrb++

(b) PREB would increment by 4.

TonyB_ · 2018-11-08 23:57

ozpropdev wrote: »
TonyB_ wrote: »
Related to the soft HDMI, will the following code (a) work and (b) add 40 to ptrb?
	setq2	#10-1
	wrlong	first_lut,ptrb++
(b) PREB would increment by 4.

Thanks for the info. +4 is not what users would expect at all and it needs fixing. When one orders ++, one expects to be served ++ !

ersmith · 2018-11-08 23:58

Rayman wrote: »

PTRA/B can get modified in other instructions, so why not here too.

Not silently... if they are modified it's explicit (the programmer asks for the modification by writing ++ or -- before or after the PTRA/B reference).

What are the chances you'd want the same PTRA value for both hub/cog and LUT access?

Pretty high? If PTRA is the stack pointer, it's pretty likely that you might want to copy a value from the HUB stack to LUT. Or if you're copying from an array in HUB to LUT, you might want to use PTRB to index the array.

Again, the general idea (have a way to auto-increment a pointer that's reading into LUT) is not a bad one. But silently modifying a pointer that the user hasn't asked to modify, and that no other instruction modifies? That's ugly. Do *something* to indicate this behavior is special. Use the wc bit on the instruction to indicate an auto-increment. Or write a magic value with SETQ or SETQ2 sometime before the copy loop. Or, have some new registers which always auto-increment when dereferenced.

ozpropdev · 2018-11-08 23:58

Is the RDLUT/WRLUT post increment only needed to achieve a soft HDMI, and not needed for the sreamer HDMI mode?
I'm a bit foggy on this HDMI stuff.

TonyB_ · 2018-11-09 00:08

ozpropdev wrote: »

Is the RDLUT/WRLUT post increment only needed to achieve a soft HDMI, and not needed for the streamer HDMI mode?
I'm a bit foggy on this HDMI stuff.

It would help the soft HDMI considerably but that's just an example, a good one mind you. It would help with hard HDMI and sprites. It would help anywhere and everywhere. Don't forget RDLUT takes three cycles, not two.

rogloh · 2018-11-09 00:25

TonyB_ wrote: »
ozpropdev wrote: »
TonyB_ wrote: »
Related to the soft HDMI, will the following code (a) work and (b) add 40 to ptrb?
	setq2	#10-1
	wrlong	first_lut,ptrb++
(b) PTEB PTRB would increment by 4.
Thanks for the info. +4 is not what users would expect at all and it needs fixing. When one orders ++, one expects to be served ++ !

Ignoring soft HDMI (I think we can live with the LUT addressing limitations), I think block transfers of data to/from HUB from COGRAM should be optimized if possible with the pointer registers. This will be a common thing do to for lots of code reading and writing in data in bursts. This is a form of memcpy / memmove between memory regions and it will be widely used.

So I am more concerned with SETQ and PTRA++/PTRB++ use with WRLONG / RDLONG not incrementing the PTRA/PTRB fully, if that is how it behaves now. I would have expected PTRA would increment by the number of transfers requested*size if you use some ++ form. If it is just by one transfer that seems unexpected.

ozpropdev · 2018-11-09 00:51

My guess is the block move function is simply setting it's own internal pointer to #/S.

The block move also has to handle the immediate #S scenario

	SETQ2	#10-1
	RDLONG	first_lut,#$20

TonyB_ · 2018-11-09 00:58

rogloh wrote: »
TonyB_ wrote: »
ozpropdev wrote: »
TonyB_ wrote: »
Related to the soft HDMI, will the following code (a) work and (b) add 40 to ptrb?
	setq2	#10-1
	wrlong	first_lut,ptrb++
(b) PTEB PTRB would increment by 4.
Thanks for the info. +4 is not what users would expect at all and it needs fixing. When one orders ++, one expects to be served ++ !
Ignoring soft HDMI (I think we can live with the LUT addressing limitations), I think block transfers of data to/from HUB from COGRAM should be optimized if possible with the pointer registers. This will be a common thing do to for lots of code reading and writing in data in bursts. This is a form of memcpy / memmove between memory regions and it will be widely used.

So I am more concerned with SETQ and PTRA++/PTRB++ use with WRLONG / RDLONG not incrementing the PTRA/PTRB fully, if that is how it behaves now. I would have expected PTRA would increment by the number of transfers requested*size if you use some ++ form. If it is just by one transfer that seems unexpected.

Well, we have to live with how things are for rev A, but not for rev B! The biggest cycle saving (320) in our 640x480x256 loop would be due to the RDLUT address auto-incrementing and from the timing viewpoint I'm much less concerned about SETQ+WRLONG incrementing PTRx correctly as the cycle saving is a lot less (64-80 probably).

This is not about one particular application, though.

ozpropdev · 2018-11-09 01:12

Actually thinking about it you can do this little trick.

		setq2	#10-1
		rdlong	0,ptrb++[10]

PTRB increments +40.

rogloh · 2018-11-09 01:17

Rev B probably won't need to worry about TMDS encoding for graphics portions of the line, though soft HDMI in general will still be useful for that device too and would remain.

Perhaps POPLUT/PUSHLUT would be a good instruction pair to have and could be leveraged on reading LUT data to give for automatic address incrementing but I'm not advocating it hard right now. A read using POPLUT could potentially be just 3 instead of 5 cycles with the following increment operation no longer required.

TonyB_ · 2018-11-09 01:31

ozpropdev wrote: »
Actually thinking about it you can do this little trick.
		setq2	#10-1
		rdlong	0,ptrb++[10]
PTRB increments +40.

Tautological but it saves two cycles. Auto-increment RD/WRLUT with PTRx and we're done!

rogloh · 2018-11-09 02:07

Cool ozpropdev, so you actually tested this on real P2 HW and it behaves that way?

rogloh · 2018-11-09 02:21

TonyB_ wrote: »

This is not about one particular application, though.

Yeah I was coming at it somewhat less from a performance aspect more from a general expectation of what the opcodes would be expected to do. If you setup X transfers and ask for auto increment with PTRA++, in my view you'd kind of want to increment by the actual number of transfers done, not just one transfer worth. I see ozpropdev found something useful to help improve performance, but that may only work for fixed sizes. Transfer sizes might depend on the application and be totally dynamic, even during run time as setq takes register values as well as constants.

But if this is not something simple to resolve, it shouldn't go into rev B, far too risky.

RDLUT/WRLUT with auto-incrementing address

Comments