RDLUT/WRLUT with auto-incrementing address
TonyB_
Posts: 2,196
in Propeller 2
As things stand, there is going to be awful lot of this
Would it be possible to post-increment PTRA/PTRB if S = $1F8/1F9 in RDLUT/WRLUT? The above code would then be
For the software-only 640x480x256 HDMI, this would reduce a loop of 33 cycles to 31, a saving of 6% and shorter loops would benefit more. The TMDS encoding for rev B will be done in hardware, but that won't eliminate the wish/need to save every cycle possible when using the two LUT instructions.
rdlut lut_data,lut_ptr ' or wrlut add lut_ptr,#1
Would it be possible to post-increment PTRA/PTRB if S = $1F8/1F9 in RDLUT/WRLUT? The above code would then be
rdlut lut_data,ptra ' or wrlut/ptrb
For the software-only 640x480x256 HDMI, this would reduce a loop of 33 cycles to 31, a saving of 6% and shorter loops would benefit more. The TMDS encoding for rev B will be done in hardware, but that won't eliminate the wish/need to save every cycle possible when using the two LUT instructions.
Comments
There's already the setq/rdlut combination for quickly filling the LUT, which is probably the most common use case.
Is incrementing ptrx every write not possible?
Yes, that was my intention.
Yes, I agree, you do not want any invisible ++ , so the Assembler needs to be explicit here.
I'm sure it's possible. It just hurts to think about. You are talking about SETQ(2)+RD/WRLONG cases, right?
Yes, the fast block moves, read and write. Again it's to avoid an extra instruction to adjust the pointer.
If both of my suggestions here were implemented, that would save 400 cycles per line for soft 640x480x256 HDMI and there are only 8000 in total.
EDIT:
~7500 of the 8000 have been used, so a saving of 400 would almost double the time available for other stuff.
I think that its absolutely critical that the instruction set is stable, and also perceived as stable, during this next period where we test with real silicon.
What is being proposed is 'nice' and undoubtedly useful, but the action involved undermines testing (which should uncover other issues needing attention). We've seen this play out several times already, most recently with Peter and Ray's ROM testing.
There will likely be an opportunity to gather up and slip this stuff in, but this *really* has to be after a good number of people have tested with the P2 evaluation boards.
Umm... so in order to make things easier for programmers, we're going to require them to remember an additional special case? This really seems dubious to me. The more special cases, exceptions, etc. in the instruction set, the harder the chip is to program.
Moreover, PTRA and PTRB aren't just any registers. They're very useful (because there are many addressing modes for these in rdlong/wrlong/rdbyte/wrbyte/etc.) and so likely to be heavily used. Hence a lot of people are likely to be tripped up by instructions using them in a non-standard way.
Worse, tools are going to be made more complicated. Every compiler for the P2 is going to have to special case these instructions, and it's going to be a real pain because PTRA and/or PTRB are going to be actively in use (e.g. fastspin uses PTRA for the stack pointer) and so code is going to be generated using these.
I *strongly* oppose this change. Please, if you're going to make a register auto-increment, use a new instruction or set some bit in the instruction to indicate that it's being handled differently. Or use a new hardware register.
PTRA/B can get modified in other instructions, so why not here too.
What are the chances you'd want the same PTRA value for both hub/cog and LUT access?
There are no spare instructions slots now, nor spare opcode bits in RD/WRLUT. You're right, PTRA/B are special registers and if compilers use them already then other registers would have to be used for RD/WRLUT which would execute exactly the same as now.
(b) PREB would increment by 4.
Thanks for the info. +4 is not what users would expect at all and it needs fixing. When one orders ++, one expects to be served ++ !
Pretty high? If PTRA is the stack pointer, it's pretty likely that you might want to copy a value from the HUB stack to LUT. Or if you're copying from an array in HUB to LUT, you might want to use PTRB to index the array.
Again, the general idea (have a way to auto-increment a pointer that's reading into LUT) is not a bad one. But silently modifying a pointer that the user hasn't asked to modify, and that no other instruction modifies? That's ugly. Do *something* to indicate this behavior is special. Use the wc bit on the instruction to indicate an auto-increment. Or write a magic value with SETQ or SETQ2 sometime before the copy loop. Or, have some new registers which always auto-increment when dereferenced.
I'm a bit foggy on this HDMI stuff.
It would help the soft HDMI considerably but that's just an example, a good one mind you. It would help with hard HDMI and sprites. It would help anywhere and everywhere. Don't forget RDLUT takes three cycles, not two.
Ignoring soft HDMI (I think we can live with the LUT addressing limitations), I think block transfers of data to/from HUB from COGRAM should be optimized if possible with the pointer registers. This will be a common thing do to for lots of code reading and writing in data in bursts. This is a form of memcpy / memmove between memory regions and it will be widely used.
So I am more concerned with SETQ and PTRA++/PTRB++ use with WRLONG / RDLONG not incrementing the PTRA/PTRB fully, if that is how it behaves now. I would have expected PTRA would increment by the number of transfers requested*size if you use some ++ form. If it is just by one transfer that seems unexpected.
The block move also has to handle the immediate #S scenario
Well, we have to live with how things are for rev A, but not for rev B! The biggest cycle saving (320) in our 640x480x256 loop would be due to the RDLUT address auto-incrementing and from the timing viewpoint I'm much less concerned about SETQ+WRLONG incrementing PTRx correctly as the cycle saving is a lot less (64-80 probably).
This is not about one particular application, though.
Perhaps POPLUT/PUSHLUT would be a good instruction pair to have and could be leveraged on reading LUT data to give for automatic address incrementing but I'm not advocating it hard right now. A read using POPLUT could potentially be just 3 instead of 5 cycles with the following increment operation no longer required.
Tautological but it saves two cycles. Auto-increment RD/WRLUT with PTRx and we're done!
Yeah I was coming at it somewhat less from a performance aspect more from a general expectation of what the opcodes would be expected to do. If you setup X transfers and ask for auto increment with PTRA++, in my view you'd kind of want to increment by the actual number of transfers done, not just one transfer worth. I see ozpropdev found something useful to help improve performance, but that may only work for fixed sizes. Transfer sizes might depend on the application and be totally dynamic, even during run time as setq takes register values as well as constants.
But if this is not something simple to resolve, it shouldn't go into rev B, far too risky.