New Pin Instructions

TonyB_ · 2018-11-15 02:04

cgracey wrote: »

rogloh wrote: »

cgracey wrote: »

Thanks for commenting on this, Eric.

I would like to use the full PTRx expression rules for RDLUT and WRLUT, but it would mean that the LUT could only be immediately addressed from $000..$0FF. Addressing LUT locations $100..$1FF would require a PTRx expression or a register.

How do people feel about that?

Please no no no. For performance reasons in unrolled loops etc we have found that we need to be able to read/write LUT registers directly using the immediate form. I guess the PTR increment is helpful to somewhat alleviate this but not always as we don't only want sequential access at higher speed, but also random access. Point in case the HDMI code that shares work and writes at some addresses with one COG and others with the other.

How many unique random addresses are you writing to in the LUT?

You could always use registers to hold those addresses if there aren't too many.

32-40, too many for registers, but all in the same half of the LUT.

cgracey wrote: »

Would it not be a net win to be able to do this: "WRLUT reg,PTRA++"?

This is typical code:

	rdlut	sprite_x_y,sprite_ptr
	add	sprite_ptr,#1 
	rdlut	sprite_pat,sprite_ptr	wc
	add	sprite_ptr,#1
	...
	rdlut	pixels,pixel_ptr
	add	pixel_ptr,#1
	...

cgracey · 2018-11-15 02:05

jmg wrote: »

cgracey wrote: »

No, it's the LUT registers that would be limited to $000..$0FF immediate addressing, not the cog registers.

I've been looking through my PASM code base and I never do RDLUT/WRLUT with immediate addressing. It's a register, if anything. Mainly, I just load code into the LUT using SETQ2+RDLONG.

Looking at the Verilog, it would be very easy to make RDLUT/WRLUT work with PTRx expressions. It's almost nothing to add.

I'm not quite following - do you mean immediate address of LUT has to go, or that it is limited to $000..$0FF immediate addressing - which is what, half the LUT ?
It seems the gain of PTRx expressions makes things more orthogonal and consistent. Access to half of LUT via immediate seems fine ?

I would say that users will expect PTRx access into all memory areas.

It just limits immediate addresses to the 1st half of the LUT.

PTRx access to cog registers is not possible, though. There are nice work-around's for indexed longs, words, bytes, nibbles, and bits. So, no need for PTRx-access in cog registers, anyway.

cgracey · 2018-11-15 02:07

TonyB_ wrote: »
cgracey wrote: »

rogloh wrote: »

cgracey wrote: »

Thanks for commenting on this, Eric.

I would like to use the full PTRx expression rules for RDLUT and WRLUT, but it would mean that the LUT could only be immediately addressed from $000..$0FF. Addressing LUT locations $100..$1FF would require a PTRx expression or a register.

How do people feel about that?

Please no no no. For performance reasons in unrolled loops etc we have found that we need to be able to read/write LUT registers directly using the immediate form. I guess the PTR increment is helpful to somewhat alleviate this but not always as we don't only want sequential access at higher speed, but also random access. Point in case the HDMI code that shares work and writes at some addresses with one COG and others with the other.

How many unique random addresses are you writing to in the LUT?

You could always use registers to hold those addresses if there aren't too many.

32-40, too many for registers, but all in the same half of the LUT.

cgracey wrote: »

Would it not be a net win to be able to do this: "WRLUT reg,PTRA++"?

This is typical code:
	rdlut	sprite_x_y,sprite_ptr
	add	sprite_ptr,#1 
	rdlut	sprite_pat,sprite_ptr	wc
	add	sprite_ptr,#1
	...
	rdlut	pixels,pixel_ptr
	add	pixel_ptr,#1
	...

Your code would definitely benefit, then, right?

rogloh · 2018-11-15 02:16

cgracey wrote: »

I don't see any other way. No WC possible in WRLUT, either.

Look at your code. I bet it's not a problem to implement this. Tell me what you see.

Yeah I did see it is not a problem for my example code right now as our particular unrolled loops only randomly write to about 60 longs at a time before they are reused, and they could be setup to always remain in one half of the LUT RAM. I was thinking more generally. But if we need to lose a bit of the S register and half the immediately addressable LUT space to gain high speed sequential reads/writes into COG RAM then maybe it's a reasonable tradeoff. I really don't feel comfortable about it but I suspect people would probably learn to live with it.

With auto-incrementing pointers we can at least get from 1.66 to 2x speedup in block transfers out of it in rep loops, though I haven't seen us need to do that (yet).
eg.

rep #1, #16
wrlut reg, ptra++

1 COG->LUT transfer every 2 clocks instead of 4

rep #1, #16
rdlut reg, ptra++

1 LUT->COG transfer every 3 clocks instead of 5

cgracey · 2018-11-15 02:16

Unless anyone comes up with a compelling reason not to, I'm going to implement PTRx addressing for RDLUT/WRLUT. This will give relative addressing with automatic indexing. The only drawback is that you won't be able to immediately address the 2nd half of the LUT ($100..$1FF). This is going to be a big improvement. It involves almost nothing.

cgracey · 2018-11-15 02:18

rogloh wrote: »

cgracey wrote: »

I don't see any other way. No WC possible in WRLUT, either.

Look at your code. I bet it's not a problem to implement this. Tell me what you see.

Yeah I did see it is not a problem for my example code right now as our particular unrolled loops only randomly write to about 60 longs at a time before they are reused, and they could be setup to always remain in one half of the LUT RAM. I was thinking more generally. But if we need to lose a bit of the S register and half the immediately addressable LUT space to gain high speed sequential reads/writes into COG RAM then maybe it's a reasonable tradeoff. I really don't feel comfortable about it but I suspect people would probably learn to live with it.

With auto-incrementing pointers we can at least get from 1.66 to 2x speedup in block transfers out of it in rep loops, though I haven't seen us need to do that (yet).
eg.

rep #1, #16
wrlut reg, ptra++

1 COG->LUT transfer every 2 clocks instead of 4

rep #1, #16
rdlut reg, ptra++

1 LUT->COG transfer every 3 clocks instead of 5

Let's sleep on it and see if any alarms go off. I suspect it'll be fine to do.

rogloh · 2018-11-15 02:24

Actually this code above doesn't really work, because the COG register being read or written doesn't change, so you still need to increment that too. That adds another instruction in the rep loop so the speed up becomes only :
4 clocks instead of 6 for block writes to LUTRAM from COGRAM
5 clocks instead of 7 for block reads to COGRAM from LUTRAM

Not quite as high.

rogloh · 2018-11-15 02:33

Actually doesn't this sample above now prove that we don't gain anything from the block transfer between LUT & COG RAMs using incrementing pointers if we use alti to achieve the same number of clocks for each burst transfer. eg.

rep #2, #16
alti xxx,#yyy ' where xxx,yyy means increment next instructions D and S field - too hard for me to figure values of xxx,yyy out quickly right now
rdlut 0-0, #0-0

Still takes 5 cycles for each transfer, excluding initial setup time.

If this is the case, then perhaps auto incrementing pointers during LUT access is slightly less useful than I thought.

Update: I guess if the number of transfers are small and you can unroll it could still help, eg. if you have code like this...

RDLUT reg, ptra++
RDLUT reg+1, ptra++
RDLUT reg+2, ptra++
RDLUT reg+3, ptra++

but then again with immediates at known addresses it can be done the same with this

RDLUT reg, #100
RDLUT reg+1, #101
RDLUT reg+2, #102
RDLUT reg+3, #103

cgracey · 2018-11-15 02:39

rogloh wrote: »

Actually doesn't this sample above now prove that we don't gain anything from the block transfer between LUT & COG RAMs using incrementing pointers if we use alti to achieve the same number of clocks for each burst transfer. eg.

rep #2, #16
alti xxx,#yyy ' where xxx,yyy means increment next instructions D and S field - too hard for me to figure values of xxx,yyy out quickly right now
rdlut 0-0, #0-0

Still takes 5 cycles for a block transfer, excluding initial setup time.

If this is the case, then perhaps auto incrementing pointers during LUT access is slightly less useful than I thought.

Interesting. Yes, maybe this is not worth doing. Glad you pointed that out. Keep thinking about it, if there's anything left to think about. Simpler is sometimes better. Maybe things are already okay with how RDLUT/WRLUT work.

cgracey · 2018-11-15 02:40

rogloh wrote: »

Actually doesn't this sample above now prove that we don't gain anything from the block transfer between LUT & COG RAMs using incrementing pointers if we use alti to achieve the same number of clocks for each burst transfer. eg.

rep #2, #16
alti xxx,#yyy ' where xxx,yyy means increment next instructions D and S field - too hard for me to figure values of xxx,yyy out quickly right now
rdlut 0-0, #0-0

Still takes 5 cycles for each transfer, excluding initial setup time.

If this is the case, then perhaps auto incrementing pointers during LUT access is slightly less useful than I thought.

Update: I guess if the number of transfers are small and you can unroll it could still help, eg. if you have code like this...

RDLUT reg, ptra++
RDLUT reg+1, ptra++
RDLUT reg+2, ptra++
RDLUT reg+3, ptra++

Good point. That would save two clocks per long.

rogloh · 2018-11-15 02:42

I also updated after you responded, it can also be achieved with the immediate form with the same savings.

cgracey · 2018-11-15 02:42

rogloh wrote: »

I also updated after you responded, it can also be achieved with the immediate form with the same savings.

Ah, of course. What do we do, then?

rogloh · 2018-11-15 02:44

Move to the next one in the list and we can keep thinking about the real benefits of pointers with LUT RAM before committing to further changes?

cgracey · 2018-11-15 02:44

rogloh wrote: »

Move to the next one in the list and we can keep thinking about the real benefits of pointers with LUT RAM before committing to further changes?

Good idea! Will do.

jmg · 2018-11-15 02:55

cgracey wrote: »

rogloh wrote: »

I also updated after you responded, it can also be achieved with the immediate form with the same savings.

Ah, of course. What do we do, then?

I would say this is well worth doing, for compilers and array handling especially. Fast copy of arrays between COGs pairs is going to be very common.

TonyB_ wrote: »

32-40, too many for registers, but all in the same half of the LUT.

So that code is unaffected, because they are all (comfortably) in the same half of LUT. That makes the case for PTRx addressing stronger.

cgracey wrote: »

Unless anyone comes up with a compelling reason not to, I'm going to implement PTRx addressing for RDLUT/WRLUT. This will give relative addressing with automatic indexing. The only drawback is that you won't be able to immediately address the 2nd half of the LUT ($100..$1FF). This is going to be a big improvement. It involves almost nothing.

Sounds good to me.

Cluso99 · 2018-11-15 03:38

I use LUT for lookup tables so access is random.

I haven't found the need to have PTR increment, either with RD/WRxxxx or with RD/WRLUT. I wasn't even aware that PTR was being updated (incorrectly I might add). I have always needed to reload PTR with the address before using SETQ/SETQ2.

The gotcha I have found coming from P1 is the <hides from Peter> JMPRET instruction because of its 9-bit cog addresses. I'll leave this for another thread.

I cannot see the point in self-modifying instructions in LUT. There aren't any support instructions like SETS/D/I here.

Cluso99 · 2018-11-15 03:55

I am not seeing the benefit from this change.

However, I can live with the immediate only accessing the lower half of LUT, if it is necessary.

I am not seeing how RDLUT xx,PTRx++ cannot require an extra clock though. There is only 1 write port into the cog ram. So, there is the write of the LUT data to the cog ram, and then there is the writeback of the PTR++ value into the cog ram. So then RDLUT xx,PTR++ will take 1 clock more than RDLUT xx,PTRx, or RDLUT is going to take an extra clock either way, and waste a clock when it doesn't need to.

TonyB_ · 2018-11-15 03:57

cgracey wrote: »

rogloh wrote: »

I also updated after you responded, it can also be achieved with the immediate form with the same savings.

Ah, of course. What do we do, then?

The immediate form won't work in REP blocks. Keep the change.

cgracey · 2018-11-15 03:58

Cluso99 wrote: »

I am not seeing the benefit from this change.

However, I can live with the immediate only accessing the lower half of LUT, if it is necessary.

I am not seeing how RDLUT xx,PTRx++ cannot require an extra clock though. There is only 1 write port into the cog ram. So, there is the write of the LUT data to the cog ram, and then there is the writeback of the PTR++ value into the cog ram. So then RDLUT xx,PTR++ will take 1 clock more than RDLUT xx,PTRx, or RDLUT is going to take an extra clock either way, and waste a clock when it doesn't need to.

PTRx are maintained as registers, not RAM. That's why an extra clock is not required.

Electrodude · 2018-11-15 06:12

Can you keep full immediate LUT addressing and instead add four single-operand instructions equivalent to RD/WRLUT D, ptra++/--?
Reposted on RDLUT/WRLUT with auto-incrementing address thread.

Cluso99 · 2018-11-15 07:22

cgracey wrote: »

Cluso99 wrote: »

I am not seeing the benefit from this change.

However, I can live with the immediate only accessing the lower half of LUT, if it is necessary.

I am not seeing how RDLUT xx,PTRx++ cannot require an extra clock though. There is only 1 write port into the cog ram. So, there is the write of the LUT data to the cog ram, and then there is the writeback of the PTR++ value into the cog ram. So then RDLUT xx,PTR++ will take 1 clock more than RDLUT xx,PTRx, or RDLUT is going to take an extra clock either way, and waste a clock when it doesn't need to.

PTRx are maintained as registers, not RAM. That's why an extra clock is not required.

Aha. Good to know.

New Pin Instructions

Comments