I would like to use the full PTRx expression rules for RDLUT and WRLUT, but it would mean that the LUT could only be immediately addressed from $000..$0FF. Addressing LUT locations $100..$1FF would require a PTRx expression or a register.
How do people feel about that?
Please no no no. For performance reasons in unrolled loops etc we have found that we need to be able to read/write LUT registers directly using the immediate form. I guess the PTR increment is helpful to somewhat alleviate this but not always as we don't only want sequential access at higher speed, but also random access. Point in case the HDMI code that shares work and writes at some addresses with one COG and others with the other.
How many unique random addresses are you writing to in the LUT?
You could always use registers to hold those addresses if there aren't too many.
32-40, too many for registers, but all in the same half of the LUT.
No, it's the LUT registers that would be limited to $000..$0FF immediate addressing, not the cog registers.
I've been looking through my PASM code base and I never do RDLUT/WRLUT with immediate addressing. It's a register, if anything. Mainly, I just load code into the LUT using SETQ2+RDLONG.
Looking at the Verilog, it would be very easy to make RDLUT/WRLUT work with PTRx expressions. It's almost nothing to add.
I'm not quite following - do you mean immediate address of LUT has to go, or that it is limited to $000..$0FF immediate addressing - which is what, half the LUT ?
It seems the gain of PTRx expressions makes things more orthogonal and consistent. Access to half of LUT via immediate seems fine ?
I would say that users will expect PTRx access into all memory areas.
It just limits immediate addresses to the 1st half of the LUT.
PTRx access to cog registers is not possible, though. There are nice work-around's for indexed longs, words, bytes, nibbles, and bits. So, no need for PTRx-access in cog registers, anyway.
I would like to use the full PTRx expression rules for RDLUT and WRLUT, but it would mean that the LUT could only be immediately addressed from $000..$0FF. Addressing LUT locations $100..$1FF would require a PTRx expression or a register.
How do people feel about that?
Please no no no. For performance reasons in unrolled loops etc we have found that we need to be able to read/write LUT registers directly using the immediate form. I guess the PTR increment is helpful to somewhat alleviate this but not always as we don't only want sequential access at higher speed, but also random access. Point in case the HDMI code that shares work and writes at some addresses with one COG and others with the other.
How many unique random addresses are you writing to in the LUT?
You could always use registers to hold those addresses if there aren't too many.
32-40, too many for registers, but all in the same half of the LUT.
I don't see any other way. No WC possible in WRLUT, either.
Look at your code. I bet it's not a problem to implement this. Tell me what you see.
Yeah I did see it is not a problem for my example code right now as our particular unrolled loops only randomly write to about 60 longs at a time before they are reused, and they could be setup to always remain in one half of the LUT RAM. I was thinking more generally. But if we need to lose a bit of the S register and half the immediately addressable LUT space to gain high speed sequential reads/writes into COG RAM then maybe it's a reasonable tradeoff. I really don't feel comfortable about it but I suspect people would probably learn to live with it.
With auto-incrementing pointers we can at least get from 1.66 to 2x speedup in block transfers out of it in rep loops, though I haven't seen us need to do that (yet).
eg.
Unless anyone comes up with a compelling reason not to, I'm going to implement PTRx addressing for RDLUT/WRLUT. This will give relative addressing with automatic indexing. The only drawback is that you won't be able to immediately address the 2nd half of the LUT ($100..$1FF). This is going to be a big improvement. It involves almost nothing.
I don't see any other way. No WC possible in WRLUT, either.
Look at your code. I bet it's not a problem to implement this. Tell me what you see.
Yeah I did see it is not a problem for my example code right now as our particular unrolled loops only randomly write to about 60 longs at a time before they are reused, and they could be setup to always remain in one half of the LUT RAM. I was thinking more generally. But if we need to lose a bit of the S register and half the immediately addressable LUT space to gain high speed sequential reads/writes into COG RAM then maybe it's a reasonable tradeoff. I really don't feel comfortable about it but I suspect people would probably learn to live with it.
With auto-incrementing pointers we can at least get from 1.66 to 2x speedup in block transfers out of it in rep loops, though I haven't seen us need to do that (yet).
eg.
rep #1, #16
wrlut reg, ptra++
1 COG->LUT transfer every 2 clocks instead of 4
rep #1, #16
rdlut reg, ptra++
1 LUT->COG transfer every 3 clocks instead of 5
Let's sleep on it and see if any alarms go off. I suspect it'll be fine to do.
Actually this code above doesn't really work, because the COG register being read or written doesn't change, so you still need to increment that too. That adds another instruction in the rep loop so the speed up becomes only :
4 clocks instead of 6 for block writes to LUTRAM from COGRAM
5 clocks instead of 7 for block reads to COGRAM from LUTRAM
Actually doesn't this sample above now prove that we don't gain anything from the block transfer between LUT & COG RAMs using incrementing pointers if we use alti to achieve the same number of clocks for each burst transfer. eg.
rep #2, #16
alti xxx,#yyy ' where xxx,yyy means increment next instructions D and S field - too hard for me to figure values of xxx,yyy out quickly right now
rdlut 0-0, #0-0
Still takes 5 cycles for each transfer, excluding initial setup time.
If this is the case, then perhaps auto incrementing pointers during LUT access is slightly less useful than I thought.
Update: I guess if the number of transfers are small and you can unroll it could still help, eg. if you have code like this...
Actually doesn't this sample above now prove that we don't gain anything from the block transfer between LUT & COG RAMs using incrementing pointers if we use alti to achieve the same number of clocks for each burst transfer. eg.
rep #2, #16
alti xxx,#yyy ' where xxx,yyy means increment next instructions D and S field - too hard for me to figure values of xxx,yyy out quickly right now
rdlut 0-0, #0-0
Still takes 5 cycles for a block transfer, excluding initial setup time.
If this is the case, then perhaps auto incrementing pointers during LUT access is slightly less useful than I thought.
Interesting. Yes, maybe this is not worth doing. Glad you pointed that out. Keep thinking about it, if there's anything left to think about. Simpler is sometimes better. Maybe things are already okay with how RDLUT/WRLUT work.
Actually doesn't this sample above now prove that we don't gain anything from the block transfer between LUT & COG RAMs using incrementing pointers if we use alti to achieve the same number of clocks for each burst transfer. eg.
rep #2, #16
alti xxx,#yyy ' where xxx,yyy means increment next instructions D and S field - too hard for me to figure values of xxx,yyy out quickly right now
rdlut 0-0, #0-0
Still takes 5 cycles for each transfer, excluding initial setup time.
If this is the case, then perhaps auto incrementing pointers during LUT access is slightly less useful than I thought.
Update: I guess if the number of transfers are small and you can unroll it could still help, eg. if you have code like this...
Unless anyone comes up with a compelling reason not to, I'm going to implement PTRx addressing for RDLUT/WRLUT. This will give relative addressing with automatic indexing. The only drawback is that you won't be able to immediately address the 2nd half of the LUT ($100..$1FF). This is going to be a big improvement. It involves almost nothing.
I haven't found the need to have PTR increment, either with RD/WRxxxx or with RD/WRLUT. I wasn't even aware that PTR was being updated (incorrectly I might add). I have always needed to reload PTR with the address before using SETQ/SETQ2.
The gotcha I have found coming from P1 is the <hides from Peter> JMPRET instruction because of its 9-bit cog addresses. I'll leave this for another thread.
I cannot see the point in self-modifying instructions in LUT. There aren't any support instructions like SETS/D/I here.
However, I can live with the immediate only accessing the lower half of LUT, if it is necessary.
I am not seeing how RDLUT xx,PTRx++ cannot require an extra clock though. There is only 1 write port into the cog ram. So, there is the write of the LUT data to the cog ram, and then there is the writeback of the PTR++ value into the cog ram. So then RDLUT xx,PTR++ will take 1 clock more than RDLUT xx,PTRx, or RDLUT is going to take an extra clock either way, and waste a clock when it doesn't need to.
However, I can live with the immediate only accessing the lower half of LUT, if it is necessary.
I am not seeing how RDLUT xx,PTRx++ cannot require an extra clock though. There is only 1 write port into the cog ram. So, there is the write of the LUT data to the cog ram, and then there is the writeback of the PTR++ value into the cog ram. So then RDLUT xx,PTR++ will take 1 clock more than RDLUT xx,PTRx, or RDLUT is going to take an extra clock either way, and waste a clock when it doesn't need to.
PTRx are maintained as registers, not RAM. That's why an extra clock is not required.
Can you keep full immediate LUT addressing and instead add four single-operand instructions equivalent to RD/WRLUT D, ptra++/--?
Reposted on RDLUT/WRLUT with auto-incrementing address thread.
However, I can live with the immediate only accessing the lower half of LUT, if it is necessary.
I am not seeing how RDLUT xx,PTRx++ cannot require an extra clock though. There is only 1 write port into the cog ram. So, there is the write of the LUT data to the cog ram, and then there is the writeback of the PTR++ value into the cog ram. So then RDLUT xx,PTR++ will take 1 clock more than RDLUT xx,PTRx, or RDLUT is going to take an extra clock either way, and waste a clock when it doesn't need to.
PTRx are maintained as registers, not RAM. That's why an extra clock is not required.
Comments
32-40, too many for registers, but all in the same half of the LUT.
This is typical code:
It just limits immediate addresses to the 1st half of the LUT.
PTRx access to cog registers is not possible, though. There are nice work-around's for indexed longs, words, bytes, nibbles, and bits. So, no need for PTRx-access in cog registers, anyway.
Your code would definitely benefit, then, right?
Yeah I did see it is not a problem for my example code right now as our particular unrolled loops only randomly write to about 60 longs at a time before they are reused, and they could be setup to always remain in one half of the LUT RAM. I was thinking more generally. But if we need to lose a bit of the S register and half the immediately addressable LUT space to gain high speed sequential reads/writes into COG RAM then maybe it's a reasonable tradeoff. I really don't feel comfortable about it but I suspect people would probably learn to live with it.
With auto-incrementing pointers we can at least get from 1.66 to 2x speedup in block transfers out of it in rep loops, though I haven't seen us need to do that (yet).
eg.
rep #1, #16
wrlut reg, ptra++
1 COG->LUT transfer every 2 clocks instead of 4
rep #1, #16
rdlut reg, ptra++
1 LUT->COG transfer every 3 clocks instead of 5
Let's sleep on it and see if any alarms go off. I suspect it'll be fine to do.
4 clocks instead of 6 for block writes to LUTRAM from COGRAM
5 clocks instead of 7 for block reads to COGRAM from LUTRAM
Not quite as high.
rep #2, #16
alti xxx,#yyy ' where xxx,yyy means increment next instructions D and S field - too hard for me to figure values of xxx,yyy out quickly right now
rdlut 0-0, #0-0
Still takes 5 cycles for each transfer, excluding initial setup time.
If this is the case, then perhaps auto incrementing pointers during LUT access is slightly less useful than I thought.
Update: I guess if the number of transfers are small and you can unroll it could still help, eg. if you have code like this...
RDLUT reg, ptra++
RDLUT reg+1, ptra++
RDLUT reg+2, ptra++
RDLUT reg+3, ptra++
but then again with immediates at known addresses it can be done the same with this
RDLUT reg, #100
RDLUT reg+1, #101
RDLUT reg+2, #102
RDLUT reg+3, #103
Interesting. Yes, maybe this is not worth doing. Glad you pointed that out. Keep thinking about it, if there's anything left to think about. Simpler is sometimes better. Maybe things are already okay with how RDLUT/WRLUT work.
Good point. That would save two clocks per long.
Ah, of course. What do we do, then?
Good idea! Will do.
I would say this is well worth doing, for compilers and array handling especially. Fast copy of arrays between COGs pairs is going to be very common.
So that code is unaffected, because they are all (comfortably) in the same half of LUT. That makes the case for PTRx addressing stronger.
Sounds good to me.
I haven't found the need to have PTR increment, either with RD/WRxxxx or with RD/WRLUT. I wasn't even aware that PTR was being updated (incorrectly I might add). I have always needed to reload PTR with the address before using SETQ/SETQ2.
The gotcha I have found coming from P1 is the <hides from Peter> JMPRET instruction because of its 9-bit cog addresses. I'll leave this for another thread.
I cannot see the point in self-modifying instructions in LUT. There aren't any support instructions like SETS/D/I here.
However, I can live with the immediate only accessing the lower half of LUT, if it is necessary.
I am not seeing how RDLUT xx,PTRx++ cannot require an extra clock though. There is only 1 write port into the cog ram. So, there is the write of the LUT data to the cog ram, and then there is the writeback of the PTR++ value into the cog ram. So then RDLUT xx,PTR++ will take 1 clock more than RDLUT xx,PTRx, or RDLUT is going to take an extra clock either way, and waste a clock when it doesn't need to.
The immediate form won't work in REP blocks. Keep the change.
PTRx are maintained as registers, not RAM. That's why an extra clock is not required.
Reposted on RDLUT/WRLUT with auto-incrementing address thread.