The way I see it is the xxLONG is used to set the block move mechanism and is executed once.
Then the block move mechanism takes over, whuch explains the sinlee PTRx action and the SETQ(2) count being #N-1.
So the RDLONG takes 1 clock for each execution of the SETQ loop. But if there is an update to the PTR then it would require an extra clock for the PTR write. I guess this is why PTR doesn't get updated (beyond the first update which is probably an artificial).
I wonder if the first SETQ takes an additional clock for that PTRx update?
PTRx gets updated on the last clock of the instruction.
We can change RDLUT/WRLUT to use PTRx expressions like the hub RD/WR/WMxxxx instructions use.
This introduces a limitation that immediate LUT addressing (#S) can only be to the first half of the LUT ($000..$0FF) to get to the second half, you must use a register or a PTRx expression.
RDLUT reg,#$000 'first immediately-addressable LUT location
RDLUT reg,#$0FF 'last immediately-addressable LUT location
RDLUT reg,PTRA++ 'PTRx encodings will use what was immediate #$100..$1FF
RDLUT reg,addr 'register addresses can span $000..$1FF
SETQ2 #$FF 'load 256 LUT locations from hub into $100..$1FF
RDLONG $100,addr '$000..$1FF can still be used as start addresses for hub loading
Here's an example of the savings an auto-incrementing LUT address will produce (including RD/WM/WRLONG PTRx change). Only five instructions saved but at least 400 cycles.
Yes Cluso, I was thinking the same and a larger immediate access space is far better, rather than lose 50% of the LUT, we'd only lose 1/32 perhaps and I'd be far more comfortable with that.
Of course this may be somewhat complex to implement and take more LE's. It does give you the increment and decrement op though and that is needed for a stack.
Anything useful for the other unused patterns? Some other operation?
Looking further at your encoding Cluso I see you have optimized it for one-hot type of control bits. One bit indicates whether A/B pointer, one bit to do a PTR update or not, one bit for inc/dec and the other for post/pre. That should help decode it quickly. It does leave the cases of
1_1111_00xx and 1_1111_10xx.
I guess they should all collapse down and do the same as 1_1111_0000 and 1_1111_1000 respectively in order to keep the one-hot type of decoding working fast.
IMHO, there is no need for an INDEX for LUT since LUT is small and SCALE=1 always (long)
This means that LUT can be directly accessed with an immediate up to the last 16 longs. ie $000-$1EF is LUT, $1F0-$1FF is PTR variants.
This is a clever idea. The index is still there but it's only one bit. It's very easy to recreate the 1SUP????? bits used for hub RAM from 11111SUPN (->1SUPNNNNU) to share the decoding.
$1F0-$1FF would be special or different anyway in both COG and LUT.
Yes Cluso, I was thinking the same and a larger immediate access space is far better, rather than lose 50% of the LUT, we'd only lose 1/32 perhaps and I'd be far more comfortable with that.
Of course this may be somewhat complex to implement and take more LE's. It does give you the increment and decrement op though and that is needed for a stack.
Anything useful for the other unused patterns? Some other operation?
Should be easy to implement. Unused patterns are contradictions, e.g. stay the same and increment.
[Putting this new post now over in this thread where it belongs...]
Let's try to think through the different use cases to see where the benefit of PTR's and LUTRAM access is. TonyB_ has identified an example of where there is a good performance gain in the HDMI code, but it was sort of only indirectly due to the additional overhead causing a critical code loop to step just beyond a hub boundary. Any reduction in cycles anywhere else in that loop would have helped gain those savings, so it is not really a perfect example of why it was required in my view.
1) LUTRAM holding lookup tables:
LUTRAM is perfect for holding look up tables and random access is definitely needed there, any auto incrementing PTR form is not so much of benefit in that situation unless you are indexing tables with multiple sequential items from the first calculated index. You very might well be doing this for accessing something like a sprite list with multiple longs per sprite.
2) LUTRAM table loading:
You can already transfer to/from LUT and HUB at the maximum speed from HUB with SETQ, no need for any LUT PTR stuff for that.
3) LUTRAM loading from COGRAM:
You can transfer small blocks of data to/from LUTRAM and COGRAM at known addresses in a small batch with one long transfer for each 2 or 3 clocks for writing and reading LUTRAM respectively but only at fixed addresses, either using immediates, or in known registers holding LUT RAM addresses, and setup initially or always fixed in advance. This approach doesn't work with REP loops for transferring continuous data streams however unless you also patch the addresses on the fly which costs more cycles.
If you want to do larger block transfers in REP loops, yes without PTRA++ you will need to copy using one more instruction in the loop, but if the data being copied is ever going into more than one register (ie. an actual block transfer), you will still need to increment the destination or source COG RAM register and you could then simultaneously increment an immediate constant for the same timing in a REP loop with ALTI. So a PTRA++ form doesn't help quite as much there, but it can help retain knowledge of where you are up to.
The only benefit I see is when you have to read through a list of longs in LUT and process the data one long at a time from LUT RAM and also keep track of where you are. For this class of processing the PTR form is very useful and saves two clocks on each time. Depending on how tight the loop is, 2 clocks could be significant.
4) Executing from LUTRAM:
Don't see much benefit in auto-incrementing PTR access, unless dynamic code sequences are being modified in bursts on the fly perhaps. Maybe reading some custom "long code" (not bytecode), from LUT RAM you could use an auto-incrementing PTR as a type of program counter? Weird application but might help if you are really COGRAM limited.
5) A LUTRAM stack:
I can see that LUTRAM with the pointer operations could be very useful for supporting a decently sized stack. Automatic pointer updating would be very useful for that use case and this would be atomic as well in interrupts (very important), but you'd have to be able to support both incrementing and decrementing update modes.
6) A LUTRAM fifo for passing data between a COG pair in shared LUT mode:
A FIFO data pipeline could also be made between two COGs with just incrementing pointers, but the FIFO overflow/underflow cases need some type of other handling and signaling between COGs, which could become somewhat problematic if the LUT RAM is already fully used for the FIFO as it probably would be unless it can somehow wrap on a smaller (half) LUTRAM boundary. This might have to involve the hub to signal the amount of data present, and once you are involving the hub it may start to become faster to copy your data in/out of hub as this is already performed a rate of one long per clock, once the initial latency is overcome. Although there are also going to be cases when you might work on one data item at a time rather than transferring in larger batches and a LUT based FIFO is probably quite useful then.
There are definitely benefits of RDLUT&WRLUT with the pointers to be had. I think if we can sacrifice the reduced range (16 longs) of immediate mode addressing to enable it as Cluso proposed, it would be worth the change.
I'm kind of partial to keeping the -16..+15 range for the index.
I think immediate-addressability for the first 256 locations is not really improved by going to the first 496. How many hard-coded addresses are you likely to have in your program? 20? I haven't even had 1, yet.
Having a 5-bit index is very useful for navigating stack frames and records. I think it's much more valuable than getting more immediately-addressable LUT locations that your code is never going to use.
If you do this, will it at least be possible to do a hardcoded LUT access via RD/WRLUT x, ##addr, where 0 =< addr < 512 even though it's an extended immediate?
Or, could you instead add an instruction that sets a 9-bit register to a value that should be used to increment ptra by on every RD/WRLUT D, ptra?
I'm kind of partial to keeping the -16..+15 range for the index.
I think immediate-addressability for the first 256 locations is not really improved by going to the first 496. How many hard-coded addresses are you likely to have in your program? 20? I haven't even had 1, yet.
Having a 5-bit index is very useful for navigating stack frames and records. I think it's much more valuable than getting more immediately-addressable LUT locations that your code is never going to use.
Just thinking about this.
A fairly extreme but practicable example of hard-coded LUT addresses is the 8bpp soft HDMI with the TMDS loop unrolled eight times. It would use 128 instructions and need a block of 80 LUT addresses, still way below the 256 limit.
If you do this, will it at least be possible to do a hardcoded LUT access via RD/WRLUT x, ##addr, where 0 =< addr < 512 even though it's an extended immediate?
Or, could you instead add an instruction that sets a 9-bit register to a value that should be used to increment ptra by on every RD/WRLUT D, ptra?
Yeah Chip, I guess thinking about the number of immediate addresses that could be actually used in COG RAM program helps lessen the impact of the reduced LUTRAM range no being considered capable for immediate addressing. To actually really use over 256 immediately addressable LUT RAM entries, you either need to use patching in loops (which could still potentially be done by changing a register instead if you have the room to hold it), or you'd have at least this many individual RDLUT instructions in your code which is getting fairly ridiculous. Though I do think back of all the optimized video code on P1 and sort of remember seeing things a bit like that too with back to back waitvids etc. An incrementing LUTRAM pointer form should now help those types of applications I would hope.
In our case with the HDMI code we did hit needing 60 random accesses in an unrolled loop and might have wanted 80 at some point but that was only one case where it was essential to be able to randomly access LUT RAM as fast as possible, and at this point we started to run out of COGRAM space anyway. It probably is unlikely someone would need to do more than 256 in a row in most cases, but I'd bet someone will eventually try to for some reason or another, such as bursting out pre-prepared pattern data to smartpin registers in some non-sequential manner or maybe to/from the CORDIC engine etc as fast as possible. I don't know. LOL.
Initially I though that accessing up to -16..+15 longs away from the pointer in the LUT seemed a bit extreme given the smallish size of the LUT compared to hub RAM but once we get the LUT available as a proper stack for sure I can see it would be very useful for stack frame situations where we could very much benefit for holding locals/parameters and they would be so much faster to get access to when compared to always reading them from HUB RAM. It is probably worth supporting the larger span of indexed addressing for this reason, and that necessitates the reduces LUT address range for immediate access. I admit it.
If you do this, will it at least be possible to do a hardcoded LUT access via RD/WRLUT x, ##addr, where 0 =< addr < 512 even though it's an extended immediate?
Or, could you instead add an instruction that sets a 9-bit register to a value that should be used to increment ptra by on every RD/WRLUT D, ptra?
You could just do this:
MOV addr,#$1FF
RDLONG reg,addr
If you use an extended ## immediate with a value that fits in 9 bits on an instruction that uses bit 8 to indicate PTRx auto-increment, will the ## disable the PTRx auto-increment? If not, would it be useful and practical to change it so that it does?
If you do this, will it at least be possible to do a hardcoded LUT access via RD/WRLUT x, ##addr, where 0 =< addr < 512 even though it's an extended immediate?
Or, could you instead add an instruction that sets a 9-bit register to a value that should be used to increment ptra by on every RD/WRLUT D, ptra?
You could just do this:
MOV addr,#$1FF
RDLONG reg,addr
If you use an extended ## immediate with a value that fits in 9 bits on an instruction that uses bit 8 to indicate PTRx auto-increment, will the ## disable the PTRx auto-increment? If not, would it be useful and practical to change it so that it does?
Yes, ## is only interpreted as a PTRx expression if the top 9 of 32 bits are %000000001.
If you do this, will it at least be possible to do a hardcoded LUT access via RD/WRLUT x, ##addr, where 0 =< addr < 512 even though it's an extended immediate?
Or, could you instead add an instruction that sets a 9-bit register to a value that should be used to increment ptra by on every RD/WRLUT D, ptra?
You could just do this:
MOV addr,#$1FF
RDLONG reg,addr
If you use an extended ## immediate with a value that fits in 9 bits on an instruction that uses bit 8 to indicate PTRx auto-increment, will the ## disable the PTRx auto-increment? If not, would it be useful and practical to change it so that it does?
Yes, ## is only interpreted as a PTRx expression if the top 9 of 32 bits are %000000001.
What's the purpose of being able to enable PTRx with ## at all?
If you do this, will it at least be possible to do a hardcoded LUT access via RD/WRLUT x, ##addr, where 0 =< addr < 512 even though it's an extended immediate?
Or, could you instead add an instruction that sets a 9-bit register to a value that should be used to increment ptra by on every RD/WRLUT D, ptra?
You could just do this:
MOV addr,#$1FF
RDLONG reg,addr
If you use an extended ## immediate with a value that fits in 9 bits on an instruction that uses bit 8 to indicate PTRx auto-increment, will the ## disable the PTRx auto-increment? If not, would it be useful and practical to change it so that it does?
Yes, ## is only interpreted as a PTRx expression if the top 9 of 32 bits are %000000001.
What's the purpose of being able to enable PTRx with ## at all?
For the hub RD/WRxxxx instructions, it affords 20-bit unscaled offsets.
The PTRx encoding seems to have a spare bit that could be used to increase the index when PTRx does not change. A larger index means there is less need to modify the base pointer. I could explain it but it's easier to see it:
Existing:
1SUPNNNNN Expression Address used PTRx after instr INDEX
1000NNNNN PTRA[INDEX] PTRA + INDEX*SCALE PTRA -16..+15
1100NNNNN PTRB[INDEX] PTRB + INDEX*SCALE PTRB -16..+15
1011NNNNN PTRA++[INDEX] PTRA PTRA ++= INDEX*SCALE -16..+15
1111NNNNN PTRB++[INDEX] PTRB PTRB ++= INDEX*SCALE -16..+15
1010NNNNN ++PTRA[INDEX] PTRA + INDEX*SCALE PTRA ++= INDEX*SCALE -16..+15
1110NNNNN ++PTRB[INDEX] PTRB + INDEX*SCALE PTRB ++= INDEX*SCALE -16..+15
S = 0 for PTRA, 1 for PTRB
U = 0 to keep PTRx same, 1 to update PTRx (PTRx += INDEX*SCALE)
P = 0 to use PTRx + INDEX*SCALE, 1 to use PTRx (post-modify)
NNNNN = INDEX (signed)
--------------------------------------------------------------------------------
Proposed:
1SUXNNNNN Expression Address used PTRx after instr INDEX
100NNNNNN PTRA[INDEX] PTRA + INDEX*SCALE PTRA -32..+31
110NNNNNN PTRB[INDEX] PTRB + INDEX*SCALE PTRB -32..+31
1011NNNNN PTRA++[INDEX] PTRA PTRA ++= INDEX*SCALE -16..+15
1111NNNNN PTRB++[INDEX] PTRB PTRB ++= INDEX*SCALE -16..+15
1010NNNNN ++PTRA[INDEX] PTRA + INDEX*SCALE PTRA ++= INDEX*SCALE -16..+15
1110NNNNN ++PTRB[INDEX] PTRB + INDEX*SCALE PTRB ++= INDEX*SCALE -16..+15
If U = 0, then P = 0 and INDEX[5] = X
If U = 1, then P = X and INDEX[5] = INDEX[4]
Comments
Then the block move mechanism takes over, whuch explains the sinlee PTRx action and the SETQ(2) count being #N-1.
PTRx gets updated on the last clock of the instruction.
No, it just happens on the last clock, not on an extra clock.
This introduces a limitation that immediate LUT addressing (#S) can only be to the first half of the LUT ($000..$0FF) to get to the second half, you must use a register or a PTRx expression.
The LUT is great place to store tables and arrays, which could be shared by two cogs as in the soft HDMI.
I agree, the gain outweighs the loss.
If you must change RD/WRLUT with immediate PTRx/PTR++/PTR--/++PTR/--PTR then I suggest... IMHO, there is no need for an INDEX for LUT since LUT is small and SCALE=1 always (long)
This means that LUT can be directly accessed with an immediate up to the last 16 longs. ie $000-$1EF is LUT, $1F0-$1FF is PTR variants.
Of course this may be somewhat complex to implement and take more LE's. It does give you the increment and decrement op though and that is needed for a stack.
Anything useful for the other unused patterns? Some other operation?
1_1111_00xx and 1_1111_10xx.
I guess they should all collapse down and do the same as 1_1111_0000 and 1_1111_1000 respectively in order to keep the one-hot type of decoding working fast.
Roger.
This is a clever idea. The index is still there but it's only one bit. It's very easy to recreate the 1SUP????? bits used for hub RAM from 11111SUPN (->1SUPNNNNU) to share the decoding.
$1F0-$1FF would be special or different anyway in both COG and LUT.
Should be easy to implement. Unused patterns are contradictions, e.g. stay the same and increment.
Let's try to think through the different use cases to see where the benefit of PTR's and LUTRAM access is. TonyB_ has identified an example of where there is a good performance gain in the HDMI code, but it was sort of only indirectly due to the additional overhead causing a critical code loop to step just beyond a hub boundary. Any reduction in cycles anywhere else in that loop would have helped gain those savings, so it is not really a perfect example of why it was required in my view.
1) LUTRAM holding lookup tables:
LUTRAM is perfect for holding look up tables and random access is definitely needed there, any auto incrementing PTR form is not so much of benefit in that situation unless you are indexing tables with multiple sequential items from the first calculated index. You very might well be doing this for accessing something like a sprite list with multiple longs per sprite.
2) LUTRAM table loading:
You can already transfer to/from LUT and HUB at the maximum speed from HUB with SETQ, no need for any LUT PTR stuff for that.
3) LUTRAM loading from COGRAM:
You can transfer small blocks of data to/from LUTRAM and COGRAM at known addresses in a small batch with one long transfer for each 2 or 3 clocks for writing and reading LUTRAM respectively but only at fixed addresses, either using immediates, or in known registers holding LUT RAM addresses, and setup initially or always fixed in advance. This approach doesn't work with REP loops for transferring continuous data streams however unless you also patch the addresses on the fly which costs more cycles.
If you want to do larger block transfers in REP loops, yes without PTRA++ you will need to copy using one more instruction in the loop, but if the data being copied is ever going into more than one register (ie. an actual block transfer), you will still need to increment the destination or source COG RAM register and you could then simultaneously increment an immediate constant for the same timing in a REP loop with ALTI. So a PTRA++ form doesn't help quite as much there, but it can help retain knowledge of where you are up to.
The only benefit I see is when you have to read through a list of longs in LUT and process the data one long at a time from LUT RAM and also keep track of where you are. For this class of processing the PTR form is very useful and saves two clocks on each time. Depending on how tight the loop is, 2 clocks could be significant.
4) Executing from LUTRAM:
Don't see much benefit in auto-incrementing PTR access, unless dynamic code sequences are being modified in bursts on the fly perhaps. Maybe reading some custom "long code" (not bytecode), from LUT RAM you could use an auto-incrementing PTR as a type of program counter? Weird application but might help if you are really COGRAM limited.
5) A LUTRAM stack:
I can see that LUTRAM with the pointer operations could be very useful for supporting a decently sized stack. Automatic pointer updating would be very useful for that use case and this would be atomic as well in interrupts (very important), but you'd have to be able to support both incrementing and decrementing update modes.
6) A LUTRAM fifo for passing data between a COG pair in shared LUT mode:
A FIFO data pipeline could also be made between two COGs with just incrementing pointers, but the FIFO overflow/underflow cases need some type of other handling and signaling between COGs, which could become somewhat problematic if the LUT RAM is already fully used for the FIFO as it probably would be unless it can somehow wrap on a smaller (half) LUTRAM boundary. This might have to involve the hub to signal the amount of data present, and once you are involving the hub it may start to become faster to copy your data in/out of hub as this is already performed a rate of one long per clock, once the initial latency is overcome. Although there are also going to be cases when you might work on one data item at a time rather than transferring in larger batches and a LUT based FIFO is probably quite useful then.
There are definitely benefits of RDLUT&WRLUT with the pointers to be had. I think if we can sacrifice the reduced range (16 longs) of immediate mode addressing to enable it as Cluso proposed, it would be worth the change.
I just took the SUPN table from Chip's PTR code used for RD/WRxxxx and reduced it
Thanks for your in-depth discussion. Very enlightening.
Small benefit, small downside.
I think immediate-addressability for the first 256 locations is not really improved by going to the first 496. How many hard-coded addresses are you likely to have in your program? 20? I haven't even had 1, yet.
Having a 5-bit index is very useful for navigating stack frames and records. I think it's much more valuable than getting more immediately-addressable LUT locations that your code is never going to use.
Just thinking about this.
We've got a little opcode space left, but I want to put the shorthand WRPIN instructions in there.
Or, could you instead add an instruction that sets a 9-bit register to a value that should be used to increment ptra by on every RD/WRLUT D, ptra?
A fairly extreme but practicable example of hard-coded LUT addresses is the 8bpp soft HDMI with the TMDS loop unrolled eight times. It would use 128 instructions and need a block of 80 LUT addresses, still way below the 256 limit.
You could just do this:
It is more likely we are going to use indirect or PTR LUT addresses.
Perhaps it will not use too much silicon as hopefully it will share similar silicon to rd/wrxxxx.
In our case with the HDMI code we did hit needing 60 random accesses in an unrolled loop and might have wanted 80 at some point but that was only one case where it was essential to be able to randomly access LUT RAM as fast as possible, and at this point we started to run out of COGRAM space anyway. It probably is unlikely someone would need to do more than 256 in a row in most cases, but I'd bet someone will eventually try to for some reason or another, such as bursting out pre-prepared pattern data to smartpin registers in some non-sequential manner or maybe to/from the CORDIC engine etc as fast as possible. I don't know. LOL.
Initially I though that accessing up to -16..+15 longs away from the pointer in the LUT seemed a bit extreme given the smallish size of the LUT compared to hub RAM but once we get the LUT available as a proper stack for sure I can see it would be very useful for stack frame situations where we could very much benefit for holding locals/parameters and they would be so much faster to get access to when compared to always reading them from HUB RAM. It is probably worth supporting the larger span of indexed addressing for this reason, and that necessitates the reduces LUT address range for immediate access. I admit it.
If you use an extended ## immediate with a value that fits in 9 bits on an instruction that uses bit 8 to indicate PTRx auto-increment, will the ## disable the PTRx auto-increment? If not, would it be useful and practical to change it so that it does?
Yes, ## is only interpreted as a PTRx expression if the top 9 of 32 bits are %000000001.
What's the purpose of being able to enable PTRx with ## at all?
For the hub RD/WRxxxx instructions, it affords 20-bit unscaled offsets.