New Pin Instructions

TonyB_ · 2018-11-14 20:40

Roy Eltham wrote: »

Sorry, I am a bit frustrated with what is deemed important vs what is not.

I feel like this change is not needed at all, and the LUT PTRx change is only being done for software HDMI that won't be used in the actual chip since you are doing hardware assisted HDMI. Both seem like things that should be very low priority compared to anything else.

Auto-incrementing PTRx with RD/WRLUT will be useful for all sorts of things. If someone else had suggested it, I would have been the first person to say "That's a great idea!" It will save code and cycles generally and where timing is critical in particular. The P2 will be able to do more, more quickly. The same applies to automatic adjustments to ptrx during fast block moves.

More than one change can be made for rev B and it's already happened.

cgracey · 2018-11-14 20:42

Thanks, Jmg.

I'm thinking that just a way to take a snapshot of the whole 64 bits would be sufficient. Maybe protect GETCT from interrupts so that a second GETCT returns the top 32 bits, time-aligned.

TonyB_ · 2018-11-14 21:08

Question:

Does the eggbeater use the low bits of CT for its slice addresses?

cgracey · 2018-11-14 21:17

TonyB_ wrote: »

Question:

Does the eggbeater use the low bits of CT for its slice addresses?

No, but there is a fixed relationship between the two. They both start cycling from reset.

TonyB_ · 2018-11-14 21:22

cgracey wrote: »

TonyB_ wrote: »

Question:

Does the eggbeater use the low bits of CT for its slice addresses?

No, but there is a fixed relationship between the two. They both start cycling from reset.

Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?

Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.

jmg · 2018-11-14 21:48

TonyB_ wrote: »

cgracey wrote: »

No, but there is a fixed relationship between the two. They both start cycling from reset.

Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?

That depends on if CT is cleared during reset. A 64b one should be cleared, to give time-from-start.

The 64b by way of consecutive reads Chip mentions above, does not need another 32 bit bus.

cgracey · 2018-11-14 22:21

CT is cleared during reset.

The phase difference between CT and each cogs' hub access is static. Never changes.

cgracey · 2018-11-14 22:22

TonyB_ wrote: »

cgracey wrote: »

TonyB_ wrote: »

Question:

Does the eggbeater use the low bits of CT for its slice addresses?

No, but there is a fixed relationship between the two. They both start cycling from reset.

Thanks, Chip. So we could deduce the phase difference and it will never change from one reset to the next?

Would a 64-bit CT require another 32-bit bus from hub to cog? If so, I have an idea.

Yes, there's another 32-bit bus involved. We can't mux high and low longs, though, because the timer events are still looking at the lower long in the background. What was your idea?

TonyB_ · 2018-11-14 23:22

deleted

Peter Jakacki · 2018-11-14 23:33

Cluso99 wrote: »

Chip,
OT. Could the CNT be extended to 64 bits?
Then just need a new instruction to copy the 64-bit CNT to the GETX and GETY internal result registers.
Everything would work as-is for 32-bit CNT, but to get a 64-bit CNT, you do a COPYCNT instruction followed by GETX and/or GETY (or whatever these instructions are called).
WAITCTx etc do not work on the 64-bit CNT.
This just allows a bigger timer to be implemented in software, should someone want longer timers.
What do you think? Presume little risk, little silicon?

And, do we need it ???

I wondered how this PIN thread started to go so OT, then I found the reason.

cgracey · 2018-11-15 00:09

Peter Jakacki wrote: »

Cluso99 wrote: »

Chip,
OT. Could the CNT be extended to 64 bits?
Then just need a new instruction to copy the 64-bit CNT to the GETX and GETY internal result registers.
Everything would work as-is for 32-bit CNT, but to get a 64-bit CNT, you do a COPYCNT instruction followed by GETX and/or GETY (or whatever these instructions are called).
WAITCTx etc do not work on the 64-bit CNT.
This just allows a bigger timer to be implemented in software, should someone want longer timers.
What do you think? Presume little risk, little silicon?

And, do we need it ???

I wondered how this PIN thread started to go so OT, then I found the reason.

Expanding the counter was easy compared to other stuff I've been working on.

I've got all the pin/bit instructions handling spans of bits now. That was an adventure.

There are four things left on my Verilog to-do list, for those wondering if the tinkering will ever end:

1) Make SETQ(2)+RD/WR/WMLONG behave sensibly with {++}PTRx{--} addressing.
2) Auto-increment PTRx on "RD/WRLUT reg,PTRx".
3) Make shorthand instructions to simplify common WRPIN operations. Very nice with simultaneous pin spanning.
4) Finish modifying the streamer to support ADC input, DAC output, LUT-sequence output, and logic analyzer modes.

1 and 2 are perfunctory, while 3 and 4 require creativity. I'll do the boring stuff next (1 and 2).

jmg · 2018-11-15 00:22

cgracey wrote: »

There are four things left on my Verilog to-do list, for those wondering if the tinkering will ever end:

...4) Finish modifying the streamer to support ADC input, DAC output, LUT-sequence output, and logic analyzer modes.

An adjunct of 4) could be the Clock out to SysCLK speed, to actually allow SYNC clock to streamer speeds. (currently missing)
I think that needs =\_ and _/= D-FF's plus XOR gate, to largely keep timing to master SysCLK. FPGA synth should manage that ?

I wonder if an optional =\_ D-FF on the clock, would allow a half-period-shift of clock timing, to give Tsu/th margin control ? Or make that an XOR/XNOR choice ?

rogloh · 2018-11-15 00:51

cgracey wrote: »

There are four things left on my Verilog to-do list, for those wondering if the tinkering will ever end:

1) Make SETQ(2)+RD/WR/WMLONG behave sensibly with {++}PTRx{--} addressing.
2) Auto-increment PTRx on "RD/WRLUT reg,PTRx".
3) Make shorthand instructions to simplify common WRPIN operations. Very nice with simultaneous pin spanning.
4) Finish modifying the streamer to support ADC input, DAC output, LUT-sequence output, and logic analyzer modes.

Wasn't there also this one talked about:
5) Allow HDMI pin order to be reversed in it's 8 pin group?

Or was this one very simple and has already been completed? I can't recall if it was fully finished and locked down yet or not.

cgracey · 2018-11-15 00:54

rogloh wrote: »

cgracey wrote: »

There are four things left on my Verilog to-do list, for those wondering if the tinkering will ever end:

1) Make SETQ(2)+RD/WR/WMLONG behave sensibly with {++}PTRx{--} addressing.
2) Auto-increment PTRx on "RD/WRLUT reg,PTRx".
3) Make shorthand instructions to simplify common WRPIN operations. Very nice with simultaneous pin spanning.
4) Finish modifying the streamer to support ADC input, DAC output, LUT-sequence output, and logic analyzer modes.

Wasn't there also this one talked about:
5) Allow HDMI pin order to be reversed in it's 8 pin group?

Or was this one very simple and has already been completed? I can't recall if it was fully finished and locked down yet or not.

That's already done.

cgracey · 2018-11-15 00:54

jmg wrote: »

cgracey wrote: »

There are four things left on my Verilog to-do list, for those wondering if the tinkering will ever end:

...4) Finish modifying the streamer to support ADC input, DAC output, LUT-sequence output, and logic analyzer modes.

An adjunct of 4) could be the Clock out to SysCLK speed, to actually allow SYNC clock to streamer speeds. (currently missing)
I think that needs =\_ and _/= D-FF's plus XOR gate, to largely keep timing to master SysCLK. FPGA synth should manage that ?

I wonder if an optional =\_ D-FF on the clock, would allow a half-period-shift of clock timing, to give Tsu/th margin control ? Or make that an XOR/XNOR choice ?

Yes, I will see about exposing the CLK with some polarity control.

ersmith · 2018-11-15 01:16

cgracey wrote: »

There are four things left on my Verilog to-do list, for those wondering if the tinkering will ever end:

1) Make SETQ(2)+RD/WR/WMLONG behave sensibly with {++}PTRx{--} addressing.
2) Auto-increment PTRx on "RD/WRLUT reg,PTRx".

1 and 2 are perfunctory, while 3 and 4 require creativity. I'll do the boring stuff next (1 and 2).

Please re-consider (2). It's really nasty for a programmer if the hardware is silently changing a register behind his/her back.

I can offer some suggestions for how to get the same functionality:

(2a) Make RD/WRLUT have the same addressing modes as RD/WRLONG (so ++PTRx is available, but not forced on the programmer). I think this would be the nicest option, since it makes the instruction set more uniform rather than having a weird exception for two registers in two instructions.

(2b) If 2a is too complicated, how about implementing a tiny subset of the RD/WRLONG functionality in RD/WRLUT? That is, if (and only if) the immediate bit is set and bit 9 of S is set, then make PTRA and PTRB be auto-incrementing.

(2c) If neither 2a nor 2b is feasible, could you pick two other registers (other than PTRA and PTRB) to make the auto-incrementing ones? PTRA and PTRB are both going to be heavily used in compilers, and having to special case the RDLUT/WRLUT instructions is going to be a pain.

cgracey · 2018-11-15 01:20

ersmith wrote: »

cgracey wrote: »

There are four things left on my Verilog to-do list, for those wondering if the tinkering will ever end:

1) Make SETQ(2)+RD/WR/WMLONG behave sensibly with {++}PTRx{--} addressing.
2) Auto-increment PTRx on "RD/WRLUT reg,PTRx".

1 and 2 are perfunctory, while 3 and 4 require creativity. I'll do the boring stuff next (1 and 2).

Please re-consider (2). It's really nasty for a programmer if the hardware is silently changing a register behind his/her back.

I can offer some suggestions for how to get the same functionality:

(2a) Make RD/WRLUT have the same addressing modes as RD/WRLONG (so ++PTRx is available, but not forced on the programmer). I think this would be the nicest option, since it makes the instruction set more uniform rather than having a weird exception for two registers in two instructions.

(2b) If 2a is too complicated, how about implementing a tiny subset of the RD/WRLONG functionality in RD/WRLUT? That is, if (and only if) the immediate bit is set and bit 9 of S is set, then make PTRA and PTRB be auto-incrementing.

(2c) If neither 2a nor 2b is feasible, could you pick two other registers (other than PTRA and PTRB) to make the auto-incrementing ones? PTRA and PTRB are both going to be heavily used in compilers, and having to special case the RDLUT/WRLUT instructions is going to be a pain.

Thanks for commenting on this, Eric.

I would like to use the full PTRx expression rules for RDLUT and WRLUT, but it would mean that the LUT could only be immediately addressed from $000..$0FF. Addressing LUT locations $100..$1FF would require a PTRx expression or a register.

How do people feel about that?

jmg · 2018-11-15 01:23

ersmith wrote: »

cgracey wrote: »

2) Auto-increment PTRx on "RD/WRLUT reg,PTRx".

Please re-consider (2). It's really nasty for a programmer if the hardware is silently changing a register behind his/her back.

I agree it's not nice if that ++ is always imposed, but I thought this ++ on the index, was optional ? (ie gives 2 opcodes)

cgracey · 2018-11-15 01:27

Amazingly, my Spin2 interpreter doesn't contain a single RDLUT or WRLUT instruction, just a single SETQ+RDLONG to load interpreter code into the LUT.

If we could use PTRx expressions in RDLUT/WRLUT, we could do things like PUSH and POP. The implications are big.

rogloh · 2018-11-15 01:31

cgracey wrote: »

Thanks for commenting on this, Eric.

I would like to use the full PTRx expression rules for RDLUT and WRLUT, but it would mean that the LUT could only be immediately addressed from $000..$0FF. Addressing LUT locations $100..$1FF would require a PTRx expression or a register.

How do people feel about that?

Please no no no. For performance reasons in unrolled loops etc we have found that we need to be able to read/write LUT registers directly using the immediate form. I guess the PTR increment is helpful to somewhat alleviate this but not always as we don't only want sequential access at higher speed, but also random access. Point in case the HDMI code that shares work and writes at some addresses with one COG and others with the other.

ersmith · 2018-11-15 01:32

cgracey wrote: »

I would like to use the full PTRx expression rules for RDLUT and WRLUT, but it would mean that the LUT could only be immediately addressed from $000..$0FF. Addressing LUT locations $100..$1FF would require a PTRx expression or a register.

How do people feel about that?

Isn't the S value the HUB address? So it would mean that only HUB addresses $000..$0FF could be directly loaded into LUT, which is the same restriction currently imposed on RDLONG/WRLONG (and seems not too onerous). Or have I misremembered something?

rogloh · 2018-11-15 01:35

cgracey wrote: »

Amazingly, my Spin2 interpreter doesn't contain a single RDLUT or WRLUT instruction, just a single SETQ+RDLONG to load interpreter code into the LUT.

If we could use PTRx expressions in RDLUT/WRLUT, we could do things like PUSH and POP. The implications are big.

Yeah I was sort of hoping for something like a PUSHLUT and a POPLUT type of operation as well, especially if the pop only still only takes 3 cycles and you get the increment included as well as returning the data. If you allowed for both increment and decrement on some pointer you could probably implement a fifo to pass data between both COGs that share some processing workload. This could be pretty useful.

cgracey · 2018-11-15 01:37

ersmith wrote: »

cgracey wrote: »

I would like to use the full PTRx expression rules for RDLUT and WRLUT, but it would mean that the LUT could only be immediately addressed from $000..$0FF. Addressing LUT locations $100..$1FF would require a PTRx expression or a register.

How do people feel about that?

Isn't the S value the HUB address? So it would mean that only HUB addresses $000..$0FF could be directly loaded into LUT, which is the same restriction currently imposed on RDLONG/WRLONG (and seems not too onerous). Or have I misremembered something?

RDLUT reads the LUT into a register.

WRLUT write a constant/register into the LUT.

The only way between hub and LUT is SETQ2+RD/WR/WMLONG.

cgracey · 2018-11-15 01:39

rogloh wrote: »

cgracey wrote: »

Thanks for commenting on this, Eric.

I would like to use the full PTRx expression rules for RDLUT and WRLUT, but it would mean that the LUT could only be immediately addressed from $000..$0FF. Addressing LUT locations $100..$1FF would require a PTRx expression or a register.

How do people feel about that?

Please no no no. For performance reasons in unrolled loops etc we have found that we need to be able to read/write LUT registers directly using the immediate form. I guess the PTR increment is helpful to somewhat alleviate this but not always as we don't only want sequential access at higher speed, but also random access. Point in case the HDMI code that shares work and writes at some addresses with one COG and others with the other.

How many unique random addresses are you writing to in the LUT?

You could always use registers to hold those addresses if there aren't too many.

Would it not be a net win to be able to do this: "WRLUT reg,PTRA++"?

cgracey · 2018-11-15 01:44

I just looked in my ROM Booter code and there is one each of RDLUT and WRLUT, and they both use registers as addresses. So, if I changed over to PTRx addressing for RDLUT/WRLUT, nothing would break in my booter.

ersmith · 2018-11-15 01:46

cgracey wrote: »

RDLUT reads the LUT into a register.

WRLUT write a constant/register into the LUT.

The only way between hub and LUT is SETQ2+RD/WR/WMLONG.

Aargh, of course. In that case we probably don't want to change the PTRA addressing, being able to address all of the COG registers is important.

How would everyone feel about using WC to indicate auto-increment in RD/WRLUT?

rogloh · 2018-11-15 01:48

cgracey wrote: »

How many unique random addresses are you writing to in the LUT?

You could always use registers to hold those addresses if there aren't too many.

Would it not be a net win to be able to do this: "WRLUT reg,PTRA++"?

Right now our current code could be made to fit as we don't necessarily use the full range in this case, but I'm thinking in a more general case outside our own implemented code which is only one example. I think it could still be useful to have full random access to the entire range of addresses and it allows normal patching of S-fields in instructions in loops etc. It could of course be worked around but you'll need to be aware of this addressing limitation and it may restrict access to look up tables to be less than 256 sized etc.

I agree it would be a net win to be able to do WRLUT reg, PTRA++, as well as reads. I guess reallocating bit 8 of the instruction for this is the only way or is there another way it could be achieved?

cgracey · 2018-11-15 01:50

ersmith wrote: »

cgracey wrote: »

RDLUT reads the LUT into a register.

WRLUT write a constant/register into the LUT.

The only way between hub and LUT is SETQ2+RD/WR/WMLONG.

Aargh, of course. In that case we probably don't want to change the PTRA addressing, being able to address all of the COG registers is important.

How would everyone feel about using WC to indicate auto-increment in RD/WRLUT?

No, it's the LUT registers that would be limited to $000..$0FF immediate addressing, not the cog registers.

I've been looking through my PASM code base and I never do RDLUT/WRLUT with immediate addressing. It's a register, if anything. Mainly, I just load code into the LUT using SETQ2+RDLONG.

Looking at the Verilog, it would be very easy to make RDLUT/WRLUT work with PTRx expressions. It's almost nothing to add.

cgracey · 2018-11-15 02:01

rogloh wrote: »

cgracey wrote: »

How many unique random addresses are you writing to in the LUT?

You could always use registers to hold those addresses if there aren't too many.

Would it not be a net win to be able to do this: "WRLUT reg,PTRA++"?

...I think it could still be useful to have full random access to the entire range of addresses and it allows normal patching of S-fields in instructions in loops etc.

I found it best to locate all self-modifying code into the cog registers, where it can easily be worked on. To modify LUT code, you'd need to do a RDLUT+modify+WRLUT, which is not competitive with cog registers, at all.

...I agree it would be a net win to be able to do WRLUT reg, PTRA++, as well as reads. I guess reallocating bit 8 of the instruction for this is the only way or is there another way it could be achieved?

I don't see any other way. No WC possible in WRLUT, either.

Look at your code. I bet it's not a problem to implement this. Tell me what you see.

jmg · 2018-11-15 02:02

cgracey wrote: »

No, it's the LUT registers that would be limited to $000..$0FF immediate addressing, not the cog registers.

I've been looking through my PASM code base and I never do RDLUT/WRLUT with immediate addressing. It's a register, if anything. Mainly, I just load code into the LUT using SETQ2+RDLONG.

Looking at the Verilog, it would be very easy to make RDLUT/WRLUT work with PTRx expressions. It's almost nothing to add.

I'm not quite following - do you mean immediate address of LUT has to go, or that it is limited to $000..$0FF immediate addressing - which is what, half the LUT ?
It seems the gain of PTRx expressions makes things more orthogonal and consistent. Access to half of LUT via immediate seems fine ?

I would say that users will expect PTRx access into all memory areas.

New Pin Instructions

Comments