As the stack is now 32-bit would it be better for C and Z to be bits 31 and 30 for CALLs and POPs?
The CALLD instructions use the same bit locations to store the flags, too. If they all changed, it could pave the way for future program counter expansion.
As the stack is now 32-bit would it be better for C and Z to be bits 31 and 30 for CALLs and POPs?
The CALLD instructions use the same bit locations to store the flags, too. If they all changed, it could pave the way for future program counter expansion.
Is that a yes, then?
Well, I don't know. You're talking maybe 5 minutes of work here.
Since so many changes are being made again I'm wondering if you might add a TLB so we can do virtual memory.
These changes are just minor refinements, not big things that would require deep re-thinking.
Yeah, I knew that would be the answer but I'm surprised that things are still in flux. Wasn't it a few months ago when we were told that the design was frozen except for the ROM?
Now that the stack has been widened to 32 bits, I expect more data to be put in it. Is 8 levels enough? I know it's not meant to be a general-purpose stack, but it does have the advantage of being a "standard location" for passing parameters and return values between third-party code.
The problem is that the stack is currently built from flipflops. To get any real size increase, we would have to go to a RAM. That might be a big change with OnSemi. I think we would have to buffer the top of the stack with 32 flipflops.
Ahh. I have to say I miss the P2-hot LUT stack pointers.
If I understood it correctly; having 32 flip-flops to buffer the top level of the stack means that you've gained at least two clock cycles to access the stack ram and store/retrieve each pushed/popped item, having direct access to the buffered top level meanwhile.
This also means that the stack ram size could increase at will, limited only by phisical space availability.
Based on the assumption that stack overflow/underflow is mostly a software concern, it means you could maintain two independent binary pointers that are only realigned (zeroed) at each cog start.
Also, because it's not a dual ported Fifo, meant for simultaneous read and write access, all the complexity of Hamming coded empty/full counters/detectors and synced read/write clocks could be avoided.
Hope I'd figured it right!
P.S. Sorry, I didn't figured it right.
Stacks are Lifos by nature, then a single write/read pointer does suffice for accessing it.
If it is meant to be directly accessible only by call/push and ret/pop, its addressing pointer don't need to be initialized to any particular value at all.
Sometimes, drinking a lot of coffee has the same effects of drinking a lot of beer or wine. My fault.
Now that the stack has been widened to 32 bits, I expect more data to be put in it. Is 8 levels enough? I know it's not meant to be a general-purpose stack, but it does have the advantage of being a "standard location" for passing parameters and return values between third-party code.
The problem is that the stack is currently built from flipflops. To get any real size increase, we would have to go to a RAM. That might be a big change with OnSemi. I think we would have to buffer the top of the stack with 32 flipflops.
Thanks, Cluso. I will try to get my head around that tomorrow.
We need to be able to put whole bytes into it, right?
That wasn't my intention. My idea was a single bit at a time, and any width CRC with any formula.
A REP loop could perform a shift instruction followed by the CRCBIT instruction.
My thought was to make the instruction as generic as possible to allow for all the various CRC incarnations.
But how wide must the rotator be? 16 bits with options for 8 and 5 bits?
I was hoping it might be 32 bits, but 16 would be fine. There is no need to have width options as the unused upper bits are just "AND"ed out by user code after the whole byte is done.
Thanks, Cluso. I will try to get my head around that tomorrow.
We need to be able to put whole bytes into it, right?
That wasn't my intention. My idea was a single bit at a time, and any width CRC with any formula.
A REP loop could perform a shift instruction followed by the CRCBIT instruction.
My thought was to make the instruction as generic as possible to allow for all the various CRC incarnations.
But how wide must the rotator be? 16 bits with options for 8 and 5 bits?
I was hoping it might be 32 bits, but 16 would be fine. There is no need to have width options as the unused upper bits are just "AND"ed out by user code after the whole byte is done.
But this thing would need a 32-bit rotator, right?
Can anyone think of any gotcha's involving putting the C and Z flags into bits 31 and 30, instead of into bits 21 and 20 for return-address storage in stack and registers?
I've got all the Verilog lines bookmarked that would need modification to implement this change, but before I do it, I'm wondering if this will compromise anything.
I can think of this: Having bits 31..22 free, as they are now, means that programmers could freely use the upper byte of the long for some purpose associated with the return address. By moving C and Z to bits 31 and 30, we clobber this option.
Can anyone think of any gotcha's involving putting the C and Z flags into bits 31 and 30, instead of into bits 21 and 20 for return-address storage in stack and registers?
I've got all the Verilog lines bookmarked that would need modification to implement this change, but before I do it, I'm wondering if this will compromise anything.
I can think of this: Having bits 31..22 free, as they are now, means that programmers could freely use the upper byte of the long for some purpose associated with the return address. By moving C and Z to bits 31 and 30, we clobber this option.
Can anyone think of any gotcha's involving putting the C and Z flags into bits 31 and 30, instead of into bits 21 and 20 for return-address storage in stack and registers?
I've got all the Verilog lines bookmarked that would need modification to implement this change, but before I do it, I'm wondering if this will compromise anything.
I can think of this: Having bits 31..22 free, as they are now, means that programmers could freely use the upper byte of the long for some purpose associated with the return address. By moving C and Z to bits 31 and 30, we clobber this option.
Why not just widen the stack to 34 bits?
Because then it would not be possible to manipulate the C and Z storage.
I have some old products that have been selling well for years and I know that if I just tweak them just a little bit that they could be more useful. Instead I leave them exactly as they are, they have been proven and if I tweak them they may fall apart somehow even after much verification. Instead I put all these features into new products.
David Betz is rightly alarmed even though these are "minor" tweaks, I can definitely see them snowballing yet once again and all need verification and time to test otherwise there is the danger that P2 might fall apart somehow.
You may have noticed during the whole long P2 saga that I have practically never put my hand up to request features. This is my mindset: If my wife was baking a pie (I wish should did) I would leave her well enough alone so I could have some of that pie. There are other days for other pies.
Let's be practical and enjoy some pie. Hopefully there will be other days for other pies.
There's a spare D,{#}S {WC/WZ/WCZ} instruction slot after ANYB. Could S be the (up to) 32-bit CRC polynomial?
That would be where I'd put it, yes. But, if this circuit involves a 32-bit rotator, forget it. That's a LOT of logic that we don't have room for.
If it's only a rotate by 1 for each iteration then that's not huge. However, I haven't tried to do my own CRC so don't know if this is the case or not.
Can anyone think of any gotcha's involving putting the C and Z flags into bits 31 and 30, instead of into bits 21 and 20 for return-address storage in stack and registers?
I've got all the Verilog lines bookmarked that would need modification to implement this change, but before I do it, I'm wondering if this will compromise anything.
I can think of this: Having bits 31..22 free, as they are now, means that programmers could freely use the upper byte of the long for some purpose associated with the return address. By moving C and Z to bits 31 and 30, we clobber this option.
Why not just widen the stack to 34 bits?
Because then it would not be possible to manipulate the C and Z storage.
Can anyone think of any gotcha's involving putting the C and Z flags into bits 31 and 30, instead of into bits 21 and 20 for return-address storage in stack and registers?
I've got all the Verilog lines bookmarked that would need modification to implement this change, but before I do it, I'm wondering if this will compromise anything.
I can think of this: Having bits 31..22 free, as they are now, means that programmers could freely use the upper byte of the long for some purpose associated with the return address. By moving C and Z to bits 31 and 30, we clobber this option.
Why not just widen the stack to 34 bits?
Because then it would not be possible to manipulate the C and Z storage.
Would C and Z at bits 1 and 0 be mad?
It used to be that way when long addresses' two LSBs were don't-care. At this point, it would be a mess, yes.
I have some old products that have been selling well for years and I know that if I just tweak them just a little bit that they could be more useful. Instead I leave them exactly as they are, they have been proven and if I tweak them they may fall apart somehow even after much verification. Instead I put all these features into new products.
David Betz is rightly alarmed even though these are "minor" tweaks, I can definitely see them snowballing yet once again and all need verification and time to test otherwise there is the danger that P2 might fall apart somehow.
You may have noticed during the whole long P2 saga that I have practically never put my hand up to request features. This is my mindset: If my wife was baking a pie (I wish should did) I would leave her well enough alone so I could have some of that pie. There are other days for other pies.
Let's be practical and enjoy some pie. Hopefully there will be other days for other pies.
I generally agree, Peter. Until OnSemi requests Verilog, we can still tweak things. I feel like we are rapidly cleaning up little warts that have bugged me for a while. I think it could be done right now. I must look into the CRC thing, though, because that is rather important for USB.
Can anyone think of any gotcha's involving putting the C and Z flags into bits 31 and 30, instead of into bits 21 and 20 for return-address storage in stack and registers?
I've got all the Verilog lines bookmarked that would need modification to implement this change, but before I do it, I'm wondering if this will compromise anything.
I can think of this: Having bits 31..22 free, as they are now, means that programmers could freely use the upper byte of the long for some purpose associated with the return address. By moving C and Z to bits 31 and 30, we clobber this option.
Bits 31/30 are more future proof, and allows a linear address to mean something.
You do not really clobber choices with 21.20 -> 31.30, as the number of spare bits is exactly the same. Some mask & shift is needed in both cases.
What you gain, is the address can be an unbroken 30 bits, which is more use than some left justified 8 bits.
If a future P2 adds the hardware for XIP Serial Flash, you do not want bits stuck at 21.20 breaking that.
Can anyone think of any gotcha's involving putting the C and Z flags into bits 31 and 30, instead of into bits 21 and 20 for return-address storage in stack and registers?
I've got all the Verilog lines bookmarked that would need modification to implement this change, but before I do it, I'm wondering if this will compromise anything.
I can think of this: Having bits 31..22 free, as they are now, means that programmers could freely use the upper byte of the long for some purpose associated with the return address. By moving C and Z to bits 31 and 30, we clobber this option.
Bits 31/30 are more future proof, and allows a linear address to mean something.
You do not really clobber choices with 21.20 -> 31.30, as the number of spare bits is exactly the same. Some mask & shift is needed in both cases.
What you gain, is the address can be an unbroken 30 bits, which is more use than some left justified 8 bits.
If a future P2 adds the hardware for XIP Serial Flash, you do not want bits stuck at 21.20 breaking that.
The hardware stack is intentionally there just for CALL and RET. Using it for PUSH and POP is a disadvantage simply because it will overflow all too easy.
Adding more hardware stack depth is overkill. We already have indirection for LUT and CogRAM and we also have PTRA/B for HubRAM.
CRCBIT D,[#]S
where D = CRC Register, C (carry flag) = current data bit, [#]S = polynomial
The CRCBIT instruction performs the following...
(1) X := C XOR D[0]
(2) D := D >> 1
(3) if X == 1 then D := D XOR POLY
So a full 8-bit CRC16 would be (using the CRC16 with initial=$0000, polynomial=$8005)
' calculate the CRC16 to include the 8-bit DATA byte
REPS #2, #8 '\\ 2 instructions x 8 loops
SHR DATA, #1 WC '\\ C:=DATA[0]
CRCBIT CRC16, POLY '// accumulate 1bit into crc
DATA long 0 ' data byte to be added to the CRC calculation
CRC16 long $0000 ' current CRC16 calculation (initially $0000)
POLY long $8005 ' polynomial
Now, rather than use a full instruction slot, it would be possible to store the polynomial (or the CRC, or both) into an internal register such as the SETQ, SETQ2, X, Y, etc, or a new one.
Comments
Is that a yes, then?
Well, I don't know. You're talking maybe 5 minutes of work here.
Ahh. I have to say I miss the P2-hot LUT stack pointers.
That sounds right.
If I understood it correctly; having 32 flip-flops to buffer the top level of the stack means that you've gained at least two clock cycles to access the stack ram and store/retrieve each pushed/popped item, having direct access to the buffered top level meanwhile.
This also means that the stack ram size could increase at will, limited only by phisical space availability.
Based on the assumption that stack overflow/underflow is mostly a software concern, it means you could maintain two independent binary pointers that are only realigned (zeroed) at each cog start.
Also, because it's not a dual ported Fifo, meant for simultaneous read and write access, all the complexity of Hamming coded empty/full counters/detectors and synced read/write clocks could be avoided.
Hope I'd figured it right!
P.S. Sorry, I didn't figured it right.
Stacks are Lifos by nature, then a single write/read pointer does suffice for accessing it.
If it is meant to be directly accessible only by call/push and ret/pop, its addressing pointer don't need to be initialized to any particular value at all.
Sometimes, drinking a lot of coffee has the same effects of drinking a lot of beer or wine. My fault.
Henrique
But this thing would need a 32-bit rotator, right?
I've got all the Verilog lines bookmarked that would need modification to implement this change, but before I do it, I'm wondering if this will compromise anything.
I can think of this: Having bits 31..22 free, as they are now, means that programmers could freely use the upper byte of the long for some purpose associated with the return address. By moving C and Z to bits 31 and 30, we clobber this option.
That would be where I'd put it, yes. But, if this circuit involves a 32-bit rotator, forget it. That's a LOT of logic that we don't have room for.
Why not just widen the stack to 34 bits?
Because then it would not be possible to manipulate the C and Z storage.
What does the CRC code look like now in PASM2?
So far, it looks like this:
David Betz is rightly alarmed even though these are "minor" tweaks, I can definitely see them snowballing yet once again and all need verification and time to test otherwise there is the danger that P2 might fall apart somehow.
You may have noticed during the whole long P2 saga that I have practically never put my hand up to request features. This is my mindset: If my wife was baking a pie (I wish should did) I would leave her well enough alone so I could have some of that pie. There are other days for other pies.
Let's be practical and enjoy some pie. Hopefully there will be other days for other pies.
If it's only a rotate by 1 for each iteration then that's not huge. However, I haven't tried to do my own CRC so don't know if this is the case or not.
Would C and Z at bits 1 and 0 be mad?
It used to be that way when long addresses' two LSBs were don't-care. At this point, it would be a mess, yes.
I generally agree, Peter. Until OnSemi requests Verilog, we can still tweak things. I feel like we are rapidly cleaning up little warts that have bugged me for a while. I think it could be done right now. I must look into the CRC thing, though, because that is rather important for USB.
Bits 31/30 are more future proof, and allows a linear address to mean something.
You do not really clobber choices with 21.20 -> 31.30, as the number of spare bits is exactly the same. Some mask & shift is needed in both cases.
What you gain, is the address can be an unbroken 30 bits, which is more use than some left justified 8 bits.
If a future P2 adds the hardware for XIP Serial Flash, you do not want bits stuck at 21.20 breaking that.
I agree with jmg.
Adding more hardware stack depth is overkill. We already have indirection for LUT and CogRAM and we also have PTRA/B for HubRAM.
CRCBIT D,[#]S
where D = CRC Register, C (carry flag) = current data bit, [#]S = polynomial
The CRCBIT instruction performs the following...
(1) X := C XOR D[0]
(2) D := D >> 1
(3) if X == 1 then D := D XOR POLY
So a full 8-bit CRC16 would be (using the CRC16 with initial=$0000, polynomial=$8005)
' calculate the CRC16 to include the 8-bit DATA byte REPS #2, #8 '\\ 2 instructions x 8 loops SHR DATA, #1 WC '\\ C:=DATA[0] CRCBIT CRC16, POLY '// accumulate 1bit into crc DATA long 0 ' data byte to be added to the CRC calculation CRC16 long $0000 ' current CRC16 calculation (initially $0000) POLY long $8005 ' polynomialNow, rather than use a full instruction slot, it would be possible to store the polynomial (or the CRC, or both) into an internal register such as the SETQ, SETQ2, X, Y, etc, or a new one.
That looks really good.
After looking at your explanation above, I tried to make a more compact instruction, but the problem is we have three inputs and one working register:
NUMBITS (input)
DATA (input)
POLY (input)
CRC (working register)
This is much easier to implement with little logic if it's broken up like you've done above.
So, we need this CRCBIT instruction. I'm on it.
In playing with the same ideas, I had just written the same code:
Q would be shifted right on each CRCQ. NUMBITS is expressed by the number of contiguous CRCQ's.
This would be very fast, anyway.