Here is a diagram of what is happening. CRC' is just the new value of CRC after the 1-bit addition.
By having the CRCBIT instruction as I have described, it is a 2-clock instruction that can be executed perhaps as each bit is being received/sent. Thus it is only an additional 2-cock instruction executed in the character assembly/sending loop.
However, for this to work in this instance, any internal registers used plus any interrupts etc must still be as per normal.
So perhaps, could there be one or two internal registers (CRC and POLY) together with two instructions to load each, plus an instruction to accumulate the CRC from the C flag, plus an instruction to read the CRC register at the end.
' to use this...
SETPOLY ##$8005 ' initial value of internal POLY register
SETCRC #0 ' initial value of internal CRC register (could be #FFFF)
.....
REPS #1, #8 ' 2 instruction x 8 loops
SHR DATA, #1 WC ' C:=DATA[0]
CRCBIT ' accumulate 1-bit into CRC
GETCRC CRC16 ' read final CRC value
To utilise the instruction as each bit in a block of characters is received/sent...
' or as receiving/sending each bit...
SETPOLY ##$8005 ' initial value of internal POLY register
SETCRC #0 ' initial value of internal CRC register (could be #FFFF)
.....
CRCBIT ' accumulate 1-bit (in Carry) into CRC (maybe 8*block)
.....
GETCRC CRC16 ' read final CRC value
In theory, we could use SETQ and SETQ2 instructions to load the initial values of CRC and POLY. Then we would only require the GETCRC D instruction and a simple CRCBIT instruction using the Carry flag.
These could be used in conjunction with the NRZI helper instruction that I proposed for bit-banging USB, etc.
I think it will be much simpler to just use regular register conventions, like you first showed:
CRCBIT crc,poly
...Where C is used by CRCBIT. What could be simpler than that?
I made some instruction room for the CRCBIT instruction. I moved ZEROX and SIGNX to near TESTN/TEST, so they now have C and Z output. This opens up room where they used to be for the CRCBIT instruction. After making room, everything seems fine, so I'm going to sleep some and then implement the CRCBIT instruction.
I think it will be much simpler to just use regular register conventions, like you first showed:
CRCBIT crc,poly
...Where C is used by CRCBIT. What could be simpler than that?
I made some instruction room for the CRCBIT instruction. I moved ZEROX and SIGNX to near TESTN/TEST, so they now have C and Z output. This opens up room where they used to be for the CRCBIT instruction. After making room, everything seems fine, so I'm going to sleep some and then implement the CRCBIT instruction.
If I understood it correctly; having 32 flip-flops to buffer the top level of the stack means that you've gained at least two clock cycles to access the stack ram and store/retrieve each pushed/popped item, having direct access to the buffered top level meanwhile.
This also means that the stack ram size could increase at will, limited only by phisical space availability.
Based on the assumption that stack overflow/underflow is mostly a software concern, it means you could maintain two independent binary pointers that are only realigned (zeroed) at each cog start.
Also, because it's not a dual ported Fifo, meant for simultaneous read and write access, all the complexity of Hamming coded empty/full counters/detectors and synced read/write clocks could be avoided.
Hope I'd figured it right!
P.S. Sorry, I didn't figured it right.
Stacks are Lifos by nature, then a single write/read pointer does suffice for accessing it.
If it is meant to be directly accessible only by call/push and ret/pop, its addressing pointer don't need to be initialized to any particular value at all.
Sometimes, drinking a lot of coffee has the same effects of drinking a lot of beer or wine. My fault.
Now that the stack has been widened to 32 bits, I expect more data to be put in it. Is 8 levels enough? I know it's not meant to be a general-purpose stack, but it does have the advantage of being a "standard location" for passing parameters and return values between third-party code.
The problem is that the stack is currently built from flipflops. To get any real size increase, we would have to go to a RAM. That might be a big change with OnSemi. I think we would have to buffer the top of the stack with 32 flipflops.
Right, it's a LIFO. Just one pointer would be needed. And the top could be buffered in flops to make it immediately accessible. It's too late to implement something like this, though. We've got OnSemi going with a fixed set of memories. Any deviation would probably cost more money.
If I understood it correctly; having 32 flip-flops to buffer the top level of the stack means that you've gained at least two clock cycles to access the stack ram and store/retrieve each pushed/popped item, having direct access to the buffered top level meanwhile.
This also means that the stack ram size could increase at will, limited only by phisical space availability.
Based on the assumption that stack overflow/underflow is mostly a software concern, it means you could maintain two independent binary pointers that are only realigned (zeroed) at each cog start.
Also, because it's not a dual ported Fifo, meant for simultaneous read and write access, all the complexity of Hamming coded empty/full counters/detectors and synced read/write clocks could be avoided.
Hope I'd figured it right!
P.S. Sorry, I didn't figured it right.
Stacks are Lifos by nature, then a single write/read pointer does suffice for accessing it.
If it is meant to be directly accessible only by call/push and ret/pop, its addressing pointer don't need to be initialized to any particular value at all.
Sometimes, drinking a lot of coffee has the same effects of drinking a lot of beer or wine. My fault.
Now that the stack has been widened to 32 bits, I expect more data to be put in it. Is 8 levels enough? I know it's not meant to be a general-purpose stack, but it does have the advantage of being a "standard location" for passing parameters and return values between third-party code.
The problem is that the stack is currently built from flipflops. To get any real size increase, we would have to go to a RAM. That might be a big change with OnSemi. I think we would have to buffer the top of the stack with 32 flipflops.
Right, it's a LIFO. Just one pointer would be needed. And the top could be buffered in flops to make it immediately accessible. It's too late to implement something like this, though. We've got OnSemi going with a fixed set of memories. Any deviation would probably cost more money.
I don't think a bigger stack would be that useful unless you could also directly address elements other than the top. That would allow you to setup stack frames needed by most high level languages. Just being able to PUSH/POP wouldn't be that useful.
Chip,
I will have to check it out. I did post working spin P1 code.
There are MSB and LSB versions of CRC, plus inversion of bits, starting values, and inversions of the POLY. Ultimately they give the same results once you know how to correct the end result.
I am fairly sure what I gave you is correct.
Unfortunately I will not get time to verify today as we have a family Christmas Party
It's not a normal stack though is it, it's special tiny hardware thing that can overflow in a blink. If PUSH/POP instructions for that hardware stack were double operand then I'd be saying those should be removed too.
We have PTRA/B for the real thing. And I'm sure ALTx instructions can do wonders for fast Cog side operations.
I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).
I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).
Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.
I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).
Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.
I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).
Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.
They now do this {C, Z, 10'b0, PC[19:0]}.
Am I not seeing something?
Assuming we want consistency, I count 14 instructions as listed in the Prop2 Instructions v30:
Register addressing versions:
CALLD
CALL/CALLPA/CALLPB/RET
CALLA/RETA
CALLB/RETB
Oddly, your JMP also mentions loading of C & Z from the D register.
Immediate addressing instructions:
CALL
CALLA
CALLB
CALLD
The immediate addressed JMP doesn't mention C & Z.
The hardware stack PUSH/POP were on that list but will be removed now that they do a full 32-bit operation.
Isn't it the case that with USB, where you are getting bytes back, not bits, you'd like an efficient means to process a whole byte, and not want to piece it out, bit by bit?
Is there ever a case where the data to be CRC'd is not a multiple of 8 bits? I know the initial USB stuff is 5 bits, so it might warrant a bitwise CRC function, but isn't most everything else multiples of bytes?
Yes, a CRC instructions for 1 bit makes not much sense.
We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles: forums.parallax.com/discussion/comment/1369775/#Comment_1369775
A loop with 8 one bit calculations will always be slower.
A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.
Isn't it the case that with USB, where you are getting bytes back, not bits, you'd like an efficient means to process a whole byte, and not want to piece it out, bit by bit?
Is there ever a case where the data to be CRC'd is not a multiple of 8 bits? I know the initial USB stuff is 5 bits, so it might warrant a bitwise CRC function, but isn't most everything else multiples of bytes?
Yes, a CRC instructions for 1 bit makes not much sense.
We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles: forums.parallax.com/discussion/comment/1369775/#Comment_1369775
A loop with 8 one bit calculations will always be slower.
A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.
Andy
Thanks, Andy. I'll investigate what it would take to do this in hardware, anyway, since it wouldn't require a big chunk of the LUT.
I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).
Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.
They now do this {C, Z, 10'b0, PC[19:0]}.
Am I not seeing something?
Assuming we want consistency, I count 14 instructions as listed in the Prop2 Instructions v30:
Register addressing versions:
CALLD
CALL/CALLPA/CALLPB/RET
CALLA/RETA
CALLB/RETB
Oddly, your JMP also mentions loading of C & Z from the D register.
Immediate addressing instructions:
CALL
CALLA
CALLB
CALLD
The immediate addressed JMP doesn't mention C & Z.
The hardware stack PUSH/POP were on that list but will be removed now that they do a full 32-bit operation.
All those instructions now handle C/Z at bits 31/30, with a 10-bit empty space before the 20-bit address.
Thanks for making the list. I'll use it to review the code changes. I think I've got them all, already, but I'll verify.
Isn't it the case that with USB, where you are getting bytes back, not bits, you'd like an efficient means to process a whole byte, and not want to piece it out, bit by bit?
Is there ever a case where the data to be CRC'd is not a multiple of 8 bits? I know the initial USB stuff is 5 bits, so it might warrant a bitwise CRC function, but isn't most everything else multiples of bytes?
Yes, a CRC instructions for 1 bit makes not much sense.
We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles: forums.parallax.com/discussion/comment/1369775/#Comment_1369775
A loop with 8 one bit calculations will always be slower.
A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.
Andy
The reasoning behind requesting a single bit CRC instruction is the possibility of accumulating the CRC as each bit of the byte/block is transmitted/received.
Often there is insufficient time at the end of a block to perform the CRC calculation for the last/all bytes and get a reply out (ack or nak) in the required time. This is the case with USB and P1.
When I asked for this instruction 4+ years ago, I had spent a lot of time understanding the USB protocol. I previously spent many years designing hardware and writing synchronous communications in the 80's and 90's.
Please, just leave it be a single bit CRC instruction. A lookup table can be used if you want Byte CRC calculation.
I moved the C/Z bits to 31/30 in all the CALLx and RETx instructions. Everything is 32-bits wide now. This meant adding 10 bits to each level of the hardware stack (80 flops per cog).
Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.
They now do this {C, Z, 10'b0, PC[19:0]}.
Am I not seeing something?
As long as it is consistent, either way works for me.
Comments
By having the CRCBIT instruction as I have described, it is a 2-clock instruction that can be executed perhaps as each bit is being received/sent. Thus it is only an additional 2-cock instruction executed in the character assembly/sending loop.
However, for this to work in this instance, any internal registers used plus any interrupts etc must still be as per normal.
So perhaps, could there be one or two internal registers (CRC and POLY) together with two instructions to load each, plus an instruction to accumulate the CRC from the C flag, plus an instruction to read the CRC register at the end. To utilise the instruction as each bit in a block of characters is received/sent...
In theory, we could use SETQ and SETQ2 instructions to load the initial values of CRC and POLY. Then we would only require the GETCRC D instruction and a simple CRCBIT instruction using the Carry flag.
These could be used in conjunction with the NRZI helper instruction that I proposed for bit-banging USB, etc.
I don't have any suggestions... I have no idea what you guys are talking about most of the time, but:
As God is my witness, I am going to have a massively parallel system to manage. I have no abstractions to offer, but I know you probably do.
This seems like the last possible moment to think about it.
Thanks,
Rich
I think it will be much simpler to just use regular register conventions, like you first showed:
CRCBIT crc,poly
...Where C is used by CRCBIT. What could be simpler than that?
I made some instruction room for the CRCBIT instruction. I moved ZEROX and SIGNX to near TESTN/TEST, so they now have C and Z output. This opens up room where they used to be for the CRCBIT instruction. After making room, everything seems fine, so I'm going to sleep some and then implement the CRCBIT instruction.
Right, it's a LIFO. Just one pointer would be needed. And the top could be buffered in flops to make it immediately accessible. It's too late to implement something like this, though. We've got OnSemi going with a fixed set of memories. Any deviation would probably cost more money.
I made the CRCBIT instruction as you described.
I'm not getting any results that match online CRC calculators.
Do you shift data in LSB-first or MSB-first?
I suppose for a byte, you do 8 bits. What if it's CRC16?
The online calculators seem to be shifting data left, not right. Any idea why things look this way?
I made this:
(1) X := C XOR D[0]
(2) D := D >> 1
(3) if X == 1 then D := D XOR POLY
The Verilog looks like this:
wire [31:0] crcbit = {1'b0, d[31:01]} ^ ({32{c ^ d[0]}} & s);
What do you think?
web tool with hex data and look at the Modbus result:
http://www.lammertbies.nl/comm/info/crc-calculation.html
Then, have to invert all the bits. Then, you send LSByte first.
Here's a Microchip doc that has an overview on both hardware and software implementations:
CRC Generating and Checking
I will have to check it out. I did post working spin P1 code.
There are MSB and LSB versions of CRC, plus inversion of bits, starting values, and inversions of the POLY. Ultimately they give the same results once you know how to correct the end result.
I am fairly sure what I gave you is correct.
Unfortunately I will not get time to verify today as we have a family Christmas Party
One thing I miss when converting P1 PASM programs is the JMPRET (JMP/CALL/RET) instruction.
It would be nice if it's possible to add an equivalent.
Thinking about handling COG and LUT, for starters, just handle COG. We are never going to be able to store the return address in LUT anyway.
What was decided on moving the CZ flags to bits 31:30 on the hardware stack?
Making it 32 bits keeps things consistent with the PUSHx/POPx hub stack operations.
We have PTRA/B for the real thing. And I'm sure ALTx instructions can do wonders for fast Cog side operations.
Chip, I note there is quite a number of instructions that do a 22 bit C+Z+addr store/restore.
They now do this {C, Z, 10'b0, PC[19:0]}.
Am I not seeing something?
Those take a lot of mux's to read. They are expensive, in a sense.
Not to mention the instruction encoding space restrictions.
That, too, and there's no unallocated registers from $1F0..$1FF.
Assuming we want consistency, I count 14 instructions as listed in the Prop2 Instructions v30:
Register addressing versions:
CALLD
CALL/CALLPA/CALLPB/RET
CALLA/RETA
CALLB/RETB
Oddly, your JMP also mentions loading of C & Z from the D register.
Immediate addressing instructions:
CALL
CALLA
CALLB
CALLD
The immediate addressed JMP doesn't mention C & Z.
The hardware stack PUSH/POP were on that list but will be removed now that they do a full 32-bit operation.
Yes, a CRC instructions for 1 bit makes not much sense.
We already have solutions that calculates a CRC16 for a whole byte in 11 clock cycles:
forums.parallax.com/discussion/comment/1369775/#Comment_1369775
A loop with 8 one bit calculations will always be slower.
A CRC calculation in parallel would make sense: The instruction takes the byte and starts a calculation in hardware. After min. 8 cycles you can read the result and add it to the current CRC sum.
Andy
Thanks, Andy. I'll investigate what it would take to do this in hardware, anyway, since it wouldn't require a big chunk of the LUT.
All those instructions now handle C/Z at bits 31/30, with a 10-bit empty space before the 20-bit address.
Thanks for making the list. I'll use it to review the code changes. I think I've got them all, already, but I'll verify.
The reasoning behind requesting a single bit CRC instruction is the possibility of accumulating the CRC as each bit of the byte/block is transmitted/received.
Often there is insufficient time at the end of a block to perform the CRC calculation for the last/all bytes and get a reply out (ack or nak) in the required time. This is the case with USB and P1.
When I asked for this instruction 4+ years ago, I had spent a lot of time understanding the USB protocol. I previously spent many years designing hardware and writing synchronous communications in the 80's and 90's.
Please, just leave it be a single bit CRC instruction. A lookup table can be used if you want Byte CRC calculation.
As long as it is consistent, either way works for me.