The reasoning behind requesting a single bit CRC instruction is the possibility of accumulating the CRC as each bit of the byte/block is transmitted/received.
Often there is insufficient time at the end of a block to perform the CRC calculation for the last/all bytes and get a reply out (ack or nak) in the required time. This is the case with USB and P1.
When I asked for this instruction 4+ years ago, I had spent a lot of time understanding the USB protocol. I previously spent many years designing hardware and writing synchronous communications in the 80's and 90's.
Please, just leave it be a single bit CRC instruction. A lookup table can be used if you want Byte CRC calculation.
But it's the Smartpin logic that receives the single bits, do you even have access to them?
At the end of a received packet there are 2 CRC bytes, which are not part of the CRC calculation for the packetdata, so you should have enough time to calculate the last byte of the data while receiving the two CRC bytes. Then it's just a compare to decide if the CRC is correct and you have to send an ACK or a NAK.
At Transmit you can calculate the CRC one byte ahead, while sending. So you have the CRC value ready when you reach the end of the data.
The case on P2 is totally different from P1, where you have to do every bit with bitbanging at he right time.
I don't know if it matters for the bit/byte crc discussion, but there are 12-bit CRCs out there, and the one I worked with couldn't be implemented with a table.
I don't know if it matters for the bit/byte crc discussion, but there are 12-bit CRCs out there, and the one I worked with couldn't be implemented with a table.
Interesting. That would need some hardware, then.
I'm still thinking about how to approach this.
Doing 8 bits at once might take too many gates and too much time, but I'm thinking that 4 bits at once might be a good balance. That would accommodate a 12-bit CRC gracefully.
There could be two 2-clock CRC instructions:
CRCBIT crc,poly 'use C as the input
CRCNIB crc,poly 'use Q[31:28] as the input, Q shifts left by 4
CRCNIB is shielded from interrupts, as is SETQ. Here is a 12-bit CRC operation:
SHL data,#32-12
SETQ data
CRCNIB crc,poly
CRCNIB crc,poly
CRCNIB crc,poly
Could overflow of the 8 deep stack be made to trigger an interrupt?
What are people expecting to use the hardware stack for? I don't think it will be used by either C or Spin since it isn't big enough and doesn't provide the flexibility to setup stack frames. It will certainly be good for temporaries but then I would think the depth of 8 would be sufficient. Also, any code that uses it would have to disable interrupts if the interrupt service routines also use the HW stack.
I don't know if it matters for the bit/byte crc discussion, but there are 12-bit CRCs out there, and the one I worked with couldn't be implemented with a table.
Interesting. That would need some hardware, then.
I'm still thinking about how to approach this.
Doing 8 bits at once might take too many gates and too much time, but I'm thinking that 4 bits at once might be a good balance. That would accommodate a 12-bit CRC gracefully.
There could be two 2-clock CRC instructions:
CRCBIT crc,poly 'use C as the input
CRCNIB crc,poly 'use Q[31:28] as the input, Q shifts left by 4
...and to make a 5-bit CRC operation...
SHL data,#32-4 WC
CRCBIT crc,poly
SETQ data
CRCNIB crc,poly
...and an 8-bit CRC operation in 8 clocks...
SHL data,#32-8
SETQ data
CRCNIB crc,poly
CRCNIB crc,poly
How about that? It might seem a little piecemeal, but it adds no flipflops.
I'm still a bit fuzzy on the calculating of CRC5 and CRC16 in byte chunks, since there can be an odd number of bytes (and bits, in the case of token and start-of-frame packets) of data to calculate. There will be a need for additional "house-keeping" code to ensure that the accumulated CRC value stays within the domain of the polymonial, right? Sorry if this is a stupid question, but my math sucks
For a time-wise comparison using lookup tables, the CRC16 takes 13 clocks/byte and for CRC5 ~32 clocks for the 11-bit token and start-of-frame data.
Also, contrary to my earlier statement, USB CRC calcs are done LSb->MSb
The CRCBIT instruction will handle the various CRC versions.
Remember that there are two mainstream CRC16 in common use...
The original IBM version and the CCITT version. Then Microcomputer implemented the MNP protocol, but because they missunderstood how the CRC16 worked, they incorrectly implemented it - a bug! But it became yet another standard although it used one of the two CRC16 polynomials.
The variations are mainly to do with the initial value, MSB or LSB, and if the result is inverted, and which byte (for 16 bit crc) comes first. All these variations are covered by selecting the polynomial, the initial value, reversing the result, and inverting the result. So you can calculate these CRCs by the one method.
On the Prop1, if you have a way to get each bit into a flag already, you can compute a CRC with only 2 additional instructions per bit. For example, if you have the bit in question stored in Z you can do
On the Prop1, if you have a way to get each bit into a flag already, you can compute a CRC with only 2 additional instructions per bit. For example, if you have the bit in question stored in Z you can do
shr crc, #1 wc
IF_C_NE_Z xor crc, crc_poly
Jonathan
That's a nice way to do it. Almost makes a CRC instruction look silly.
On the Prop1, if you have a way to get each bit into a flag already, you can compute a CRC with only 2 additional instructions per bit. For example, if you have the bit in question stored in Z you can do
shr crc, #1 wc
IF_C_NE_Z xor crc, crc_poly
Jonathan
That's a nice way to do it. Almost makes a CRC instruction look silly.
It's only silly if you have the time for extra instruction. That is my point.
However, it's the same deal with those other TJx and DJx instructions.
I have a similar NRZI instruction request but I haven't had time to re-verify my request of years ago.
The combination of the NRZI and CRC is essential for bit-banging FS USB or other similar protocols.
While we do have USB with SmartPins, it's possible there are other protocols where the specific SmartPins functions won't work. For the little work/silicon involved, it seems prudent to have the flexible options these two instructions would provide. I would not have asked if these two were imperative for the bit-bang approach. This could also cover any shortcomings/bugs in the SmartPins, should we be unlucky to find some.
While testing the new SD driver together with improved Ethernet drivers (W5500 block speeds x6 faster =1MB/s) I suddenly ran into a problem with my PNut compiled kernel itself. It started acting strange and wouldn't load Forth code and it would continue to switch back to binary input mode even though there wasn't anything there telling it to. The exact same version worked well earlier in the day so I tried to track down the bug that made no sense until I remembered the extra nop before the coginits that I had used before but was now disabled. As soon as I enabled the nop the problem went away, and as soon as I disable the problem reappears.
But the exact same version on the exact same V29 has been working. It really seems to be some marginal timing problem, perhaps only with the A9s but I have had funny problems for no real reason on previous FPGA versions as you know. I will try to pin down exactly what is happening while it is still playing up but I think it has something to do with the cogid I do with every coginit that runs from hubexec so I will look there.
EDIT: I moved the nop to just before the cogid in hubexec and the problem went away. Here are the sections of code in question:
dat
orgh 0
org
clkset #$FF 'switch to 80MHz (if pll, else 50MHz)
reboot
' nop ' seems to need delay after clkset (otherwise next coginit ids incorrectly)
coginit #7,#@RESET
coginit #6,#@RESET
coginit #5,#@RESET
coginit #4,#@RESET
coginit #3,#@RESET
coginit #2,#@RESET 'vgarun (when DACs are made available)
coginit #1,#@rxcog
coginit #0,#@RESET
org 0
RESET call #INITCOG ' run non-time critical init from hubexec
jmp #doNEXT
'***************************************** HUB CODE ***************************
'
dat
orgh
version long vernum,vertime
vername byte "V28 BOOT"
INITCOG
loc PTRA,#@IDLE
nop ' !!! added the nop here instead and the bug is gone
cogid X 'only cog 0 uses the serial port by default
tjnz X,#INITSTKS
I suspect that this bug has something to do with the hub FIFO interface underflowing in some cases.
I will make a new version which adds a few levels to the FIFO buffer and I will increase the 'full' level. We'll see if this fixes the problem.
I had made a FIFO simulator for the 16-cog version that revealed how many levels were needed. Perhaps my reduction for 8 cogs was too simplistic. Or, my simulation model was incomplete.
.... until I remembered the extra nop before the coginits that I had used before but was now disabled. As soon as I enabled the nop the problem went away, and as soon as I disable the problem reappears.
... EDIT: I moved the nop to just before the cogid in hubexec and the problem went away. Here are the sections of code in question:
Checking what you are saying here - there are (at least?) two locations where NOP can fix the problem(s), and either one works ?
Both seem to prefix COGxx opcodes ? One comes after clkset.
If the issue was only clkset locked/related, should the 2nd NOP have any effect ?
What is the exact relative timing of those two lines, as in how many sysclks separate them ?
I have some further information about that startup bug I'm seeing. Pretty much adding a nop almost anywhere in the startup path fixes the problem. Now if I add three nops instead of one the problem comes back.
INITCOG
nop
nop
nop
loc PTRA,#@IDLE ' default startup into Instruction Pointer
cogid X
tjnz X,#INITSTKS
My console prompt includes a radix symbol which should be # for decimal but this changes to something else as a symptom of the problem.
TAQOZ#
may instead startup as:
TAQOZ%
for instance, but that is only a symptom.
INITCOG
cogid X
nop
wrbyte X,#$0F0
loc PTRA,#@IDLE ' default startup into Instruction Pointer
cogid X
tjnz X,#INITSTKS
Then I examine location $F0 which should be 0 for console on cog 0 but it's 2. So I comment out the nop and I get a value of 3. This sure is interesting....... edit: I think maybe that test is not that reliable since it relies upon the streamer, and what sequence it is in for that cog and so whatever comes last, so I will try something a little different
INITCOG
cogid X
nop
wrbyte X,#$0F0
loc PTRA,#@IDLE ' default startup into Instruction Pointer
cogid X
tjnz X,#INITSTKS
Then I examine location $F0 which should be 0 for console on cog 0 but it's 2. So I comment out the nop and I get a value of 3. This sure is interesting.......
Whoa!!!
Maybe COGID is being stalled by a hubexec fetch, winding up with a stale ID from the hub. I think this is it.
I made a mistake maybe in writing the value to the same location in hub since it relied upon the streamer. So I wrote to the cog's RAM instead with this code which produces the radix prompt symptom but indicates that it is cog #0. However I have a feeling that another cog also thinks it is supposed to be the console cog too so I will check further.
INITCOG
nop
cogid X
mov clkdly,X
or clkdly,#$80
loc PTRA,#@IDLE ' default startup into Instruction Pointer
tjnz X,#INITSTKS
Even though the prompt shows $ instead of # it seems it correctly registered as cog#0 (or'd with $80 check pattern) at Tachyon internal register location #28 (clkdly).
TAQOZ$ #28 COG@ .BYTE 80 ok
The radix or base is stored as "user registers" for each cog in hub RAM but each user register area should be unique. The base for cog#0 is at $800.
Comments
But it's the Smartpin logic that receives the single bits, do you even have access to them?
At the end of a received packet there are 2 CRC bytes, which are not part of the CRC calculation for the packetdata, so you should have enough time to calculate the last byte of the data while receiving the two CRC bytes. Then it's just a compare to decide if the CRC is correct and you have to send an ACK or a NAK.
At Transmit you can calculate the CRC one byte ahead, while sending. So you have the CRC value ready when you reach the end of the data.
The case on P2 is totally different from P1, where you have to do every bit with bitbanging at he right time.
Andy
Interesting. That would need some hardware, then.
I'm still thinking about how to approach this.
Doing 8 bits at once might take too many gates and too much time, but I'm thinking that 4 bits at once might be a good balance. That would accommodate a 12-bit CRC gracefully.
There could be two 2-clock CRC instructions:
CRCNIB is shielded from interrupts, as is SETQ. Here is a 12-bit CRC operation:
...and to make a 13-bit CRC operation... ...and to make a 5-bit CRC operation... ...and an 8-bit CRC operation in 8 clocks...
How about that? It might seem a little piecemeal, but it adds no flipflops.
Can these CRC instructions accommodate these differences?
I do use call and ret though, which uses the stack.
I think I've gotten to about 5 deep in the stack.
Maybe I can't get to 8, but it is something that I feel I have to keep in back of my mind...
I think that, at the heart, they are all the same. You may have to pre-reverse your data if it is LSB first.
For a time-wise comparison using lookup tables, the CRC16 takes 13 clocks/byte and for CRC5 ~32 clocks for the 11-bit token and start-of-frame data.
Also, contrary to my earlier statement, USB CRC calcs are done LSb->MSb
Remember that there are two mainstream CRC16 in common use...
The original IBM version and the CCITT version. Then Microcomputer implemented the MNP protocol, but because they missunderstood how the CRC16 worked, they incorrectly implemented it - a bug! But it became yet another standard although it used one of the two CRC16 polynomials.
The variations are mainly to do with the initial value, MSB or LSB, and if the result is inverted, and which byte (for 16 bit crc) comes first. All these variations are covered by selecting the polynomial, the initial value, reversing the result, and inverting the result. So you can calculate these CRCs by the one method.
That wouldn't be a problem. Just use a REV instead of a SHL.
That's a nice way to do it. Almost makes a CRC instruction look silly.
However, it's the same deal with those other TJx and DJx instructions.
I have a similar NRZI instruction request but I haven't had time to re-verify my request of years ago.
The combination of the NRZI and CRC is essential for bit-banging FS USB or other similar protocols.
While we do have USB with SmartPins, it's possible there are other protocols where the specific SmartPins functions won't work. For the little work/silicon involved, it seems prudent to have the flexible options these two instructions would provide. I would not have asked if these two were imperative for the bit-bang approach. This could also cover any shortcomings/bugs in the SmartPins, should we be unlucky to find some.
I got the CRC worked out. We have CRCBIT (uses C) and CRCNIB (uses Q[31:28], shifts Q left by 4 bits, shields interrupts):
It generates correct results!
You can do bits or nibbles at a time. Nibble operations can be stacked, handling four bits every two clocks.
But the exact same version on the exact same V29 has been working. It really seems to be some marginal timing problem, perhaps only with the A9s but I have had funny problems for no real reason on previous FPGA versions as you know. I will try to pin down exactly what is happening while it is still playing up but I think it has something to do with the cogid I do with every coginit that runs from hubexec so I will look there.
EDIT: I moved the nop to just before the cogid in hubexec and the problem went away. Here are the sections of code in question:
I suspect that this bug has something to do with the hub FIFO interface underflowing in some cases.
I will make a new version which adds a few levels to the FIFO buffer and I will increase the 'full' level. We'll see if this fixes the problem.
I had made a FIFO simulator for the 16-cog version that revealed how many levels were needed. Perhaps my reduction for 8 cogs was too simplistic. Or, my simulation model was incomplete.
I am interested to know what happens from the prior branch (which resets the FIFO) to the failure point.
Are there any Prop123-A9-boards left for sale?
(and if yes: how to get one to germany?)
Would be great !
Checking what you are saying here - there are (at least?) two locations where NOP can fix the problem(s), and either one works ?
Both seem to prefix COGxx opcodes ? One comes after clkset.
If the issue was only clkset locked/related, should the 2nd NOP have any effect ?
What is the exact relative timing of those two lines, as in how many sysclks separate them ?
We have some in stock. They are part #60065. They are $475.00.
If you email Chantal at Parallax, she can get the order going: cwoods@parallax.com.
Welcome aboard!!!
My console prompt includes a radix symbol which should be # for decimal but this changes to something else as a symptom of the problem.
TAQOZ#
may instead startup as:
TAQOZ%
for instance, but that is only a symptom.
In your post above, does the COGID return the wrong value?
edit: I think maybe that test is not that reliable since it relies upon the streamer, and what sequence it is in for that cog and so whatever comes last, so I will try something a little different
Whoa!!!
Maybe COGID is being stalled by a hubexec fetch, winding up with a stale ID from the hub. I think this is it.
Let me recompile...
Even though the prompt shows $ instead of # it seems it correctly registered as cog#0 (or'd with $80 check pattern) at Tachyon internal register location #28 (clkdly).
The radix or base is stored as "user registers" for each cog in hub RAM but each user register area should be unique. The base for cog#0 is at $800.
It reports correctly for all cogs not including cog#1 which is a serial cog.
But this sure looks funny because they should all remain zeroed. So leave it to me as I will check further and cogid seems to be fine.