Another approach could be to code USB for 1.5MHz, and focus on compact block 'macros' that do the Dual-Bit, and DeStuff operations, that Chip can then turn into opcodes that can allow a jump to 12MHz ?
because a code picture is worth 1000 words, applying this rule backwards, this is some (untested) Verilog
for the suggested DeStuff-Jump opcode
// DS_Shift_JEQ8 Reg,Adr Opcode => 3 Fields mapped into active Register, does Skip and Count, and exits on 8 valid bits
reg [9:0] ShiftB; // can be as small as 7b, but 10b is full raw copy, might be useful ?
reg [7:0] DataBY;
reg [2:0] BitCtr;
assign Q[31:22] = ShiftB[9:0]; // Field 3 = copy Shifter, No skip
assign Q[18:16] = BitCtr[2:0]; // Field 2 = Bit Counter, skips, exit on last-Rx-bit
assign Q[7:0] = DataBY[7:0]; // Field 1 = RxByte bits, skips
always @(ShiftB) // inserts a 0 after six (USB) sequential 1's in the transmitter
begin
DoSwallow = (ShiftB[5:0] == 6'b111111); // next bit is skipped
end
always @(BitCtr)
begin
JExit = (BitCtr == 3'b111) & !DoSwallow; // this clock edge is LAST shift/inc, so exit too
end
always @(posedge CLK)
begin
Z_FLag_Err <= (ShiftB[6:0] == 7'b1111111); // Store overflow, optional.
ShiftB <= {ShiftB[8:0],Din}; // live raw bit pattern, no skip
if ( DoSwallow ) begin // .CE Hold Ctr, No Shift == Skip this Stuff-bit
BitCtr <= BitCtr;
DataBY <= DataBY;
end
else begin // VALID bit, so INC and Do Shift
BitCtr <= BitCtr + 1;
DataBY <= {DataBY[6:0],Din};
end
end
Can be used rolled, or unrolled, but needs inversion on JUMP, or it could be patched into SerDes, with PairSample, for USB_ByteRX
' Psuedo ASM code, 2 USB helper opcodes.
Destuff_d = 0 ' Init all 3 fields and Z
'Start Loop:
PairSampleOpcode ' or can include the Jump ?
JumpIf_SE0
DS_Shift_JEQ8 Destuff_d, ByteDone 'updates 3 fields, and jumps if Field_N_Bits = 8 bits, Sets Z on Error.
JNZ Destuff_ERR ' check if last DeStuff had an error
PairSampleOpcode
JumpIf_SE0
DS_Shift_JEQ8 Destuff_d, ByteDone 'updates 3 fields, and jumps if Field_N_Bits = 8 bits, Sets Z on Error.
JNZ Destuff_ERR ' check if last DeStuff had an error
.. repeat unrolled for 10? PairSampleOpcode
ByteDone: '8 pin samples with no-destuff, 9 or 10 pin samples with 1.2 Skips
WrByte
After DS_Shift_JEQ8 jumps, it has 3 fields in register : lower 8 bits = valid USB data, 3 mid bits as counter (000 on exit) and 10 upper bits as USB raw copy.
Every Pin has a 1.5k a 10k and a 100k resistor which can be configured as Pullup or Pulldown. Further every pin has a comparator that can compare the levels of two pins and builds the difference. Some of these was implemented especially for USB years ago when Chip asked what pin hardware is needed for USB.
I don't know what the comparator outputs when both pins are Low. Can we detect this state reliable? We need that to detect the SE0 (end of packet) state. But I never have seen that a software USB solution detects an SE1 state as an error case inside the bit receive loop.
For differential output on two pins it is as simple as:
XOR OUTx,MaskDmp
where MaskDmp is the pinmask for D- and D+ pins.
Andy
Thanks Andy. That 1k5 pullup is what is required on the D+ pin for FS (1k5 pullup on D- for LS). That saves a pin.
I think you are correct that SE1 is usually not checked at each bit time. SE1 IIRC continues for some time, so its not a real problem.
I wasn't worried about the tx side because it is so much easier to do than rx. Basically if we can do rx then we can do tx. The real P2 will run at least 2x the fpga speed so we will be in a much better position when the real silicon is ready. However, for now if it takes 2 cogs that's fine by me. At least we can get something working enough to prove no further instructions/logic is required. I am fairly certain that the 2 instructions I asked for will be of sufficient help.
SERDES should be able to tx anyway - all we need to do is be able to set the no of bits to be sent and pre-do any bitstuffing into the output buffers.
We can always resort to a lookup table for CRC16 but by being able to calculate it for each bit as it is read/written pretty much solves this issue easily.
Sure we may be able to make serdes help, but first I want to understand the precise instructions required to satisfactorily perform the rx by sw bit reading. Then I can look at the top level protocol for endpoints etc. This is the part I don't yet understand although I have seen example code.
BTW 10K pulldowns will most likely work for USB Master. 10K pullups will work for PS2, I2C and lots of other cases. So these internal pullups/pulldowns are going to be a great help to minimise hw.
On the topic of CRC, the code above for 3 fields, could (just) pack to 4, to include CRC16. (not sure about CRC-5-USB ? - operand bit ?)
DataByte:8 CRC:16 BitCtr:3, leaves 5 for LiveBits, ok if register DoSwallow
There may be some alignment that allows init of BitCtr and LiveBits, without clobbering CRC, and still give Byte read-off.
Perhaps the above op should be called
GETXP [#]D [WZ],[WC] 'pin into !Z via WZ, xor pin into C via WC (similar to GETP & GETNP)
Just a bit more info for the bit-banging USB FS RX sequence for each bit currently is..
waitcnt time, bittime ' wait for next mid-bit sample time
test K, ina wz ' read usb pin
muxz bits, bitmask wc ' b30 (mux mask for rx inbound xor register)
shl bits, #1 ' shift new xor'd in bit to b31 (to prev bit)
test JK, ina wz ' SE0 ? (ie EOP ?)
if_z jmp #waitforend ' y: wait for end
rcl data, #1 ' accumulate bit into data byte
rcl stuff, #6 wz ' accumulate 6 bit blocks. If zero we need to unstuff next bit
[I]'There is no time to accumulate the crc16 here. A special 1bit crc instruction as suggested in the first post would help here.
[/I] if_z call #unstuff
If the special instruction did the following...
GETUSB [#]D WZ,WC where
D = pin no (0..127)
C = C XOR PINx
Z = ! ( PINx OR PINy ) 'ie ZERO if both PINx and PINy are ZERO; PINy = PINx XOR #1 Note1: PINx and PINy are a pair of pins. If PINx is even then PINy := PINx + 1 else if PINx is odd then PINy := PINx - 1
- The allowance for the PINx/PINy pair to be reversed is for USB LS & HS where J/K are effectively swapped between D-/D+.
Note2: WZ & WC could be permanently set on if required.
This instruction would permit the above bit-banging code sequence to be reduced to (replaces 4 instructions)...
waitcnt time, bittime ' wait for next mid-bit sample time
[B] getusb K wz,wc ' C has prev bit; C = C XOR PIN; Z = !(PIN OR PIN+/-1) = both pin pairs are zero
[/B] if_z jmp #waitforend ' y: wait for end
rcl data, #1 ' accumulate bit into data byte
rcl stuff, #6 wz ' accumulate 6 bit blocks. If zero (6 zero bits) we need to unstuff next bit.
[I]'There is no time to accumulate the crc16 here. A special 1bit crc instruction as suggested in the first post would help here.
[/I] if_z call #unstuff
As you can see, a new single bit 1 clock CRC instruction would help immensely too.
Here is a working USB CRC5 generation for reference...
'initialisation first
mov data, xxxxx ' get the 5bit data
and data, #$1F ' just in case
mov count, #5 ' 5 bits
mov crc5, xxxxx ' preset crc5 register
' calculate CRC5
:loop mov temp, data ' get copy of data bits left to process
xor temp, crc5 ' lsb of data xor crc5 required below
shr temp, #1 wc ' result of data[lsb] xor crc5[lsb] from above
shr data, #1 ' shift input data
shr crc5, #1 ' shift crc5
if_c xor crc5, #$14 ' crc5 polynomial =$14=100
djnz count, #:loop
Analysing the CRC breakdown for a single bit is (can someone please verify this is correct)...
' C has the single data bit to be accumulated into the CRC5 register
' POLY stores the polynomial
' COUNT stores the number of CRC bits in the CRC algorithm
rcl temp, temp ' put C into bit 0
xor temp, crc5 ' xor the lowest bit of crc5
and temp, #1 wz ' and put result in Z
shr crc5, #1 ' CRC5 >> 1
if_nz xor crc5, poly ' if BIT XOR CRC5[0] = 1 then CRC5 XOR POLY
Provided the above is correct then a new special instruction could do the following... (This is slightly different to my proposal for the instruction in the earlier post)
WARNING: There is at least something wrong with the CRC generation below as it does not conform with the block diagram above. Maybe it is just reversed LSB/MSB but I am not sure yet. Can anyone help get this right???
CRCBIT D where
D = CRCn cog register
C = C has the input bit and two internal registers POLY and COUNT (set by special instructions, or else ACCA & ACCB could be used) are
POLY = The polynomial (up to 32 bits, unused bits zero) (could be ACCA)
COUNT = The number of bits in the CRC generation (or a mask???) (could be ACCB) the instruction would perform the following (can someone please check)...
if (C XOR D[0] ) == 1 then
D >> 1
D XOR POLY
else
D>>1
endif
I cannot see the use for the COUNT (number of bits in the CRC) other than at the end of the whole CRC calculation where an AND mask would extract the relevant bits. If this is correct, then COUNT would not be required. What am I missing?
Now the resulting code would become...
[I]'Note: The internal register(s) POLY and COUNT would be previously set as would the users CRCn Register[/I]
waitcnt time, bittime ' wait for next mid-bit sample time
[B] getusb K wz,wc ' C has prev bit; C = C XOR PIN; Z = !(PIN OR PIN+/-1) = both pin pairs are zero
[/B] if_z jmp #waitforend ' y: wait for end
rcl data, #1 ' accumulate bit into data byte
rcl stuff, #6 wz ' accumulate 6 bit blocks. If zero (6 zero bits) we need to unstuff next bit.
[B] crcbit CRC ' C has data bit; POLY has polynomial; COUNT (if reqd) has no.of.bits/mask; accumulate the CRC[/B]
if_z call #unstuff
So the new CRCBIT instruction would replace at least 4 instructions.
Attached is a simple spin program for the P1 to calculate any CRC.
There are various polynomials, number of bits, lsb/msb first, preset crc initial value, xor final value, send LSB/MSB crc byte first.
But a general purpose CRC is better.
Would some of you please test/modify this program and check it works?
What I would like to do is ask Chip for a single-bit CRC instruction for the P2. IMHO the best format for this would be that the data-bit would be in the C flag. Because we only have P2 instructions available with a single operand [#]D style, I thought that the polynomial could be written to the ACCA (perhaps or ACCB?) and that D would point to the CRC register in cog memory.
This is the CRC calculation in spin for a byte...
d := DATA & $FF
repeat i from 0 to 7
c := (d ^ crc) & $01 ' data bit 0 XOR crc bit 0
d := d >> 1 ' data >> 1
crc := crc >> 1 ' crc >> 1
if c
crc := crc ^ poly ' if c==1: crc xor poly
This is a possible P2 CRC bit accumulate instruction format...
[B]CRCBIT D[/B]
[I]where D = CRC Register, C = current data bit, ACCA = polynomial
The CRCBIT instruction performs the following...
(1) X := C XOR D[0]
(2) D := D >> 1
(3) if X == 1 then D := D XOR ACCA
[/I]
The idea is that for bit-banging, the CRCBIT instruction would be called for each bit sent/received, and the bit would already be in C.
I expect CRCBIT should be capable of being a 1 clock instruction.
So, to accumulate an 8 bit byte (disregarding any reversals and initialisation) the following could be used...
This would take 2+16 clocks per byte, or for 4 bytes in a passed long 2+64 clocks.
Implementing an atomic CRC instruction would be easy to do and a good use of resources. Let's do it, along with the special pin instructions to facilitate USB. These are really good ideas that result in almost no silicon growth, but will cut bit-period processing requirements in half for many protocols.
I suggested a possible version of the CRCBIT instruction (haven't found the thread/posts yet) where we could use the instruction to calculate any polynomial.
However, when Chip looked at the complexity it would take too much silicon. Therefore I suggest we look at the possibility of just 2 polynomial options, those being the two common CRC16 - the IBM/USB and CCITT. The xmodem variant of the CCITT is easily done on the initial and final CRC value by sw. As I have said, I don't think we need CRC5 for USB as we can precalculate most of the CRC5 including our USB address, so its quite simple. It may be worth while to see just what gates are involved.
As I have said previously here, I am quite happy to just get the 2 instructions and start to work with them while Chip moves on to SERDES. Because anything I discover that would help would be quite simple I don't mind suggesting it while Chip is doing SERDES.
There is nothing better than to run code to find the weaknesses. Much better than theory.
I take it this would be for the CRC16. It may not be needed to have any CRC5 hardware support. As already mentioned, once you know your address and endpoint, the CRC5 value is static for most packet types and can therefore be precalculated. The only time where it is dynamic is for the SOF (start of frame) token packets which contain an 11bit incrementing frame counter and are sent once per millisecond. For a slave, unless your application wants to know the frame number at all times, you probably can ignore the CRC5 in this packet type as you won't need to care too much if there is a bit error in the frame counter and the value is occasionally wrong.
Slaves never have to generate the CRC-5, only check it which is easy. But if a P2 implements a USB host we would need to be able to generate it, and in the worst case we could always use an 11 bit indexed lookup table for an exact match if required. It will just burn 2kB of hub RAM for that approach, or we could do some type of 8 bit LUT implementation using stack RAM perhaps.
Update: Wrote this before your previous reply Cluso99, just saw you agree CRC5 is probably not needed too.
I suggested a possible version of the CRCBIT instruction (haven't found the thread/posts yet) where we could use the instruction to calculate any polynomial.
However, when Chip looked at the complexity it would take too much silicon. Therefore I suggest we look at the possibility of just 2 polynomial options, those being the two common CRC16 - the IBM/USB and CCITT. The xmodem variant of the CCITT is easily done on the initial and final CRC value by sw. As I have said, I don't think we need CRC5 for USB as we can precalculate most of the CRC5 including our USB address, so its quite simple. It may be worth while to see just what gates are involved.
As I have said previously here, I am quite happy to just get the 2 instructions and start to work with them while Chip moves on to SERDES. Because anything I discover that would help would be quite simple I don't mind suggesting it while Chip is doing SERDES.
There is nothing better than to run code to find the weaknesses. Much better than theory.
So, to accumulate an 8 bit byte (disregarding any reversals and initialisation) the following could be used...
This would take 2+16 clocks per byte, or for 4 bytes in a passed long 2+64 clocks.
One interesting thing I noticed about the proposed CRCBIT instruction is that at best it takes a mininum of one clock per bit if you already have your bit in C and unrolled everything etc. If you have to rotate to C from another register it will take 16 clocks per byte (2 per bit). There is also some CRC initial setup overhead required but I will ignore that for now.
This means this will take 8 clock cycles per byte to complete at best. A LUT implementation will only take 5 instructions per byte. That means if you have the Stack RAM to spare, it will be significantly faster and free more cycles so interestingly the HW is not necessarily adding as much value as we would like in this case. In fact it is lowering performance which is a little counter-intuitive. Just wanted to point that out. It would however let you interleave it within the bit processing workload of the COG which could be useful if there is already a free cycle there for doing it.
I posted link to page that generate Verilog code for both 5 and 16 Bit CRC for USB
CRC could be included on-the-fly in the suggested DS_Shift_JEQ8 opcode, it will fit, but I see some fish hooks.
* The examples show init CRC to all 1's, not a large issue, but can take more code.
* There is no Length element in USB, so the EOP signals when to finish-and-check, problem is, in simplest designs by the time EOP arrives, you have just done a CRC on itself. Hmm...
I think the CRC applies only to the preceeding data
Maybe there is enough time to roll-back those last 16 bits of CRC ? (anyone seen CRC roll-back code ?)
Other option would be a 16bit delay line feeding CRC, so the CRC is that from 16 bits back in time.
That needs some init of that delay line - to what content ?
CRC could be included on-the-fly in the suggested DS_Shift_JEQ8 opcode, it will fit, but I see some fish hooks.
* The examples show init CRC to all 1's, not a large issue, but can take more code.
* There is no Length element in USB, so the EOP signals when to finish-and-check, problem is, in simplest designs by the time EOP arrives, you have just done a CRC on itself. Hmm...
I think the CRC applies only to the preceeding data
Maybe there is enough time to roll-back those last 16 bits of CRC ? (anyone seen CRC roll-back code ?)
Other option would be a 16bit delay line feeding CRC, so the CRC is that from 16 bits back in time.
That needs some init of that delay line - to what content ?
You may not have to worry about rollback. I'm not a CRC expert but from memory if you include the CRC itself in the CRC accumulation you may end up with a zero or some known constant to check.
You may not have to worry about rollback. I'm not a CRC expert but from memory if you include the CRC itself in the CRC accumulation you may end up with a zero or some known constant to check.
I think that is true only for checksums.
I think there is a post-rx check, which relies on pre-load of CRC, which comes naturally with the Opcode-on-register design,
Works like this
Pass1: have full packet, all data and RxCRC and the Calculated CRC, which is 'overcooked' by having run on CRC too.
Save ocCRC for later.
What is needed is a equality comparison, so is we re-prime the CRC register with the RxCRC, and now play through the RxCRC for 16 'clocks', we now have a copy of RxCRC (+) Last 16 bits(=CRC) = ocCRC, and this is compared with saved ocCRC, and we do not need to reverse CRC, we just need to duplicate the CRC-append, and check that.
This would cost preloads + 16x(RRC+DS_Shift_JEQ8), at the end of a packet. Is that too slow ?
When appending a CRC to a message, it is possible to detach the transmitted CRC, recompute it, and verify the recomputed value against the transmitted one. However, a simpler technique is commonly used in hardware.
When the CRC is transmitted with the correct bit order (most significant terms first), a receiver can compute an overall CRC, over the message and the CRC, and if the CRC is correct, the result will be zero. This possibility is the reason that most network protocols that include a CRC do so before the ending delimiter; it is not necessary to know whether the end of the packet is imminent to check the CRC."
When appending a CRC to a message, it is possible to detach the transmitted CRC, recompute it, and verify the recomputed value against the transmitted one. However, a simpler technique is commonly used in hardware.
When the CRC is transmitted with the correct bit order (most significant terms first), a receiver can compute an overall CRC, over the message and the CRC, and if the CRC is correct, the result will be zero. This possibility is the reason that most network protocols that include a CRC do so before the ending delimiter; it is not necessary to know whether the end of the packet is imminent to check the CRC."
Cool , I just assumed it was too complex to do that, if that is correct, then life does get a lot simpler.
Just Prime CRC field with the needed 1's and check for 0000 at the EOP - no post RX footwork needed at all.
The Verilog above does USB DeStuff, it may be an opcode param or two can select HDLC DeStuff or no Destuff, which would allow the CRC engine in the opcode to be used for Txmit ?
It is simple enough to just push the CRC at the end of each byte onto the tasks 4 deep stack. At the end, the CRC will be two pops down. QED.
With the 4 Field DS_Shift_JEQ8 proposed, there is not even the need to do that.
The CRC is available in the upper bits of the register, and preserves across bytes.
If the total packet CRC sums over itself to Zero, then you just check that field for 0000 at the EOP
A switch would be needed to make the CRC field accessible for transmit, tho I suppose it could call on every Physical TxBit, in which case the de-stuff does not need disable ?
That allows one opcode to be used both ways, (but it does not stuff-on-tx)
Here is a possible single bit CRC Verilog that should do CRC5usb, CRC16usb and CRC16ccitt...
You will note where the resultant bits are different for the 3 crc polynomials, I have just included 3 lines, first for CRC5usb, then CRC16usb and last CRC16ccitt. These 3 statements need to have some if then or similar decoding depending upon which crc polynomial is chosen.
For the n/a case of crc5, anything can be chosen.
////////////////////////////////////////////////////////////////////////////////
// Copyright (C) 1999-2008 Easics NV.
// This source file may be used and distributed without restriction
// provided that this copyright statement is not removed from the file
// and that any derivative work contains the original copyright notice
// and the associated disclaimer.
//
// THIS SOURCE FILE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS
// OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
// WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
//
// Purpose : synthesizable CRC function
//
// Info : [EMAIL="tools@easics.be"]tools@easics.be[/EMAIL]
// [URL="http://www.easics.com/"]http://www.easics.com[/URL]
//
// RR20130310 modified for CRC5, CRC16usb & CRC16ccitt
////////////////////////////////////////////////////////////////////////////////
module CRC;
// polynomial: CRC5usb=(0 2 5), CRC16usb=(0 2 15 16), CRC16=(0 5 12 16)
// data width: 1
// convention: the first serial bit is D[0]
function [15:0] nextCRC16_D1;
input Data;
input [15:0] crc;
reg [0:0] d;
reg [15:0] c;
reg [15:0] newcrc;
begin
d[0] = Data;
c = crc;
newcrc[0] = d[0] ^ c[4];
newcrc[0] = d[0] ^ c[15];
newcrc[0] = d[0] ^ c[15];
newcrc[1] = c[0];
newcrc[2] = d[0] ^ c[1] ^ c[4];
newcrc[2] = d[0] ^ c[1] ^ c[15];
newcrc[2] = c[1];
newcrc[3] = c[2];
newcrc[4] = c[3];
//n/a
newcrc[5] = c[4];
newcrc[5] = d[0] ^ c[4] ^ c[15];
newcrc[6] = c[5];
newcrc[7] = c[6];
newcrc[8] = c[7];
newcrc[9] = c[8];
newcrc[10] = c[9];
newcrc[11] = c[10];
//n/a
newcrc[12] = c[11];
newcrc[12] = d[0] ^ c[11] ^ c[15];
newcrc[13] = c[12];
newcrc[14] = c[13];
//n/a
newcrc[15] = d[0] ^ c[14] ^ c[15];
newcrc[15] = c[14];
nextCRC16_D1 = newcrc;
end
endfunction
endmodule
Thanks for the easics link as I used this to see what Verilog was generated for each polynomial.
I just want to remind you that SD card also need CRC. In this case we also need CRC7.
The polynomial for CRC7 is 0x89; the polynomial for CRC16 is 0x1021 which is based upon a standard called CRC-CCITT.
I just want to remind you that SD card also need CRC. In this case we also need CRC7.
The polynomial for CRC7 is 0x89; the polynomial for CRC16 is 0x1021 which is based upon a standard called CRC-CCITT.
Thanks
Mike
Mike,
The poly for CRC16 USB is $8005. $1021 is for CRC16 CCITT. Confusing isn't it. IBM created the original CRC16 (as now used by USB) for use in sync comms back in the 80's or earlier. But as Europe usually does, they had to use a different poly
There is one thing still confusing me about the proposed GETXP instruction. After calling such an instruction you would get carry flag C result being the XOR of the original C flag value and one of the USB data pins. So if C comes back as 1 that means it was different to the sampled pin value, and if it comes back 0 it was the same value as the sampled pin value. This is fine and it detects logical 0/1 NRZI bitstream nicely.
However, unless I am missing something else it appears you would then want to reuse C again for the next iteration. The problem is that this time around C is not the last pin value, it indicates whether there a difference between previous C value and the previous pin value. So some other operation to reset C back to the previous data pin value appears to be required before the next time it gets called, or some trick is required. Are you doing this as well somewhere in your code? I didn't see that mentioned anywhere. Won't this require an additional clock cycle to do?
So some other operation to reset C back to the previous data pin value appears to be required before the next time it gets called, or some trick is required. Are you doing this as well somewhere in your code? I didn't see that mentioned anywhere. Won't this require an additional clock cycle to do?
I think this last value work is done in the background, as part of the opcode.
It means the very first opcode C will be discarded, as the previous value is ?? but the SE0 will be valid from first clock.
Once an edge is sensed, the SW will phase adjust to try to sample in bit-centre.
The USB bit stream allows for edge-resync but that may be harder to achieve at 12MHz, so some limits on Xtal tolerance and data-length may be imposed.
A 1.5MHz P2 probably could manage edge-resync, and 1.5MHz is fine for a lot of tasks.
It may be that the P2 Counters have a capture mode that can help with edge-resync ?
Comments
because a code picture is worth 1000 words, applying this rule backwards, this is some (untested) Verilog
for the suggested DeStuff-Jump opcode
Can be used rolled, or unrolled, but needs inversion on JUMP,
or it could be patched into SerDes, with PairSample, for USB_ByteRX
After DS_Shift_JEQ8 jumps, it has 3 fields in register : lower 8 bits = valid USB data, 3 mid bits as counter (000 on exit) and 10 upper bits as USB raw copy.
I think you are correct that SE1 is usually not checked at each bit time. SE1 IIRC continues for some time, so its not a real problem.
I wasn't worried about the tx side because it is so much easier to do than rx. Basically if we can do rx then we can do tx. The real P2 will run at least 2x the fpga speed so we will be in a much better position when the real silicon is ready. However, for now if it takes 2 cogs that's fine by me. At least we can get something working enough to prove no further instructions/logic is required. I am fairly certain that the 2 instructions I asked for will be of sufficient help.
SERDES should be able to tx anyway - all we need to do is be able to set the no of bits to be sent and pre-do any bitstuffing into the output buffers.
We can always resort to a lookup table for CRC16 but by being able to calculate it for each bit as it is read/written pretty much solves this issue easily.
Sure we may be able to make serdes help, but first I want to understand the precise instructions required to satisfactorily perform the rx by sw bit reading. Then I can look at the top level protocol for endpoints etc. This is the part I don't yet understand although I have seen example code.
BTW 10K pulldowns will most likely work for USB Master. 10K pullups will work for PS2, I2C and lots of other cases. So these internal pullups/pulldowns are going to be a great help to minimise hw.
DataByte:8 CRC:16 BitCtr:3, leaves 5 for LiveBits, ok if register DoSwallow
There may be some alignment that allows init of BitCtr and LiveBits, without clobbering CRC, and still give Byte read-off.
http://forums.parallax.com/showthread.php/151821-P2-Possible-additional-Instructions?p=1221492&viewfull=1#post1221492
I have reproduced this post here although there has been later updates to this (need to check what precisely)
http://forums.parallax.com/showthread.php/151992-CRC-generation?p=1222728&viewfull=1#post1222728
and copied below... and Chip's reply
I suggested a possible version of the CRCBIT instruction (haven't found the thread/posts yet) where we could use the instruction to calculate any polynomial.
However, when Chip looked at the complexity it would take too much silicon. Therefore I suggest we look at the possibility of just 2 polynomial options, those being the two common CRC16 - the IBM/USB and CCITT. The xmodem variant of the CCITT is easily done on the initial and final CRC value by sw. As I have said, I don't think we need CRC5 for USB as we can precalculate most of the CRC5 including our USB address, so its quite simple. It may be worth while to see just what gates are involved.
As I have said previously here, I am quite happy to just get the 2 instructions and start to work with them while Chip moves on to SERDES. Because anything I discover that would help would be quite simple I don't mind suggesting it while Chip is doing SERDES.
There is nothing better than to run code to find the weaknesses. Much better than theory.
Slaves never have to generate the CRC-5, only check it which is easy. But if a P2 implements a USB host we would need to be able to generate it, and in the worst case we could always use an 11 bit indexed lookup table for an exact match if required. It will just burn 2kB of hub RAM for that approach, or we could do some type of 8 bit LUT implementation using stack RAM perhaps.
Update: Wrote this before your previous reply Cluso99, just saw you agree CRC5 is probably not needed too.
I think it was in Propeller II update - BLOG
I posted link to page that generate Verilog code for both 5 and 16 Bit CRC for USB
One interesting thing I noticed about the proposed CRCBIT instruction is that at best it takes a mininum of one clock per bit if you already have your bit in C and unrolled everything etc. If you have to rotate to C from another register it will take 16 clocks per byte (2 per bit). There is also some CRC initial setup overhead required but I will ignore that for now.
This means this will take 8 clock cycles per byte to complete at best. A LUT implementation will only take 5 instructions per byte. That means if you have the Stack RAM to spare, it will be significantly faster and free more cycles so interestingly the HW is not necessarily adding as much value as we would like in this case. In fact it is lowering performance which is a little counter-intuitive. Just wanted to point that out. It would however let you interleave it within the bit processing workload of the COG which could be useful if there is already a free cycle there for doing it.
CRC could be included on-the-fly in the suggested DS_Shift_JEQ8 opcode, it will fit, but I see some fish hooks.
* The examples show init CRC to all 1's, not a large issue, but can take more code.
* There is no Length element in USB, so the EOP signals when to finish-and-check, problem is, in simplest designs by the time EOP arrives, you have just done a CRC on itself. Hmm...
I think the CRC applies only to the preceeding data
Maybe there is enough time to roll-back those last 16 bits of CRC ? (anyone seen CRC roll-back code ?)
Other option would be a 16bit delay line feeding CRC, so the CRC is that from 16 bits back in time.
That needs some init of that delay line - to what content ?
Don't mind what post it was
Link to that Site.
http://www.easics.be/webtools/crctool
You may not have to worry about rollback. I'm not a CRC expert but from memory if you include the CRC itself in the CRC accumulation you may end up with a zero or some known constant to check.
I think that is true only for checksums.
I think there is a post-rx check, which relies on pre-load of CRC, which comes naturally with the Opcode-on-register design,
Works like this
Pass1: have full packet, all data and RxCRC and the Calculated CRC, which is 'overcooked' by having run on CRC too.
Save ocCRC for later.
What is needed is a equality comparison, so is we re-prime the CRC register with the RxCRC, and now play through the RxCRC for 16 'clocks', we now have a copy of RxCRC (+) Last 16 bits(=CRC) = ocCRC, and this is compared with saved ocCRC, and we do not need to reverse CRC, we just need to duplicate the CRC-append, and check that.
This would cost preloads + 16x(RRC+DS_Shift_JEQ8), at the end of a packet. Is that too slow ?
There is also
http://outputlogic.com/?page_id=321
Maybe you are right (I don't know either way) but this is what Wikipedia had to say: http://en.wikipedia.org/wiki/Computation_of_cyclic_redundancy_checks
"One-pass checking
When appending a CRC to a message, it is possible to detach the transmitted CRC, recompute it, and verify the recomputed value against the transmitted one. However, a simpler technique is commonly used in hardware.
When the CRC is transmitted with the correct bit order (most significant terms first), a receiver can compute an overall CRC, over the message and the CRC, and if the CRC is correct, the result will be zero. This possibility is the reason that most network protocols that include a CRC do so before the ending delimiter; it is not necessary to know whether the end of the packet is imminent to check the CRC."
Cool , I just assumed it was too complex to do that, if that is correct, then life does get a lot simpler.
Just Prime CRC field with the needed 1's and check for 0000 at the EOP - no post RX footwork needed at all.
The Verilog above does USB DeStuff, it may be an opcode param or two can select HDLC DeStuff or no Destuff, which would allow the CRC engine in the opcode to be used for Txmit ?
With the 4 Field DS_Shift_JEQ8 proposed, there is not even the need to do that.
The CRC is available in the upper bits of the register, and preserves across bytes.
If the total packet CRC sums over itself to Zero, then you just check that field for 0000 at the EOP
A switch would be needed to make the CRC field accessible for transmit, tho I suppose it could call on every Physical TxBit, in which case the de-stuff does not need disable ?
That allows one opcode to be used both ways, (but it does not stuff-on-tx)
You will note where the resultant bits are different for the 3 crc polynomials, I have just included 3 lines, first for CRC5usb, then CRC16usb and last CRC16ccitt. These 3 statements need to have some if then or similar decoding depending upon which crc polynomial is chosen.
For the n/a case of crc5, anything can be chosen.
Thanks for the easics link as I used this to see what Verilog was generated for each polynomial.
I am not sure how you specify the inputs to select the 3 possible polynomials. Presuming
00 = crc5
10 = crc16 usb
11 = crc16 ccitt
then how would you write the following selectively for And yes this one can be simplified.
Or would a complex statement something like be better.
Best get out my Verilog intro.
I just want to remind you that SD card also need CRC. In this case we also need CRC7.
The polynomial for CRC7 is 0x89; the polynomial for CRC16 is 0x1021 which is based upon a standard called CRC-CCITT.
Thanks
Mike
The poly for CRC16 USB is $8005. $1021 is for CRC16 CCITT. Confusing isn't it. IBM created the original CRC16 (as now used by USB) for use in sync comms back in the 80's or earlier. But as Europe usually does, they had to use a different poly
There is one thing still confusing me about the proposed GETXP instruction. After calling such an instruction you would get carry flag C result being the XOR of the original C flag value and one of the USB data pins. So if C comes back as 1 that means it was different to the sampled pin value, and if it comes back 0 it was the same value as the sampled pin value. This is fine and it detects logical 0/1 NRZI bitstream nicely.
However, unless I am missing something else it appears you would then want to reuse C again for the next iteration. The problem is that this time around C is not the last pin value, it indicates whether there a difference between previous C value and the previous pin value. So some other operation to reset C back to the previous data pin value appears to be required before the next time it gets called, or some trick is required. Are you doing this as well somewhere in your code? I didn't see that mentioned anywhere. Won't this require an additional clock cycle to do?
I think this last value work is done in the background, as part of the opcode.
It means the very first opcode C will be discarded, as the previous value is ?? but the SE0 will be valid from first clock.
Once an edge is sensed, the SW will phase adjust to try to sample in bit-centre.
The USB bit stream allows for edge-resync but that may be harder to achieve at 12MHz, so some limits on Xtal tolerance and data-length may be imposed.
A 1.5MHz P2 probably could manage edge-resync, and 1.5MHz is fine for a lot of tasks.
It may be that the P2 Counters have a capture mode that can help with edge-resync ?