I think this last value work is done in the background, as part of the opcode.
It means the very first opcode C will be discarded, as the previous value is ?? but the SE0 will be valid from first clock.
Maybe but I didn't see it anywhere. It could have been mentioned elsewhere in older posts. It just looked like it would reuse C each time around, but that doesn't quite work right in my mind.
There is one thing still confusing me about the proposed GETXP instruction. After calling such an instruction you would get carry flag C result being the XOR of the original C flag value and one of the USB data pins. So if C comes back as 1 that means it was different to the sampled pin value, and if it comes back 0 it was the same value as the sampled pin value. This is fine and it detects logical 0/1 NRZI bitstream nicely.
However, unless I am missing something else it appears you would then want to reuse C again for the next iteration. The problem is that this time around C is not the last pin value, it indicates whether there a difference between previous C value and the previous pin value. So some other operation to reset C back to the previous data pin value appears to be required before the next time it gets called, or some trick is required. Are you doing this as well somewhere in your code? I didn't see that mentioned anywhere. Won't this require an additional clock cycle to do?
I think you have found a bug when I reduced the logic.
Relooking at it, we need to actually keep the data bit, not the C bit.
I have also been looking at the GETXP and CRCBIT and wondering if I can combine the instructions, together with the unstuff count.
rx
waitcnt time, bittime ' 0 wait for next mid-bit sample time
test K, pina wz ' 1 read usb pin
muxz bits, bit30 wc ' 2 b31=previous, b30=new; C=parity 00/11=odd, 01/10=even
shl bits, #1 ' 3 shift new b30 bit into previous b31
test JK, pina wz ' 4 check for SE0 (ie EOP) ?
if_z jmp #waitforend ' 5 y: wait for end
rcr data, #1 ' 6 accum new bit (rotate carry from xor into top bit)
rcl stuffcnt, #6 wz ' 7 if entire register is zero, we need to unstuff (6 bits in)
if_z jmp #unstuff ' 8 y: unstuff next bit
sub bitcnt, #1 wz ' 9 bitcnt--
if_nz jmp #rx ' 10
' 8/32 bits so save long...
J long 1<<DM_PIN ' D-
K long 1<<DP_PIN ' D+
JK long 1<<DP_PIN | 1<<DM_PIN ' J & K bit mask
enable long 1<<EN_PIN ' Enable 1K5 pullup for LS
bittime long BIT_DLY ' USB bit time
bit30 long 1<<30 '$40000000 ' MUX mask for RX inbound xor register
data long 0
stuffcnt long 0 ' counts 1 bits (was dc in rx and db in tx)
bits long 0 ' b31=previous, b30=new
time long 0
I have also been looking at the GETXP and CRCBIT and wondering if I can combine the instructions, together with the unstuff count.
In my example I did two opcodes, spit as
a) a ReadPinPair opcode and
b) a Destuff_CRC_Jump opcode, which reads CY as the ip, and you can read CRC and RxDATA from the register.
This jumps when Counter = 8, which can be any of 8/9/10 physical bits.
That split means the Destuff_CRC_Jump can apply to the Tx bitstream, (via CY) and collect the CRC as it sends Data too.
In your code above, pretty much 0..4 is one opcode, 5 is JNZ, and 6..10 is the other opcode (but including CRC)
The CRC Field part of D, inits with 1's.
The a) opcode could be ReadPinPair_JNZ, I think and still work ? (ie includes JNZ )
I think you have found a bug when I reduced the logic.
Relooking at it, we need to actually keep the data bit, not the C bit.
I have also been looking at the GETXP and CRCBIT and wondering if I can combine the instructions, together with the unstuff count.
Yeah I thought there was something missing. Now I wonder if it makes sense to combine the destuff and pin sampling work together into one instruction but keep the CRC work separate so that it remains independent of USB/NRZI and could therefore be used by other non-USB software as well. If we end up having two CRC H/W blocks in the COG, lets call them CRCA, CRCB, it could allow different polynomials/algorithms such as CRC-5 and CRC-16 to be dynamically selected depending on where we are up to in the packet as we pick either CRCA or CRCB instructions.
Now we have four possible values of Z, C flags which could get returned after the combined USB sampling/destuffing operation, while D can hold current stuff counts, the accumulated byte data and an end of byte marker bit. We can reinitialize independent fields of this register such as D/S/I/X as required, once we process our byte. We just need a way to identify the pin pair, either via including S in the opcode (if we have that luxury in the instruction encodings remaining) or via some other means like a separate SETUSB xxx instruction for example.
Output flags
ZC = 10 - indicates SE0 detected
ZC = 11 - indicates bit was destuffed and should be ignored
ZC = 01 - good data inserted, data bit = 1 (no pin changed detected)
ZC = 00 - good data inserted, data bit = 0 (pin change detected)
New bit always gets inserted into D[8], D[9] also remembers the last pin value.
D[7].. D[1] bits contain previous data, D[0] holds end of byte marker, this is shifted downwards later by other code. So D[8]=~(old D[8] XOR pinvalue) in the USBSTUFF instruction, but D[7] = old D[8], D[6] = old D[7] etc down to D[0] when we rotate.
The stuff count could be indicated by copying the last data bit value also into bit31 of the data register, if a 1 reaches bit 25 as we shift downwards you have to start to destuff. So no need for an actual destuff counter, just make use of the shift operation. When the bit is unstuffed once we decode a zero, the top 6 bits of the register are reset to 0.
USBSTUFF data WC WZ ' this does the USB pin samping and destuffing, we shift down the 8 bit data and do the CRC operation ourselves below
if_z_and_nc JMP #se0_detected 'exit loop if SE0
if_nz CRCA crcval ' accumulate CRC using C flag as data
if_nz SHR data, #1 WC ' if C is set we are at the end of the byte
if_nz_and_c JMP #byte_done
If you wrap this up into a REP loop you get down to 5 instructions per bit, or 6 with the WAITCNT, but the byte exit jumps will take 4 cycles. Not sure if that blows the timing/budget.
Here is a possible special USB instruction that should take 1 clock (4 via the pipline).
D would hold all the values required (data byte accumulator, stuff counter, last pin values, and the CRC).
D could be a fixed register such as $1F0.
The P2 instruction would be...
[B]RECVUSB D, S/# WZ,WC[/B]
where
S/# is the PinPair# and Poly bits
S[31..9] = unused
S[8..7] = 00= CRC16 USB
01= CRC5 USB
10= CRC16 CCITT
11= undefined
S[6..0] = D-/D+ Pin Pair #0..127
The pin pair is always a pair of pins mod 2. ie nnnnnnx where x=0 and x=1 for the pair.
If the pin pair is even (S[0]=0) then J is the lowest pin and K is the higher pin of the consecutive pair
If the pin pair is odd (S[0]=1) then K is the lowest pin and J is the higher pin of the consecutive pair.
This arrangement allows for simple LS and FS by making the pin pair even or odd.
D is the cog register storing a 32 bit field...
D[31..16] = crc16
D[15] = K new pin value
D[14] = J new pin value
D[13..11] = undefined
D[10..8] = unstuff counter 3 bits
D[7..0] = data byte accumulation
Z = new D[15] ie K new value
C = new D[14] ie J new value
ZC
00 = SE0
01 = J ?
10 = K ?
11 = SE1
(may want to invert Z ??? and swap D[14] - D[15] ???
Here is what I have come up with for possible Verilog code - none of it is tested - I am not a Verilog coder.
module RECVUSB;
// polynomial: CRC5usb=(0 2 5), CRC16usb=(0 2 15 16), CRC16=(0 5 12 16)
// data width: 1
// convention: the first serial bit is data[0]
function [31:0] RxUSB;
input [1:0] poly; // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
input [31:0] dest; // original D value
input [1:0] pins; // K:J pin values
reg [15:0] c; // original CRC (accumulated)
// reg [2:0] spare; // undefined
reg [0:0] k; // K new pin value
reg [0:0] j; // J new pin value
reg [2:0] stuffcnt; // stuff counter 3 bits
reg [7:0] data; // data byte (accumulated)
// 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
always @(poly) begin
crc16usb = (poly == 2'b00);
crc05usb = (poly == 2'b01);
crc16itt = (poly == 2'b10);
end
begin
c = dest[31:16]; // original crc value (accum)
stuffcnt = dest[11:8]; // original stuff counter value
k = pins[1]; // new pin values
j = pins[0]; // new pin values
// calculate the new crc...
newcrc[1] = c[0];
newcrc[3] = c[2];
newcrc[4] = c[3];
newcrc[6] = c[5];
newcrc[7] = c[6];
newcrc[8] = c[7];
newcrc[9] = c[8];
newcrc[10] = c[9];
newcrc[11] = c[10];
newcrc[13] = c[12];
newcrc[14] = c[13];
if crc05usb then begin
newcrc[0] = k ^ c[4];
newcrc[2] = c[1] ^ k ^ c[4];
end
if crc16usb or crc16itt then begin
newcrc[0] = k ^ c[15];
end
if crc16usb then begin
newcrc[2] = c[1] ^ k ^ c[15];
newcrc[5] = c[4];
newcrc[12] = c[11];
newcrc[15] = c[14] ^ k ^ c[15];
end
if crc16itt then begin
newcrc[2] = c[1];
newcrc[5] = c[4] ^ k ^ c[15];
newcrc[12] = c[11] ^ k ^ c[15];
newcrc[15] = c[14];
end
// check for bit unstuff
if stuffcnt == 3b'110' then begin
// unstuff
RxUSB[10:8] = 3b'000';
rxUSB[7:0] = dest[7:0];
else
// accum data bit into byte
RxUSB[10:8] = stuffcnt++;
data[7:1] = dest[6:0];
data[0] = k ^ dest[15]; // k ^ previous pin value
end
RxUSB[31:16] = newcrc[15:0];
RxUSB[15:13] = 3b'000';
RxUSB[12] = k;
RxUSB[11] = j;
RxUSB[10:8] = stuffcnt;
RxUSB[7:0] = data[7:0];
if WZ then begin
Z[0] = k;
end
if WC then begin
C[0] = j;
end
end
endfunction
endmodule
jmg: I will take a look at what you have done but my understanding of Verilog is quite poor. Would you mind looking at this please?
You will note in my previous post that the single instruction REVCUSB checks the pin pairs, calculates/accumulates the CRC (5 or 16usb or 16ccitt), unstuffs bits, and accumulates this bit to the byte. Everything is held in the one D long/register. In particular, the CRC is in the upper word, and the lowest byte contains the data byte.
The Z & C flags are set to the current KJ pins so that 4 conditions can be decoded automatically (SE0, SE1 plus J, K).
The sw can keep count of 8 bits (Just realised I need a way to test for this as the instruction does not advise of unstuffing without further testing).
I think I will have the instruction keep a count of bits (excludes unstuffing) and set Z when done, and C for SE0/SE1.
Yeah I thought there was something missing. Now I wonder if it makes sense to combine the destuff and pin sampling work together into one instruction but keep the CRC work separate so that it remains independent of USB/NRZI and could therefore be used by other non-USB software as well. If we end up having two CRC H/W blocks in the COG, lets call them CRCA, CRCB, it could allow different polynomials/algorithms such as CRC-5 and CRC-16 to be dynamically selected depending on where we are up to in the packet as we pick either CRCA or CRCB instructions.
Now we have four possible values of Z, C flags which could get returned after the combined USB sampling/destuffing operation, while D can hold current stuff counts, the accumulated byte data and an end of byte marker bit. We can reinitialize independent fields of this register such as D/S/I/X as required, once we process our byte. We just need a way to identify the pin pair, either via including S in the opcode (if we have that luxury in the instruction encodings remaining) or via some other means like a separate SETUSB xxx instruction for example.
Output flags
ZC = 10 - indicates SE0 detected
ZC = 11 - indicates bit was destuffed and should be ignored
ZC = 01 - good data inserted, data bit = 1 (no pin changed detected)
ZC = 00 - good data inserted, data bit = 0 (pin change detected)
New bit always gets inserted into D[8], D[7].. D[1] bits contain previous data, D[0] holds end of byte marker, this is shifted downwards later by other code. So D[8]=~(old D[8] XOR pinvalue) in the USBSTUFF instruction, but D[7] = old D[8], D[6] = old D[7] etc down to D[0] when we rotate.
The stuff count could be indicated by copying the last data bit value also into bit31 of the data register, if a 1 reaches bit 25 as we shift downwards you have to start to destuff. So no need for an actual destuff counter, just make use of the shift operation. When the bit is unstuffed once we decode a zero, the top 6 bits of the register are reset to 0.
USBSTUFF data WC WZ ' this does the USB pin samping and destuffing, we shift down the 8 bit data and do the CRC operation ourselves below
if_z_and_nc JMP #se0_detected 'exit loop if SE0
if_nz CRCA crcval ' accumulate CRC using C flag as data
if_nz SHR data, #1 WC ' if C is set we are at the end of the byte
if_nz_and_c JMP #byte_done
If you wrap this up into a REP loop you get down to 5 instructions per bit, or 6 with the WAITCNT, but the byte exit jumps will take 4 cycles. Not sure if that blows the timing/budget.
Unfortunately you cannot have a waitcnt/passcnt within a repx loop.
I thought I would post where I am up to (before I retire for the evening) with the Verilog for the P2 USB instruction. It still needs some work wrt unstuffing as I don't reset the counter when I get a clocked bit.
////////////////////////////////////////////////////////////////////////////////
// Acknowledgements: Verilog code for CRC's [URL="http://www.easics.com/"]http://www.easics.com[/URL]
// RR20140310 start
// RR20140311 continued
////////////////////////////////////////////////////////////////////////////////
// polynomial: CRC5usb=(0 2 5), CRC16usb=(0 2 15 16), CRC16ccitt=(0 5 12 16)
// data width: 1, LSB first
//
// inputs: D, S, PINS
// outputs: D, Z, C
module RxUSB
(
input [31:0] s, // S operand
input [31:0] d, // D operand
input [127:0] p, // input pins
output [31:0] r, // D result
output z, // Z flag
output c // C flag
);
reg [15:0] crc; // original CRC (accumulated)
reg [2:0] bitcnt; // data bit counter 3 bits
reg k; // K new pin value
reg j; // J new pin value
reg [2:0] stuffcnt; // stuff counter 3 bits
reg [7:0] data; // data byte (accumulated)
reg [8:7] poly; // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
reg [6:0] pinno; // pin pair numbers 0-127
reg [15:0] newcrc; // new crc
///////////////////////////////////////////////////////////////////////////////
begin
crc = d[31:16]; // original crc value (accum)
k0 = d[15]; // previous K
j0 = d[14]; // previous J
stuffcnt = d[13:11]; // original stuff counter value
bitcnt = d[10:8]; // original bit counter value
data = d[7:0]; // original data value (accum)
poly = s[8:7]; // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
? kpin = value(s[6:0]); // K pin no.
? jpin = value(s[6:0]) ^1 // J pin no.
k = pins[kpin]; // new pin value
j = pins[jpin]; // new pin value
// 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
always @(poly) begin
crc05usb = (poly == 2'b00); // CRC5usb =(0 2 5)
crc16usb = (poly == 2'b01); // CRC16usb =(0 2 15 16)
crc16itt = (poly == 2'b10); // CRC16ccitt=(0 5 12 16)
end
// calculate the new crc...
if crc05usb then
kr0 = k ^ crc[4];
kr2 = k ^ crc[4];
kr5 = 1b'0;
kr12 = 1b'0;
kr15 = 1b'0;
else if crc16usb then
kr0 = k ^ crc[15];
kr2 = k ^ crc[15];
kr5 = 1b'0;
kr12 = 1b'0;
kr15 = k ^ crc[15];
else if crc16itt then
kr0 = k ^ crc[15];
kr2 = 1b'0;
kr5 = k ^ crc[15];
kr12 = k ^ crc[15];
kr15 = 1b'0;
end;
newcrc[0] = kr0;
newcrc[1] = crc[0];
newcrc[2] = crc[1] ^ kr2;
newcrc[3] = crc[2];
newcrc[4] = crc[3];
newcrc[5] = crc[4] ^ kr5;
newcrc[6] = crc[5];
newcrc[7] = crc[6];
newcrc[8] = crc[7];
newcrc[9] = crc[8];
newcrc[10] = crc[9];
newcrc[11] = crc[10];
newcrc[12] = crc[11] ^ kr12;
newcrc[13] = crc[12];
newcrc[14] = crc[13];
newcrc[15] = crc[14] ^ kr15;
?? check "1' bit first
// check for bit unstuff
if stuffcnt == 3b'110' then
// unstuff
stuffcnt = 3b'000';
r[7:0] = data[7:0];
zero = 1b'0;
else
// inc bit count & accum data bit into byte
bitcnt++;
if bitcnt == 3b'000 then
zero = 1b'1;
else
zero = 1b'0;
end
stuffcnt++;
r[7:1] = data[6:0];
r[0] = k ^ k0; // k ^ previous pin value
end
r[31:16] = newcrc[15:0];
r[15] = k;
r[14] = j;
r[13:11] = stuffcnt;
r[10:8] = bitcnt[2:0];
if WZ then begin
z = zero;
end
if WC then begin
c = k ^ j;
end
end
endfunction
endmodule
You will note in my previous post that the single instruction REVCUSB checks the pin pairs, calculates/accumulates the CRC (5 or 16usb or 16ccitt), unstuffs bits, and accumulates this bit to the byte. Everything is held in the one D long/register. In particular, the CRC is in the upper word, and the lowest byte contains the data byte.
There may be some issues with starting this (eg first clock has unknown last-pin state).
If you go to a single opcode Verilog scheme, that opens up more options
( note my variant includes a Bit counter )
addit: I see your #71 includes Bit Counter too, so that is pretty much ready to timer-trigger.
The Classic opcode usage is called once per bit
or, instead of PC/execute trigger, consider that same Verilog is now Counter triggered.
The launch can morph slightly, to a WAITUSB style that is asked for 8 bits or exits on SE0
This is almost the same verilog, but now it is SW executed once per byte, slashing the code-overhead.
The once-per-bit internal engine, is timer triggered, exits the WAITUSB style SW interface, on 8 bits OR an Trapevent (SE0,Error)
It is also then fairly simple to do edge-snap on the Timer reloads, which allows longer packets with normal tolerances.
There would be two opcodes, one to Prime on Write and Read when done,and the WAITUSB form
Init Write values are Counter divide, preset of CRC and presets/clears of counters
Init read gives CRC result, Data Bits, and optional other info
Here is what I have come up with for possible Verilog code - none of it is tested - I am not a Verilog coder.
jmg: I will take a look at what you have done but my understanding of Verilog is quite poor. Would you mind looking at this please?
I'd suggest you grab a copy of Lattice ISPLever, CPLD version.
This is faster at compiling and fitting than the newer Diamond, and with CPLD equation output, is a little easier to read, and check what you coded in EQN form.
eg it will use .CE (clock enable) .D .C on flipflops in the EQN out which are easy to scan.
The USB opcode engine will 'fit' in a modest CPLD.
Compile/FIT is ~ 15seconds, for my code example, and LC4256ZE-B-EVN is a possible hardware test platform.
There may be some issues with starting this (eg first clock has unknown last-pin state).
The first time you call this the initial state in the D register passed should be known and setup as the idle state of the bus. This is fixed for low speed/high speed. We can keep the last data around from the previous call as well. No need to trash it.
The launch can morph slightly, to a WAITUSB style that is asked for 8 bits or exits on SE0
I quite like the sound of that idea. It would however probably prevent or impact a multi-tasking based implementation like I proposed in my first post due to variable execution timing.
It is also then fairly simple to do edge-snap on the Timer reloads, which allows longer packets with normal tolerances.
I'm intrigued by this idea to remain in sync for long packets. How would you do this in practice?
There would be two opcodes, one to Prime on Write and Read when done,and the WAITUSB form
Init Write values are Counter divide, preset of CRC and presets/clears of counters
Init read gives CRC result, Data Bits, and optional other info
One thing we need to bear in mind that we don't want the final SE0 bit condition received at the end of the packet to have any impact on the last CRC operation. The CRC needs to be preserved intact so we can validate it.
One thing we need to bear in mind that we don't want the final SE0 bit condition received at the end of the packet to have any impact on the last CRC operation. The CRC needs to be preserved intact so we can validate it.
Good point, CRC and Data Rx need to advance only on valid Bits, ie Stuff OR P=M => Skip change of CRC.Data.
Code in #71 does not qualify CRC advance, on Skip,
The launch can morph slightly, to a WAITUSB style that is asked for 8 bits or exits on SE0
I quite like the sound of that idea. It would however probably prevent or impact a multi-tasking based implementation like I proposed in my first post due to variable execution timing.
hehe, some people want everything !!
The first SW-loops suggested were so starved of cycles, there was no multi-tasking option, it was needing special care to even run at 80MHz.
If the USB read becomes byte based with WAITUSB, then the next logical step is to allow buffering on Byte-read, so it behaves very like a conventional UART - ie you have a whole byte time of elbow room.
With no buffering, the get-byte-and restart WAITUSB will have tight constraints.
If we target a 48MHz min CLK, each byte arrives every 32/36/40 sysclks, which might allow some careful multi-tasking with UART style buffering.
I'm sure a 50% usage, 48MHz USB support would have wide appeal
Note you would be limited to one USB per COG, and also on present P2, no Full task swaps or any libraries using locks in the same COG.
If suggested enhance of P2 to HW queue for shared resource were done, then libraries and code need to use LOCK far less often, and a USB thread could run at 50% clocks, with almost anything happening in the other threads.
With no buffering, the get-byte-and restart WAITUSB will have tight constraints.
Yeah the tight timing contraints make it tricky and could add limitations as to the USB implementation. With a fully buffered byte process we have oodles more time to process the data without upsetting the bit capture process and the driver code is a lot easier to write and understand.
Update: Actually "oodles" probably only means about ~14 bits worth of time (from memory) before we need to respond to the USB host with an ACK/NAK/STALL or the data requested or risk hitting a timeout. A byte buffer will eat into this time so we need to not buffer more than one byte. But that still gives lots more instructions breathing room to get the job done. 8 USB bits at say 96/48MHz is 64/32 P2 instruction cycles for example. You can do a lot in that time.
There may be some issues with starting this (eg first clock has unknown last-pin state).
It will be required that you first do a MOV D,setup
You will need to reset the bit and stuff counters, setup the J & K bits, preferably clear the data byte, and preset the CRC16 to $FFFF.
If you go to a single opcode Verilog scheme, that opens up more options
( note my variant includes a Bit counter )
addit: I see your #71 includes Bit Counter too, so that is pretty much ready to timer-trigger.
I want to be able to use this instruction for writing/outputting USB too. Currently I think it will work if the previous instruction outputs the bits on J & K pins, then call this instruction which will compile the CRC16/5 for you. The sw will need to do the bitstuffing.
The Classic opcode usage is called once per bit
or, instead of PC/execute trigger, consider that same Verilog is now Counter triggered.
The launch can morph slightly, to a WAITUSB style that is asked for 8 bits or exits on SE0
This is almost the same verilog, but now it is SW executed once per byte, slashing the code-overhead.
The once-per-bit internal engine, is timer triggered, exits the WAITUSB style SW interface, on 8 bits OR an Trapevent (SE0,Error)
This only becomes useful if this was a task and then PASSCNT has to be used. But then we also need a TX version too.
I would rather control this in sw, especially at this point in time. I am trying to cover the CRC16-CCITT plus both CRC5 & CRC16 for the USB, for both RX & TX cases.
IIRC there is no unstuffing in CRC16-CCITT protocols as they use SYN & DLE.
It is also then fairly simple to do edge-snap on the Timer reloads, which allows longer packets with normal tolerances.
There would be two opcodes, one to Prime on Write and Read when done,and the WAITUSB form
Init Write values are Counter divide, preset of CRC and presets/clears of counters
Init read gives CRC result, Data Bits, and optional other info
We do have to be mindful of creating general cases so we can use them elsewhere. Currently this will do NRZI comms, and the special stuff/unstuff. We also have to be mindful of instruction availability and silicon.
The first time you call this the initial state in the D register passed should be known and setup as the idle state of the bus. This is fixed for low speed/high speed. We can keep the last data around from the previous call as well. No need to trash it.
Agreed.
I quite like the sound of that idea. It would however probably prevent or impact a multi-tasking based implementation like I proposed in my first post due to variable execution timing.
I wanted to KISS otherwise we may get caught with a bug we cannot get over.
I'm intrigued by this idea to remain in sync for long packets. How would you do this in practice?
You wait on a bit change, then step half a bit and sample. QED.
One thing we need to bear in mind that we don't want the final SE0 bit condition received at the end of the packet to have any impact on the last CRC operation. The CRC needs to be preserved intact so we can validate it.
It is also then fairly simple to do edge-snap on the Timer reloads, which allows longer packets with normal tolerances.
I'm intrigued by this idea to remain in sync for long packets. How would you do this in practice?
If we assume realistic and useful targets of 48MHz CLK to timers and 48MHz sliced 50% to CPU, then each bit is 4 clocks.
Any edge forces the counter to (say) 00 and then it clocks 012301230123 when no edges are present.
Data is sampled when the counter is at 50% ==2, and timer values of 1 and 3 here, are margin.
Timing skew will either shorten the 3 value, or extend the 0, and so it will jitter about the correct clock speed.
At higher clock speeds, the granularity improves.
If you stick with even divides, the possible clocks are 48MHz, 72MHz, 96MHz, 120MHz etc;
if you allow uneven sides, (which should be ok) then 60MHz, 84MHz, 108MHz are also possible.
72MHz is inside present FPGA builds and 84MHz / 96MHz (+?) may be possible on Cyclone V builds.
At 72MHz and 50% slice and UART style buffering, there is 24/27/30 thread clocks per byte streaming.
Will that be enough to meet packet specs ?
I want to be able to use this instruction for writing/outputting USB too. Currently I think it will work if the previous instruction outputs the bits on J & K pins, then call this instruction which will compile the CRC16/5 for you. The sw will need to do the bitstuffing.
@Cluso99,
When transmitting I believe the CRC16 position in the frame needs to be padded with 16 bit of zeroes at the end and the CRC process include the zeroes in its computation then output this CRC data instead of the 16 zeroes. A streaming bit process for doing CRC on the fly with the wire transitions would probably not do this for you.
jmg: If I were to try and compile the Verilog I would get lost for some considerable time. It's better for me to think the thing thru and let others fix the Verilog syntax so it works properly.
Perhaps you might like to do it? I am sure you could check it out simply with your BeMicro??? FPGA, and since you would be using Quartus it would help Chip too.
You wait on a bit change, then step half a bit and sample. QED.
Um this question was asked with respect to jmg's timer snap idea. I was more interested in the hardware details for changing the timer. I know we can do that 1/2 bit approach in software, but only effectively at the start of the packet during the sync period - it would be difficult to do it every bit in software with RXUSB running at the same time.
@Cluso99,
When transmitting I believe the CRC16 position in the frame needs to be padded with 16 bit of zeroes at the end and the CRC process include the zeroes in its computation then output this CRC data instead of the 16 zeroes. A streaming bit process for doing CRC on the fly with the wire transitions would probably not do this for you.
As soon as you send a bit by XOR OUTA,pinmask you do a USBBIT D,pins. Then, when you send the last data byte's last data bit, you SHR D,#16 and now we have the CRC in the lower 16 bits ready to shift out. Not sure which end we need to send from so maybe we don't need to do a SHR anyway. We have a bit time to get this ready, so I think its doable.
Yeah but real the problem is you need these extra 16 zero bits to do the computation, but the zeroes don't go out on the wire. What are you sending during this time?
Um this question was asked with respect to jmg's timer snap idea. I was more interested in the hardware details for changing the timer. I know we can do that 1/2 bit approach in software, but only effectively at the start of the packet during the sync period - it would be difficult to do it every bit in software with RXUSB running at the same time.
You definitely do not need to do it on every bit so you do it when necessary. IIRC it's also easy to limit the block size in USB, so calc the xtal accuracy and ensure you have the timing set correctly and synchronised at the start, and all should be good to go.
Yeah that has been mentioned earlier about keeping packet sizes down. One worry I had was if you have a hub enviroment you could see long packets to other devices on the bus. There is a risk for long packets (even if you are ignoring the data as it is not for your address/endpoint) you might drift and start sampling on transitions which could be falsely interpreted as an SE0 EOP if there is slight skew between the differential signal transitions, then you start hunting for syncs again and could resynch on random bus data. CRCs will probably save us however and things should eventually recover again. That problem can also be dealt with in the software only approach by additional sampling in the middle of the bit and ensuring mutliple EOPs get detected.
This only becomes useful if this was a task and then PASSCNT has to be used.
I'm not following, byte handling allows much lower clocks, even in one task.
I think lowish (FPGA region) clocks and threads should be a practical goal.
But then we also need a TX version too.
Ideally, but TX is less 'drop dead' as it can take some time to assemble/organize things I think.
I would rather control this in sw, especially at this point in time.
Of course, I think coding a "verilog clone" in SW for 1.5 MHz USB testing should be possible.
If that also allows timer-paced sampling, it is a small step to use counters and a per-byte jump.
The shift to timer-paced operation uses almost identical Verilog, and a data buffer for read is small.
It may also avoid this somewhat complex opcode, pushing down fMAX if it works on register-space.
(timer paced code decouples things a little from register critical paths)
I think maybe the CRC does not need a buffered read, as it is checked on EOP ?
If there are spare virtual Pins, the USB RxRDY flags could hook into some of those ?
Chip would likely need to modify the counters slightly to allow /N reloadable counting, and edge resync.
I'm not sure if those modes are already in the Counters.
Yeah that has been mentioned earlier about keeping packet sizes down. One worry I had was if you have a hub enviroment you could see long packets to other devices on the bus. There is a risk for long packets (even if you are ignoring the data as it is not for your address/endpoint) you might drift and start sampling on transitions which could be falsely interpreted as an SE0 EOP if there is slight skew between the differential signal transitions, then you start hunting for syncs again and could resynch on random bus data. CRCs will probably save us however and things should eventually recover again. That problem can also be dealt with in the software only approach by additional sampling in the middle of the bit and ensuring mutliple EOPs get detected.
I think long packets to other devices is not a problem. You just wait for a new 2 SE0's in 2 successive instructions (or SE1's). The sync up is quite easy. That is being done now without crcs.
jmg: If I were to try and compile the Verilog I would get lost for some considerable time. It's better for me to think the thing thru and let others fix the Verilog syntax so it works properly.
The problem with this, is if the Verilog needs a lot of changes( as this does), it quickly becomes too clumsy to have someone else applying fix-ups. Also in the form you code, checking is harder as it is not so self contained.
As always, it is better to code in small pieces, get 'working' equations, and look at the .eq0 & .rpt files to confirm you have counters / clock enables / MUXes as expected, and no logic blow-outs.
Below is the code, edited/modified so Lattice Verilog at least compiles it (with some warnings).
////////////////////////////////////////////////////////////////////////////////
// Acknowledgements: Verilog code for CRC's http://www.easics.com
// RR20140310 start
// RR20140311,12 continued
////////////////////////////////////////////////////////////////////////////////
// polynomial: CRC5usb=(0 2 5), CRC16usb=(0 2 15 16), CRC16ccitt=(0 5 12 16)
// data width: 1, LSB first
//
// inputs: D, S, PINS
// outputs: D, Z, C
module RxUSB
(
input CLK, //
input Load_d, //
input jI, //
input kI, //
input WZ, //
input WC, //
input [31:0] s, // S operand
input [31:0] d, // D operand
input [127:0] p, // input pins
output reg [31:0] r, // D result
output reg zz, // Z flag
output reg cy, // Carry flag
output reg SkipStuff, // move so can see in EQNs better
output reg InvalidPM
);
reg [15:0] crc; // original CRC (accumulated)
reg [2:0] bitcnt; // data bit counter 3 bits
reg k; // K new pin value
reg j; // J new pin value
reg [2:0] stuffcnt; // stuff counter 3 bits
reg [7:0] data; // data byte (accumulated)
reg [8:7] poly; // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
reg [6:0] pinno; // pin pair numbers 0-127
reg [15:0] newcrc; // new crc
reg t; // 1 if k toggles (ie 1 bit)
reg kP; // K old pin value
reg jP; // J old pin value
//reg r,z,c; // D result
reg crc05usb;
reg crc16usb;
reg crc16itt;
reg crc16ndef;
///////////////////////////////////////////////////////////////////////////////
// 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
always @(poly) begin
crc05usb = (poly == 2'b00); // CRC5usb =(0 2 5)
crc16usb = (poly == 2'b01); // CRC16usb =(0 2 15 16)
crc16itt = (poly == 2'b10); // CRC16ccitt=(0 5 12 16)
crc16ndef = (poly == 2'b11); // undefined - alias to one above
end
// check for a "1" bit toggle
always @(kI or jI or kP or stuffcnt) begin
t = kI ^ kP; // new pin value ^ previous pin value; 1=toggled
SkipStuff = (!t & (stuffcnt == 3'b110)); // !t needed for ccitt ?
InvalidPM = (kI==jI); // Signaling states are non-diff
end
always @(posedge CLK) begin
if (Load_d) begin // WRITE to register - Value INIT
// crc = d[31:16]; // original crc value (accum) moved below
kP = d[15]; // previous K
jP = d[14]; // previous J
stuffcnt = d[13:11]; // original stuff counter value
bitcnt = d[10:8]; // original bit counter value
data = d[7:0]; // original data value (accum)
poly = s[8:7]; // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
//? kpin = value(s[6:0]); // K pin no.
//? jpin = value(s[6:0]) ^1 // J pin no.
k = kI; // new pin value
j = jI; // new pin value
end // Load_d
else begin // !Load_d = normal RUN , compiler wants in one block..
k = kI; // new pin value
j = jI; // new pin value
kP = k; // previous K
jP = j; // previous J
// check for bit unstuff
if (SkipStuff) begin
// unstuff
stuffcnt <= 3'b000;
// bitcnt = bitcnt; // implicit, but makes hold action clear
end
else if (!InvalidPM) begin
// inc bit count & accum data bit into byte
bitcnt++;
if (t)
stuffcnt <= 3'b000; // reset if input bit toggles
else
stuffcnt++;
end
end // Load_d
end // (posedge CLK)
reg kr0;
reg kr2;
reg kr5;
reg kr12;
reg kr15;
reg HoldCRC;
always @(*) begin
// calculate the new crc... - decoded values, so no overlaps in if
if (crc05usb) begin
kr0 = t ^ crc[4];
kr2 = t ^ crc[4];
kr5 = 1'b0;
kr12 = 1'b0;
kr15 = 1'b0;
end
if (crc16usb) begin
kr0 = t ^ crc[15];
kr2 = t ^ crc[15];
kr5 = 1'b0;
kr12 = 1'b0;
kr15 = t ^ crc[15];
end
if (crc16itt) begin
kr0 = t ^ crc[15];
kr2 = 1'b0;
kr5 = t ^ crc[15];
kr12 = t ^ crc[15];
kr15 = 1'b0;
end
if (crc16ndef) begin // alias crc16itt, so cover ALL decodes.
kr0 = t ^ crc[15];
kr2 = 1'b0;
kr5 = t ^ crc[15];
kr12 = t ^ crc[15];
kr15 = 1'b0;
end
HoldCRC = InvalidPM | SkipStuff;
end // always @(*)
always @(posedge CLK) begin
if (Load_d) begin // WRITE to register - Value INIT
crc <= d[31:16]; // original crc value (accum)
end
else if (HoldCRC) begin
crc[0] <= kr0; //16
crc[1] <= crc[0]; //17
crc[2] <= crc[1] ^ kr2; //18
crc[3] <= crc[2]; //19
crc[4] <= crc[3]; //20
crc[5] <= crc[4] ^ kr5; //21
crc[6] <= crc[5]; //22
crc[7] <= crc[6]; //23
crc[8] <= crc[7]; //24
crc[9] <= crc[8]; //25
crc[10] <= crc[9]; //26
crc[11] <= crc[10]; //27 - bad eqns??, needed <=
crc[12] <= crc[11] ^ kr12; //28
crc[13] <= crc[12]; //29
crc[14] <= crc[13]; //30
crc[15] <= crc[14] ^ kr15; //31
end // valid
end // (posedge CLK)
// set results
always @(*) begin
r[31:16] = crc;
r[15] = k;
r[14] = j;
end // always @(*)
always @(*) begin // non register here ? - this is a bit mangled, data needs fixing
if (t) begin // toggled bit?
r[13:11] = 3'b000; // reset stuff counter
end
else begin
r[13:11] = stuffcnt;
end
r[10:8] = bitcnt;
if (SkipStuff) begin
r[7:0] = data;
end
else begin
r[7:1] = data[6:0];
r[0] = t; // add new data bit
end
end // @(*)
always @(posedge CLK) begin
if (WZ) begin
if ( !SkipStuff & (bitcnt == 3'b000)) begin
zz <= 1'b1; // byte ready
end
else begin
zz <= 1'b0; // byte not ready
end
end
if (WC) begin
cy <= k ^ j; // c = SE0/SE1
end
end // (posedge CLK)
endmodule
I think long packets to other devices is not a problem. You just wait for a new 2 SE0's in 2 successive instructions (or SE1's). The sync up is quite easy. That is being done now without crcs.
That's starting to sound like a lot of crossed fingers...?
Chip may already have edge reset modes in the counters, and I think the SW WAIT can then work, with a Counter.
To test at 1.5MHz, and a simple Reload timer, the FPGA needs to clock at either 78MHz or 81MHz , with reload values of 52 or 54, and use SW wait values of 50% of those for mid-bit sampling.
Comments
Maybe but I didn't see it anywhere. It could have been mentioned elsewhere in older posts. It just looked like it would reuse C each time around, but that doesn't quite work right in my mind.
Relooking at it, we need to actually keep the data bit, not the C bit.
I have also been looking at the GETXP and CRCBIT and wondering if I can combine the instructions, together with the unstuff count.
In my example I did two opcodes, spit as
a) a ReadPinPair opcode and
b) a Destuff_CRC_Jump opcode, which reads CY as the ip, and you can read CRC and RxDATA from the register.
This jumps when Counter = 8, which can be any of 8/9/10 physical bits.
That split means the Destuff_CRC_Jump can apply to the Tx bitstream, (via CY) and collect the CRC as it sends Data too.
In your code above, pretty much 0..4 is one opcode, 5 is JNZ, and 6..10 is the other opcode (but including CRC)
The CRC Field part of D, inits with 1's.
The a) opcode could be ReadPinPair_JNZ, I think and still work ? (ie includes JNZ )
Yeah I thought there was something missing. Now I wonder if it makes sense to combine the destuff and pin sampling work together into one instruction but keep the CRC work separate so that it remains independent of USB/NRZI and could therefore be used by other non-USB software as well. If we end up having two CRC H/W blocks in the COG, lets call them CRCA, CRCB, it could allow different polynomials/algorithms such as CRC-5 and CRC-16 to be dynamically selected depending on where we are up to in the packet as we pick either CRCA or CRCB instructions.
Now we have four possible values of Z, C flags which could get returned after the combined USB sampling/destuffing operation, while D can hold current stuff counts, the accumulated byte data and an end of byte marker bit. We can reinitialize independent fields of this register such as D/S/I/X as required, once we process our byte. We just need a way to identify the pin pair, either via including S in the opcode (if we have that luxury in the instruction encodings remaining) or via some other means like a separate SETUSB xxx instruction for example.
Output flags
ZC = 10 - indicates SE0 detected
ZC = 11 - indicates bit was destuffed and should be ignored
ZC = 01 - good data inserted, data bit = 1 (no pin changed detected)
ZC = 00 - good data inserted, data bit = 0 (pin change detected)
New bit always gets inserted into D[8], D[9] also remembers the last pin value.
D[7].. D[1] bits contain previous data, D[0] holds end of byte marker, this is shifted downwards later by other code. So D[8]=~(old D[8] XOR pinvalue) in the USBSTUFF instruction, but D[7] = old D[8], D[6] = old D[7] etc down to D[0] when we rotate.
The stuff count could be indicated by copying the last data bit value also into bit31 of the data register, if a 1 reaches bit 25 as we shift downwards you have to start to destuff. So no need for an actual destuff counter, just make use of the shift operation. When the bit is unstuffed once we decode a zero, the top 6 bits of the register are reset to 0.
If you wrap this up into a REP loop you get down to 5 instructions per bit, or 6 with the WAITCNT, but the byte exit jumps will take 4 cycles. Not sure if that blows the timing/budget.
D would hold all the values required (data byte accumulator, stuff counter, last pin values, and the CRC).
D could be a fixed register such as $1F0.
The P2 instruction would be...
Here is what I have come up with for possible Verilog code - none of it is tested - I am not a Verilog coder.
jmg: I will take a look at what you have done but my understanding of Verilog is quite poor. Would you mind looking at this please?
The Z & C flags are set to the current KJ pins so that 4 conditions can be decoded automatically (SE0, SE1 plus J, K).
The sw can keep count of 8 bits (Just realised I need a way to test for this as the instruction does not advise of unstuffing without further testing).
I think I will have the instruction keep a count of bits (excludes unstuffing) and set Z when done, and C for SE0/SE1.
Ok, in that case a DJNZ or unrolled loop would probably be needed then.
There may be some issues with starting this (eg first clock has unknown last-pin state).
If you go to a single opcode Verilog scheme, that opens up more options
( note my variant includes a Bit counter )
addit: I see your #71 includes Bit Counter too, so that is pretty much ready to timer-trigger.
The Classic opcode usage is called once per bit
or, instead of PC/execute trigger, consider that same Verilog is now Counter triggered.
The launch can morph slightly, to a WAITUSB style that is asked for 8 bits or exits on SE0
This is almost the same verilog, but now it is SW executed once per byte, slashing the code-overhead.
The once-per-bit internal engine, is timer triggered, exits the WAITUSB style SW interface, on 8 bits OR an Trapevent (SE0,Error)
It is also then fairly simple to do edge-snap on the Timer reloads, which allows longer packets with normal tolerances.
There would be two opcodes, one to Prime on Write and Read when done,and the WAITUSB form
Init Write values are Counter divide, preset of CRC and presets/clears of counters
Init read gives CRC result, Data Bits, and optional other info
I'd suggest you grab a copy of Lattice ISPLever, CPLD version.
This is faster at compiling and fitting than the newer Diamond, and with CPLD equation output, is a little easier to read, and check what you coded in EQN form.
eg it will use .CE (clock enable) .D .C on flipflops in the EQN out which are easy to scan.
The USB opcode engine will 'fit' in a modest CPLD.
Compile/FIT is ~ 15seconds, for my code example, and LC4256ZE-B-EVN is a possible hardware test platform.
The first time you call this the initial state in the D register passed should be known and setup as the idle state of the bus. This is fixed for low speed/high speed. We can keep the last data around from the previous call as well. No need to trash it.
I quite like the sound of that idea. It would however probably prevent or impact a multi-tasking based implementation like I proposed in my first post due to variable execution timing.
I'm intrigued by this idea to remain in sync for long packets. How would you do this in practice?
One thing we need to bear in mind that we don't want the final SE0 bit condition received at the end of the packet to have any impact on the last CRC operation. The CRC needs to be preserved intact so we can validate it.
Good point, CRC and Data Rx need to advance only on valid Bits, ie Stuff OR P=M => Skip change of CRC.Data.
Code in #71 does not qualify CRC advance, on Skip,
hehe, some people want everything !!
The first SW-loops suggested were so starved of cycles, there was no multi-tasking option, it was needing special care to even run at 80MHz.
If the USB read becomes byte based with WAITUSB, then the next logical step is to allow buffering on Byte-read, so it behaves very like a conventional UART - ie you have a whole byte time of elbow room.
With no buffering, the get-byte-and restart WAITUSB will have tight constraints.
If we target a 48MHz min CLK, each byte arrives every 32/36/40 sysclks, which might allow some careful multi-tasking with UART style buffering.
I'm sure a 50% usage, 48MHz USB support would have wide appeal
Note you would be limited to one USB per COG, and also on present P2, no Full task swaps or any libraries using locks in the same COG.
If suggested enhance of P2 to HW queue for shared resource were done, then libraries and code need to use LOCK far less often, and a USB thread could run at 50% clocks, with almost anything happening in the other threads.
Yeah the tight timing contraints make it tricky and could add limitations as to the USB implementation. With a fully buffered byte process we have oodles more time to process the data without upsetting the bit capture process and the driver code is a lot easier to write and understand.
Update: Actually "oodles" probably only means about ~14 bits worth of time (from memory) before we need to respond to the USB host with an ACK/NAK/STALL or the data requested or risk hitting a timeout. A byte buffer will eat into this time so we need to not buffer more than one byte. But that still gives lots more instructions breathing room to get the job done. 8 USB bits at say 96/48MHz is 64/32 P2 instruction cycles for example. You can do a lot in that time.
You will need to reset the bit and stuff counters, setup the J & K bits, preferably clear the data byte, and preset the CRC16 to $FFFF. I want to be able to use this instruction for writing/outputting USB too. Currently I think it will work if the previous instruction outputs the bits on J & K pins, then call this instruction which will compile the CRC16/5 for you. The sw will need to do the bitstuffing. This only becomes useful if this was a task and then PASSCNT has to be used. But then we also need a TX version too.
I would rather control this in sw, especially at this point in time. I am trying to cover the CRC16-CCITT plus both CRC5 & CRC16 for the USB, for both RX & TX cases.
IIRC there is no unstuffing in CRC16-CCITT protocols as they use SYN & DLE. We do have to be mindful of creating general cases so we can use them elsewhere. Currently this will do NRZI comms, and the special stuff/unstuff. We also have to be mindful of instruction availability and silicon.
If we assume realistic and useful targets of 48MHz CLK to timers and 48MHz sliced 50% to CPU, then each bit is 4 clocks.
Any edge forces the counter to (say) 00 and then it clocks 012301230123 when no edges are present.
Data is sampled when the counter is at 50% ==2, and timer values of 1 and 3 here, are margin.
Timing skew will either shorten the 3 value, or extend the 0, and so it will jitter about the correct clock speed.
At higher clock speeds, the granularity improves.
If you stick with even divides, the possible clocks are 48MHz, 72MHz, 96MHz, 120MHz etc;
if you allow uneven sides, (which should be ok) then 60MHz, 84MHz, 108MHz are also possible.
72MHz is inside present FPGA builds and 84MHz / 96MHz (+?) may be possible on Cyclone V builds.
At 72MHz and 50% slice and UART style buffering, there is 24/27/30 thread clocks per byte streaming.
Will that be enough to meet packet specs ?
@Cluso99,
When transmitting I believe the CRC16 position in the frame needs to be padded with 16 bit of zeroes at the end and the CRC process include the zeroes in its computation then output this CRC data instead of the 16 zeroes. A streaming bit process for doing CRC on the fly with the wire transitions would probably not do this for you.
Perhaps you might like to do it? I am sure you could check it out simply with your BeMicro??? FPGA, and since you would be using Quartus it would help Chip too.
Um this question was asked with respect to jmg's timer snap idea. I was more interested in the hardware details for changing the timer. I know we can do that 1/2 bit approach in software, but only effectively at the start of the packet during the sync period - it would be difficult to do it every bit in software with RXUSB running at the same time.
I'm not following, byte handling allows much lower clocks, even in one task.
I think lowish (FPGA region) clocks and threads should be a practical goal.
Ideally, but TX is less 'drop dead' as it can take some time to assemble/organize things I think.
Of course, I think coding a "verilog clone" in SW for 1.5 MHz USB testing should be possible.
If that also allows timer-paced sampling, it is a small step to use counters and a per-byte jump.
The shift to timer-paced operation uses almost identical Verilog, and a data buffer for read is small.
It may also avoid this somewhat complex opcode, pushing down fMAX if it works on register-space.
(timer paced code decouples things a little from register critical paths)
I think maybe the CRC does not need a buffered read, as it is checked on EOP ?
If there are spare virtual Pins, the USB RxRDY flags could hook into some of those ?
Chip would likely need to modify the counters slightly to allow /N reloadable counting, and edge resync.
I'm not sure if those modes are already in the Counters.
The problem with this, is if the Verilog needs a lot of changes( as this does), it quickly becomes too clumsy to have someone else applying fix-ups. Also in the form you code, checking is harder as it is not so self contained.
As always, it is better to code in small pieces, get 'working' equations, and look at the .eq0 & .rpt files to confirm you have counters / clock enables / MUXes as expected, and no logic blow-outs.
Below is the code, edited/modified so Lattice Verilog at least compiles it (with some warnings).
Updated code, better CRC eqns
That's starting to sound like a lot of crossed fingers...?
Chip may already have edge reset modes in the counters, and I think the SW WAIT can then work, with a Counter.
To test at 1.5MHz, and a simple Reload timer, the FPGA needs to clock at either 78MHz or 81MHz , with reload values of 52 or 54, and use SW wait values of 50% of those for mid-bit sampling.