P2 and full speed USB slave requirements/ideas

rogloh · 2014-03-10 20:13

jmg wrote: »

I think this last value work is done in the background, as part of the opcode.
It means the very first opcode C will be discarded, as the previous value is ?? but the SE0 will be valid from first clock.

Maybe but I didn't see it anywhere. It could have been mentioned elsewhere in older posts. It just looked like it would reuse C each time around, but that doesn't quite work right in my mind.

Cluso99 · 2014-03-10 21:03

rogloh wrote: »

@Cluso99,

There is one thing still confusing me about the proposed GETXP instruction. After calling such an instruction you would get carry flag C result being the XOR of the original C flag value and one of the USB data pins. So if C comes back as 1 that means it was different to the sampled pin value, and if it comes back 0 it was the same value as the sampled pin value. This is fine and it detects logical 0/1 NRZI bitstream nicely.

However, unless I am missing something else it appears you would then want to reuse C again for the next iteration. The problem is that this time around C is not the last pin value, it indicates whether there a difference between previous C value and the previous pin value. So some other operation to reset C back to the previous data pin value appears to be required before the next time it gets called, or some trick is required. Are you doing this as well somewhere in your code? I didn't see that mentioned anywhere. Won't this require an additional clock cycle to do?

I think you have found a bug when I reduced the logic.
Relooking at it, we need to actually keep the data bit, not the C bit.

I have also been looking at the GETXP and CRCBIT and wondering if I can combine the instructions, together with the unstuff count.

rx
              waitcnt   time, bittime           ' 0 wait for next mid-bit sample time
              test      K, pina         wz      ' 1 read usb pin
              muxz      bits, bit30     wc      ' 2 b31=previous, b30=new; C=parity 00/11=odd, 01/10=even
              shl       bits, #1                ' 3 shift new b30 bit into previous b31
              test      JK, pina        wz      ' 4 check for SE0 (ie EOP) ?
        if_z  jmp       #waitforend             ' 5 y: wait for end
              rcr       data, #1                ' 6 accum new bit (rotate carry from xor into top bit)
              rcl       stuffcnt, #6    wz      ' 7 if entire register is zero, we need to unstuff (6 bits in)
        if_z  jmp       #unstuff                ' 8 y: unstuff next bit
              sub       bitcnt, #1 wz           ' 9 bitcnt-- 
        if_nz jmp       #rx                     ' 10
' 8/32 bits so save long...        

J             long      1<<DM_PIN               ' D-
K             long      1<<DP_PIN               ' D+
JK            long      1<<DP_PIN | 1<<DM_PIN   ' J & K bit mask
enable        long      1<<EN_PIN               ' Enable 1K5 pullup for LS
bittime       long      BIT_DLY                 ' USB bit time
bit30         long      1<<30        '$40000000 ' MUX mask for RX inbound xor register
data          long      0
stuffcnt      long      0                       ' counts 1 bits (was dc in rx and db in tx)
bits          long      0                       ' b31=previous, b30=new
time          long      0

jmg · 2014-03-10 22:33

Cluso99 wrote: »

I have also been looking at the GETXP and CRCBIT and wondering if I can combine the instructions, together with the unstuff count.

In my example I did two opcodes, spit as

a) a ReadPinPair opcode and
b) a Destuff_CRC_Jump opcode, which reads CY as the ip, and you can read CRC and RxDATA from the register.
This jumps when Counter = 8, which can be any of 8/9/10 physical bits.

That split means the Destuff_CRC_Jump can apply to the Tx bitstream, (via CY) and collect the CRC as it sends Data too.

In your code above, pretty much 0..4 is one opcode, 5 is JNZ, and 6..10 is the other opcode (but including CRC)
The CRC Field part of D, inits with 1's.

The a) opcode could be ReadPinPair_JNZ, I think and still work ? (ie includes JNZ )

rogloh · 2014-03-11 00:02

Cluso99 wrote: »

I think you have found a bug when I reduced the logic.
Relooking at it, we need to actually keep the data bit, not the C bit.

I have also been looking at the GETXP and CRCBIT and wondering if I can combine the instructions, together with the unstuff count.

Yeah I thought there was something missing. Now I wonder if it makes sense to combine the destuff and pin sampling work together into one instruction but keep the CRC work separate so that it remains independent of USB/NRZI and could therefore be used by other non-USB software as well. If we end up having two CRC H/W blocks in the COG, lets call them CRCA, CRCB, it could allow different polynomials/algorithms such as CRC-5 and CRC-16 to be dynamically selected depending on where we are up to in the packet as we pick either CRCA or CRCB instructions.

Now we have four possible values of Z, C flags which could get returned after the combined USB sampling/destuffing operation, while D can hold current stuff counts, the accumulated byte data and an end of byte marker bit. We can reinitialize independent fields of this register such as D/S/I/X as required, once we process our byte. We just need a way to identify the pin pair, either via including S in the opcode (if we have that luxury in the instruction encodings remaining) or via some other means like a separate SETUSB xxx instruction for example.

Output flags
ZC = 10 - indicates SE0 detected
ZC = 11 - indicates bit was destuffed and should be ignored
ZC = 01 - good data inserted, data bit = 1 (no pin changed detected)
ZC = 00 - good data inserted, data bit = 0 (pin change detected)
New bit always gets inserted into D[8], D[9] also remembers the last pin value.
D[7].. D[1] bits contain previous data, D[0] holds end of byte marker, this is shifted downwards later by other code. So D[8]=~(old D[8] XOR pinvalue) in the USBSTUFF instruction, but D[7] = old D[8], D[6] = old D[7] etc down to D[0] when we rotate.

The stuff count could be indicated by copying the last data bit value also into bit31 of the data register, if a 1 reaches bit 25 as we shift downwards you have to start to destuff. So no need for an actual destuff counter, just make use of the shift operation. When the bit is unstuffed once we decode a zero, the top 6 bits of the register are reset to 0.

USBSTUFF data WC WZ  ' this does the USB pin samping and destuffing, we shift down the 8 bit data and do the CRC operation ourselves below
if_z_and_nc   JMP #se0_detected    'exit loop if SE0
if_nz         CRCA crcval          ' accumulate CRC using C flag as data
if_nz         SHR data, #1 WC      ' if C is set we are at the end of the byte
if_nz_and_c   JMP #byte_done

If you wrap this up into a REP loop you get down to 5 instructions per bit, or 6 with the WAITCNT, but the byte exit jumps will take 4 cycles. Not sure if that blows the timing/budget.

Cluso99 · 2014-03-11 00:12

Here is a possible special USB instruction that should take 1 clock (4 via the pipline).

D would hold all the values required (data byte accumulator, stuff counter, last pin values, and the CRC).
D could be a fixed register such as $1F0.

The P2 instruction would be...

              [B]RECVUSB   D, S/#          WZ,WC[/B]
where
  S/# is the PinPair# and Poly bits
    S[31..9]  = unused
    S[8..7]   = 00= CRC16 USB
                01= CRC5  USB
                10= CRC16 CCITT
                11= undefined
    S[6..0]   = D-/D+ Pin Pair #0..127
                The pin pair is always a pair of pins mod 2. ie nnnnnnx where x=0 and x=1 for the pair.
                If the pin pair is even (S[0]=0) then J is the lowest pin and K is the higher pin of the consecutive pair
                If the pin pair is odd  (S[0]=1) then K is the lowest pin and J is the higher pin of the consecutive pair.
                This arrangement allows for simple LS and FS by making the pin pair even or odd.                              
  D is the cog register storing a 32 bit field...
    D[31..16] = crc16
    D[15]     = K new pin value
    D[14]     = J new pin value
    D[13..11] = undefined
    D[10..8]  = unstuff counter 3 bits
    D[7..0]   = data byte accumulation
  Z = new D[15] ie K new value
  C = new D[14] ie J new value
    ZC
    00 = SE0
    01 = J ?
    10 = K ?
    11 = SE1
(may want to invert Z ??? and swap D[14] - D[15] ???

Here is what I have come up with for possible Verilog code - none of it is tested - I am not a Verilog coder.

module RECVUSB;
  // polynomial: CRC5usb=(0 2 5), CRC16usb=(0 2 15 16), CRC16=(0 5 12 16) 
  // data width: 1
  // convention: the first serial bit is data[0]
  function [31:0] RxUSB;
    input [1:0]   poly;         // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined 
    input [31:0]  dest;         // original D value
    input [1:0]   pins;         // K:J pin values
    reg [15:0]  c;              // original CRC (accumulated)
//  reg [2:0]   spare;          // undefined
    reg [0:0]   k;              // K new pin value
    reg [0:0]   j;              // J new pin value
    reg [2:0]   stuffcnt;       // stuff counter 3 bits
    reg [7:0]   data;           // data byte (accumulated)
// 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
  always @(poly)  begin   
    crc16usb  = (poly == 2'b00); 
    crc05usb  = (poly == 2'b01); 
    crc16itt  = (poly == 2'b10); 
  end
  begin
    c = dest[31:16];            // original crc value (accum)
    stuffcnt = dest[11:8];      // original stuff counter value
    k = pins[1];                // new pin values
    j = pins[0];                // new pin values
// calculate the new crc...
    newcrc[1]  = c[0];
    newcrc[3]  = c[2];
    newcrc[4]  = c[3];
    newcrc[6]  = c[5];
    newcrc[7]  = c[6];
    newcrc[8]  = c[7];
    newcrc[9]  = c[8];
    newcrc[10] = c[9];
    newcrc[11] = c[10];
    newcrc[13] = c[12];
    newcrc[14] = c[13];
    if crc05usb then begin
        newcrc[0]  =         k ^  c[4];
        newcrc[2]  = c[1]  ^ k ^  c[4];
    end
    if crc16usb or crc16itt then begin
        newcrc[0]  =         k ^ c[15];        
    end
    if crc16usb then begin
        newcrc[2]  = c[1]  ^ k ^ c[15];
        newcrc[5]  = c[4];
        newcrc[12] = c[11];
        newcrc[15] = c[14] ^ k ^ c[15];
    end
    if crc16itt then begin
        newcrc[2]  = c[1];
        newcrc[5]  = c[4]  ^ k ^ c[15];
        newcrc[12] = c[11] ^ k ^ c[15];
        newcrc[15] = c[14];
    end
// check for bit unstuff
    if stuffcnt == 3b'110' then begin
        // unstuff
        RxUSB[10:8] = 3b'000';
        rxUSB[7:0]  = dest[7:0];
    else
        // accum data bit into byte
        RxUSB[10:8] = stuffcnt++;
        data[7:1] = dest[6:0];
        data[0]   = k ^ dest[15];   // k ^ previous pin value    
    end        
    RxUSB[31:16] = newcrc[15:0];
    RxUSB[15:13] = 3b'000';
    RxUSB[12]    = k;
    RxUSB[11]    = j;
    RxUSB[10:8]  = stuffcnt;
    RxUSB[7:0]   = data[7:0];
    if WZ then begin
        Z[0]         = k;
    end
    if WC then begin           
        C[0]         = j;
    end           
  end
  endfunction
endmodule

jmg: I will take a look at what you have done but my understanding of Verilog is quite poor. Would you mind looking at this please?

Cluso99 · 2014-03-11 00:25

You will note in my previous post that the single instruction REVCUSB checks the pin pairs, calculates/accumulates the CRC (5 or 16usb or 16ccitt), unstuffs bits, and accumulates this bit to the byte. Everything is held in the one D long/register. In particular, the CRC is in the upper word, and the lowest byte contains the data byte.

The Z & C flags are set to the current KJ pins so that 4 conditions can be decoded automatically (SE0, SE1 plus J, K).

The sw can keep count of 8 bits (Just realised I need a way to test for this as the instruction does not advise of unstuffing without further testing).

I think I will have the instruction keep a count of bits (excludes unstuffing) and set Z when done, and C for SE0/SE1.

Cluso99 · 2014-03-11 00:26

rogloh wrote: »
Yeah I thought there was something missing. Now I wonder if it makes sense to combine the destuff and pin sampling work together into one instruction but keep the CRC work separate so that it remains independent of USB/NRZI and could therefore be used by other non-USB software as well. If we end up having two CRC H/W blocks in the COG, lets call them CRCA, CRCB, it could allow different polynomials/algorithms such as CRC-5 and CRC-16 to be dynamically selected depending on where we are up to in the packet as we pick either CRCA or CRCB instructions.

Now we have four possible values of Z, C flags which could get returned after the combined USB sampling/destuffing operation, while D can hold current stuff counts, the accumulated byte data and an end of byte marker bit. We can reinitialize independent fields of this register such as D/S/I/X as required, once we process our byte. We just need a way to identify the pin pair, either via including S in the opcode (if we have that luxury in the instruction encodings remaining) or via some other means like a separate SETUSB xxx instruction for example.

Output flags
ZC = 10 - indicates SE0 detected
ZC = 11 - indicates bit was destuffed and should be ignored
ZC = 01 - good data inserted, data bit = 1 (no pin changed detected)
ZC = 00 - good data inserted, data bit = 0 (pin change detected)
New bit always gets inserted into D[8], D[7].. D[1] bits contain previous data, D[0] holds end of byte marker, this is shifted downwards later by other code. So D[8]=~(old D[8] XOR pinvalue) in the USBSTUFF instruction, but D[7] = old D[8], D[6] = old D[7] etc down to D[0] when we rotate.

The stuff count could be indicated by copying the last data bit value also into bit31 of the data register, if a 1 reaches bit 25 as we shift downwards you have to start to destuff. So no need for an actual destuff counter, just make use of the shift operation. When the bit is unstuffed once we decode a zero, the top 6 bits of the register are reset to 0.
USBSTUFF data WC WZ  ' this does the USB pin samping and destuffing, we shift down the 8 bit data and do the CRC operation ourselves below
if_z_and_nc   JMP #se0_detected    'exit loop if SE0
if_nz         CRCA crcval          ' accumulate CRC using C flag as data
if_nz         SHR data, #1 WC      ' if C is set we are at the end of the byte
if_nz_and_c   JMP #byte_done
If you wrap this up into a REP loop you get down to 5 instructions per bit, or 6 with the WAITCNT, but the byte exit jumps will take 4 cycles. Not sure if that blows the timing/budget.

Unfortunately you cannot have a waitcnt/passcnt within a repx loop.

rogloh · 2014-03-11 01:48

Cluso99 wrote: »

Unfortunately you cannot have a waitcnt/passcnt within a repx loop.

Ok, in that case a DJNZ or unrolled loop would probably be needed then.

Cluso99 · 2014-03-11 05:27

I thought I would post where I am up to (before I retire for the evening) with the Verilog for the P2 USB instruction. It still needs some work wrt unstuffing as I don't reset the counter when I get a clocked bit.

////////////////////////////////////////////////////////////////////////////////
// Acknowledgements: Verilog code for CRC's   [URL="http://www.easics.com/"]http://www.easics.com[/URL]
// RR20140310 start
// RR20140311 continued
////////////////////////////////////////////////////////////////////////////////
// polynomial: CRC5usb=(0 2 5), CRC16usb=(0 2 15 16), CRC16ccitt=(0 5 12 16) 
// data width: 1, LSB first
//
// inputs:  D, S, PINS
// outputs: D, Z, C
module          RxUSB
(
input   [31:0]  s,              // S operand
input   [31:0]  d,              // D operand
input   [127:0] p,              // input pins
output  [31:0]  r,              // D result
output          z,              // Z flag
output          c               // C flag    
);
reg     [15:0]  crc;            // original CRC (accumulated)
reg     [2:0]   bitcnt;         // data bit counter 3 bits
reg             k;              // K new pin value
reg             j;              // J new pin value
reg     [2:0]   stuffcnt;       // stuff counter 3 bits
reg     [7:0]   data;           // data byte (accumulated)
reg     [8:7]   poly;           // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
reg     [6:0]   pinno;          // pin pair numbers 0-127
reg     [15:0]  newcrc;         // new crc

///////////////////////////////////////////////////////////////////////////////
  begin
    crc      = d[31:16];        // original crc value (accum)
    k0       = d[15];           // previous K
    j0       = d[14];           // previous J
    stuffcnt = d[13:11];        // original stuff counter value
    bitcnt   = d[10:8];         // original bit   counter value
    data     = d[7:0];          // original data value (accum)
    poly     = s[8:7];          // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
?   kpin     = value(s[6:0]);   // K pin no.
?   jpin     = value(s[6:0]) ^1 // J pin no.
    k        = pins[kpin];      // new pin value
    j        = pins[jpin];      // new pin value
// 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
    always @(poly)  begin   
        crc05usb  = (poly == 2'b00);    // CRC5usb   =(0 2 5)
        crc16usb  = (poly == 2'b01);    // CRC16usb  =(0 2      15 16)
        crc16itt  = (poly == 2'b10);    // CRC16ccitt=(0   5 12    16)
    end
// calculate the new crc...
    if crc05usb then
        kr0  = k ^ crc[4];
        kr2  = k ^ crc[4];
        kr5  = 1b'0;
        kr12 = 1b'0;
        kr15 = 1b'0; 
    else if crc16usb then
        kr0  = k ^ crc[15];
        kr2  = k ^ crc[15];
        kr5  = 1b'0;
        kr12 = 1b'0;
        kr15 = k ^ crc[15]; 
    else if crc16itt then
        kr0  = k ^ crc[15];
        kr2  = 1b'0;
        kr5  = k ^ crc[15];
        kr12 = k ^ crc[15];
        kr15 = 1b'0; 
    end;    
    newcrc[0]  = kr0;
    newcrc[1]  = crc[0];
    newcrc[2]  = crc[1] ^ kr2;
    newcrc[3]  = crc[2];
    newcrc[4]  = crc[3];
    newcrc[5]  = crc[4] ^ kr5;
    newcrc[6]  = crc[5];
    newcrc[7]  = crc[6];
    newcrc[8]  = crc[7];
    newcrc[9]  = crc[8];
    newcrc[10] = crc[9];
    newcrc[11] = crc[10];
    newcrc[12] = crc[11] ^ kr12;
    newcrc[13] = crc[12];
    newcrc[14] = crc[13];
    newcrc[15] = crc[14] ^ kr15;
    
??  check "1' bit first
// check for bit unstuff
    if stuffcnt == 3b'110' then 
        // unstuff
        stuffcnt = 3b'000';
        r[7:0]  = data[7:0];
        zero = 1b'0;
    else
        // inc bit count & accum data bit into byte
        bitcnt++;
        if bitcnt == 3b'000 then 
            zero = 1b'1;
        else
            zero = 1b'0;
        end
        stuffcnt++;
        r[7:1] = data[6:0];
        r[0]   = k ^ k0;        // k ^ previous pin value    
    end        
    r[31:16] = newcrc[15:0];
    r[15]    = k;
    r[14]    = j;
    r[13:11] = stuffcnt;
    r[10:8]  = bitcnt[2:0];
    if WZ then begin
        z = zero;
    end
    if WC then begin           
        c = k ^ j;
    end           
  end
  endfunction
endmodule

Cluso99 · 2014-03-11 07:09

Here is the latest. (Now I am really retiring for the evening)

////////////////////////////////////////////////////////////////////////////////
// Acknowledgements: Verilog code for CRC's   [URL="http://www.easics.com/"]http://www.easics.com[/URL]
// RR20140310 start
// RR20140311,12 continued
////////////////////////////////////////////////////////////////////////////////
// polynomial: CRC5usb=(0 2 5), CRC16usb=(0 2 15 16), CRC16ccitt=(0 5 12 16) 
// data width: 1, LSB first
//
// inputs:  D, S, PINS
// outputs: D, Z, C
module          RxUSB
(
input   [31:0]  s,              // S operand
input   [31:0]  d,              // D operand
input   [127:0] p,              // input pins
output  [31:0]  r,              // D result
output          z,              // Z flag
output          c               // C flag    
);

reg     [15:0]  crc;            // original CRC (accumulated)
reg     [2:0]   bitcnt;         // data bit counter 3 bits
reg             k;              // K new pin value
reg             j;              // J new pin value
reg     [2:0]   stuffcnt;       // stuff counter 3 bits
reg     [7:0]   data;           // data byte (accumulated)
reg     [8:7]   poly;           // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
reg     [6:0]   pinno;          // pin pair numbers 0-127
reg     [15:0]  newcrc;         // new crc
reg             t;              // 1 if k toggles (ie 1 bit)

///////////////////////////////////////////////////////////////////////////////
  begin
    crc      = d[31:16];        // original crc value (accum)
    k0       = d[15];           // previous K
    j0       = d[14];           // previous J
    stuffcnt = d[13:11];        // original stuff counter value
    bitcnt   = d[10:8];         // original bit   counter value
    data     = d[7:0];          // original data value (accum)

    poly     = s[8:7];          // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
?   kpin     = value(s[6:0]);   // K pin no.
?   jpin     = value(s[6:0]) ^1 // J pin no.

    k        = pins[kpin];      // new pin value
    j        = pins[jpin];      // new pin value

// 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
    always @(poly)  begin   
        crc05usb  = (poly == 2'b00);    // CRC5usb   =(0 2 5)
        crc16usb  = (poly == 2'b01);    // CRC16usb  =(0 2      15 16)
        crc16itt  = (poly == 2'b10);    // CRC16ccitt=(0   5 12    16)
    end

// check for a "1" bit toggle
    t = k ^ k0;                 // new pin value ^ previous pin value; 1=toggled

// check for bit unstuff
    if !t && (stuffcnt == 3b'110') then 
        // unstuff
        unstuff = 1b'1;         // unstuff true
        stuffcnt = 3b'000';
    else
        // inc bit count & accum data bit into byte
        unstuff = 1b'0;         // unstuff false
        bitcnt++;
        stuffcnt++;             // will be reset at result if input bit toggles
    end        

// calculate the new crc...
    if crc05usb then
        kr0  = t ^ crc[4];
        kr2  = t ^ crc[4];
        kr5  = 1b'0;
        kr12 = 1b'0;
        kr15 = 1b'0; 
    else if crc16usb then
        kr0  = t ^ crc[15];
        kr2  = t ^ crc[15];
        kr5  = 1b'0;
        kr12 = 1b'0;
        kr15 = t ^ crc[15]; 
    else if crc16itt then
        kr0  = t ^ crc[15];
        kr2  = 1b'0;
        kr5  = t ^ crc[15];
        kr12 = t ^ crc[15];
        kr15 = 1b'0; 
    end;    

    newcrc[0]  = kr0;
    newcrc[1]  = crc[0];
    newcrc[2]  = crc[1] ^ kr2;
    newcrc[3]  = crc[2];
    newcrc[4]  = crc[3];
    newcrc[5]  = crc[4] ^ kr5;
    newcrc[6]  = crc[5];
    newcrc[7]  = crc[6];
    newcrc[8]  = crc[7];
    newcrc[9]  = crc[8];
    newcrc[10] = crc[9];
    newcrc[11] = crc[10];
    newcrc[12] = crc[11] ^ kr12;
    newcrc[13] = crc[12];
    newcrc[14] = crc[13];
    newcrc[15] = crc[14] ^ kr15;
    
// set results
    r[31:16] = newcrc;
    r[15]    = k;
    r[14]    = j;
    if t then                   // toggled bit?
        r[13:11] = 3b'000       // reset stuff counter
    else    
        r[13:11] = stuffcnt;
    end    
    r[10:8]  = bitcnt;
    if unstuff then
        r[7:0] = data;
    else
        r[7:1] = data[6:0];
        r[0]   = t;             // add new data bit
    end        

    if WZ then
        if !(unstuff) && (bitcnt == 3b'000) then
            z = 1b'1;           // byte ready
        else
            z = 1b'0;           // byte not ready
        end    
    end

    if WC then           
        c = k ^ j;              // c = SE0/SE1
    end           
  end
  endfunction
endmodule

jmg · 2014-03-11 13:40

Cluso99 wrote: »

You will note in my previous post that the single instruction REVCUSB checks the pin pairs, calculates/accumulates the CRC (5 or 16usb or 16ccitt), unstuffs bits, and accumulates this bit to the byte. Everything is held in the one D long/register. In particular, the CRC is in the upper word, and the lowest byte contains the data byte.

There may be some issues with starting this (eg first clock has unknown last-pin state).

If you go to a single opcode Verilog scheme, that opens up more options
( note my variant includes a Bit counter )
addit: I see your #71 includes Bit Counter too, so that is pretty much ready to timer-trigger.

The Classic opcode usage is called once per bit
or, instead of PC/execute trigger, consider that same Verilog is now Counter triggered.

The launch can morph slightly, to a WAITUSB style that is asked for 8 bits or exits on SE0

This is almost the same verilog, but now it is SW executed once per byte, slashing the code-overhead.
The once-per-bit internal engine, is timer triggered, exits the WAITUSB style SW interface, on 8 bits OR an Trapevent (SE0,Error)

It is also then fairly simple to do edge-snap on the Timer reloads, which allows longer packets with normal tolerances.
There would be two opcodes, one to Prime on Write and Read when done,and the WAITUSB form

Init Write values are Counter divide, preset of CRC and presets/clears of counters
Init read gives CRC result, Data Bits, and optional other info

jmg · 2014-03-11 13:49

Cluso99 wrote: »

Here is what I have come up with for possible Verilog code - none of it is tested - I am not a Verilog coder.
jmg: I will take a look at what you have done but my understanding of Verilog is quite poor. Would you mind looking at this please?

I'd suggest you grab a copy of Lattice ISPLever, CPLD version.

This is faster at compiling and fitting than the newer Diamond, and with CPLD equation output, is a little easier to read, and check what you coded in EQN form.

eg it will use .CE (clock enable) .D .C on flipflops in the EQN out which are easy to scan.

The USB opcode engine will 'fit' in a modest CPLD.

Compile/FIT is ~ 15seconds, for my code example, and LC4256ZE-B-EVN is a possible hardware test platform.

rogloh · 2014-03-11 15:57

jmg wrote: »

There may be some issues with starting this (eg first clock has unknown last-pin state).

The first time you call this the initial state in the D register passed should be known and setup as the idle state of the bus. This is fixed for low speed/high speed. We can keep the last data around from the previous call as well. No need to trash it.

The launch can morph slightly, to a WAITUSB style that is asked for 8 bits or exits on SE0

I quite like the sound of that idea. It would however probably prevent or impact a multi-tasking based implementation like I proposed in my first post due to variable execution timing.

It is also then fairly simple to do edge-snap on the Timer reloads, which allows longer packets with normal tolerances.

I'm intrigued by this idea to remain in sync for long packets. How would you do this in practice?

There would be two opcodes, one to Prime on Write and Read when done,and the WAITUSB form

Init Write values are Counter divide, preset of CRC and presets/clears of counters
Init read gives CRC result, Data Bits, and optional other info

One thing we need to bear in mind that we don't want the final SE0 bit condition received at the end of the packet to have any impact on the last CRC operation. The CRC needs to be preserved intact so we can validate it.

jmg · 2014-03-11 16:36

rogloh wrote: »

One thing we need to bear in mind that we don't want the final SE0 bit condition received at the end of the packet to have any impact on the last CRC operation. The CRC needs to be preserved intact so we can validate it.

Good point, CRC and Data Rx need to advance only on valid Bits, ie Stuff OR P=M => Skip change of CRC.Data.
Code in #71 does not qualify CRC advance, on Skip,

jmg · 2014-03-11 16:46

rogloh wrote: »

jmg wrote:

The launch can morph slightly, to a WAITUSB style that is asked for 8 bits or exits on SE0

I quite like the sound of that idea. It would however probably prevent or impact a multi-tasking based implementation like I proposed in my first post due to variable execution timing.

hehe, some people want everything !!
The first SW-loops suggested were so starved of cycles, there was no multi-tasking option, it was needing special care to even run at 80MHz.

If the USB read becomes byte based with WAITUSB, then the next logical step is to allow buffering on Byte-read, so it behaves very like a conventional UART - ie you have a whole byte time of elbow room.

With no buffering, the get-byte-and restart WAITUSB will have tight constraints.

If we target a 48MHz min CLK, each byte arrives every 32/36/40 sysclks, which might allow some careful multi-tasking with UART style buffering.

I'm sure a 50% usage, 48MHz USB support would have wide appeal

Note you would be limited to one USB per COG, and also on present P2, no Full task swaps or any libraries using locks in the same COG.

If suggested enhance of P2 to HW queue for shared resource were done, then libraries and code need to use LOCK far less often, and a USB thread could run at 50% clocks, with almost anything happening in the other threads.

rogloh · 2014-03-11 16:55

jmg wrote: »

hehe, some people want everything !!

I try my best.

With no buffering, the get-byte-and restart WAITUSB will have tight constraints.

Yeah the tight timing contraints make it tricky and could add limitations as to the USB implementation. With a fully buffered byte process we have oodles more time to process the data without upsetting the bit capture process and the driver code is a lot easier to write and understand.

Update: Actually "oodles" probably only means about ~14 bits worth of time (from memory) before we need to respond to the USB host with an ACK/NAK/STALL or the data requested or risk hitting a timeout. A byte buffer will eat into this time so we need to not buffer more than one byte. But that still gives lots more instructions breathing room to get the job done. 8 USB bits at say 96/48MHz is 64/32 P2 instruction cycles for example. You can do a lot in that time.

Cluso99 · 2014-03-11 17:02

Before I get into the answers, my last code post has a problem that CRC needs to remain the same when unstuffing a bit. I will fix that shortly.

jmg wrote: »

There may be some issues with starting this (eg first clock has unknown last-pin state).

It will be required that you first do a MOV D,setup
You will need to reset the bit and stuff counters, setup the J & K bits, preferably clear the data byte, and preset the CRC16 to $FFFF.

If you go to a single opcode Verilog scheme, that opens up more options
( note my variant includes a Bit counter )
addit: I see your #71 includes Bit Counter too, so that is pretty much ready to timer-trigger.

I want to be able to use this instruction for writing/outputting USB too. Currently I think it will work if the previous instruction outputs the bits on J & K pins, then call this instruction which will compile the CRC16/5 for you. The sw will need to do the bitstuffing.

The Classic opcode usage is called once per bit
or, instead of PC/execute trigger, consider that same Verilog is now Counter triggered.

The launch can morph slightly, to a WAITUSB style that is asked for 8 bits or exits on SE0

This is almost the same verilog, but now it is SW executed once per byte, slashing the code-overhead.
The once-per-bit internal engine, is timer triggered, exits the WAITUSB style SW interface, on 8 bits OR an Trapevent (SE0,Error)

This only becomes useful if this was a task and then PASSCNT has to be used. But then we also need a TX version too.
I would rather control this in sw, especially at this point in time. I am trying to cover the CRC16-CCITT plus both CRC5 & CRC16 for the USB, for both RX & TX cases.
IIRC there is no unstuffing in CRC16-CCITT protocols as they use SYN & DLE.

It is also then fairly simple to do edge-snap on the Timer reloads, which allows longer packets with normal tolerances.
There would be two opcodes, one to Prime on Write and Read when done,and the WAITUSB form

Init Write values are Counter divide, preset of CRC and presets/clears of counters
Init read gives CRC result, Data Bits, and optional other info

We do have to be mindful of creating general cases so we can use them elsewhere. Currently this will do NRZI comms, and the special stuff/unstuff. We also have to be mindful of instruction availability and silicon.

Cluso99 · 2014-03-11 17:10

rogloh wrote: »

The first time you call this the initial state in the D register passed should be known and setup as the idle state of the bus. This is fixed for low speed/high speed. We can keep the last data around from the previous call as well. No need to trash it.

Agreed.

I quite like the sound of that idea. It would however probably prevent or impact a multi-tasking based implementation like I proposed in my first post due to variable execution timing.

I wanted to KISS otherwise we may get caught with a bug we cannot get over.

I'm intrigued by this idea to remain in sync for long packets. How would you do this in practice?

You wait on a bit change, then step half a bit and sample. QED.

One thing we need to bear in mind that we don't want the final SE0 bit condition received at the end of the packet to have any impact on the last CRC operation. The CRC needs to be preserved intact so we can validate it.

Yes, thought of this onite too.

jmg · 2014-03-11 17:12

rogloh wrote: »

It is also then fairly simple to do edge-snap on the Timer reloads, which allows longer packets with normal tolerances.

I'm intrigued by this idea to remain in sync for long packets. How would you do this in practice?

If we assume realistic and useful targets of 48MHz CLK to timers and 48MHz sliced 50% to CPU, then each bit is 4 clocks.

Any edge forces the counter to (say) 00 and then it clocks 012301230123 when no edges are present.
Data is sampled when the counter is at 50% ==2, and timer values of 1 and 3 here, are margin.
Timing skew will either shorten the 3 value, or extend the 0, and so it will jitter about the correct clock speed.
At higher clock speeds, the granularity improves.

If you stick with even divides, the possible clocks are 48MHz, 72MHz, 96MHz, 120MHz etc;
if you allow uneven sides, (which should be ok) then 60MHz, 84MHz, 108MHz are also possible.

72MHz is inside present FPGA builds and 84MHz / 96MHz (+?) may be possible on Cyclone V builds.

At 72MHz and 50% slice and UART style buffering, there is 24/27/30 thread clocks per byte streaming.
Will that be enough to meet packet specs ?

rogloh · 2014-03-11 17:13

Cluso99 wrote: »

I want to be able to use this instruction for writing/outputting USB too. Currently I think it will work if the previous instruction outputs the bits on J & K pins, then call this instruction which will compile the CRC16/5 for you. The sw will need to do the bitstuffing.

@Cluso99,
When transmitting I believe the CRC16 position in the frame needs to be padded with 16 bit of zeroes at the end and the CRC process include the zeroes in its computation then output this CRC data instead of the 16 zeroes. A streaming bit process for doing CRC on the fly with the wire transitions would probably not do this for you.

Cluso99 · 2014-03-11 17:16

jmg: If I were to try and compile the Verilog I would get lost for some considerable time. It's better for me to think the thing thru and let others fix the Verilog syntax so it works properly.

Perhaps you might like to do it? I am sure you could check it out simply with your BeMicro??? FPGA, and since you would be using Quartus it would help Chip too.

rogloh · 2014-03-11 17:21

Cluso99 wrote: »

You wait on a bit change, then step half a bit and sample. QED.

Um this question was asked with respect to jmg's timer snap idea. I was more interested in the hardware details for changing the timer. I know we can do that 1/2 bit approach in software, but only effectively at the start of the packet during the sync period - it would be difficult to do it every bit in software with RXUSB running at the same time.

Cluso99 · 2014-03-11 17:23

rogloh wrote: »

@Cluso99,
When transmitting I believe the CRC16 position in the frame needs to be padded with 16 bit of zeroes at the end and the CRC process include the zeroes in its computation then output this CRC data instead of the 16 zeroes. A streaming bit process for doing CRC on the fly with the wire transitions would probably not do this for you.

As soon as you send a bit by XOR OUTA,pinmask you do a USBBIT D,pins. Then, when you send the last data byte's last data bit, you SHR D,#16 and now we have the CRC in the lower 16 bits ready to shift out. Not sure which end we need to send from so maybe we don't need to do a SHR anyway. We have a bit time to get this ready, so I think its doable.

rogloh · 2014-03-11 17:26

Yeah but real the problem is you need these extra 16 zero bits to do the computation, but the zeroes don't go out on the wire. What are you sending during this time?

Cluso99 · 2014-03-11 17:27

rogloh wrote: »

Um this question was asked with respect to jmg's timer snap idea. I was more interested in the hardware details for changing the timer. I know we can do that 1/2 bit approach in software, but only effectively at the start of the packet during the sync period - it would be difficult to do it every bit in software with RXUSB running at the same time.

You definitely do not need to do it on every bit so you do it when necessary. IIRC it's also easy to limit the block size in USB, so calc the xtal accuracy and ensure you have the timing set correctly and synchronised at the start, and all should be good to go.

rogloh · 2014-03-11 17:36

Yeah that has been mentioned earlier about keeping packet sizes down. One worry I had was if you have a hub enviroment you could see long packets to other devices on the bus. There is a risk for long packets (even if you are ignoring the data as it is not for your address/endpoint) you might drift and start sampling on transitions which could be falsely interpreted as an SE0 EOP if there is slight skew between the differential signal transitions, then you start hunting for syncs again and could resynch on random bus data. CRCs will probably save us however and things should eventually recover again. That problem can also be dealt with in the software only approach by additional sampling in the middle of the bit and ensuring mutliple EOPs get detected.

jmg · 2014-03-11 17:36

Cluso99 wrote: »

This only becomes useful if this was a task and then PASSCNT has to be used.

I'm not following, byte handling allows much lower clocks, even in one task.
I think lowish (FPGA region) clocks and threads should be a practical goal.

But then we also need a TX version too.

Ideally, but TX is less 'drop dead' as it can take some time to assemble/organize things I think.

I would rather control this in sw, especially at this point in time.

Of course, I think coding a "verilog clone" in SW for 1.5 MHz USB testing should be possible.
If that also allows timer-paced sampling, it is a small step to use counters and a per-byte jump.

The shift to timer-paced operation uses almost identical Verilog, and a data buffer for read is small.
It may also avoid this somewhat complex opcode, pushing down fMAX if it works on register-space.
(timer paced code decouples things a little from register critical paths)

I think maybe the CRC does not need a buffered read, as it is checked on EOP ?

If there are spare virtual Pins, the USB RxRDY flags could hook into some of those ?

Chip would likely need to modify the counters slightly to allow /N reloadable counting, and edge resync.
I'm not sure if those modes are already in the Counters.

Cluso99 · 2014-03-11 18:48

rogloh wrote: »

Yeah that has been mentioned earlier about keeping packet sizes down. One worry I had was if you have a hub enviroment you could see long packets to other devices on the bus. There is a risk for long packets (even if you are ignoring the data as it is not for your address/endpoint) you might drift and start sampling on transitions which could be falsely interpreted as an SE0 EOP if there is slight skew between the differential signal transitions, then you start hunting for syncs again and could resynch on random bus data. CRCs will probably save us however and things should eventually recover again. That problem can also be dealt with in the software only approach by additional sampling in the middle of the bit and ensuring mutliple EOPs get detected.

I think long packets to other devices is not a problem. You just wait for a new 2 SE0's in 2 successive instructions (or SE1's). The sync up is quite easy. That is being done now without crcs.

jmg · 2014-03-11 18:50

Cluso99 wrote: »

jmg: If I were to try and compile the Verilog I would get lost for some considerable time. It's better for me to think the thing thru and let others fix the Verilog syntax so it works properly.

The problem with this, is if the Verilog needs a lot of changes( as this does), it quickly becomes too clumsy to have someone else applying fix-ups. Also in the form you code, checking is harder as it is not so self contained.

As always, it is better to code in small pieces, get 'working' equations, and look at the .eq0 & .rpt files to confirm you have counters / clock enables / MUXes as expected, and no logic blow-outs.

Below is the code, edited/modified so Lattice Verilog at least compiles it (with some warnings).

////////////////////////////////////////////////////////////////////////////////
// Acknowledgements: Verilog code for CRC's   http://www.easics.com
// RR20140310 start
// RR20140311,12 continued
////////////////////////////////////////////////////////////////////////////////
// polynomial: CRC5usb=(0 2 5), CRC16usb=(0 2 15 16), CRC16ccitt=(0 5 12 16) 
// data width: 1, LSB first
//
// inputs:  D, S, PINS
// outputs: D, Z, C
module          RxUSB
(
input   CLK,                 // 
input   Load_d,              // 
input   jI,              // 
input   kI,              // 
input   WZ,              // 
input   WC,              // 
input   [31:0]  s,              // S operand
input   [31:0]  d,              // D operand
input   [127:0] p,              // input pins
output reg   [31:0]  r,              // D result
output reg        zz,              // Z flag
output reg        cy,               // Carry flag    

output reg     SkipStuff,   // move so can see in EQNs better
output reg     InvalidPM

);

reg     [15:0]  crc;            // original CRC (accumulated)
reg     [2:0]   bitcnt;         // data bit counter 3 bits
reg             k;              // K new pin value
reg             j;              // J new pin value
reg     [2:0]   stuffcnt;       // stuff counter 3 bits
reg     [7:0]   data;           // data byte (accumulated)
reg     [8:7]   poly;           // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
reg     [6:0]   pinno;          // pin pair numbers 0-127
reg     [15:0]  newcrc;         // new crc
reg             t;              // 1 if k toggles (ie 1 bit)
reg             kP;              // K old pin value
reg             jP;              // J old pin value
//reg     r,z,c;              // D result


reg     crc05usb;
reg     crc16usb;
reg     crc16itt;
reg     crc16ndef;

///////////////////////////////////////////////////////////////////////////////

// 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
    always @(poly)  begin   
        crc05usb  = (poly == 2'b00);    // CRC5usb   =(0 2 5)
        crc16usb  = (poly == 2'b01);    // CRC16usb  =(0 2      15 16)
        crc16itt  = (poly == 2'b10);    // CRC16ccitt=(0   5 12    16)
        crc16ndef = (poly == 2'b11);    // undefined - alias to one above 
    end

// check for a "1" bit toggle
    always @(kI or jI or kP or stuffcnt)  begin   
      t = kI ^ kP;                 // new pin value ^ previous pin value; 1=toggled
      SkipStuff = (!t & (stuffcnt == 3'b110));  // !t needed for ccitt ?
      InvalidPM = (kI==jI);    // Signaling states are non-diff
    end



always  @(posedge CLK) begin
  if (Load_d) begin  // WRITE to register - Value INIT
//    crc      = d[31:16];        // original crc value (accum) moved below
    kP       = d[15];           // previous K
    jP       = d[14];           // previous J
    stuffcnt = d[13:11];        // original stuff counter value
    bitcnt   = d[10:8];         // original bit   counter value
    data     = d[7:0];          // original data value (accum)

    poly     = s[8:7];          // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
//?   kpin     = value(s[6:0]);   // K pin no.
//?   jpin     = value(s[6:0]) ^1 // J pin no.
    k        = kI;      // new pin value
    j        = jI;      // new pin value
  end // Load_d
  else begin // !Load_d  = normal RUN , compiler wants in one block..
    k        = kI;      // new pin value
    j        = jI;      // new pin value
    kP       = k;       // previous K
    jP       = j;       // previous J

// check for bit unstuff
    if (SkipStuff) begin
        // unstuff
        stuffcnt <= 3'b000;
//      bitcnt   = bitcnt;    //  implicit, but makes hold action clear 
    end
    else if (!InvalidPM) begin
        // inc bit count & accum data bit into byte
        bitcnt++;
        if (t) 
           stuffcnt <= 3'b000; // reset if input bit toggles
        else
           stuffcnt++;          
    end  
  end // Load_d 
end // (posedge CLK)     


reg     kr0;
reg     kr2;
reg     kr5;
reg     kr12;
reg     kr15;
reg     HoldCRC;

always @(*) begin 
// calculate the new crc... - decoded values, so no overlaps in if
    if (crc05usb) begin
        kr0  = t ^ crc[4];
        kr2  = t ^ crc[4];
        kr5  = 1'b0;
        kr12 = 1'b0;
        kr15 = 1'b0; 
    end
    if (crc16usb) begin
        kr0  = t ^ crc[15];
        kr2  = t ^ crc[15];
        kr5  = 1'b0;
        kr12 = 1'b0;
        kr15 = t ^ crc[15]; 
    end
    if (crc16itt) begin
        kr0  = t ^ crc[15];
        kr2  = 1'b0;
        kr5  = t ^ crc[15];
        kr12 = t ^ crc[15];
        kr15 = 1'b0; 
    end    
    if (crc16ndef) begin  // alias crc16itt, so cover ALL decodes.
        kr0  = t ^ crc[15];
        kr2  = 1'b0;
        kr5  = t ^ crc[15];
        kr12 = t ^ crc[15];
        kr15 = 1'b0; 
    end  
    HoldCRC = InvalidPM | SkipStuff;
end // always @(*)

always  @(posedge CLK) begin
  if (Load_d) begin  // WRITE to register - Value INIT
    crc     <= d[31:16];        // original crc value (accum)
  end 
  else if (HoldCRC) begin
    crc[0]  <= kr0;              //16
    crc[1]  <= crc[0];           //17
    crc[2]  <= crc[1] ^ kr2;     //18
    crc[3]  <= crc[2];           //19
    crc[4]  <= crc[3];           //20
    crc[5]  <= crc[4] ^ kr5;     //21
    crc[6]  <= crc[5];           //22
    crc[7]  <= crc[6];           //23
    crc[8]  <= crc[7];           //24
    crc[9]  <= crc[8];           //25
    crc[10] <= crc[9];           //26
    crc[11] <= crc[10];          //27 - bad eqns??, needed <= 
    crc[12] <= crc[11] ^ kr12;   //28
    crc[13] <= crc[12];          //29
    crc[14] <= crc[13];          //30
    crc[15] <= crc[14] ^ kr15;   //31
  end // valid 
end // (posedge CLK)     
    
// set results
always @(*) begin 
    r[31:16] = crc;
    r[15]    = k;
    r[14]    = j;
end // always @(*)

always  @(*) begin   // non register here ? - this is a bit mangled, data needs fixing 
    if (t)   begin                // toggled bit?
        r[13:11] = 3'b000;       // reset stuff counter
    end
    else begin   
        r[13:11] = stuffcnt;
    end    
    r[10:8]  = bitcnt;
    if (SkipStuff) begin
        r[7:0] = data;
    end
    else begin
        r[7:1] = data[6:0];
        r[0]   = t;             // add new data bit
    end        
end // @(*)     

always  @(posedge CLK) begin
    if (WZ) begin
        if (  !SkipStuff & (bitcnt == 3'b000)) begin
            zz <= 1'b1;           // byte ready
        end
        else begin
            zz <= 1'b0;           // byte not ready
        end    
    end

    if (WC) begin          
        cy <= k ^ j;              // c = SE0/SE1
    end           
end // (posedge CLK)   


endmodule

Updated code, better CRC eqns

jmg · 2014-03-11 19:00

Cluso99 wrote: »

I think long packets to other devices is not a problem. You just wait for a new 2 SE0's in 2 successive instructions (or SE1's). The sync up is quite easy. That is being done now without crcs.

That's starting to sound like a lot of crossed fingers...?
Chip may already have edge reset modes in the counters, and I think the SW WAIT can then work, with a Counter.

To test at 1.5MHz, and a simple Reload timer, the FPGA needs to clock at either 78MHz or 81MHz , with reload values of 52 or 54, and use SW wait values of 50% of those for mid-bit sampling.

P2 and full speed USB slave requirements/ideas

Comments