P2 and full speed USB slave requirements/ideas

Cluso99 · 2014-03-12 17:13

jmg wrote: »

I think chip was meaning the earlier, simpler code to allocate Pins and manage SE0 and T into the flags ?

The code in #107 is not quite 'mission-ready', and Pin mapping and the couple of FF's & XORs to do SE0_SE1 and T should be common to any extended code.

<= assign is verilog that ensures you do get a clocked result. ( ie usually a D-FF )
= within a clocked block seems to sometimes give a clocked result, but not always. Best to be careful.
( another reason I suggested you run something like Lattice ISPlever)

There is now no point in producing code for the previous GETXP examples.
I am certain Chip understands what I am trying to do regarding extracting the correct pin pair. If not, a fixed P0 & P1 would work for now.
I think the code is close enough for testing - it is the time it will take Chip to do this with appropriate fixes and fitting into his Verilog regime.

As for testing, initially I propose to just output on those pin pairs various conditions, and each time calling the new RxUSB instruction (whatever we call it - and we don't need pnut support as I can code it as a long). This way, I can control what is output and therefore test what the instruction receives. So I can verify the instruction is working as designed in a controlled environment.

Once I have that running, I can snoop a real FS USB and ensure I can read the tokens and packets, and verify the crc5 and crc16usb, the SE0 at the EOP and of course the initial sync sequence.

jmg · 2014-03-12 17:32

Cluso99 wrote: »

As for testing, initially I propose to just output on those pin pairs various conditions, and each time calling the new RxUSB instruction (whatever we call it - and we don't need pnut support as I can code it as a long). This way, I can control what is output and therefore test what the instruction receives. So I can verify the instruction is working as designed in a controlled environment.

That testing approach sounds like a good idea, Chip then just needs to make as much SW-readable as is practical. ie 32b each way.
It probably does not need to mesh into the register-array, just as long as it can R/W in SW. (ie like the counter setups)

rogloh · 2014-03-12 17:44

I've been looking at Cluso99's proposed RXUSB instruction.

I think for the final P2 (not the FPGA) there is scope to be able to use this simply with a byte processing loop in another COG task if the P2 clock is a multiple of 12MHz and >=96MHz.

This could be the bitloop.

LOOP:           SYNCTRA
                RXUSB   data, setup WC WZ
if_nz_and_nc    JMPD    #LOOP
if_c            JMP     #SE_ERROR
if_z            MOV     INDA++, data
if_z            ADD     fifo_counter, #1

EDIT: Sorry accidently hit tabs + space while typing which clicked submit and this posted too soon. I'm still formulating and thinking about this idea. I want to come up with a 6 or (7?) instruction bit loop and use SYNCTRA which will wait until the right time to sample without stopping the other hardware task from running. I think I will need to subtract from PHSA somewhere as well making this 7 instructions. We would still do the 1:8 task allocation to give the byte processor task its time for the packet.

jmg · 2014-03-12 18:02

Thinking about the 4 phase Digital PLL equivalent I posted in #118.

This can help snoop on a 1.5MHz USB, and so open that testing domain, when this gets to real data flows.

Taking a 80MHz FPGA clock, we can get to 1.5MHz on average, with modest jitter.
80/1.5 = 53.3333333333333333
2^32/(80/1.5) = 80530636.8 round(2^32/(80/1.5)) = 80530637
2^32/(round(2^32/(80/1.5))) = 53.3333332008785675
80M/(2^32/(round(2^32/(80/1.5)))) = 1500000.0037252903
1/(ans-1.5M) = 268.435456s of numeric beat error.
Will add 80530637 every clock (1.5/80)

Then the upper 2 bits are the fastest /4 case, and the DPLL rule, from #118 is along the lines of

If edge occurs @ MSB = 3 -> No change (add as usual) (1.5/80)
If edge occurs @ MSB = 2 -> Need to advance to 1 quadrant, or add 2^32/4 ONCE ( or 2^32/8 twice )
If edge occurs @ MSB = 0 -> Need to retard 1 quadrant, or add 2^32*3/4 ONCE ( or 2^32*3/8 twice )
edge @ MSB =1 should never happen during a data stream.
That case could be flagged, and it can add 2^32*2/4 ONCE ( or 2^32*4/8 twice ) to match the Verilog action.

If a COG is set for 2 threads 50%, we have 2 x 40MHz flows to manage 1.5MBd data.

Code then does

  FRQx  = Default_2e32_1p5d80

and adjust is for advance two lines
  FRQx  = Advance_S2_2e32d8
  FRQx  = Default_2e32_1p5d80

and adjust is for retard two lines
  FRQx  = retard_S0_2e32_3d8
  FRQx  = Default_2e32_1p5d80
At 50% slot, FRQx I think will apply twice, before being restored to default locked value.

Or there may be time for a way to read PHSx, for all edges not in quadrant 3, and calculate a (double) add to give quadrant 0 next.
just one adjust code block is then needed.

If quadrant3 is the edge value, then quadrant 1 is the sample point ( and Q0,Q2 are the guard bands)

Each thread has 26 thread cycles per bit.
PHSx.MSB can map to a pin, and be used as a sample-and scope trigger.

jmg · 2014-03-12 18:09

rogloh wrote: »

I think for the final P2 (not the FPGA) there is scope to be able to use this simply with a byte processing loop in another COG task if the P2 clock is a multiple of 12MHz and >=96MHz.

The problem with that >=96MHz, is you cannot fully test this in the FPGA, which is close to drop-dead.

Better I think, to include 48MHz ( & 60MHz & 72MHz& maybe 84MHz ) on the Clock targets.
The Auto-pacing (DPLL) code for BYTE level handler, is in #118, and is not large.

rogloh · 2014-03-12 18:25

jmg wrote: »

The problem with that >=96MHz, is you cannot fully test this in the FPGA, which is close to drop-dead.

Better I think, to include 48MHz ( & 60MHz & 72MHz& maybe 84MHz ) on the Clock targets.
The Auto-pacing (DPLL) code for BYTE level handler, is in #118, and is not large.

Yeah I know 96MHz is just too fast for the FPGA. Assuming we were to stick to bit processing loops in software I see a lot of merit in the 1:8 approach with a byte processing task running as well, as it simplifies the software design and decouples the timing critical bit work from the other (slower) byte orientied protocol processing work. The problem is transferring the data and doing the error checking takes time and I don't see a way to get it down too much more given what Cluso has proposed. Now if we come up with more USB extensions beyond RXUSB that provides bytes for us and deals with timing, that could work out nicely as well. I don't think we are there yet but hopefully we are heading in that direction...it's worth continuing that discussion too.

jmg · 2014-03-12 18:34

rogloh wrote: »

....Now if we come up with more USB extensions beyond RXUSB that provides bytes for us and deals with timing, that could work out nicely as well. I don't think we are there yet but hopefully we are heading in that direction...

See Chip's comment in #118 - we are closer than you think - once you have a bit-counter inside the Verilog, then you just need to 'fire' the per-bit Verilog, on a DPLL timer, (#118) and buffer the DataTX.

I'm not sure if CRC needs buffering, or just preservation-care over a SE0 event.

Sapieha · 2014-03-12 18:57

Hi jmg.

I posted that code's SCH on prop ii blog thread -- none even commented it.

jmg wrote: »

I think chip was meaning the earlier, simpler code to allocate Pins and manage SE0 and T into the flags ?

The code in #107 is not quite 'mission-ready', and Pin mapping and the couple of FF's & XORs to do SE0_SE1 and T should be common to any extended code.

<= assign is verilog that ensures you do get a clocked result. ( ie usually a D-FF )
= within a clocked block seems to sometimes give a clocked result, but not always. Best to be careful.
( another reason I suggested you run something like Lattice ISPlever)

jmg · 2014-03-12 19:08

Sapieha wrote: »

Hi jmg.
I posted that code's SCH on prop ii blog thread -- none even commented it.

It was too low resolution for me to see clearly, and besides, I can see the fitter equation's which are easier to follow...
It does pay with Verilog (like with most high level languages) to check you got what you expected, and not something else, or logic-bloat.

Cluso99 · 2014-03-12 19:45

jmg wrote: »

That testing approach sounds like a good idea, Chip then just needs to make as much SW-readable as is practical. ie 32b each way.
It probably does not need to mesh into the register-array, just as long as it can R/W in SW. (ie like the counter setups)

??? The instruction uses the D register, so we can preset it, and we can read it.
Chip said there will be instruction space, so no need for a fixed D address.

jmg · 2014-03-12 19:58

Cluso99 wrote: »

??? The instruction uses the D register, so we can preset it, and we can read it.
Chip said there will be instruction space, so no need for a fixed D address.

The exact implementation is up to Chip, I'm just observing that this is more like a Counter or SerDes in operation, than a memory/register.
The Counters use SETxx and GETxx opcodes , which also have a D address.

rogloh · 2014-03-12 20:43

jmg wrote: »
Thinking about the 4 phase Digital PLL equivalent I posted in #118.

This can help snoop on a 1.5MHz USB, and so open that testing domain, when this gets to real data flows.

Taking a 80MHz FPGA clock, we can get to 1.5MHz on average, with modest jitter.
80/1.5 = 53.3333333333333333
2^32/(80/1.5) = 80530636.8 round(2^32/(80/1.5)) = 80530637
2^32/(round(2^32/(80/1.5))) = 53.3333332008785675
80M/(2^32/(round(2^32/(80/1.5)))) = 1500000.0037252903
1/(ans-1.5M) = 268.435456s of numeric beat error.
Will add 80530637 every clock (1.5/80)

Then the upper 2 bits are the fastest /4 case, and the DPLL rule, from #118 is along the lines of

If edge occurs @ MSB = 3 -> No change (add as usual) (1.5/80)
If edge occurs @ MSB = 2 -> Need to advance to 1 quadrant, or add 2^32/4 ONCE ( or 2^32/8 twice )
If edge occurs @ MSB = 0 -> Need to retard 1 quadrant, or add 2^32*3/4 ONCE ( or 2^32*3/8 twice )
edge @ MSB =1 should never happen during a data stream.
That case could be flagged, and it can add 2^32*2/4 ONCE ( or 2^32*4/8 twice ) to match the Verilog action.

If a COG is set for 2 threads 50%, we have 2 x 40MHz flows to manage 1.5MBd data.

Code then does
  FRQx  = Default_2e32_1p5d80

and adjust is for advance two lines
  FRQx  = Advance_S2_2e32d8
  FRQx  = Default_2e32_1p5d80

and adjust is for retard two lines
  FRQx  = retard_S0_2e32_3d8
  FRQx  = Default_2e32_1p5d80
At 50% slot, FRQx I think will apply twice, before being restored to default locked value.
Or there may be time for a way to read PHSx, for all edges not in quadrant 3, and calculate a (double) add to give quadrant 0 next.
just one adjust code block is then needed.

If quadrant3 is the edge value, then quadrant 1 is the sample point ( and Q0,Q2 are the guard bands)

Each thread has 26 thread cycles per bit.
PHSx.MSB can map to a pin, and be used as a sample-and scope trigger.

I do quite like the sound of this adaptive clocking. I need to get my head around it more. What HW changes or other further instructions would be required to support it? How much is done in HW vs software? I see we would use one of the counters, does it need modifications or do all the clocking tweaks adjusting FRQA happen in software?

I think you meant 26 thread cycles per byte, not per bit above. But that is still nice and already gives us at least 3 hub cycles per byte @80MHz in the byte processing task. At 48MHz this drops down to 16 clocks per thread or 2 hub cycles which I think should still be fairly generous.

So what crystal frequency limitations would this overall approach entail? Anything >=48Mhz or do you still need discrete 12MHz multiples above this? I imagine for receiving if we get aggressive we could potentially adjust timing after every byte which that probably means we don't want slip any more than say 1/4 bit per 8 bits right? That is ~3% tolerance. But the transmit adds it own complexity and I expect we want to be able to transmit accurately at the right bit rate, which then means a 12MHz multiple. How does your design deal with that?

jmg · 2014-03-12 21:45

rogloh wrote: »

I do quite like the sound of this adaptive clocking. I need to get my head around it more. What HW changes or other further instructions would be required to support it? How much is done in HW vs software? I see we would use one of the counters, does it need modifications or do all the clocking tweaks adjusting FRQA happen in software?

The counter form above is just a skeleton of SW-emulation ideas, of the verilog code in #118, and a means to smart-sample a USB stream at 1.5MHz.
The SW-emulation is looking at ways to test.emulate verilog ideas, using the FPGA in SW, but at modest USB speeds. Lucky there is the low speed mode

In #118 you ca see Chips comment suggests he may fit a custom Baud controller easier than adding modes to a Timer.

Either way fine, it's whatever is easiest and smallest to include. Separate 8bit Baud Div, frees 32b timers for other tasks.

The Code in #118 is pretty much all you need, just a few lines of Verilog -> Silicon.

There are not many changes once the USB code block includes a BitCtr, and is called once per bit, it's just a matter of do you call it in SW, or use a DPLL as in #118 ?.

rogloh wrote: »

I
I think you meant 26 thread cycles per byte, not per bit above.

No, that is per-bit, - but notice that is for a test version, running at 1.5MHz LO-speed USB, where things will be easier to probe, and hook-into.

At 1.5MHz I think FPGA-P2 can fit one DPLL + Diagnostics thread sampling and checking, and one Thread running the USB-Verilog tests.

Once that looks good, the Verilog DPLL would be added to pace the USB engine, instead of SW calls.
In this form, it's not quite as easy to probe or test, so a mixed SW version (aka Verilog emulation) gives a way to bring this up, and get higher level code working.

rogloh wrote: »

So what crystal frequency limitations would this overall approach entail? Anything >=48Mhz or do you still need discrete 12MHz multiples above this?

Yes, the Baud-DPLL assumes N x 12MHz with N >= 4, and Ok up to 200MHz, and can do low-Speed USB at >200MHz SysClk.

The Code in #118 auto-syncs, so tolerance is not so critical, but you would need a crystal or resonator for timing.
( ie RC osc's are probably off the table)
Chips Xtal PLL is now any-integer, so that gives a few choices of how to get to 12MHz x N

Sapieha · 2014-03-13 03:23

Hi jmg.

I think that to BitBanged send receive it is all hardware that needs.

Look in attachment.
Only one more signal I think is needed are MODE 0/1 that inverse TXD/RXD.

HAve even one version that include NRZI IN/OUT

Cluso99 · 2014-03-13 03:46

Sapieha wrote: »

Hi jmg.

I think that to BitBanged send receive it is all hardware that needs.

Look in attachment.
Only one more signal I think is needed are MODE 0/1 that inverse TXD/RXD.

HAve even one version that include NRZI IN/OUT

Sapieha,
Sorry, I don't understand what you are saying/showing.

Sapieha · 2014-03-13 03:53

Hi Cluso

Hardware between 2 PIN's to read/send BitBanged USB --->
Most of it needs even if other functions that need connect to USB's differential pins

This part of Hardware You can't omit in any type of USB communication.

Cluso99 wrote: »

Sapieha,
Sorry, I don't understand what you are saying/showing.

Cluso99 · 2014-03-13 04:55

Sapieha wrote: »

Hi Cluso

Hardware between 2 PIN's to read/send BitBanged USB --->
Most of it needs even if other functions that need connect to USB's differential pins

This part of Hardware You can't omit in any type of USB communication.

Yes. My usb instruction uses a pair of pins.
But your circuit did not show both inputs.

Sapieha · 2014-03-13 05:01

Hi Cluso.

D_n, D_p are lines that Input/Output to D-, D+

nOEi --->select Input/Output to USB
TDX, RXD are NRZI Output/Input to this circuity

Cluso99 wrote: »

Yes. My usb instruction uses a pair of pins.
But your circuit did not show both inputs.

Sapieha · 2014-03-13 16:02

Hi

Here are circuity with in build NRZI.

D_n, D_p are lines that Input/Output to D-, D+

nOEi --->select Input/Output to USB
TXD_di, RXD_do are real bits Output/Input to this circuity

Give directly Real bits with Receive and use real bits with send

Ned only one instruction that in field D --- Can control special signals --->
some of them are read only and some write/read (nOEi, SOE, SEI, SUSPEND, reset_i)
And field S port number.
and send/receive to flag C bit value (maybe directly shifted IN/OUT from register specified by RESD instruction.

jmg · 2014-03-13 16:59

Verilog from above, after edits/fixes to get it to compile, and some cleanups on stuff and Data.
Included all logic in ONE place (KISS) and made stuff counter more self contained.
For simplicity, the CLK here is considered as USB sample point.

Code that merged with the DPLL Baud further up would add TSW as a CE gate,to get that sample point aligned correctly.

////////////////////////////////////////////////////////////////////////////////
// RR20140310-12 P2 RxUSB instruction
////////////////////////////////////////////////////////////////////////////////
/*---------------------------------------------------------------------------------------------------------------------
              RxUSB   D, S/#          WZ,WC             ' Receive single NRZI bit pair, accum CRC and byte, unstuff bits
where
  S/# is the PinPair# and Poly bits
    S[31..9]  = unused
    S[8..7]   = 00= CRC5  USB    (0 2 5)  
                01= CRC16 USB    (0 2 15 16)
                10= CRC16 CCITT  (0 5 12 16)
                11= undefined
    S[6..0]   = D-/D+ Pin Pair #0..127
                The pin pair is always a pair of pins mod 2. ie nnnnnnx where x=0 and x=1 for the pair.
                If the pin pair is even (S[0]=0) then J is the lowest pin and K is the higher pin of the consecutive pair
                If the pin pair is odd  (S[0]=1) then K is the lowest pin and J is the higher pin of the consecutive pair.
                This arrangement allows for simple LS and FS by making the pin pair even or odd.                              
  D is the cog register storing a 32 bit field...
    D[31..16] = crc16
    D[15]     = K new pin value
    D[14]     = J new pin value
    D[13..11] = unstuff counter 3 bits
    D[10..8]  = bit counter 3 bits
    D[7..0]   = data byte accumulation
  Z = data byte ready (8 bits)
  C = SE0/SE1
It would be acceptable for D to be at a fixed location eg $1F0.
---------------------------------------------------------------------------------------------------------------------*/
// inputs:  D, S, PINS
// outputs: D, Z, C
////////////////////////////////////////////////////////////////////////////////
module          RxUSB
(
input           CLK,
input           Load_d,
input           jI,             // new J value
input           kI,             // new K value
input   [31:0]  s,              // S operand
input   [31:0]  d,              // D operand
input           wz,             // WZ operand
input           wc,             // WC operand
input   [127:0] p,              // input pins
output reg [31:0]  r,              // D result
output reg      zz,             // Z flag
output reg      cy              // C flag    
);
reg     [15:0]  crc;            // original CRC (accumulated)
reg     [2:0]   bitcnt;         // data bit counter 3 bits
reg             k;              // K new pin value
reg             j;              // J new pin value
reg     [2:0]   stuffcnt;       // stuff counter 3 bits
reg     [7:0]   data;           // data byte (accumulated)
reg     [1:0]   poly;           // crc05usb/crc16usb/crc16ccitt/undef polynomial selection
//reg     [6:0]   pinno;          // pin pair numbers 0-127
reg             kP;             // K previous pin value
reg             jP;             // J previous pin value
// flags/conditions...
reg             crc05usb;       // 00= CRC5  USB    
reg             crc16usb;       // 01= CRC16 USB   
reg             crc16itt;       // 10= CRC16 CCITT 
reg             crc16ndef;      // 11= undefined   
reg             toggle;         // data bit 0 or 1
reg             BitStuff;       // unstuff this bit
reg             SE0_SE1;        // SE0/SE1 condition
///////////////////////////////////////////////////////////////////////////////
// set crc options
    always @(poly)  begin   
        crc05usb  = (poly == 2'b00);                    // CRC5usb   =(0 2 5)
        crc16usb  = (poly == 2'b01);                    // CRC16usb  =(0 2      15 16)
        crc16itt  = (poly == 2'b10);                    // CRC16ccitt=(0   5 12    16)
        crc16ndef = (poly == 2'b11);                    // undefined
    end
// check for a "1" bit =toggle, and SE0/SE1 conditions, and BitStuff condition
    always @(*)  begin   
        toggle    = kI ^ kP;                            // 1=Hi data bit (toggle) = new pin value ^ previous pin value
        SE0_SE1   = (kI == jI);                         // detect SE0/SE1 (j==k)
        BitStuff  = (!toggle & (stuffcnt == 3'b110) & (crc05usb | crc16usb));  // unstuff this bit - USB only ?
//        BitStuff  = ( (stuffcnt == 3'b110) & (crc05usb or crc16usb));  // unstuff this bit - USB only ?
    end    // Counter alone is enough, once have 6, will get a 0, unless we want to preserve 1111111, not used in USB?
///////////////////////////////////////////////////////////////////////////////
// Set Initial conditions
    always @(posedge CLK) begin
        if (Load_d) begin                               // write initial values to registers
            kP       <= d[15];                           // previous K
            jP       <= d[14];                           // previous J
            stuffcnt <= d[13:11];                        // original stuff counter value
            bitcnt   <= d[10:8];                         // original bit   counter value
            data     <= d[7:0];                          // original data value (accum)
            poly     <= s[8:7];                          // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
            k        <= kI;                              // new pin value
            j        <= jI;                              // new pin value
        end
        else begin                                      // !Load_d = normal RUN (compiler wants in one block)
// ??? is this correct way around etc ???
            k       <= kI;                              // new pin value
            j       <= jI;                              // new pin value
            kP      <= kI;                              // previous pin value
            jP      <= jI;                              // previous pin value
// check for bit unstuff
             if (!BitStuff & !SE0_SE1) begin    // Collect only valid data bits
                bitcnt    <= bitcnt+1;                      
                data[6:0] <= data[7:1];         // LSB first - shift right
                data[7]   <= toggle;            
             end       
	     if (!toggle | (stuffcnt == 3'b110) ) begin  // reset if Din = 0, OR reaches (USB) Threshold.  
                stuffcnt<= 3'b000;        
             end          
             else begin
                stuffcnt <= stuffcnt+1;
             end

        end // Load_d
    end                                                                          
///////////////////////////////////////////////////////////////////////////////
// CRC routine
reg             kr0;
reg             kr2;
reg             kr5;
reg             kr12;
reg             kr15;

// calculate the new crc... (decoded values so no overlaps in if)
    always @(*) begin
        if (crc05usb) begin
            kr0  = toggle ^ crc[4];
            kr2  = toggle ^ crc[4];
            kr5  = 1'b0;
            kr12 = 1'b0;
            kr15 = 1'b0;
        end
        if (crc16usb) begin
            kr0  = toggle ^ crc[15];
            kr2  = toggle ^ crc[15];
            kr5  = 1'b0;
            kr12 = 1'b0;
            kr15 = toggle ^ crc[15]; 
        end
        if (crc16itt) begin
            kr0  = toggle ^ crc[15];
            kr2  = 1'b0;
            kr5  = toggle ^ crc[15];
            kr12 = toggle ^ crc[15];
            kr15 = 1'b0; 
        end
        if (crc16ndef) begin
            kr0  = 1'b0;
            kr2  = 1'b0;
            kr5  = 1'b0;
            kr12 = 1'b0;
            kr15 = 1'b0; 
        end
    end        
    always @(posedge CLK) begin
        if (Load_d) begin                     // write to reg initial value
            crc <= d[31:16];                  // original crc value (accum)
        end
        else if (!SE0_SE1 & !BitStuff) begin  // Only valid data 
            crc[0]  <= kr0;
            crc[1]  <= crc[0];
            crc[2]  <= crc[1] ^ kr2;
            crc[3]  <= crc[2];
            crc[4]  <= crc[3];
            crc[5]  <= crc[4] ^ kr5;
            crc[6]  <= crc[5];
            crc[7]  <= crc[6];
            crc[8]  <= crc[7];
            crc[9]  <= crc[8];
            crc[10] <= crc[9];
            crc[11] <= crc[10];
            crc[12] <= crc[11] ^ kr12;
            crc[13] <= crc[12];
            crc[14] <= crc[13];
            crc[15] <= crc[14] ^ kr15;
        end
    end    
        
///////////////////////////////////////////////////////////////////////////////
    
// set D results - optional 32 bit pick-off.
    always @(*)  begin             //                     ??? or @(posedge CLK)
        r[31:16] = crc;
        r[15]    = k;
        r[14]    = j;
        r[13:11] = stuffcnt;
        r[10:8]  = bitcnt;
        r[7:0]   = data;               
    end    
    
// set Z and C flags
    always  @(posedge CLK) begin
        if (wz)  begin
            if (!BitStuff & (bitcnt == 3'b111)) begin    // About to load last bit.. so  
                zz <= 1'b1;                              // byte ready
            end
            else begin    
                zz <= 1'b0;                              // byte not ready
            end
        end
        if (wc) begin          
            cy <= SE0_SE1;          // c = SE0/SE1
        end           
    end
endmodule
// Pre/   Post/ Post loaded
// 000    001   1
// 001    010   2
// 010    011   3
// 011    100   4
// 100    101   5
// 101    110   6
// 110    111   7
// 111    000   8

Cluso99 · 2014-03-13 17:36

Sapieha wrote: »

Hi

Here are circuity with in build NRZI.

D_n, D_p are lines that Input/Output to D-, D+

nOEi --->select Input/Output to USB
TXD_di, RXD_do are real bits Output/Input to this circuity

Give directly Real bits with Receive and use real bits with send

Ned only one instruction that in field D --- Can control special signals --->
some of them are read only and some write/read (nOEi, SOE, SEI, SUSPEND, reset_i)
And field S port number.
and send/receive to flag C bit value (maybe directly shifted IN/OUT from register specified by RESD instruction.

Thanks Sapieha. Now I understand what you mean.
I am writing the instruction using Verilog now. It does show the circuitry required.

Sapieha · 2014-03-13 17:54

Hi jmg.

Nice code ---> compile nice in my Quartus.

BUT it is only Receive part ---> Still need Send part and Hardware drivers for ( j, k ) IN/OUT

jmg wrote: »

Verilog from above, after edits/fixes to get it to compile, and some cleanups on stuff and Data.
Included all logic in ONE place (KISS) and made stuff counter more self contained.
For simplicity, the CLK here is considered as USB sample point.

Code that merged with the DPLL Baud further up would add TSW as a CE gate,to get that sample point aligned correctly.

////////////////////////////////////////////////////////////////////////////////
// RR20140310-12 P2 RxUSB instruction
////////////////////////////////////////////////////////////////////////////////
/*---------------------------------------------------------------------------------------------------------------------
              RxUSB   D, S/#          WZ,WC             ' Receive single NRZI bit pair, accum CRC and byte, unstuff bits
where
  S/# is the PinPair# and Poly bits
    S[31..9]  = unused
    S[8..7]   = 00= CRC5  USB    (0 2 5)  
                01= CRC16 USB    (0 2 15 16)
                10= CRC16 CCITT  (0 5 12 16)
                11= undefined
    S[6..0]   = D-/D+ Pin Pair #0..127
                The pin pair is always a pair of pins mod 2. ie nnnnnnx where x=0 and x=1 for the pair.
                If the pin pair is even (S[0]=0) then J is the lowest pin and K is the higher pin of the consecutive pair
                If the pin pair is odd  (S[0]=1) then K is the lowest pin and J is the higher pin of the consecutive pair.
                This arrangement allows for simple LS and FS by making the pin pair even or odd.                              
  D is the cog register storing a 32 bit field...
    D[31..16] = crc16
    D[15]     = K new pin value
    D[14]     = J new pin value
    D[13..11] = unstuff counter 3 bits
    D[10..8]  = bit counter 3 bits
    D[7..0]   = data byte accumulation
  Z = data byte ready (8 bits)
  C = SE0/SE1
It would be acceptable for D to be at a fixed location eg $1F0.
---------------------------------------------------------------------------------------------------------------------*/
// inputs:  D, S, PINS
// outputs: D, Z, C
////////////////////////////////////////////////////////////////////////////////
module          RxUSB
(
input           CLK,
input           Load_d,
input           jI,             // new J value
input           kI,             // new K value
input   [31:0]  s,              // S operand
input   [31:0]  d,              // D operand
input           wz,             // WZ operand
input           wc,             // WC operand
input   [127:0] p,              // input pins
output reg [31:0]  r,              // D result
output reg      zz,             // Z flag
output reg      cy              // C flag    
);
reg     [15:0]  crc;            // original CRC (accumulated)
reg     [2:0]   bitcnt;         // data bit counter 3 bits
reg             k;              // K new pin value
reg             j;              // J new pin value
reg     [2:0]   stuffcnt;       // stuff counter 3 bits
reg     [7:0]   data;           // data byte (accumulated)
reg     [1:0]   poly;           // crc05usb/crc16usb/crc16ccitt/undef polynomial selection
//reg     [6:0]   pinno;          // pin pair numbers 0-127
reg             kP;             // K previous pin value
reg             jP;             // J previous pin value
// flags/conditions...
reg             crc05usb;       // 00= CRC5  USB    
reg             crc16usb;       // 01= CRC16 USB   
reg             crc16itt;       // 10= CRC16 CCITT 
reg             crc16ndef;      // 11= undefined   
reg             toggle;         // data bit 0 or 1
reg             BitStuff;       // unstuff this bit
reg             SE0_SE1;        // SE0/SE1 condition
///////////////////////////////////////////////////////////////////////////////
// set crc options
    always @(poly)  begin   
        crc05usb  = (poly == 2'b00);                    // CRC5usb   =(0 2 5)
        crc16usb  = (poly == 2'b01);                    // CRC16usb  =(0 2      15 16)
        crc16itt  = (poly == 2'b10);                    // CRC16ccitt=(0   5 12    16)
        crc16ndef = (poly == 2'b11);                    // undefined
    end
// check for a "1" bit =toggle, and SE0/SE1 conditions, and BitStuff condition
    always @(*)  begin   
        toggle    = kI ^ kP;                            // 1=Hi data bit (toggle) = new pin value ^ previous pin value
        SE0_SE1   = (kI == jI);                         // detect SE0/SE1 (j==k)
        BitStuff  = (!toggle & (stuffcnt == 3'b110) & (crc05usb | crc16usb));  // unstuff this bit - USB only ?
//        BitStuff  = ( (stuffcnt == 3'b110) & (crc05usb or crc16usb));  // unstuff this bit - USB only ?
    end    // Counter alone is enough, once have 6, will get a 0, unless we want to preserve 1111111, not used in USB?
///////////////////////////////////////////////////////////////////////////////
// Set Initial conditions
    always @(posedge CLK) begin
        if (Load_d) begin                               // write initial values to registers
            kP       <= d[15];                           // previous K
            jP       <= d[14];                           // previous J
            stuffcnt <= d[13:11];                        // original stuff counter value
            bitcnt   <= d[10:8];                         // original bit   counter value
            data     <= d[7:0];                          // original data value (accum)
            poly     <= s[8:7];                          // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
            k        <= kI;                              // new pin value
            j        <= jI;                              // new pin value
        end
        else begin                                      // !Load_d = normal RUN (compiler wants in one block)
// ??? is this correct way around etc ???
            k       <= kI;                              // new pin value
            j       <= jI;                              // new pin value
            kP      <= kI;                              // previous pin value
            jP      <= jI;                              // previous pin value
// check for bit unstuff
             if (!BitStuff & !SE0_SE1) begin    // Collect only valid data bits
                bitcnt    <= bitcnt+1;                      
                data[6:0] <= data[7:1];         // LSB first - shift right
                data[7]   <= toggle;            
             end       
         if (!toggle | (stuffcnt == 3'b110) ) begin  // reset if Din = 0, OR reaches (USB) Threshold.  
                stuffcnt<= 3'b000;        
             end          
             else begin
                stuffcnt <= stuffcnt+1;
             end

        end // Load_d
    end                                                                          
///////////////////////////////////////////////////////////////////////////////
// CRC routine
reg             kr0;
reg             kr2;
reg             kr5;
reg             kr12;
reg             kr15;

// calculate the new crc... (decoded values so no overlaps in if)
    always @(*) begin
        if (crc05usb) begin
            kr0  = toggle ^ crc[4];
            kr2  = toggle ^ crc[4];
            kr5  = 1'b0;
            kr12 = 1'b0;
            kr15 = 1'b0;
        end
        if (crc16usb) begin
            kr0  = toggle ^ crc[15];
            kr2  = toggle ^ crc[15];
            kr5  = 1'b0;
            kr12 = 1'b0;
            kr15 = toggle ^ crc[15]; 
        end
        if (crc16itt) begin
            kr0  = toggle ^ crc[15];
            kr2  = 1'b0;
            kr5  = toggle ^ crc[15];
            kr12 = toggle ^ crc[15];
            kr15 = 1'b0; 
        end
        if (crc16ndef) begin
            kr0  = 1'b0;
            kr2  = 1'b0;
            kr5  = 1'b0;
            kr12 = 1'b0;
            kr15 = 1'b0; 
        end
    end        
    always @(posedge CLK) begin
        if (Load_d) begin                     // write to reg initial value
            crc <= d[31:16];                  // original crc value (accum)
        end
        else if (!SE0_SE1 & !BitStuff) begin  // Only valid data 
            crc[0]  <= kr0;
            crc[1]  <= crc[0];
            crc[2]  <= crc[1] ^ kr2;
            crc[3]  <= crc[2];
            crc[4]  <= crc[3];
            crc[5]  <= crc[4] ^ kr5;
            crc[6]  <= crc[5];
            crc[7]  <= crc[6];
            crc[8]  <= crc[7];
            crc[9]  <= crc[8];
            crc[10] <= crc[9];
            crc[11] <= crc[10];
            crc[12] <= crc[11] ^ kr12;
            crc[13] <= crc[12];
            crc[14] <= crc[13];
            crc[15] <= crc[14] ^ kr15;
        end
    end    
        
///////////////////////////////////////////////////////////////////////////////
    
// set D results - optional 32 bit pick-off.
    always @(*)  begin             //                     ??? or @(posedge CLK)
        r[31:16] = crc;
        r[15]    = k;
        r[14]    = j;
        r[13:11] = stuffcnt;
        r[10:8]  = bitcnt;
        r[7:0]   = data;               
    end    
    
// set Z and C flags
    always  @(posedge CLK) begin
        if (wz)  begin
            if (!BitStuff & (bitcnt == 3'b111)) begin    // About to load last bit.. so  
                zz <= 1'b1;                              // byte ready
            end
            else begin    
                zz <= 1'b0;                              // byte not ready
            end
        end
        if (wc) begin          
            cy <= SE0_SE1;          // c = SE0/SE1
        end           
    end
endmodule
// Pre/   Post/ Post loaded
// 000    001   1
// 001    010   2
// 010    011   3
// 011    100   4
// 100    101   5
// 101    110   6
// 110    111   7
// 111    000   8

Sapieha · 2014-03-13 17:56

Hi Cluso.

To part that shows in my SCH I already have Verilog code

Cluso99 wrote: »

Thanks Sapieha. Now I understand what you mean.
I am writing the instruction using Verilog now. It does show the circuitry required.

jmg · 2014-03-13 19:55

Sapieha wrote: »

Hi jmg.

Nice code ---> compile nice in my Quartus.

BUT it is only Receive part ---> Still need Send part and Hardware drivers for ( j, k ) IN/OUT

Correct, no Tx yet - I think P2 has differential out support now, and the Serdes may/(should?) support packed sends.
If the CRC above can be shared (it could snoop on a Tx stream?), that just leaves bit-stuff to do in SW before starting to
send a block.
Receive is a tougher nut to crack, so the focus was on that.

Even doing TxStuff in Verilog is not many gates ( similar to the Rx Side )
Roughly :

// ~~~~~~~~~~~ Stuff counter, INC when sending ones, else clear ~~~~~~~~~~~~~~~~~~~
        if (!DataBY[0] | (StuffCtr == 3'b110) ) begin    // reset when DSend = 0, OR reaches Threshold.  
            StuffCtr <= 3'b000;        
        end          
        else begin
            StuffCtr <= StuffCtr+1;
        end
	
// ~~~~~~~~~~~ Insert 0, or send/Shift data  ~~~~~~~~~~~~~~~~~~~
        if (StuffCtr == 3'b110) begin 
          TxT  <= !TxT;               // toggle = insert send 0, skip TxCount, skip shift DataBY
        end 
        else begin                    // No insert, normal data send, so INC and Do Shift 
	  if ( !DataBY[0] ) begin     // send 0 = toggle, send 1 = hold value on TxT
            TxT  <= !TxT;      
          end 
          BitCtr <= BitCtr + 1;
	  DataBY <= {Din,DataBY[7:1]};   // LSB first, so shift in from right 
	end

Cluso99 · 2014-03-13 20:41

Sapieha,
As jmg said, we are concentrating on the harder part - the receive end first. But we can also use the same instruction to do the CRC calcs after outputting each bit. So the instruction is quite powerful as it is.
Currently it is also capable of doing CRC16 but it needs a couple of fixes because BiSync/SDLC is uses a single bit, not complementary pairs, and it can be NRZ or NRZI.

jmg · 2014-03-25 16:33

I'll add in here some test results from another discussion, as this gives a performance reference point of existing USB devices,
and also shows some issues in the details of settings, and sustained speeds, for when testing USB flows on P2.

Testing on more PCs and USB ports, shows some subtle differences on the test PCs
* USB3 ports(blue) seem to sustain higher baud traffic, then a 'standard' port (even tho both run at 12MHz - maybe larger buffers ?)
* Windows Device Settings defaulted to 16ms and change to 1ms did help 2MBd on std USB
* 3MBd on USB3 HW, was very close to managing reliable streamed Duplex.
* Adding a 2nd stop bit, seemed to help a little.
* Some failure modes looked a little brutal, at 3Mbd giving errors, sometimes the USB VCP vanished from Win8, and did fully restore on unpliug/replug. Moving a another PC then back, seemed to clear things.
(ie maybe more than just dropped data was going on here )
* TX seemed to never drop, but receive side seemed to have the issues.

As another reference point, Silabs CP2130 specs 3.9 and 2.6MBps on read.write so that does look to be about the duplex limit.
They also give 5.8MBd(W) and 6.6MBd(R) as one-way limits.

Loopback streaming tests, 100000 blocks, with a Frequency counter and Char counter Terminal.
( This terminal has been crafted to have low overhead, and quiet modes, so the PC SW side does not set the ceiling.)

Propeller Project Board Tests  FT231X (20p) Loopback 
FT231X  File of   [U......U]                        Shift-Ctr-V.                        Right-click Paste.
Block Size  Baud    Set     TxSend       RxBack    FreqAv                               FreqAv
100000      3Mbd    n,8,1  100000 	 99128!*  1.49989M Qm Overrun errors            1.49985MHz Overrun errors
100000      2.4MBd  n,8,1  100000 	100000    1.00001M Qm                           1000.018MHz Solid << 2MBd alias
100000      2Mbd    n,8,1  100000 	100000    1.00001M Qm                           1000.018MHz Solid
100000      1.5Mbd  n,8,1  100000 	100000    750k quiet mode, less in hex          750.007KHz Solid 
100000      1Mbd    n,8,1  100000 	100000    380~500KHz variable(hex)              500.0062KHz Solid 
100000      500kbd  n,8,1  100000 	100000    243.KHz  sometimes 250KHz (hex)       250.0045KHz Solid

100000      3Mbd    n,8,2  100000 	 99577!*   fewer Overrun errors            1.49985MHz Overrun errors
100000      3Mbd    m,8,2  100000 	 99949!*   Better Rx Yield, still < 100%

* in 3MBd case, external edge TX count is correct, so it is RX side which is dropping chars

Added:
Same tests, SiLabs CP2105 (ENH) channel Duplex, Shift-Ctr-V QuietMode : 
(kHz values under 0.5*Baud, mean added stop bits)
Block Size  Baud    Set    TxSend       RxBack   FreqAv  
100000      1.2Mbd  n,8,1  100000 	100000   ->  441.194kHz
100000      2Mbd    n,8,1  100000 	100000   ->  525.516kHz
100000      3Mbd    n,8,1  100000 	100000   ->  624.674kHz

It seems the FT231X can sustain 2MBd duplex, (with good PC sw) and at 3MBd can send to that with no added stop bits, but it stutters a little on 3MBd Duplex, on the Receive side.

expanding to 2 Stop bits, and mark parity both help, but are not quite enough to make duplex without over run.
(SW works to well above this on a FT232H, but that uses different frame speed and drivers)

I think FTDI have somewhat mangled their Baud formula in my data sheet, tests show more correct is

FT231X Virtual Baud Clock of 24MHz, with legal divisors of 8,12,16,17,18,19,20,21...

ie above 16, single digit steps are supported, below 16 it is 8,12

P2 and full speed USB slave requirements/ideas

Comments