I think chip was meaning the earlier, simpler code to allocate Pins and manage SE0 and T into the flags ?
The code in #107 is not quite 'mission-ready', and Pin mapping and the couple of FF's & XORs to do SE0_SE1 and T should be common to any extended code.
<= assign is verilog that ensures you do get a clocked result. ( ie usually a D-FF )
= within a clocked block seems to sometimes give a clocked result, but not always. Best to be careful.
( another reason I suggested you run something like Lattice ISPlever)
There is now no point in producing code for the previous GETXP examples.
I am certain Chip understands what I am trying to do regarding extracting the correct pin pair. If not, a fixed P0 & P1 would work for now.
I think the code is close enough for testing - it is the time it will take Chip to do this with appropriate fixes and fitting into his Verilog regime.
As for testing, initially I propose to just output on those pin pairs various conditions, and each time calling the new RxUSB instruction (whatever we call it - and we don't need pnut support as I can code it as a long). This way, I can control what is output and therefore test what the instruction receives. So I can verify the instruction is working as designed in a controlled environment.
Once I have that running, I can snoop a real FS USB and ensure I can read the tokens and packets, and verify the crc5 and crc16usb, the SE0 at the EOP and of course the initial sync sequence.
As for testing, initially I propose to just output on those pin pairs various conditions, and each time calling the new RxUSB instruction (whatever we call it - and we don't need pnut support as I can code it as a long). This way, I can control what is output and therefore test what the instruction receives. So I can verify the instruction is working as designed in a controlled environment.
That testing approach sounds like a good idea, Chip then just needs to make as much SW-readable as is practical. ie 32b each way.
It probably does not need to mesh into the register-array, just as long as it can R/W in SW. (ie like the counter setups)
I've been looking at Cluso99's proposed RXUSB instruction.
I think for the final P2 (not the FPGA) there is scope to be able to use this simply with a byte processing loop in another COG task if the P2 clock is a multiple of 12MHz and >=96MHz.
EDIT: Sorry accidently hit tabs + space while typing which clicked submit and this posted too soon. I'm still formulating and thinking about this idea. I want to come up with a 6 or (7?) instruction bit loop and use SYNCTRA which will wait until the right time to sample without stopping the other hardware task from running. I think I will need to subtract from PHSA somewhere as well making this 7 instructions. We would still do the 1:8 task allocation to give the byte processor task its time for the packet.
Thinking about the 4 phase Digital PLL equivalent I posted in #118.
This can help snoop on a 1.5MHz USB, and so open that testing domain, when this gets to real data flows.
Taking a 80MHz FPGA clock, we can get to 1.5MHz on average, with modest jitter.
80/1.5 = 53.3333333333333333
2^32/(80/1.5) = 80530636.8 round(2^32/(80/1.5)) = 80530637
2^32/(round(2^32/(80/1.5))) = 53.3333332008785675
80M/(2^32/(round(2^32/(80/1.5)))) = 1500000.0037252903
1/(ans-1.5M) = 268.435456s of numeric beat error.
Will add 80530637 every clock (1.5/80)
Then the upper 2 bits are the fastest /4 case, and the DPLL rule, from #118 is along the lines of
If edge occurs @ MSB = 3 -> No change (add as usual) (1.5/80)
If edge occurs @ MSB = 2 -> Need to advance to 1 quadrant, or add 2^32/4 ONCE ( or 2^32/8 twice )
If edge occurs @ MSB = 0 -> Need to retard 1 quadrant, or add 2^32*3/4 ONCE ( or 2^32*3/8 twice )
edge @ MSB =1 should never happen during a data stream.
That case could be flagged, and it can add 2^32*2/4 ONCE ( or 2^32*4/8 twice ) to match the Verilog action.
If a COG is set for 2 threads 50%, we have 2 x 40MHz flows to manage 1.5MBd data.
Code then does
FRQx = Default_2e32_1p5d80
and adjust is for advance two lines
FRQx = Advance_S2_2e32d8
FRQx = Default_2e32_1p5d80
and adjust is for retard two lines
FRQx = retard_S0_2e32_3d8
FRQx = Default_2e32_1p5d80
At 50% slot, FRQx I think will apply twice, before being restored to default locked value.
Or there may be time for a way to read PHSx, for all edges not in quadrant 3, and calculate a (double) add to give quadrant 0 next.
just one adjust code block is then needed.
If quadrant3 is the edge value, then quadrant 1 is the sample point ( and Q0,Q2 are the guard bands)
Each thread has 26 thread cycles per bit.
PHSx.MSB can map to a pin, and be used as a sample-and scope trigger.
I think for the final P2 (not the FPGA) there is scope to be able to use this simply with a byte processing loop in another COG task if the P2 clock is a multiple of 12MHz and >=96MHz.
The problem with that >=96MHz, is you cannot fully test this in the FPGA, which is close to drop-dead.
Better I think, to include 48MHz ( & 60MHz & 72MHz& maybe 84MHz ) on the Clock targets.
The Auto-pacing (DPLL) code for BYTE level handler, is in #118, and is not large.
The problem with that >=96MHz, is you cannot fully test this in the FPGA, which is close to drop-dead.
Better I think, to include 48MHz ( & 60MHz & 72MHz& maybe 84MHz ) on the Clock targets.
The Auto-pacing (DPLL) code for BYTE level handler, is in #118, and is not large.
Yeah I know 96MHz is just too fast for the FPGA. Assuming we were to stick to bit processing loops in software I see a lot of merit in the 1:8 approach with a byte processing task running as well, as it simplifies the software design and decouples the timing critical bit work from the other (slower) byte orientied protocol processing work. The problem is transferring the data and doing the error checking takes time and I don't see a way to get it down too much more given what Cluso has proposed. Now if we come up with more USB extensions beyond RXUSB that provides bytes for us and deals with timing, that could work out nicely as well. I don't think we are there yet but hopefully we are heading in that direction...it's worth continuing that discussion too.
....Now if we come up with more USB extensions beyond RXUSB that provides bytes for us and deals with timing, that could work out nicely as well. I don't think we are there yet but hopefully we are heading in that direction...
See Chip's comment in #118 - we are closer than you think - once you have a bit-counter inside the Verilog, then you just need to 'fire' the per-bit Verilog, on a DPLL timer, (#118) and buffer the DataTX.
I'm not sure if CRC needs buffering, or just preservation-care over a SE0 event.
I think chip was meaning the earlier, simpler code to allocate Pins and manage SE0 and T into the flags ?
The code in #107 is not quite 'mission-ready', and Pin mapping and the couple of FF's & XORs to do SE0_SE1 and T should be common to any extended code.
<= assign is verilog that ensures you do get a clocked result. ( ie usually a D-FF )
= within a clocked block seems to sometimes give a clocked result, but not always. Best to be careful.
( another reason I suggested you run something like Lattice ISPlever)
Hi jmg.
I posted that code's SCH on prop ii blog thread -- none even commented it.
It was too low resolution for me to see clearly, and besides, I can see the fitter equation's which are easier to follow...
It does pay with Verilog (like with most high level languages) to check you got what you expected, and not something else, or logic-bloat.
That testing approach sounds like a good idea, Chip then just needs to make as much SW-readable as is practical. ie 32b each way.
It probably does not need to mesh into the register-array, just as long as it can R/W in SW. (ie like the counter setups)
??? The instruction uses the D register, so we can preset it, and we can read it.
Chip said there will be instruction space, so no need for a fixed D address.
??? The instruction uses the D register, so we can preset it, and we can read it.
Chip said there will be instruction space, so no need for a fixed D address.
The exact implementation is up to Chip, I'm just observing that this is more like a Counter or SerDes in operation, than a memory/register.
The Counters use SETxx and GETxx opcodes , which also have a D address.
Thinking about the 4 phase Digital PLL equivalent I posted in #118.
This can help snoop on a 1.5MHz USB, and so open that testing domain, when this gets to real data flows.
Taking a 80MHz FPGA clock, we can get to 1.5MHz on average, with modest jitter.
80/1.5 = 53.3333333333333333
2^32/(80/1.5) = 80530636.8 round(2^32/(80/1.5)) = 80530637
2^32/(round(2^32/(80/1.5))) = 53.3333332008785675
80M/(2^32/(round(2^32/(80/1.5)))) = 1500000.0037252903
1/(ans-1.5M) = 268.435456s of numeric beat error.
Will add 80530637 every clock (1.5/80)
Then the upper 2 bits are the fastest /4 case, and the DPLL rule, from #118 is along the lines of
If edge occurs @ MSB = 3 -> No change (add as usual) (1.5/80)
If edge occurs @ MSB = 2 -> Need to advance to 1 quadrant, or add 2^32/4 ONCE ( or 2^32/8 twice )
If edge occurs @ MSB = 0 -> Need to retard 1 quadrant, or add 2^32*3/4 ONCE ( or 2^32*3/8 twice )
edge @ MSB =1 should never happen during a data stream.
That case could be flagged, and it can add 2^32*2/4 ONCE ( or 2^32*4/8 twice ) to match the Verilog action.
If a COG is set for 2 threads 50%, we have 2 x 40MHz flows to manage 1.5MBd data.
Code then does
FRQx = Default_2e32_1p5d80
and adjust is for advance two lines
FRQx = Advance_S2_2e32d8
FRQx = Default_2e32_1p5d80
and adjust is for retard two lines
FRQx = retard_S0_2e32_3d8
FRQx = Default_2e32_1p5d80
At 50% slot, FRQx I think will apply twice, before being restored to default locked value.
Or there may be time for a way to read PHSx, for all edges not in quadrant 3, and calculate a (double) add to give quadrant 0 next.
just one adjust code block is then needed.
If quadrant3 is the edge value, then quadrant 1 is the sample point ( and Q0,Q2 are the guard bands)
Each thread has 26 thread cycles per bit.
PHSx.MSB can map to a pin, and be used as a sample-and scope trigger.
I do quite like the sound of this adaptive clocking. I need to get my head around it more. What HW changes or other further instructions would be required to support it? How much is done in HW vs software? I see we would use one of the counters, does it need modifications or do all the clocking tweaks adjusting FRQA happen in software?
I think you meant 26 thread cycles per byte, not per bit above. But that is still nice and already gives us at least 3 hub cycles per byte @80MHz in the byte processing task. At 48MHz this drops down to 16 clocks per thread or 2 hub cycles which I think should still be fairly generous.
So what crystal frequency limitations would this overall approach entail? Anything >=48Mhz or do you still need discrete 12MHz multiples above this? I imagine for receiving if we get aggressive we could potentially adjust timing after every byte which that probably means we don't want slip any more than say 1/4 bit per 8 bits right? That is ~3% tolerance. But the transmit adds it own complexity and I expect we want to be able to transmit accurately at the right bit rate, which then means a 12MHz multiple. How does your design deal with that?
I do quite like the sound of this adaptive clocking. I need to get my head around it more. What HW changes or other further instructions would be required to support it? How much is done in HW vs software? I see we would use one of the counters, does it need modifications or do all the clocking tweaks adjusting FRQA happen in software?
The counter form above is just a skeleton of SW-emulation ideas, of the verilog code in #118, and a means to smart-sample a USB stream at 1.5MHz.
The SW-emulation is looking at ways to test.emulate verilog ideas, using the FPGA in SW, but at modest USB speeds. Lucky there is the low speed mode
In #118 you ca see Chips comment suggests he may fit a custom Baud controller easier than adding modes to a Timer.
Either way fine, it's whatever is easiest and smallest to include. Separate 8bit Baud Div, frees 32b timers for other tasks.
The Code in #118 is pretty much all you need, just a few lines of Verilog -> Silicon.
There are not many changes once the USB code block includes a BitCtr, and is called once per bit, it's just a matter of do you call it in SW, or use a DPLL as in #118 ?.
I
I think you meant 26 thread cycles per byte, not per bit above.
No, that is per-bit, - but notice that is for a test version, running at 1.5MHz LO-speed USB, where things will be easier to probe, and hook-into.
At 1.5MHz I think FPGA-P2 can fit one DPLL + Diagnostics thread sampling and checking, and one Thread running the USB-Verilog tests.
Once that looks good, the Verilog DPLL would be added to pace the USB engine, instead of SW calls.
In this form, it's not quite as easy to probe or test, so a mixed SW version (aka Verilog emulation) gives a way to bring this up, and get higher level code working.
So what crystal frequency limitations would this overall approach entail? Anything >=48Mhz or do you still need discrete 12MHz multiples above this?
Yes, the Baud-DPLL assumes N x 12MHz with N >= 4, and Ok up to 200MHz, and can do low-Speed USB at >200MHz SysClk.
The Code in #118 auto-syncs, so tolerance is not so critical, but you would need a crystal or resonator for timing.
( ie RC osc's are probably off the table)
Chips Xtal PLL is now any-integer, so that gives a few choices of how to get to 12MHz x N
nOEi --->select Input/Output to USB
TXD_di, RXD_do are real bits Output/Input to this circuity
Give directly Real bits with Receive and use real bits with send
Ned only one instruction that in field D --- Can control special signals --->
some of them are read only and some write/read (nOEi, SOE, SEI, SUSPEND, reset_i)
And field S port number.
and send/receive to flag C bit value (maybe directly shifted IN/OUT from register specified by RESD instruction.
Verilog from above, after edits/fixes to get it to compile, and some cleanups on stuff and Data.
Included all logic in ONE place (KISS) and made stuff counter more self contained.
For simplicity, the CLK here is considered as USB sample point.
Code that merged with the DPLL Baud further up would add TSW as a CE gate,to get that sample point aligned correctly.
////////////////////////////////////////////////////////////////////////////////
// RR20140310-12 P2 RxUSB instruction
////////////////////////////////////////////////////////////////////////////////
/*---------------------------------------------------------------------------------------------------------------------
RxUSB D, S/# WZ,WC ' Receive single NRZI bit pair, accum CRC and byte, unstuff bits
where
S/# is the PinPair# and Poly bits
S[31..9] = unused
S[8..7] = 00= CRC5 USB (0 2 5)
01= CRC16 USB (0 2 15 16)
10= CRC16 CCITT (0 5 12 16)
11= undefined
S[6..0] = D-/D+ Pin Pair #0..127
The pin pair is always a pair of pins mod 2. ie nnnnnnx where x=0 and x=1 for the pair.
If the pin pair is even (S[0]=0) then J is the lowest pin and K is the higher pin of the consecutive pair
If the pin pair is odd (S[0]=1) then K is the lowest pin and J is the higher pin of the consecutive pair.
This arrangement allows for simple LS and FS by making the pin pair even or odd.
D is the cog register storing a 32 bit field...
D[31..16] = crc16
D[15] = K new pin value
D[14] = J new pin value
D[13..11] = unstuff counter 3 bits
D[10..8] = bit counter 3 bits
D[7..0] = data byte accumulation
Z = data byte ready (8 bits)
C = SE0/SE1
It would be acceptable for D to be at a fixed location eg $1F0.
---------------------------------------------------------------------------------------------------------------------*/
// inputs: D, S, PINS
// outputs: D, Z, C
////////////////////////////////////////////////////////////////////////////////
module RxUSB
(
input CLK,
input Load_d,
input jI, // new J value
input kI, // new K value
input [31:0] s, // S operand
input [31:0] d, // D operand
input wz, // WZ operand
input wc, // WC operand
input [127:0] p, // input pins
output reg [31:0] r, // D result
output reg zz, // Z flag
output reg cy // C flag
);
reg [15:0] crc; // original CRC (accumulated)
reg [2:0] bitcnt; // data bit counter 3 bits
reg k; // K new pin value
reg j; // J new pin value
reg [2:0] stuffcnt; // stuff counter 3 bits
reg [7:0] data; // data byte (accumulated)
reg [1:0] poly; // crc05usb/crc16usb/crc16ccitt/undef polynomial selection
//reg [6:0] pinno; // pin pair numbers 0-127
reg kP; // K previous pin value
reg jP; // J previous pin value
// flags/conditions...
reg crc05usb; // 00= CRC5 USB
reg crc16usb; // 01= CRC16 USB
reg crc16itt; // 10= CRC16 CCITT
reg crc16ndef; // 11= undefined
reg toggle; // data bit 0 or 1
reg BitStuff; // unstuff this bit
reg SE0_SE1; // SE0/SE1 condition
///////////////////////////////////////////////////////////////////////////////
// set crc options
always @(poly) begin
crc05usb = (poly == 2'b00); // CRC5usb =(0 2 5)
crc16usb = (poly == 2'b01); // CRC16usb =(0 2 15 16)
crc16itt = (poly == 2'b10); // CRC16ccitt=(0 5 12 16)
crc16ndef = (poly == 2'b11); // undefined
end
// check for a "1" bit =toggle, and SE0/SE1 conditions, and BitStuff condition
always @(*) begin
toggle = kI ^ kP; // 1=Hi data bit (toggle) = new pin value ^ previous pin value
SE0_SE1 = (kI == jI); // detect SE0/SE1 (j==k)
BitStuff = (!toggle & (stuffcnt == 3'b110) & (crc05usb | crc16usb)); // unstuff this bit - USB only ?
// BitStuff = ( (stuffcnt == 3'b110) & (crc05usb or crc16usb)); // unstuff this bit - USB only ?
end // Counter alone is enough, once have 6, will get a 0, unless we want to preserve 1111111, not used in USB?
///////////////////////////////////////////////////////////////////////////////
// Set Initial conditions
always @(posedge CLK) begin
if (Load_d) begin // write initial values to registers
kP <= d[15]; // previous K
jP <= d[14]; // previous J
stuffcnt <= d[13:11]; // original stuff counter value
bitcnt <= d[10:8]; // original bit counter value
data <= d[7:0]; // original data value (accum)
poly <= s[8:7]; // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
k <= kI; // new pin value
j <= jI; // new pin value
end
else begin // !Load_d = normal RUN (compiler wants in one block)
// ??? is this correct way around etc ???
k <= kI; // new pin value
j <= jI; // new pin value
kP <= kI; // previous pin value
jP <= jI; // previous pin value
// check for bit unstuff
if (!BitStuff & !SE0_SE1) begin // Collect only valid data bits
bitcnt <= bitcnt+1;
data[6:0] <= data[7:1]; // LSB first - shift right
data[7] <= toggle;
end
if (!toggle | (stuffcnt == 3'b110) ) begin // reset if Din = 0, OR reaches (USB) Threshold.
stuffcnt<= 3'b000;
end
else begin
stuffcnt <= stuffcnt+1;
end
end // Load_d
end
///////////////////////////////////////////////////////////////////////////////
// CRC routine
reg kr0;
reg kr2;
reg kr5;
reg kr12;
reg kr15;
// calculate the new crc... (decoded values so no overlaps in if)
always @(*) begin
if (crc05usb) begin
kr0 = toggle ^ crc[4];
kr2 = toggle ^ crc[4];
kr5 = 1'b0;
kr12 = 1'b0;
kr15 = 1'b0;
end
if (crc16usb) begin
kr0 = toggle ^ crc[15];
kr2 = toggle ^ crc[15];
kr5 = 1'b0;
kr12 = 1'b0;
kr15 = toggle ^ crc[15];
end
if (crc16itt) begin
kr0 = toggle ^ crc[15];
kr2 = 1'b0;
kr5 = toggle ^ crc[15];
kr12 = toggle ^ crc[15];
kr15 = 1'b0;
end
if (crc16ndef) begin
kr0 = 1'b0;
kr2 = 1'b0;
kr5 = 1'b0;
kr12 = 1'b0;
kr15 = 1'b0;
end
end
always @(posedge CLK) begin
if (Load_d) begin // write to reg initial value
crc <= d[31:16]; // original crc value (accum)
end
else if (!SE0_SE1 & !BitStuff) begin // Only valid data
crc[0] <= kr0;
crc[1] <= crc[0];
crc[2] <= crc[1] ^ kr2;
crc[3] <= crc[2];
crc[4] <= crc[3];
crc[5] <= crc[4] ^ kr5;
crc[6] <= crc[5];
crc[7] <= crc[6];
crc[8] <= crc[7];
crc[9] <= crc[8];
crc[10] <= crc[9];
crc[11] <= crc[10];
crc[12] <= crc[11] ^ kr12;
crc[13] <= crc[12];
crc[14] <= crc[13];
crc[15] <= crc[14] ^ kr15;
end
end
///////////////////////////////////////////////////////////////////////////////
// set D results - optional 32 bit pick-off.
always @(*) begin // ??? or @(posedge CLK)
r[31:16] = crc;
r[15] = k;
r[14] = j;
r[13:11] = stuffcnt;
r[10:8] = bitcnt;
r[7:0] = data;
end
// set Z and C flags
always @(posedge CLK) begin
if (wz) begin
if (!BitStuff & (bitcnt == 3'b111)) begin // About to load last bit.. so
zz <= 1'b1; // byte ready
end
else begin
zz <= 1'b0; // byte not ready
end
end
if (wc) begin
cy <= SE0_SE1; // c = SE0/SE1
end
end
endmodule
// Pre/ Post/ Post loaded
// 000 001 1
// 001 010 2
// 010 011 3
// 011 100 4
// 100 101 5
// 101 110 6
// 110 111 7
// 111 000 8
nOEi --->select Input/Output to USB
TXD_di, RXD_do are real bits Output/Input to this circuity
Give directly Real bits with Receive and use real bits with send
Ned only one instruction that in field D --- Can control special signals --->
some of them are read only and some write/read (nOEi, SOE, SEI, SUSPEND, reset_i)
And field S port number.
and send/receive to flag C bit value (maybe directly shifted IN/OUT from register specified by RESD instruction.
Thanks Sapieha. Now I understand what you mean.
I am writing the instruction using Verilog now. It does show the circuitry required.
Verilog from above, after edits/fixes to get it to compile, and some cleanups on stuff and Data.
Included all logic in ONE place (KISS) and made stuff counter more self contained.
For simplicity, the CLK here is considered as USB sample point.
Code that merged with the DPLL Baud further up would add TSW as a CE gate,to get that sample point aligned correctly.
////////////////////////////////////////////////////////////////////////////////
// RR20140310-12 P2 RxUSB instruction
////////////////////////////////////////////////////////////////////////////////
/*---------------------------------------------------------------------------------------------------------------------
RxUSB D, S/# WZ,WC ' Receive single NRZI bit pair, accum CRC and byte, unstuff bits
where
S/# is the PinPair# and Poly bits
S[31..9] = unused
S[8..7] = 00= CRC5 USB (0 2 5)
01= CRC16 USB (0 2 15 16)
10= CRC16 CCITT (0 5 12 16)
11= undefined
S[6..0] = D-/D+ Pin Pair #0..127
The pin pair is always a pair of pins mod 2. ie nnnnnnx where x=0 and x=1 for the pair.
If the pin pair is even (S[0]=0) then J is the lowest pin and K is the higher pin of the consecutive pair
If the pin pair is odd (S[0]=1) then K is the lowest pin and J is the higher pin of the consecutive pair.
This arrangement allows for simple LS and FS by making the pin pair even or odd.
D is the cog register storing a 32 bit field...
D[31..16] = crc16
D[15] = K new pin value
D[14] = J new pin value
D[13..11] = unstuff counter 3 bits
D[10..8] = bit counter 3 bits
D[7..0] = data byte accumulation
Z = data byte ready (8 bits)
C = SE0/SE1
It would be acceptable for D to be at a fixed location eg $1F0.
---------------------------------------------------------------------------------------------------------------------*/
// inputs: D, S, PINS
// outputs: D, Z, C
////////////////////////////////////////////////////////////////////////////////
module RxUSB
(
input CLK,
input Load_d,
input jI, // new J value
input kI, // new K value
input [31:0] s, // S operand
input [31:0] d, // D operand
input wz, // WZ operand
input wc, // WC operand
input [127:0] p, // input pins
output reg [31:0] r, // D result
output reg zz, // Z flag
output reg cy // C flag
);
reg [15:0] crc; // original CRC (accumulated)
reg [2:0] bitcnt; // data bit counter 3 bits
reg k; // K new pin value
reg j; // J new pin value
reg [2:0] stuffcnt; // stuff counter 3 bits
reg [7:0] data; // data byte (accumulated)
reg [1:0] poly; // crc05usb/crc16usb/crc16ccitt/undef polynomial selection
//reg [6:0] pinno; // pin pair numbers 0-127
reg kP; // K previous pin value
reg jP; // J previous pin value
// flags/conditions...
reg crc05usb; // 00= CRC5 USB
reg crc16usb; // 01= CRC16 USB
reg crc16itt; // 10= CRC16 CCITT
reg crc16ndef; // 11= undefined
reg toggle; // data bit 0 or 1
reg BitStuff; // unstuff this bit
reg SE0_SE1; // SE0/SE1 condition
///////////////////////////////////////////////////////////////////////////////
// set crc options
always @(poly) begin
crc05usb = (poly == 2'b00); // CRC5usb =(0 2 5)
crc16usb = (poly == 2'b01); // CRC16usb =(0 2 15 16)
crc16itt = (poly == 2'b10); // CRC16ccitt=(0 5 12 16)
crc16ndef = (poly == 2'b11); // undefined
end
// check for a "1" bit =toggle, and SE0/SE1 conditions, and BitStuff condition
always @(*) begin
toggle = kI ^ kP; // 1=Hi data bit (toggle) = new pin value ^ previous pin value
SE0_SE1 = (kI == jI); // detect SE0/SE1 (j==k)
BitStuff = (!toggle & (stuffcnt == 3'b110) & (crc05usb | crc16usb)); // unstuff this bit - USB only ?
// BitStuff = ( (stuffcnt == 3'b110) & (crc05usb or crc16usb)); // unstuff this bit - USB only ?
end // Counter alone is enough, once have 6, will get a 0, unless we want to preserve 1111111, not used in USB?
///////////////////////////////////////////////////////////////////////////////
// Set Initial conditions
always @(posedge CLK) begin
if (Load_d) begin // write initial values to registers
kP <= d[15]; // previous K
jP <= d[14]; // previous J
stuffcnt <= d[13:11]; // original stuff counter value
bitcnt <= d[10:8]; // original bit counter value
data <= d[7:0]; // original data value (accum)
poly <= s[8:7]; // 00=crc16usb, 01=crc05usb, 10=crc16ccitt, 11=undefined
k <= kI; // new pin value
j <= jI; // new pin value
end
else begin // !Load_d = normal RUN (compiler wants in one block)
// ??? is this correct way around etc ???
k <= kI; // new pin value
j <= jI; // new pin value
kP <= kI; // previous pin value
jP <= jI; // previous pin value
// check for bit unstuff
if (!BitStuff & !SE0_SE1) begin // Collect only valid data bits
bitcnt <= bitcnt+1;
data[6:0] <= data[7:1]; // LSB first - shift right
data[7] <= toggle;
end
if (!toggle | (stuffcnt == 3'b110) ) begin // reset if Din = 0, OR reaches (USB) Threshold.
stuffcnt<= 3'b000;
end
else begin
stuffcnt <= stuffcnt+1;
end
end // Load_d
end
///////////////////////////////////////////////////////////////////////////////
// CRC routine
reg kr0;
reg kr2;
reg kr5;
reg kr12;
reg kr15;
// calculate the new crc... (decoded values so no overlaps in if)
always @(*) begin
if (crc05usb) begin
kr0 = toggle ^ crc[4];
kr2 = toggle ^ crc[4];
kr5 = 1'b0;
kr12 = 1'b0;
kr15 = 1'b0;
end
if (crc16usb) begin
kr0 = toggle ^ crc[15];
kr2 = toggle ^ crc[15];
kr5 = 1'b0;
kr12 = 1'b0;
kr15 = toggle ^ crc[15];
end
if (crc16itt) begin
kr0 = toggle ^ crc[15];
kr2 = 1'b0;
kr5 = toggle ^ crc[15];
kr12 = toggle ^ crc[15];
kr15 = 1'b0;
end
if (crc16ndef) begin
kr0 = 1'b0;
kr2 = 1'b0;
kr5 = 1'b0;
kr12 = 1'b0;
kr15 = 1'b0;
end
end
always @(posedge CLK) begin
if (Load_d) begin // write to reg initial value
crc <= d[31:16]; // original crc value (accum)
end
else if (!SE0_SE1 & !BitStuff) begin // Only valid data
crc[0] <= kr0;
crc[1] <= crc[0];
crc[2] <= crc[1] ^ kr2;
crc[3] <= crc[2];
crc[4] <= crc[3];
crc[5] <= crc[4] ^ kr5;
crc[6] <= crc[5];
crc[7] <= crc[6];
crc[8] <= crc[7];
crc[9] <= crc[8];
crc[10] <= crc[9];
crc[11] <= crc[10];
crc[12] <= crc[11] ^ kr12;
crc[13] <= crc[12];
crc[14] <= crc[13];
crc[15] <= crc[14] ^ kr15;
end
end
///////////////////////////////////////////////////////////////////////////////
// set D results - optional 32 bit pick-off.
always @(*) begin // ??? or @(posedge CLK)
r[31:16] = crc;
r[15] = k;
r[14] = j;
r[13:11] = stuffcnt;
r[10:8] = bitcnt;
r[7:0] = data;
end
// set Z and C flags
always @(posedge CLK) begin
if (wz) begin
if (!BitStuff & (bitcnt == 3'b111)) begin // About to load last bit.. so
zz <= 1'b1; // byte ready
end
else begin
zz <= 1'b0; // byte not ready
end
end
if (wc) begin
cy <= SE0_SE1; // c = SE0/SE1
end
end
endmodule
// Pre/ Post/ Post loaded
// 000 001 1
// 001 010 2
// 010 011 3
// 011 100 4
// 100 101 5
// 101 110 6
// 110 111 7
// 111 000 8
BUT it is only Receive part ---> Still need Send part and Hardware drivers for ( j, k ) IN/OUT
Correct, no Tx yet - I think P2 has differential out support now, and the Serdes may/(should?) support packed sends.
If the CRC above can be shared (it could snoop on a Tx stream?), that just leaves bit-stuff to do in SW before starting to
send a block.
Receive is a tougher nut to crack, so the focus was on that.
Even doing TxStuff in Verilog is not many gates ( similar to the Rx Side )
Roughly :
// ~~~~~~~~~~~ Stuff counter, INC when sending ones, else clear ~~~~~~~~~~~~~~~~~~~
if (!DataBY[0] | (StuffCtr == 3'b110) ) begin // reset when DSend = 0, OR reaches Threshold.
StuffCtr <= 3'b000;
end
else begin
StuffCtr <= StuffCtr+1;
end
// ~~~~~~~~~~~ Insert 0, or send/Shift data ~~~~~~~~~~~~~~~~~~~
if (StuffCtr == 3'b110) begin
TxT <= !TxT; // toggle = insert send 0, skip TxCount, skip shift DataBY
end
else begin // No insert, normal data send, so INC and Do Shift
if ( !DataBY[0] ) begin // send 0 = toggle, send 1 = hold value on TxT
TxT <= !TxT;
end
BitCtr <= BitCtr + 1;
DataBY <= {Din,DataBY[7:1]}; // LSB first, so shift in from right
end
Sapieha,
As jmg said, we are concentrating on the harder part - the receive end first. But we can also use the same instruction to do the CRC calcs after outputting each bit. So the instruction is quite powerful as it is.
Currently it is also capable of doing CRC16 but it needs a couple of fixes because BiSync/SDLC is uses a single bit, not complementary pairs, and it can be NRZ or NRZI.
I'll add in here some test results from another discussion, as this gives a performance reference point of existing USB devices,
and also shows some issues in the details of settings, and sustained speeds, for when testing USB flows on P2.
Testing on more PCs and USB ports, shows some subtle differences on the test PCs
* USB3 ports(blue) seem to sustain higher baud traffic, then a 'standard' port (even tho both run at 12MHz - maybe larger buffers ?)
* Windows Device Settings defaulted to 16ms and change to 1ms did help 2MBd on std USB
* 3MBd on USB3 HW, was very close to managing reliable streamed Duplex.
* Adding a 2nd stop bit, seemed to help a little.
* Some failure modes looked a little brutal, at 3Mbd giving errors, sometimes the USB VCP vanished from Win8, and did fully restore on unpliug/replug. Moving a another PC then back, seemed to clear things.
(ie maybe more than just dropped data was going on here )
* TX seemed to never drop, but receive side seemed to have the issues.
As another reference point, Silabs CP2130 specs 3.9 and 2.6MBps on read.write so that does look to be about the duplex limit.
They also give 5.8MBd(W) and 6.6MBd(R) as one-way limits.
Loopback streaming tests, 100000 blocks, with a Frequency counter and Char counter Terminal.
( This terminal has been crafted to have low overhead, and quiet modes, so the PC SW side does not set the ceiling.)
Propeller Project Board Tests FT231X (20p) Loopback
FT231X File of [U......U] Shift-Ctr-V. Right-click Paste.
Block Size Baud Set TxSend RxBack FreqAv FreqAv
100000 3Mbd n,8,1 100000 99128!* 1.49989M Qm Overrun errors 1.49985MHz Overrun errors
100000 2.4MBd n,8,1 100000 100000 1.00001M Qm 1000.018MHz Solid << 2MBd alias
100000 2Mbd n,8,1 100000 100000 1.00001M Qm 1000.018MHz Solid
100000 1.5Mbd n,8,1 100000 100000 750k quiet mode, less in hex 750.007KHz Solid
100000 1Mbd n,8,1 100000 100000 380~500KHz variable(hex) 500.0062KHz Solid
100000 500kbd n,8,1 100000 100000 243.KHz sometimes 250KHz (hex) 250.0045KHz Solid
100000 3Mbd n,8,2 100000 99577!* fewer Overrun errors 1.49985MHz Overrun errors
100000 3Mbd m,8,2 100000 99949!* Better Rx Yield, still < 100%
* in 3MBd case, external edge TX count is correct, so it is RX side which is dropping chars
Added:
Same tests, SiLabs CP2105 (ENH) channel Duplex, Shift-Ctr-V QuietMode :
(kHz values under 0.5*Baud, mean added stop bits)
Block Size Baud Set TxSend RxBack FreqAv
100000 1.2Mbd n,8,1 100000 100000 -> 441.194kHz
100000 2Mbd n,8,1 100000 100000 -> 525.516kHz
100000 3Mbd n,8,1 100000 100000 -> 624.674kHz
It seems the FT231X can sustain 2MBd duplex, (with good PC sw) and at 3MBd can send to that with no added stop bits, but it stutters a little on 3MBd Duplex, on the Receive side.
expanding to 2 Stop bits, and mark parity both help, but are not quite enough to make duplex without over run.
(SW works to well above this on a FT232H, but that uses different frame speed and drivers)
I think FTDI have somewhat mangled their Baud formula in my data sheet, tests show more correct is
FT231X Virtual Baud Clock of 24MHz, with legal divisors of 8,12,16,17,18,19,20,21...
ie above 16, single digit steps are supported, below 16 it is 8,12
Comments
I am certain Chip understands what I am trying to do regarding extracting the correct pin pair. If not, a fixed P0 & P1 would work for now.
I think the code is close enough for testing - it is the time it will take Chip to do this with appropriate fixes and fitting into his Verilog regime.
As for testing, initially I propose to just output on those pin pairs various conditions, and each time calling the new RxUSB instruction (whatever we call it - and we don't need pnut support as I can code it as a long). This way, I can control what is output and therefore test what the instruction receives. So I can verify the instruction is working as designed in a controlled environment.
Once I have that running, I can snoop a real FS USB and ensure I can read the tokens and packets, and verify the crc5 and crc16usb, the SE0 at the EOP and of course the initial sync sequence.
That testing approach sounds like a good idea, Chip then just needs to make as much SW-readable as is practical. ie 32b each way.
It probably does not need to mesh into the register-array, just as long as it can R/W in SW. (ie like the counter setups)
I think for the final P2 (not the FPGA) there is scope to be able to use this simply with a byte processing loop in another COG task if the P2 clock is a multiple of 12MHz and >=96MHz.
This could be the bitloop.
EDIT: Sorry accidently hit tabs + space while typing which clicked submit and this posted too soon. I'm still formulating and thinking about this idea. I want to come up with a 6 or (7?) instruction bit loop and use SYNCTRA which will wait until the right time to sample without stopping the other hardware task from running. I think I will need to subtract from PHSA somewhere as well making this 7 instructions. We would still do the 1:8 task allocation to give the byte processor task its time for the packet.
This can help snoop on a 1.5MHz USB, and so open that testing domain, when this gets to real data flows.
Taking a 80MHz FPGA clock, we can get to 1.5MHz on average, with modest jitter.
80/1.5 = 53.3333333333333333
2^32/(80/1.5) = 80530636.8 round(2^32/(80/1.5)) = 80530637
2^32/(round(2^32/(80/1.5))) = 53.3333332008785675
80M/(2^32/(round(2^32/(80/1.5)))) = 1500000.0037252903
1/(ans-1.5M) = 268.435456s of numeric beat error.
Will add 80530637 every clock (1.5/80)
Then the upper 2 bits are the fastest /4 case, and the DPLL rule, from #118 is along the lines of
If edge occurs @ MSB = 3 -> No change (add as usual) (1.5/80)
If edge occurs @ MSB = 2 -> Need to advance to 1 quadrant, or add 2^32/4 ONCE ( or 2^32/8 twice )
If edge occurs @ MSB = 0 -> Need to retard 1 quadrant, or add 2^32*3/4 ONCE ( or 2^32*3/8 twice )
edge @ MSB =1 should never happen during a data stream.
That case could be flagged, and it can add 2^32*2/4 ONCE ( or 2^32*4/8 twice ) to match the Verilog action.
If a COG is set for 2 threads 50%, we have 2 x 40MHz flows to manage 1.5MBd data.
Code then does
Or there may be time for a way to read PHSx, for all edges not in quadrant 3, and calculate a (double) add to give quadrant 0 next.
just one adjust code block is then needed.
If quadrant3 is the edge value, then quadrant 1 is the sample point ( and Q0,Q2 are the guard bands)
Each thread has 26 thread cycles per bit.
PHSx.MSB can map to a pin, and be used as a sample-and scope trigger.
The problem with that >=96MHz, is you cannot fully test this in the FPGA, which is close to drop-dead.
Better I think, to include 48MHz ( & 60MHz & 72MHz& maybe 84MHz ) on the Clock targets.
The Auto-pacing (DPLL) code for BYTE level handler, is in #118, and is not large.
Yeah I know 96MHz is just too fast for the FPGA. Assuming we were to stick to bit processing loops in software I see a lot of merit in the 1:8 approach with a byte processing task running as well, as it simplifies the software design and decouples the timing critical bit work from the other (slower) byte orientied protocol processing work. The problem is transferring the data and doing the error checking takes time and I don't see a way to get it down too much more given what Cluso has proposed. Now if we come up with more USB extensions beyond RXUSB that provides bytes for us and deals with timing, that could work out nicely as well. I don't think we are there yet but hopefully we are heading in that direction...it's worth continuing that discussion too.
See Chip's comment in #118 - we are closer than you think - once you have a bit-counter inside the Verilog, then you just need to 'fire' the per-bit Verilog, on a DPLL timer, (#118) and buffer the DataTX.
I'm not sure if CRC needs buffering, or just preservation-care over a SE0 event.
I posted that code's SCH on prop ii blog thread -- none even commented it.
It was too low resolution for me to see clearly, and besides, I can see the fitter equation's which are easier to follow...
It does pay with Verilog (like with most high level languages) to check you got what you expected, and not something else, or logic-bloat.
Chip said there will be instruction space, so no need for a fixed D address.
The exact implementation is up to Chip, I'm just observing that this is more like a Counter or SerDes in operation, than a memory/register.
The Counters use SETxx and GETxx opcodes , which also have a D address.
I do quite like the sound of this adaptive clocking. I need to get my head around it more. What HW changes or other further instructions would be required to support it? How much is done in HW vs software? I see we would use one of the counters, does it need modifications or do all the clocking tweaks adjusting FRQA happen in software?
I think you meant 26 thread cycles per byte, not per bit above. But that is still nice and already gives us at least 3 hub cycles per byte @80MHz in the byte processing task. At 48MHz this drops down to 16 clocks per thread or 2 hub cycles which I think should still be fairly generous.
So what crystal frequency limitations would this overall approach entail? Anything >=48Mhz or do you still need discrete 12MHz multiples above this? I imagine for receiving if we get aggressive we could potentially adjust timing after every byte which that probably means we don't want slip any more than say 1/4 bit per 8 bits right? That is ~3% tolerance. But the transmit adds it own complexity and I expect we want to be able to transmit accurately at the right bit rate, which then means a 12MHz multiple. How does your design deal with that?
The counter form above is just a skeleton of SW-emulation ideas, of the verilog code in #118, and a means to smart-sample a USB stream at 1.5MHz.
The SW-emulation is looking at ways to test.emulate verilog ideas, using the FPGA in SW, but at modest USB speeds. Lucky there is the low speed mode
In #118 you ca see Chips comment suggests he may fit a custom Baud controller easier than adding modes to a Timer.
Either way fine, it's whatever is easiest and smallest to include. Separate 8bit Baud Div, frees 32b timers for other tasks.
The Code in #118 is pretty much all you need, just a few lines of Verilog -> Silicon.
There are not many changes once the USB code block includes a BitCtr, and is called once per bit, it's just a matter of do you call it in SW, or use a DPLL as in #118 ?.
No, that is per-bit, - but notice that is for a test version, running at 1.5MHz LO-speed USB, where things will be easier to probe, and hook-into.
At 1.5MHz I think FPGA-P2 can fit one DPLL + Diagnostics thread sampling and checking, and one Thread running the USB-Verilog tests.
Once that looks good, the Verilog DPLL would be added to pace the USB engine, instead of SW calls.
In this form, it's not quite as easy to probe or test, so a mixed SW version (aka Verilog emulation) gives a way to bring this up, and get higher level code working.
Yes, the Baud-DPLL assumes N x 12MHz with N >= 4, and Ok up to 200MHz, and can do low-Speed USB at >200MHz SysClk.
The Code in #118 auto-syncs, so tolerance is not so critical, but you would need a crystal or resonator for timing.
( ie RC osc's are probably off the table)
Chips Xtal PLL is now any-integer, so that gives a few choices of how to get to 12MHz x N
I think that to BitBanged send receive it is all hardware that needs.
Look in attachment.
Only one more signal I think is needed are MODE 0/1 that inverse TXD/RXD.
HAve even one version that include NRZI IN/OUT
Sorry, I don't understand what you are saying/showing.
Hardware between 2 PIN's to read/send BitBanged USB --->
Most of it needs even if other functions that need connect to USB's differential pins
This part of Hardware You can't omit in any type of USB communication.
But your circuit did not show both inputs.
D_n, D_p are lines that Input/Output to D-, D+
nOEi --->select Input/Output to USB
TDX, RXD are NRZI Output/Input to this circuity
Here are circuity with in build NRZI.
D_n, D_p are lines that Input/Output to D-, D+
nOEi --->select Input/Output to USB
TXD_di, RXD_do are real bits Output/Input to this circuity
Give directly Real bits with Receive and use real bits with send
Ned only one instruction that in field D --- Can control special signals --->
some of them are read only and some write/read (nOEi, SOE, SEI, SUSPEND, reset_i)
And field S port number.
and send/receive to flag C bit value (maybe directly shifted IN/OUT from register specified by RESD instruction.
Included all logic in ONE place (KISS) and made stuff counter more self contained.
For simplicity, the CLK here is considered as USB sample point.
Code that merged with the DPLL Baud further up would add TSW as a CE gate,to get that sample point aligned correctly.
I am writing the instruction using Verilog now. It does show the circuitry required.
Nice code ---> compile nice in my Quartus.
BUT it is only Receive part ---> Still need Send part and Hardware drivers for ( j, k ) IN/OUT
To part that shows in my SCH I already have Verilog code
Correct, no Tx yet - I think P2 has differential out support now, and the Serdes may/(should?) support packed sends.
If the CRC above can be shared (it could snoop on a Tx stream?), that just leaves bit-stuff to do in SW before starting to
send a block.
Receive is a tougher nut to crack, so the focus was on that.
Even doing TxStuff in Verilog is not many gates ( similar to the Rx Side )
Roughly :
As jmg said, we are concentrating on the harder part - the receive end first. But we can also use the same instruction to do the CRC calcs after outputting each bit. So the instruction is quite powerful as it is.
Currently it is also capable of doing CRC16 but it needs a couple of fixes because BiSync/SDLC is uses a single bit, not complementary pairs, and it can be NRZ or NRZI.
and also shows some issues in the details of settings, and sustained speeds, for when testing USB flows on P2.
Testing on more PCs and USB ports, shows some subtle differences on the test PCs
* USB3 ports(blue) seem to sustain higher baud traffic, then a 'standard' port (even tho both run at 12MHz - maybe larger buffers ?)
* Windows Device Settings defaulted to 16ms and change to 1ms did help 2MBd on std USB
* 3MBd on USB3 HW, was very close to managing reliable streamed Duplex.
* Adding a 2nd stop bit, seemed to help a little.
* Some failure modes looked a little brutal, at 3Mbd giving errors, sometimes the USB VCP vanished from Win8, and did fully restore on unpliug/replug. Moving a another PC then back, seemed to clear things.
(ie maybe more than just dropped data was going on here )
* TX seemed to never drop, but receive side seemed to have the issues.
As another reference point, Silabs CP2130 specs 3.9 and 2.6MBps on read.write so that does look to be about the duplex limit.
They also give 5.8MBd(W) and 6.6MBd(R) as one-way limits.
Loopback streaming tests, 100000 blocks, with a Frequency counter and Char counter Terminal.
( This terminal has been crafted to have low overhead, and quiet modes, so the PC SW side does not set the ceiling.)
It seems the FT231X can sustain 2MBd duplex, (with good PC sw) and at 3MBd can send to that with no added stop bits, but it stutters a little on 3MBd Duplex, on the Receive side.
expanding to 2 Stop bits, and mark parity both help, but are not quite enough to make duplex without over run.
(SW works to well above this on a FT232H, but that uses different frame speed and drivers)
I think FTDI have somewhat mangled their Baud formula in my data sheet, tests show more correct is
FT231X Virtual Baud Clock of 24MHz, with legal divisors of 8,12,16,17,18,19,20,21...
ie above 16, single digit steps are supported, below 16 it is 8,12