P2 and full speed USB slave requirements/ideas
rogloh
Posts: 5,786
I've recently been contemplating just what we might be able to achieve regarding a full speed USB slave (12Mbps) on the P2 even if we don't happen to get any new CRC capabilities in hardware or the GETXP pin pair instruction already requested earlier by Cluso. Some of this information might be useful for others, so I decided to start a new thread discussing USB on P2. I really would like to see us achieving all the low level USB functionality in a single COG if possible - that should be the target given we have a ~200MHz processor here. The application actually making use of the USB would have to run in another COG to feed the host data and also consume data sent by the USB host. It may also have to do the control endpoint transfers to setup the USB and report device descriptors etc if that can't be done automatically in the USB COG as well (that part is TBD, though it would be really great if the USB COG could do all the reporting for you automatically, simplifiying the client application significantly).
After reviewing both the USB spec and other information online, I've tried to imagine what might be possible using an approach where we have a couple of HW tasks in a COG processing the incoming data at 12Mbps. The first task does all the bit processing and byte/packet framing work and feeds byte wide USB data to the second byte processing task using an internal INDA/INDB wrapping fifo arrangement between tasks. The byte wide processing task is given 1/8th of the CPU timeslots and is aligned to always run on the hub window. It can therefore get access to the hub RAM at any time and if WIDE reads are always used, it will only take a single cycle to get the result. All writes can access the hub in single cycle.
I've come up with some rudimentary bit processing code in the critical inner loop to see if we have enough cycles to keep up with the incoming USB data and feed it to the other task. From what I have so far it seems it might just be doable. The inner loop code detects the SE0 condition, illegal SE1 and destuffing errors, and does NRZI bit decoding and destuffing as well as writing the resulting bytes to the fifo.
I'm assuming a P2 clock rate of 192MHz (16 x 12M). This is the sweet spot for full speed USB at 12Mbps because it gives 16 instructions between bits and this is also 2 hub windows between bits. As a result we have 14 instructions (7+7) every 16 CPU timeslots to run the bit processing task per USB bit, and in parallel to this 2 instructions every 16 CPU timeslots for the byte processing task (both timeslots are hub cycles). That gives us 24MIPS total to do all the byte processing work. Amongst other things it would need to do the CRC16, PID, endpoint and address checks as well as CRC5 check and hub RAM read/write, ACK/NAK/STALL handshaking etc. Sounds like a quite lot but remember we can read a whole WIDE of 32 bytes in one clock cycle for our endpoint transfers. This fact can help enormously as we can request using 32 byte USB data transfers in our endpoints, so we don't ever have to transfer more than a single WIDE of data per packet. Using 32 byte transfers and double buffering of endpoints should give very good full speed USB throughput. I'd expect performance potentially peaking near 1MB/s for bulk transfers if they get implemented. We can also use the Stack RAM as a CRC16 look up table, a PID checker jump table, and a CRC5 checker with any assigned USB device address and two endpoints all fitting in the 256x32 bit RAM simultaneously.
Below is my example of the inner loop I hacked up for the standard part of the bit processing task just to see what might fit in our budget of 14 cycles per bit. It has to use delayed jumps everywhere so as to always consume a fixed number of clocks per jump to keep the loop at 14 cycles and not add any jitter between bit sampling. It gets a bit hairy requiring reuse of flags and conditionals everywhere and I hope I got the logic correct in the end. The byte processing task is quite different and has to use regular jumps which will always be seen by this task to take a single clock cycle instead of 4 as there will always be 7 more instructions sent to the pipeline from the bit processing task before the byte processing task is run again. This fact simplifies the code in the byte processing task enormously and is a real win. Without that capability I doubt it really could be workable with delayed jumps, and it would burn too many instruction cycles.
The byte processing loop needs to read data from the internal FIFO in COG RAM and process it. If you don't keep up with the incoming data the (elastic) COG FIFO can absorb a slight burst of extra bytes, so as long as you can catch up later in the interpacket delay you might be able to get behind temporarily. That being said you still already have 12 more instructions left per byte in the loop to process it, if you can unroll the loop you might get 13 in some places.
During data transfers I know that you can do a stack RAM table driven CRC16 accumulation in 5 instruction cycles (6 if you share the table with other data). We only need a single cycle for the hub memory write transfer using PTRA++, this hub transfer can be done at the end of the packet if we copy into a wide, and the end of packet exit/error condition could be detected in one more instruction using the Carry bit as a flag passed in bit31. That tally seems to fit in the cycles available so far with some to spare.
Once an endpoint is identified the total endpoint state information could be read in very quickly from hub RAM. An eight long WIDE per endpoint could easily hold two pointers and data lengths for enabling double buffering, as well as endpoint flags such as enabled state so we know when to ACK/NAK/STALL etc. Some of this endpoint buffer information could be quickly updated back to hub RAM with individual long writes in single cycles when more data has been read from or written to endpoints which can indicate to the application COGs that new data is available, or the data has been consumed etc. Pointers to buffers in hub RAM can be alternated to get double buffering working. There's lots of scope there.
USB gets complicated and there's plenty of other things beyond this simple idea to consider like responding to errors, timeouts, bus turnaround, USB protocol ACK/NAK exchange etc etc etc, and none of this stuff is fully proven, but at least now in my mind I am beginning to feel like there is some hope here for getting a FS USB slave on P2 that can keep up with the incoming data rate assuming we can run the P2 at 192MHz, even without any extra CRC or GETXP stuff.
Your thoughts? Does this type of approach have potential? Is there a fatal flaw that kills it?
Roger.
PS. I think the USB slave on P2 is one very useful thing, but actually having a full speed USB host software implementation for P2 would be great too. We could then attach and use external FS USB devices and operate them on the P2. That's quite a bit of extra protocol work needed there, especially if you wanted to do multiple devices on the bus with hubs etc but one day we might have it working for P2 with any luck. That's the desire anyway.
Update: This link gives a simplified introduction/description of the kinds of things that are involved for USB : http://www.usbmadesimple.co.uk/ums_3.htm
After reviewing both the USB spec and other information online, I've tried to imagine what might be possible using an approach where we have a couple of HW tasks in a COG processing the incoming data at 12Mbps. The first task does all the bit processing and byte/packet framing work and feeds byte wide USB data to the second byte processing task using an internal INDA/INDB wrapping fifo arrangement between tasks. The byte wide processing task is given 1/8th of the CPU timeslots and is aligned to always run on the hub window. It can therefore get access to the hub RAM at any time and if WIDE reads are always used, it will only take a single cycle to get the result. All writes can access the hub in single cycle.
I've come up with some rudimentary bit processing code in the critical inner loop to see if we have enough cycles to keep up with the incoming USB data and feed it to the other task. From what I have so far it seems it might just be doable. The inner loop code detects the SE0 condition, illegal SE1 and destuffing errors, and does NRZI bit decoding and destuffing as well as writing the resulting bytes to the fifo.
I'm assuming a P2 clock rate of 192MHz (16 x 12M). This is the sweet spot for full speed USB at 12Mbps because it gives 16 instructions between bits and this is also 2 hub windows between bits. As a result we have 14 instructions (7+7) every 16 CPU timeslots to run the bit processing task per USB bit, and in parallel to this 2 instructions every 16 CPU timeslots for the byte processing task (both timeslots are hub cycles). That gives us 24MIPS total to do all the byte processing work. Amongst other things it would need to do the CRC16, PID, endpoint and address checks as well as CRC5 check and hub RAM read/write, ACK/NAK/STALL handshaking etc. Sounds like a quite lot but remember we can read a whole WIDE of 32 bytes in one clock cycle for our endpoint transfers. This fact can help enormously as we can request using 32 byte USB data transfers in our endpoints, so we don't ever have to transfer more than a single WIDE of data per packet. Using 32 byte transfers and double buffering of endpoints should give very good full speed USB throughput. I'd expect performance potentially peaking near 1MB/s for bulk transfers if they get implemented. We can also use the Stack RAM as a CRC16 look up table, a PID checker jump table, and a CRC5 checker with any assigned USB device address and two endpoints all fitting in the 256x32 bit RAM simultaneously.
Below is my example of the inner loop I hacked up for the standard part of the bit processing task just to see what might fit in our budget of 14 cycles per bit. It has to use delayed jumps everywhere so as to always consume a fixed number of clocks per jump to keep the loop at 14 cycles and not add any jitter between bit sampling. It gets a bit hairy requiring reuse of flags and conditionals everywhere and I hope I got the logic correct in the end. The byte processing task is quite different and has to use regular jumps which will always be seen by this task to take a single clock cycle instead of 4 as there will always be 7 more instructions sent to the pipeline from the bit processing task before the byte processing task is run again. This fact simplifies the code in the byte processing task enormously and is a real win. Without that capability I doubt it really could be workable with delayed jumps, and it would burn too many instruction cycles.
bitloop: MOV usbpins, pina ' sample pin data from input port (patched as pina in this case) AND usbpins, USB_PIN_MASK WC WZ ' mask USB D+, D- pins, Z=1 if SE0, C=0 if both pins high or low, Z=0, C=1 if normal J/K data if_z_or_nc JMPD #eop_or_se1_error ' if SE0 or illegal SE1 condition exit loop in 3 more cycles to go handle SE1 error or EOP if_c SUB lastval, usbpins WZ ' else if data pins were good, compare against previous value, Z=1 if same, Z=0 if different MOV lastval, usbpins ' save the current usb pin values for data bit comparison in next iteration of this loop MUXZ data, #$100 ' copy new data bit into bit8 of our combined data accumulator and 8 bit byte counter SHR destuffing, #1 WC ' shift destuffing counter down to see if we need to remove this bit (once C=0) if_nz MOV destuffing, #$3F ' if we just had a pin transition, reset destuffing counter to allow 6 more ones in a row if_z_and_nc MOV destuff_error, #1 ' if no transition and we are also destuffing this bit it's a bitstuff error, remember it if_c SHR data, #1 WC ' if we inserted a valid bit to the accumulator, move data down. C=1 when all 8 bits are in JZD destuff_error, #bitloop ' continue processing bits in the byte unless there is an error, in which case fall through if_c MOV inda++, data ' if we have all 8 bits, write data byte to INDA/INDB fifo for byte processing task to read if_c MOV data, #$80 ' and reset data byte to new value, ready for next 8 bits to be loaded if_c ADD fifo_counter, #1 ' and increment fifo counter to notify other task we have a new byte of data ready in the fifo bitstuffing_err: 'we fall out of inner loop to handle bitstuff error here, look for next EOP etc eop_or_se1_error: 'we check Z=1 for SE0 or Z=0/C=1 for SE1 error and act accordingly once outside the inner bitloop - 1) we need to handle errors accordingly, do EOP processing, resume, suspend, timeouts etc 2) detect center of first bit transition, detect sync pattern then reinitialize our destuffing counter back to $3F, clear any destuff error, set lastval of usbpins to the "K" pin value from the sync, and data accumulator back to $80 before reentering the inner loop above for the next frame
The byte processing loop needs to read data from the internal FIFO in COG RAM and process it. If you don't keep up with the incoming data the (elastic) COG FIFO can absorb a slight burst of extra bytes, so as long as you can catch up later in the interpacket delay you might be able to get behind temporarily. That being said you still already have 12 more instructions left per byte in the loop to process it, if you can unroll the loop you might get 13 in some places.
wait_data: JZ fifo_counter, #wait_data MOV byte, indb++ WC ' C=1 could indicate an error or the end of the packet etc SUB fifo_counter, #1 '... process data here - we have up to 12 more instruction slots per byte in this loop for hub accesses, jumping to other parts, CRC validation etc JMP #wait_data
During data transfers I know that you can do a stack RAM table driven CRC16 accumulation in 5 instruction cycles (6 if you share the table with other data). We only need a single cycle for the hub memory write transfer using PTRA++, this hub transfer can be done at the end of the packet if we copy into a wide, and the end of packet exit/error condition could be detected in one more instruction using the Carry bit as a flag passed in bit31. That tally seems to fit in the cycles available so far with some to spare.
Once an endpoint is identified the total endpoint state information could be read in very quickly from hub RAM. An eight long WIDE per endpoint could easily hold two pointers and data lengths for enabling double buffering, as well as endpoint flags such as enabled state so we know when to ACK/NAK/STALL etc. Some of this endpoint buffer information could be quickly updated back to hub RAM with individual long writes in single cycles when more data has been read from or written to endpoints which can indicate to the application COGs that new data is available, or the data has been consumed etc. Pointers to buffers in hub RAM can be alternated to get double buffering working. There's lots of scope there.
USB gets complicated and there's plenty of other things beyond this simple idea to consider like responding to errors, timeouts, bus turnaround, USB protocol ACK/NAK exchange etc etc etc, and none of this stuff is fully proven, but at least now in my mind I am beginning to feel like there is some hope here for getting a FS USB slave on P2 that can keep up with the incoming data rate assuming we can run the P2 at 192MHz, even without any extra CRC or GETXP stuff.
Your thoughts? Does this type of approach have potential? Is there a fatal flaw that kills it?
Roger.
PS. I think the USB slave on P2 is one very useful thing, but actually having a full speed USB host software implementation for P2 would be great too. We could then attach and use external FS USB devices and operate them on the P2. That's quite a bit of extra protocol work needed there, especially if you wanted to do multiple devices on the bus with hubs etc but one day we might have it working for P2 with any luck. That's the desire anyway.
Update: This link gives a simplified introduction/description of the kinds of things that are involved for USB : http://www.usbmadesimple.co.uk/ums_3.htm
Comments
Don't worry, any inner loop code on tight designs is always a bit hairy
Any Sw approach has potential, but I can see a couple of fishhooks :
a) 192Mhz requirement means this cannot be tested on a FPGA and also forces a rather high SysClk on a real device.
The need to FPGA test, is pretty important, and that will likely give added opcodes. Chip is going to look at this.
b) I think USB ideally also needs edge-re-sync handling, (which real devices do) otherwise you risk sampling creep on longer messages. You could specify a close spec crystal, but PCs are pretty slack in their timing precision.
Some RC-Osc USB micros measure the 1ms frames, and resync their clocks to that, but I don't think the P2 has that level of fine-tune on the PLL ?
It may be possible to find a path with some slack time and do a jog-fix in that ?
My thought about the crystal timing was that we would already be resyncing to the center of the bit at the start of each frame, and restricting ourselves to 32 byte transfers so we wouldn't need to remain in sync for very long. This allows a worst case slip of less than 1/2 bit over about (32+4)*8 = 288 bits. That is 1 part in 576 or ~1736 ppm. Our P2 crystal should hopefully be within that tolerance. I know that will still cause bit slippage on any longer data frames, but if we set our endpoint's buffer sizes to be 32 bit frames only these longer packets would be for other devices on the bus (so we ignore them anyway) and we can allow the CRC16 to protect us for that. So we may still be okay there I hope. Also if there are no hubs involved we won't see long frames. We need to ensure that EOP and sync detection is robust enough to eventually recover in case we sampled something right in the middle of the signal transition and falsely interpreted as an EOP for example.
As to the design limitation of running at 192MHz I agree that will prevent FPGA testing at lower speeds. The approach above is basically a software only implementation which would target the final P2 and doesn't use additional USB specific hardware or new instructions to be tested. It only makes use of the existing features, that was the intent. Whether it could potentially work or not is mostly what I am interested in understanding.
Thanks for the feedback jmg.
I think we need to get these USB issues understood well before we jump into the SERDES, because this could affect the SERDES, and it would be a shame to get the cart before the horse there.
I'm glad you guys are thinking about this.
Cluso99, apart from the GETXP you've proposed and possible extra CRC blocks which might help validate CRC-5 and/or generate CRC16 on incoming and outgoing data, what else are you interested in for USB? How much makes sense to do in software vs hardware? Remember anything extra dedicated in the COG is going to get replicated 8 times in the device, so it will burn more gates.
A SERDES (hate the word as I keep thinking 8b10b Vitesse SerDes that I once had to work with, prefer progammable USART or some other better generic name for a universal Serial Interface) could help us out if it can get in sync and recover USB clock and put all the destuffed NRZI decoded bytes into a queue for us to read at our leisure and detect EOP and error conditions. But most of that is specific to USB requirements and starts to get quite detailed. If this SERDES block has to do other things like be useful for sustained SPI transfers and other traditional synchronous serial port uses those would probably not require NRZI. Perhaps a 2 or 4 bit clocked transfer mode could be useful for RMII or MII as well if Ethernet PHYs are to be added one day.
IIRC Cluso wanted a bit-pair opcode, that would swallow roughly 5 lines of the above code, and thus speed any SW loops.
That may still not be quite enough to make the FPGA speed threshold, and you can see another roughly 5 lines handling bit-destuff.
If still more speed is needed, choices are then
a) to manage destuff in HW, in which case FPGA speed should be a breeze up to the CRC
b) Perhaps a destuff opcode - which would work like a SHL, but instead of moving all bits left, it would test to see if instead it should skip the added-bit. This opcode would need a 'hidden counter' as checking the shifted-bit-pattern alone would not be enough. Another method would be a dual shift, and use top 7 bits as a FIFO stack, ignored when the lower 8 bits are read.
(ie there is a bit counter, but now it is using upper spare bits, and not hidden)
If b) is used, it still does not need SERDES, but does need 2 new opcodes.
One thing I noticed about Cluso's proposed GETXP is that it doesn't save the raw state of the pins, only accumulates the XOR delta from C. This is fine for normal operation and it can detect the SE0. It can't however detect the (illegal) SE1 error bus condition right away if both data lines lock up and go high. For checking for that you'd need to poll the pins a second time, as GETXP won't give this information. The problem with that is that the pins potentially may have changed from the earlier GETXP if you are getting close to a bit transition. I feel it's better to latch them together at the same point in time as well as write C/Z flags and it also saves another instruction. So instead of the proposed GETXP D WZ WC instruction where D references a pin pair, it might be better to allow a GETXP D, S/# WZ WC version of the instruction where the raw two pins identified by S/# are also recorded in D and can be referenced later if required for checking SE1.
A DeStuffShifter opcode can read CY and write to Z and CY, as you would interleave this with other opcodes.
That may give enough conditions to test ? -If a exception-detect case was important, this could be a SHIFTandJUMP opcode,
as the #param of SHL is not needed, and could be a exit-case address
Is the 7 ones in a row case needed on USB ?
This link says only for High Speed USB ?
http://www.usbmadesimple.co.uk/ums_6.htm
A single sample point sounds good, especially as that gives best jitter tolerance.
I found these notes on AVR SW USB
http://www.obdev.at/articles/implementing-usb-1.1-in-firmware.html
http://en.wikipedia.org/wiki/Bit_stuffing
an option for a bit-destuff opcode would be the 2nd param as count-length, which would cover both USB and HDLC ?
That would exit with CY and Z as the flags for overflow.
Can the loop then become something like
Loop:
PairSampleOpcode
JumpIf_SE0
DS_Shift
jumpif_ERR
Packers
DJNZ BitCtr,Loop
WrByte
WrBitCtr
PairSampleOpcode
JumpIf_SE0
DS_Shift
jumpif_ERR
DJNZ BitCtr,Loop
7 cycles, ~ 84Mhz ?
Also with such code working flat out taking in bits when are we going to be able to process the packet? Is another COG going to do this, if so, you still require the hub writes which are not always aligned to the arriving bytes. We'd need HW buffering the incoming bitstream as an elastic fifo to compensate for the hub access jitter. That means sampling at 12Mbps using a recovered clock if we don't do it at each 7 cycles in software at 84MHz (or every 6 at 72MHz). Eight clocks per bit is a nicer number then you can potentially stay synchronized with the hub if HW buffers the bitstream but that is 96MHz.
Another option is to unroll the bit-gathering, and use conditional dec-jump ?
We could make instructions that use a register and C to go between stuffed and unstuffed data. We just need to know what it must do to achieve that. I think making a few instructions for this and CRC would be worthwhile, maybe even rolled together.
I can read the FS USB now using the DE0 but because of being 1 instruction too many in the bit accumulation, I cannot do anything worthwhile. I need the GETXP because that reduces IIRC 3 instructions to 1 and allows a following conditional jump on SE0. SE1 can be tested differently. Bit unstuffing can also be done quite simply. As far as CRC goes, CRC5 is not a real problem as most of it can be precalculated once. For CRC16 I proposed a single CRCBIT instruction. I did ask for the poly to be able to be prespecified in a register such as ACCA but the number of gates is too large to make the CRC this flexible. Therefore I thought for now that the CRC16-IBM be implemented, with perhaps later the CRC16-CCITT to be an alternative option. This covers almost all cases of the CRC16 usage because other variants depend on the initial and final CRC values.
While I work on this, I thought that the SERDES could be done. I want to participate in this because I hope that it can be quite simple so that we may use it to do other things including just sending bit data, output clocks as an option, input bits with/without external clocks, and also be able to chain the 2 SERDES circuits (presuming there will be 2 in a cog). I would like to even be able to chain multiple cogs SERDES if possible. What I am thinking here is the 74LVC595 type circuits to make an intelligent P2 peripheral, etc.
We did achieve some interesting things with the VGA generator as a serial bit stream transmitter, but unfortunately we couldn't get it to receive in a similar fashion - on the P1. Some want to be able to do this on P2 and the new video generator cannot do this.
IMHO I think that the SERDES circuits to properly implement the FS USB is too complex, and too restrictive, to be put into the P2.
DESTUFF D, S/#, WZ
It takes D,S/# and Carry flag as its inputs and first checks if it is allowed to shift D right and write the inverted C bit into D[7] by comparing S/# with an internal counter we maintain. It outputs Z indicating if we inserted the bit or not.
If the inverted C bit value was written to D it sets Z, if it is not allowed and not written it clears Z (or the reverse if easier).
If C = 0 (which meant no change to input pin value, data is logical 1)
If S/# == Counter, we are not allowed to write, Clear Z, don't modify counter, (keep destuffing logical 1's until we get the next 0)
If S/# != Counter, we are allowed to write, Set Z, right shift D 1 bit and write ~C to D[7] and increment counter
If C = 1 (which meant a change to input pin data value, data logical 0)
If S/# == Counter, we are not allowed to write, Clear Z, clear counter
If S/# != Counter, we are allowed to write, Set Z, right shift D 1 bit and write ~C to D[7] and clear counter
We also need a way to ensure the internal counter can be cleared initially. I suggest passing S/#=0 for that purpose or alternatively encode the clear operation in a high bit of S somewhere to avoid calling this instruction twice. S is the number of bits allowed before we have to destuff any logical 1 bit (which is indicated by C=0 if we use GETXP as defined by Cluso). The counter doesn't need to be more than a say 4 or 5 bits wide I imagine. We can probably just use the constant form of S most of the time.
If D[0] rotates back into D[31] during the shift I guess you could always rotate it back up again at the end and reuse this byte oriented HW for both 16/32 bit variants.
If Chip has half a dozen lines of tested code, to turn into a single opcode, that has a higher chance working first time.
Probably easiest to stick with bytes, as a known std entity ?
Re DESTUFF D, S/#, WZ
A compact option for DeStuff is to have 3 working sub-fields within D, and have the # param as a Jump Address.
Lower 8 bits are for the collected output byte, another field works as a 1's counter (can be a SR) and a final field for #-bits-collected, which incs on each valid bit, and does not inc on skip.
D is loaded with 0000 (or some init value) on loop start, which primes counters and shifters.
The bitcounter is checked each loop, and if it is about to inc to 8, then next clk does the jump, and shifts in the last bit.
Code packs to :
Destuff_d = 0 ' Init all 3 fields and Z
Loop:
JNZ Destuff_ERR ' check if last DeStuff had an error
PairSampleOpcode
JumpIf_SE0
DS_Shift_JN8, Loop 'updates 3 fields, and jumps if Field_N_Bits <> 8 bits, Sets Z on Error.
' exits here on 8 valid data bits
WrByte
- etc
& maybe that opcode can have a Delayed version ?
or, an unrolled version, of opposite logic ? - jumps are now rarely taken (once per byte)
Destuff_d = 0 ' Init all 3 fields and Z
'Start Loop:
PairSampleOpcode
JumpIf_SE0
DS_Shift_JEQ8, ByteDone 'updates 3 fields, and jumps if Field_N_Bits = 8 bits, Sets Z on Error.
JNZ Destuff_ERR ' check if last DeStuff had an error
PairSampleOpcode
JumpIf_SE0
DS_Shift_JEQ8, ByteDone 'updates 3 fields, and jumps if Field_N_Bits = 8 bits, Sets Z on Error.
JNZ Destuff_ERR ' check if last DeStuff had an error
PairSampleOpcode
JumpIf_SE0
DS_Shift_JEQ8, ByteDone 'updates 3 fields, and jumps if Field_N_Bits = 8 bits, Sets Z on Error.
JNZ Destuff_ERR ' check if last DeStuff had an error
.... unroll 10 ? = maximum destuff for 8 bits out
PairSampleOpcode
JumpIf_SE0
DS_Shift_JEQ8, ByteDone 'updates 3 fields, and jumps if Field_N_Bits = 8 bits, Sets Z on Error.
JNZ Destuff_ERR ' check if last DeStuff had an error
PairSampleOpcode
JumpIf_SE0
DS_Shift_JEQ8, ByteDone 'updates 3 fields, and jumps if Field_N_Bits = 8 bits, Sets Z on Error.
JNZ Destuff_ERR ' check if last DeStuff had an error
ByteDone: '8 pin samples with no-destuff, 9 or 10 pin samples with 1.2 Skips
WrByte
- etc
RLC stuffcnt,#6 WZ
You see we already have the bit in C from the previous GETXP, and by shifting this into the stuffcounter, if we get 6 o's in a row, we end up with 36 x 0's less 4 discarded, in the stuffcounter. If the stuff counter is 0 then we need to unstuff, which is all performed by the one RLC and the following IF_Z JMP instruction to unstuff the next bit.
Quite simple really.
It is good that we try to think how to get the best from the HW and It would be nice if we could have a SW USB Host controller.
Hey, but we are in 2014 ! Do we really need to set the CPU to its highest speed (192 MHz) for a 12 mbps USB 1.1 slave?
USB 1.1 was released in September 1998. That was 16 years ago. USB is now in version 3.0.
Have parallax ever considered to buy ASIC IP for some key protocols (like USB)? USB 3.0 and USB 2.0 maybe would be expensive and cost prohibitive? But USB 1.1? How much does a USB 1.1 IP costs compared to the $5 million investment in this 7 (?) years?
There are free USB 1.1 Slave IP Cores (an silicon tested !!!) cores out there : http://www.asics.ws/fip_sub.html
USB 1.1 Phy -> 111 LUTs in Xilinx Spartan 2
USB 1.1 Device -> 885 LUTs in Xilinx Spartan 2e
It would be a killer feature if we can have a USB 1.1 Slave with the same easy of use as the current Serial transceivers (three instructions: configure, send and receive).
Sounds intriguing, I kinda get what you are doing but it would be good to see the rest of the loop in order to follow it completely. By the way, at say 72MHz how many COGs do you think you are you needing to use to do the FS USB, with Tx and Rx? Is it a single COG or multiple COGs? Do you rely on idle bus time to process the packets?
Roger.
So I don't see any real problems other than whether it requires 1 or 2 cogs. With the additional instructions I am hopeful that I might get it to 1 cog with some restrictions and/or caveats.
The hw should just be 3 resistors and 3 pins. No need for an external hw driver.
Agree it does seem like a highly clocked CPU to do it, but you need a 12MHz+ AVR for doing a bare bones USB at 1.5MBps with limitations and a single P1 COG at 20MHz can't really keep up with low speed USB either. FS USB is 8 times faster so that brings us into the ballpark of 160MHz on a Prop if we want do all it in software. Software implementations have been the typical Prop way to do things in the past. Part of the issue is you really need to write the data to the hub eventually and don't want any cycle holdups if you have sample the USB data bits at precise intervals. 192MHz gives us a sweet spot (96MHz is another one if things can be crunched down more). Also if we can rely on lots of idle time to do the packet processing work there are probably more ways to cut down the number of cycles required, but if you want to work at line rate and be able to sustain it no matter what and you don't use extra hardware or new instructions to accelerate things you will likely need a high clock speed P2 IMO.
Cool. Yeah the actual HW interface should be pretty simple. Why 3 pins, don't we have the ability to enable the pullup on the I/O pin internally or was that left out or the wrong resistance?
It is next all that needs.
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1249284&viewfull=1#post1249284
I mean I dont know/recall the final pullup/pulldown values available in the P2.
I don't know what the comparator outputs when both pins are Low. Can we detect this state reliable? We need that to detect the SE0 (end of packet) state. But I never have seen that a software USB solution detects an SE1 state as an error case inside the bit receive loop.
For differential output on two pins it is as simple as:
XOR OUTx,MaskDmp
where MaskDmp is the pinmask for D- and D+ pins.
Andy