Simplify the state machines considerably. Only operate on A input (or B input, if selectable), with the 4 possible combinations directly selecting one of the 4 possible opcodes. No need for X[15:0] or Y[15:0]. No need for state change. This also means, no need for state change bit in the opcodes or state change configuration fields. Repurpose all of that to make the remaining stuff more robust. You might even be able to fully (and easily) reproduce some of the other pin cell modes at that point.
Actually, if you kept the state feature, you could do 8 directly-mapped opcodes. and still have 12 bits left over for additional configuration:
X[4:0] opcode for input %00, state 0
X[9:5] opcode for input %01, state 0
X[14:10] opcode for input %10, state 0
X[19:15] opcode for input %11, state 0
X[25:20] UNUSED
X[31:26] existing options
Y[4:0] opcode for input %00, state 1
Y[9:5] opcode for input %01, state 1
Y[14:10] opcode for input %10, state 1
Y[19:15] opcode for input %11, state 1
Y[25:20] UNUSED
Y[31:26] existing options
Maybe you could have the last state, when it advances, be an implied 'update'. That would enable the opcodes to be only four bits, instead of five.
What to make is interesting to think about. FPGA's are built from elemental blocks, but in a case like this you could have 16-bit up/down counters and 16-bit adders, treated as single elements.
Another thought on the simplified state machine version. Instead of using [A-previous, A-current] for a selector, use [B-current, A-current]. Then, add an option (or pin mode) where B-current is muxed to A-previous! That way, you can have your state machine either operate on a single pin, where the pin's state change is important. Or you can operate on two pins, where the current pin combination is important.
I agree. I started making new 6-bit mode codes, with room for word-size control for the serial modes, and I removed the programmable mode, already.
If you were to go with a single simplified state machine mode and move the "with output" to a configuration bit in X or Y, that would free up three pin modes. Would that allow you to stay at 5 bits?
I agree. I started making new 6-bit mode codes, with room for word-size control for the serial modes, and I removed the programmable mode, already.
If you were to go with a single simplified state machine mode and move the "with output" to a configuration bit in X or Y, that would free up three pin modes. Would that allow you to stay at 5 bits?
No, because four serial codes needed eight variations each for word length, making 32 codes just for serial.
Also, I got rid of the two bits at 30 and 29. Now, when a smart mode is selected, the pin-level DIR is always high. If you want an input, set HHHLLL to 111111 for float (no output).
Do you think that USB analyzer would be particularly helpful, or would a scope be sufficient? Maybe no need to spend $400.
I would use both, - you can start with a scope, but scopes are good if you know where to look, and with unexpected issues, you do not know where to look.
A multi-channel digital storage scope with some trigger help done in a 'clipped on COG' should be a (minimal) starting point.
This should be enough to allow checking a NCO-lock sampling clock in Sync-Mode.
If a function is expected to provide a response to a host transmission, the maximum inter-packet delay for a function or hub with a detachable (TRSPIPD1) cable is 6.5 bit times
The bold text, I believe, defines the software pinch point for full-speed USB, does it not?
We have up to 7.5 bit periods (625ns, or 50 clocks at 80MHz) to get a response headed back to the host.
This may be tight, especially if CRC checking is involved. Is this realistic to do?
Well found ! I figured that Turn-Around figure was going to be pivotal (but hard to nail down).
I would base the calcs on 48MHz and 6.5 gives 26 cycles.
Sounds like CRC needs to be running in hardware ?
This pivotal number also means Pins will have to manage the bit-level stuff.
"If the endpoint has no data to send or is not yet ready to send data, the device can send a NAK handshake packet. The host retries the IN transfer until it receives an ACK packet from the device. That ACK packet implies that the device has accepted the data. "
You may be able to buy some time with a NAK, but that looks to force a repeat, which you need to at least parse to know when to do the eventual 'I'm-finally-ready' ACK.
ie that need to parse the retry, in parallel with any slower SW, means you buy little real time.
This NAK kludge also halves the USB bandwidth, and may also limit packet sizes, & will certainly make debug much harder.
I see now what needs to be done. It's pretty straightforward.
Is it appropriate to leave all CRC computation to the cog? I think someone said that they had that working in 5 clocks per byte on Prop2 Hot, which means 10 clocks on this architecture.
It looks like sending can be done by writing 9-bit values via PINSETY, where if bit 8 is cleared, it means data, and if bit 8 is set it means just do the SE0 for two clocks, then J, then quit driving the two lines (EOP, or end-of-packet). These 9-bit transmit values will be double-buffered, so that they transmit back-to-back without delays. IN will signal when it can accept another byte.
Receiving is a little more complex, because some status must be conveyed at times when there's no data.
Maybe receive-data/report-line-status can be the default state and when you want to transmit, you just do two initial PINSETY's to get things started and double-buffered, and you give another byte on every IN high. Then, when you give it a bit-8-high value it will do the EOP signaling. After that, it returns to receive-data/report-line-status mode. This smart pin mode will need to control two pins' pin-level DIR's.
Does this sound viable? Did I miss anything?
Hard to say, some things you may have inferred ?
I cannot see mention of Bit-Stuff/UnStuff, or mention of edge DPLL for nominally centre-locked sampling.
I agree. I started making new 6-bit mode codes, with room for word-size control for the serial modes, and I removed the programmable mode, already.
Sounds good.
One Comment: Size control packed into 3 bits drops one standard UART size (drops 2 if you expect SW to do Parity and Stop bit control ), so Length really needs to be 5 bit field.
When talking to MCUs, you can expect systems to have any mix of Parity and stop bit options.
There is a discussion on an Atmel forum about a 14b SPI device, and if BitBang is the only solution. Someone mentions the 32b parts have 8-16b length choices. Infineon has 1-63b.
this mentions The SOF packet consisting of an 11-bit frame number is sent by the host every 1ms ± 500ns on a full speed bus
so they guarantee MAX of one part in 2000, but should be much better than that.
Some USB-locked MCUs claim ~0.1% when sync'd to this (usually their trim LSB)
A P2 COG should be able to Sync & check this periodic time to < 1ppm precision.
IIRC I measured around 230ppm of absolute error, which is probably PC clock related.
Addit: Just did a quick Freq Ctr check on a PC connected device but inactive USB with LPF triggering.
1-1k/(250.04574*4) = 182.926 ppm high on this PC (or ~one part in 5467)
Seems stable - some of the wider variances are
1-250.04587/250.04568 = 0.759ppm
1-250.04596/250.04568 = 1.119ppm
I think that gives a NCO field of 644362944, to give an average (12M*(1+182.926u) sample clock from 80.000M
With that, a P2 COG and modest code should be able to snoop & calibrate, then capture the raw SOF packets, and see the incrementing 11-bit frame number & CRC changes.
Calibrate over 10 1ms frames gives 1.25ppm LSB on capture.
Capture of just the raw SOF packets needs less precision, as they are so short, but the idea of calibrate and capture is to allow larger message receives too, & have the sniffer-COG as a useful Locked Frame Capture instrument.
this mentions The SOF packet consisting of an 11-bit frame number is sent by the host every 1ms ± 500ns on a full speed bus
so they guarantee MAX of one part in 2000, but should be much better than that.
Some USB-locked MCUs claim ~0.1% when sync'd to this (usually their trim LSB)
A P2 COG should be able to Sync & check this periodic time to < 1ppm precision.
IIRC I measured around 230ppm of absolute error, which is probably PC clock related.
Addit: Just did a quick Freq Ctr check on a PC connected device but inactive USB with LPF triggering.
1-1k/(250.04574*4) = 182.926 ppm high on this PC (or ~one part in 5467)
Seems stable - some of the wider variances are
1-250.04587/250.04568 = 0.759ppm
1-250.04596/250.04568 = 1.119ppm
I think that gives a NCO field of 644362944, to give an average (12M*(1+182.926u) sample clock from 80.000M
With that, a P2 COG and modest code should be able to snoop & calibrate, then capture the raw SOF packets, and see the incrementing 11-bit frame number & CRC changes.
Calibrate over 10 1ms frames gives 1.25ppm LSB on capture.
Capture of just the raw SOF packets needs less precision, as they are so short, but the idea of calibrate and capture is to allow larger message receives too, & have the sniffer-COG as a useful Locked Frame Capture instrument.
Can't we just reset the phase of our receiver's bit-period NCO on each transition that comes in? When no transitions come in, we just use NCO rollover for our sample clock (plus one half period for bit center).
Can't we just reset the phase of our receiver's bit-period NCO on each transition that comes in? When no transitions come in, we just use NCO rollover for our sample clock (plus one half period for bit center).
Yes, of course, when you have that Verliog written, you sync on every available USB edge.
I'm more thinking about ways the present Verilog code can be quickly used in a Logic Analyzer Cog, and still get useful operation, and cross-checking of upcoming Verilog.
A COG example that did DPLL calibrate and then capture of USB, would also make a good reference example, for NCO and Sync-Shifters working together.
If a function is expected to provide a response to a host transmission, the maximum inter-packet delay for a function or hub with a detachable (TRSPIPD1) cable is 6.5 bit times
The bold text, I believe, defines the software pinch point for full-speed USB, does it not?
We have up to 7.5 bit periods (625ns, or 50 clocks at 80MHz) to get a response headed back to the host.
This may be tight, especially if CRC checking is involved. Is this realistic to do?
Well found ! I figured that Turn-Around figure was going to be pivotal (but hard to nail down).
I would base the calcs on 48MHz and 6.5 gives 26 cycles.
Sounds like CRC needs to be running in hardware ?
This pivotal number also means Pins will have to manage the bit-level stuff.
"If the endpoint has no data to send or is not yet ready to send data, the device can send a NAK handshake packet. The host retries the IN transfer until it receives an ACK packet from the device. That ACK packet implies that the device has accepted the data. "
You may be able to buy some time with a NAK, but that looks to force a repeat, which you need to at least parse to know when to do the eventual 'I'm-finally-ready' ACK.
ie that need to parse the retry, in parallel with any slower SW, means you buy little real time.
This NAK kludge also halves the USB bandwidth, and may also limit packet sizes, & will certainly make debug much harder.
This bus turnaround time is 625ns, or 25 two-clock instructions at 80MHz. Do you think that is insufficient time to formulate a response?
A cog would really have to babysit a USB connection.
Biting off too much at such a late stage, imho. Cluso has wanted a couple of helper instructions for bit-bashing USB, that should be enough for this attempt.
@Chip,
Yes the Rx COG would need to babysit the USB for sure, remember though, we can try to parse the USB packet and do the CRC on the fly as it arrives so by the end of the arriving packet much of the knowledge needed is likely to be there and we won't need much longer to be able to reply back with a streamed response. If it is just an ACK or NAK of an endpoint channel to be sent that is simple and the COG can be aware of the readiness of Hub data by polling hub memory periodically between arriving bytes. We can also have access to precomputed USB device/configuration descriptors ready to be sent back on command from hub memory. The first few response bytes of the frame (like sync/pid etc) can also be available in internal COGRAM before we even need to worry about starting the CRC16 on any available hub data. The time seems tight but hopefully, and without coding up a full PASM implementation in new P2 opcodes it will be hard to know for sure, we should have some good chance with a 100MIP processor (or two if we decouple TX&RX) to throw at the problem.
If we at least get clock recovery, packet delineation/error conditions and hopefully bit unstuffing done for us in hardware my best guess is we would have some chance to achieve full speed 12Mbps in a P2 COG or two. We may need to consider the CRC5 portion in HW also if we find that is required to meet the response time, however once you know your address and endpoint, the CRC5 is static for most packet types and could potentially be precalculated for checking against on a device. Table driven CRC16 algorithms can make good use of the 256 entry stack RAM. Remember the incoming/outgoing half-duplex byte rate is only 1.5MB/s so byte-by-byte table indexed lookups for CRC16 in internal stack memory won't be too challenging for a COG.
Of course we would also like the P2 to be able to act as a host and a device but much of that is just software implementation detail apart from the clocking needs. The ability to both host USB keyboards/mice/USB sticks and other common devices etc, and have the P2 behave as a serial port (like FTDI) or mass storage adapters etc to another host would be really useful for the P2, especially if we don't need to add additional host processors or extra FTDI chips etc to designs.
Roger.
PS. I am glad you began to read the USB spec Chip. I know when I dug into this all a couple of years back I was turned off it too and avoided it like the plague initially, but like I found, once you just start with that usbmadesimple site link and figured it out and then move into the intricate low level details in the standard it slowly starts to make some sense. Just don't worry so much about all the descriptor software handshaking etc - that is all handled by the application code side. I also found that understanding the low speed USB software implementation on the AVR micro controllers helped see what was needed to come up with a hybrid type of approach amenable to the prop. But you do need to be in a very receptive state of mind after some strong coffee etc to not get too bogged down by this standard, and it includes a bunch of high speed stuff too that can be ignored.
@Chip,
Yes the Rx COG would need to babysit the USB for sure, remember though, we can try to parse the USB packet and do the CRC on the fly at it arrives so by the end of the arriving packet much of the knowledge needed is likely to be there and we won't need much longer to be able to reply back with a streamed response. If it is just an ACK or NAK of an endpoint channel to be sent that is simple and the COG can be aware of the readiness of Hub data by polling hub memory periodically between arriving bytes. We can also have access to precomputed USB device/configuration descriptors ready to be sent back on command from hub memory. The first few response bytes of the frame (like sync/pid etc) can also be available in internal COGRAM before we even need to worry about starting the CRC16 on any available hub data. The time seems tight but hopefully, and without coding up a full PASM implementation in new P2 opcodes it will be hard to know for sure, we should have some good chance with a 100MIP processor (or two if we decouple TX&RX) to throw at the problem.
If we at least get clock recovery, packet delineation/error conditions and hopefully bit unstuffing done for us in hardware my best guess is we would have some chance to achieve full speed 12Mbps in a P2 COG or two. We may need to consider the CRC5 portion in HW also if we find that is required to meet the response time, however once you know your address and endpoint, the CRC5 is static for most packet types and could potentially be precalculated for checking against on a device. Table driven CRC16 algorithms can make good use of the 256 entry stack RAM. Remember the incoming/outgoing half-duplex byte rate is only 1.5MB/s so byte-by-byte table indexed lookups for CRC16 in internal stack memory won't be too challenging for a COG.
Of course we would also like the P2 to be able to act as a host and a device but much of that is just software implementation detail apart from the clocking needs. The ability to both host USB keyboards/mice/USB sticks and other common devices etc, and have the P2 behave as a serial port (like FTDI) or mass storage adapters etc to another host would be really useful for the P2, especially if we don't need to add additional host processors or extra FTDI chips etc to designs.
Roger.
PS. I am glad you began to read the USB spec Chip. I know when I dug into this all a couple of years back I was turned off it too and avoided it like the plague initially, but like I found, once you just start with that usbmadesimple site link and figured it out and then move into the intricate low level details in the standard it slowly starts to make some sense. Just don't worry so much about all the descriptor software handshaking etc - that is all handled by the application code side. I also found that understanding the low speed USB software implementation on the AVR micro controllers helped see what was needed to come up with a hybrid type of approach amenable to the prop. But you do need to be in a very receptive state of mind after some strong coffee etc to not get too bogged down by this standard, and it includes a bunch of high speed stuff too that can be ignored.
I will try to implement the transceiver in hardware. The cog will be able get staus, set bus states, and send and receive bytes. I don't know if the CRC5 should be done in hardware, but if it's small, we could do it. Like you said, software has to pick it up from there. Thanks for your encouraging words, and everyone else's.
This bus turnaround time is 625ns, or 25 two-clock instructions at 80MHz. Do you think that is insufficient time to formulate a response?
A cog would really have to babysit a USB connection.
Yes, it certainly needs quite close watching, but the response is really just ACK or NACK to the (Checksum==OK), on RX, provided the next packet has a clean place to land.
If there is a NACK, the current buffer has to be re-used for the retry.
ie the 'formulate a response' is less of the formulate, and more of the 'load the pipe'
.... Table driven CRC16 algorithms can make good use of the 256 entry stack RAM. Remember the incoming/outgoing half-duplex byte rate is only 1.5MB/s so byte-by-byte table indexed lookups for CRC16 in internal stack memory won't be too challenging for a COG.
I was wondering if the CRC can be a parallel/serial adjunct to the COG-Pin serial link.
That needs just 16 copies, not 64, and CRC is naturally serial anyway.
Not sure of the Logic cost, or if this can be read/checked within the tight time-budget.
This bus turnaround time is 625ns, or 25 two-clock instructions at 80MHz. Do you think that is insufficient time to formulate a response?
A cog would really have to babysit a USB connection.
Yes, it certainly needs quite close watching, but the response is really just ACK or NACK to the (Checksum==OK), on RX, provided the next packet has a clean place to land.
If there is a NACK, the current buffer has to be re-used for the retry.
ie the 'formulate a response' is less of the formulate, and more of the 'load the pipe'
@jmg, I will try to dig up the code for table CRC16 I believe I had identified in an old P2 hot related post. If we can make use of that it would alleviate extra HW burden of needing any CRC engines for USB as I think even the CRC5 could really be table driven (though perhaps from deeper HUB RAM) once you know your endpoint/address etc. Its only the SOF packet every 1ms where the CRC5 varies the most - this could be precomputed in a table and indexed by the 11 bit frame number for both host/device operations. The other packet types, the CRC5 is static per address and endpoint for each PID type - as a device we already know our endpoints, and our address is allocated by the host. So at this point we may be able to generate the expected CRC5's in software or from some fixed table in hub RAM based on the 127 possible combinations of addresses. As a host, before communicating with the endpoints of devices we know the address we will allocate and the new device's endpoints we are initializing, so we can precompute all the CRC5's we have to send out to a device before we begin regular operation communicating to it. Easy... I hope.
Its only the SOF packet every 1ms where the CRC5 varies the most - this could be precomputed in a table and indexed by the 11 bit frame number for both host/device operations..
I don't think a response is needed to this SOF, so is a CRC5 check even strictly needed ?
Something else to consider:
I just saw in a USB document that the pullups must be 1.5 kOhm +- 1%. I'm pretty sure that this is exaggerated. But if I remember correctly the pullups are now 1k and 10k, while they were 1.5k and 15k on the P2hot.
Why has this changed? Just to get nicer values or is 1.5 kOhm not possible?
I fear 1k will not work reliable for USB, 10k is no problem.
Something else to consider:
I just saw in a USB document that the pullups must be 1.5 kOhm +- 1%. I'm pretty sure that this is exaggerated. But if I remember correctly the pullups are now 1k and 10k, while they were 1.5k and 15k on the P2hot.
Why has this changed? Just to get nicer values or is 1.5 kOhm not possible?
I fear 1k will not work reliable for USB, 10k is no problem.
Andy
I was thinking about that today. I did change those values to be in decades. I wonder if its worth forcing the 1k to 1.5k, instead of just having an external resistor, just for USB. I could add a special 1.5k resistor to the pad. Also, I should increase the drive strength of the I/O pads in order to get the impedance down to the required 40, or so, ohms.
@jmg I finally found this old post I did that had the sample code for stack table driven CRC on P2 hot. It was actually for CRC32 on Ethernet but I believe the same general approach can apply for CRC16 accumulation if required.
Yes you are right about the SOF - no response from the device. It would only be the host that generates it that would really need the CRC5 for each frame number. The device would only have to check it if it really wanted a reliable last SOF frame number, which in most applications it probably doesn't care about too much anyway. So CRC5 is probably not a big issue to deal with IMHO.
Something else to consider:
I just saw in a USB document that the pullups must be 1.5 kOhm +- 1%. I'm pretty sure that this is exaggerated. But if I remember correctly the pullups are now 1k and 10k, while they were 1.5k and 15k on the P2hot.
Why has this changed? Just to get nicer values or is 1.5 kOhm not possible?
I fear 1k will not work reliable for USB, 10k is no problem.
Andy
I was thinking about that today. I did change those values to be in decades. I wonder if its worth forcing the 1k to 1.5k, instead of just having an external resistor, just for USB. I could add a special 1.5k resistor to the pad. Also, I should increase the drive strength of the I/O pads in order to get the impedance down to the required 40, or so, ohms.
If you have an external 1.5k resistor then you often need to switch it with an additional pin. If the pullup is already present when the 3.3V supply is on, but booting needs a second or two until the USB firmware is ready to response then the PC thinks the USB device does not work correct and you always get a Messagebox.
The pullup really needs to be made switchable, with an internal one this will not use any additional pins.
Something else to consider:
I just saw in a USB document that the pullups must be 1.5 kOhm +- 1%. I'm pretty sure that this is exaggerated. But if I remember correctly the pullups are now 1k and 10k, while they were 1.5k and 15k on the P2hot.
Why has this changed? Just to get nicer values or is 1.5 kOhm not possible?
I fear 1k will not work reliable for USB, 10k is no problem.
Andy
I was thinking about that today. I did change those values to be in decades. I wonder if its worth forcing the 1k to 1.5k, instead of just having an external resistor, just for USB. I could add a special 1.5k resistor to the pad. Also, I should increase the drive strength of the I/O pads in order to get the impedance down to the required 40, or so, ohms.
If you have an external 1.5k resistor then you often need to switch it with an additional pin. If the pullup is already present when the 3.3V supply is on, but booting needs a second or two until the USB firmware is ready to response then the PC thinks the USB device does not work correct and you always get a Messagebox.
The pullup really needs to be made switchable, with an internal one this will not use any additional pins.
Andy
Then, we should add one to each low-level pin. That layout is coming together right now, so it's not too late. It would be a special resistor that is controlled exclusively by the USB mode in the smart pin.
I was looking at the new P2 instructions and the new RDLUT actually saves us an instruction compared to P2 hot, so my CRC table lookup accumulation code reduces down to 4 instructions per byte (within the existing USB byte processing loop). So I suspect this adds up to 8 clocks per byte if each instruction takes 2 clock cycles on the new P2. If true, then for full speed USB the COG doing software CRC needs 12MHz of the CPU or 6MIPs which for a P2 clocked at (say) 96MHz would then consume 12.5% of the COG's instruction bandwidth per streamed USB byte. This overhead is not too bad and should leave plenty for the remaining processing. If it can be interleaved during hub transfers, it should work out very well.
PS. this code uses a trick where it assumes address foldover and the RDLUT only using the 8 LSBs of the source address register. I suspect that is the still case on P2 but if not will require additional instructions.
I was looking at the new P2 instructions and the new RDLUT actually saves us an instruction compared to P2 hot, so my CRC table lookup accumulation code reduces down to 4 instructions per byte (within the existing USB byte processing loop). So I suspect this adds up to 8 clocks per byte if each instruction takes 2 clock cycles on the new P2. If true, then for full speed USB the COG doing software CRC needs 12MHz of the CPU or 6MIPs which for a P2 clocked at (say) 96MHz would then consume 12.5% of the COG's instruction bandwidth per streamed USB byte. This overhead is not too bad and should leave plenty for the remaining processing. If it can be interleaved during hub transfers, it should work out very well.
PS. this code uses a trick where it assumes address foldover and the RDLUT only using the 8 LSBs of the source address register. I suspect that is the still case on P2 but if not will require additional instructions.
Comments
Maybe you could have the last state, when it advances, be an implied 'update'. That would enable the opcodes to be only four bits, instead of five.
What to make is interesting to think about. FPGA's are built from elemental blocks, but in a case like this you could have 16-bit up/down counters and 16-bit adders, treated as single elements.
If you were to go with a single simplified state machine mode and move the "with output" to a configuration bit in X or Y, that would free up three pin modes. Would that allow you to stay at 5 bits?
No, because four serial codes needed eight variations each for word length, making 32 codes just for serial.
Also, I got rid of the two bits at 30 and 29. Now, when a smart mode is selected, the pin-level DIR is always high. If you want an input, set HHHLLL to 111111 for float (no output).
I would use both, - you can start with a scope, but scopes are good if you know where to look, and with unexpected issues, you do not know where to look.
A multi-channel digital storage scope with some trigger help done in a 'clipped on COG' should be a (minimal) starting point.
This should be enough to allow checking a NCO-lock sampling clock in Sync-Mode.
Well found ! I figured that Turn-Around figure was going to be pivotal (but hard to nail down).
I would base the calcs on 48MHz and 6.5 gives 26 cycles.
Sounds like CRC needs to be running in hardware ?
This pivotal number also means Pins will have to manage the bit-level stuff.
I found more info here :
https://msdn.microsoft.com/en-us/library/windows/hardware/ff539199(v=vs.85).aspx
"If the endpoint has no data to send or is not yet ready to send data, the device can send a NAK handshake packet. The host retries the IN transfer until it receives an ACK packet from the device. That ACK packet implies that the device has accepted the data. "
You may be able to buy some time with a NAK, but that looks to force a repeat, which you need to at least parse to know when to do the eventual 'I'm-finally-ready' ACK.
ie that need to parse the retry, in parallel with any slower SW, means you buy little real time.
This NAK kludge also halves the USB bandwidth, and may also limit packet sizes, & will certainly make debug much harder.
Hard to say, some things you may have inferred ?
I cannot see mention of Bit-Stuff/UnStuff, or mention of edge DPLL for nominally centre-locked sampling.
One Comment: Size control packed into 3 bits drops one standard UART size (drops 2 if you expect SW to do Parity and Stop bit control ), so Length really needs to be 5 bit field.
When talking to MCUs, you can expect systems to have any mix of Parity and stop bit options.
There is a discussion on an Atmel forum about a 14b SPI device, and if BitBang is the only solution. Someone mentions the 32b parts have 8-16b length choices. Infineon has 1-63b.
ie Strange sizes pop up on a regular basis.
http://www.beyondlogic.org/usbnutshell/usb3.shtml
this mentions
The SOF packet consisting of an 11-bit frame number is sent by the host every 1ms ± 500ns on a full speed bus
so they guarantee MAX of one part in 2000, but should be much better than that.
Some USB-locked MCUs claim ~0.1% when sync'd to this (usually their trim LSB)
A P2 COG should be able to Sync & check this periodic time to < 1ppm precision.
IIRC I measured around 230ppm of absolute error, which is probably PC clock related.
Addit: Just did a quick Freq Ctr check on a PC connected device but inactive USB with LPF triggering.
1-1k/(250.04574*4) = 182.926 ppm high on this PC (or ~one part in 5467)
Seems stable - some of the wider variances are
1-250.04587/250.04568 = 0.759ppm
1-250.04596/250.04568 = 1.119ppm
I think that gives a NCO field of 644362944, to give an average (12M*(1+182.926u) sample clock from 80.000M
With that, a P2 COG and modest code should be able to snoop & calibrate, then capture the raw SOF packets, and see the incrementing 11-bit frame number & CRC changes.
Calibrate over 10 1ms frames gives 1.25ppm LSB on capture.
Capture of just the raw SOF packets needs less precision, as they are so short, but the idea of calibrate and capture is to allow larger message receives too, & have the sniffer-COG as a useful Locked Frame Capture instrument.
Can't we just reset the phase of our receiver's bit-period NCO on each transition that comes in? When no transitions come in, we just use NCO rollover for our sample clock (plus one half period for bit center).
I'm more thinking about ways the present Verilog code can be quickly used in a Logic Analyzer Cog, and still get useful operation, and cross-checking of upcoming Verilog.
A COG example that did DPLL calibrate and then capture of USB, would also make a good reference example, for NCO and Sync-Shifters working together.
This bus turnaround time is 625ns, or 25 two-clock instructions at 80MHz. Do you think that is insufficient time to formulate a response?
A cog would really have to babysit a USB connection.
I'm assuming this 80MHz reference is for proving out USB on the FPGA. Do you have an estimate of what the final fMax will be?
Yes the Rx COG would need to babysit the USB for sure, remember though, we can try to parse the USB packet and do the CRC on the fly as it arrives so by the end of the arriving packet much of the knowledge needed is likely to be there and we won't need much longer to be able to reply back with a streamed response. If it is just an ACK or NAK of an endpoint channel to be sent that is simple and the COG can be aware of the readiness of Hub data by polling hub memory periodically between arriving bytes. We can also have access to precomputed USB device/configuration descriptors ready to be sent back on command from hub memory. The first few response bytes of the frame (like sync/pid etc) can also be available in internal COGRAM before we even need to worry about starting the CRC16 on any available hub data. The time seems tight but hopefully, and without coding up a full PASM implementation in new P2 opcodes it will be hard to know for sure, we should have some good chance with a 100MIP processor (or two if we decouple TX&RX) to throw at the problem.
If we at least get clock recovery, packet delineation/error conditions and hopefully bit unstuffing done for us in hardware my best guess is we would have some chance to achieve full speed 12Mbps in a P2 COG or two. We may need to consider the CRC5 portion in HW also if we find that is required to meet the response time, however once you know your address and endpoint, the CRC5 is static for most packet types and could potentially be precalculated for checking against on a device. Table driven CRC16 algorithms can make good use of the 256 entry stack RAM. Remember the incoming/outgoing half-duplex byte rate is only 1.5MB/s so byte-by-byte table indexed lookups for CRC16 in internal stack memory won't be too challenging for a COG.
Of course we would also like the P2 to be able to act as a host and a device but much of that is just software implementation detail apart from the clocking needs. The ability to both host USB keyboards/mice/USB sticks and other common devices etc, and have the P2 behave as a serial port (like FTDI) or mass storage adapters etc to another host would be really useful for the P2, especially if we don't need to add additional host processors or extra FTDI chips etc to designs.
Roger.
PS. I am glad you began to read the USB spec Chip. I know when I dug into this all a couple of years back I was turned off it too and avoided it like the plague initially, but like I found, once you just start with that usbmadesimple site link and figured it out and then move into the intricate low level details in the standard it slowly starts to make some sense. Just don't worry so much about all the descriptor software handshaking etc - that is all handled by the application code side. I also found that understanding the low speed USB software implementation on the AVR micro controllers helped see what was needed to come up with a hybrid type of approach amenable to the prop. But you do need to be in a very receptive state of mind after some strong coffee etc to not get too bogged down by this standard, and it includes a bunch of high speed stuff too that can be ignored.
The chip will be specified at 160MHz, at least.
I will try to implement the transceiver in hardware. The cog will be able get staus, set bus states, and send and receive bytes. I don't know if the CRC5 should be done in hardware, but if it's small, we could do it. Like you said, software has to pick it up from there. Thanks for your encouraging words, and everyone else's.
Yes, it certainly needs quite close watching, but the response is really just ACK or NACK to the (Checksum==OK), on RX, provided the next packet has a clean place to land.
If there is a NACK, the current buffer has to be re-used for the retry.
ie the 'formulate a response' is less of the formulate, and more of the 'load the pipe'
That needs just 16 copies, not 64, and CRC is naturally serial anyway.
Not sure of the Logic cost, or if this can be read/checked within the tight time-budget.
Well, then that is really easy!
I don't think a response is needed to this SOF, so is a CRC5 check even strictly needed ?
I just saw in a USB document that the pullups must be 1.5 kOhm +- 1%. I'm pretty sure that this is exaggerated. But if I remember correctly the pullups are now 1k and 10k, while they were 1.5k and 15k on the P2hot.
Why has this changed? Just to get nicer values or is 1.5 kOhm not possible?
I fear 1k will not work reliable for USB, 10k is no problem.
Andy
I was thinking about that today. I did change those values to be in decades. I wonder if its worth forcing the 1k to 1.5k, instead of just having an external resistor, just for USB. I could add a special 1.5k resistor to the pad. Also, I should increase the drive strength of the I/O pads in order to get the impedance down to the required 40, or so, ohms.
http://forums.parallax.com/discussion/comment/1168959/#Comment_1168959
Yes you are right about the SOF - no response from the device. It would only be the host that generates it that would really need the CRC5 for each frame number. The device would only have to check it if it really wanted a reliable last SOF frame number, which in most applications it probably doesn't care about too much anyway. So CRC5 is probably not a big issue to deal with IMHO.
ps. this link is useful to understand which fields go into the crc and the generator polynomials etc.
http://www.usb.org/developers/whitepapers/crcdes.pdf
If you have an external 1.5k resistor then you often need to switch it with an additional pin. If the pullup is already present when the 3.3V supply is on, but booting needs a second or two until the USB firmware is ready to response then the PC thinks the USB device does not work correct and you always get a Messagebox.
The pullup really needs to be made switchable, with an internal one this will not use any additional pins.
Andy
Then, we should add one to each low-level pin. That layout is coming together right now, so it's not too late. It would be a special resistor that is controlled exclusively by the USB mode in the smart pin.
PS. this code uses a trick where it assumes address foldover and the RDLUT only using the 8 LSBs of the source address register. I suspect that is the still case on P2 but if not will require additional instructions.
That's fantastic, Rogloh!
I won't worry about that in hardware, then.