SmartPin USB (was: SmartPin state machine modes)

rogloh · 2016-02-20 06:20

Yep, in general I think things we mostly need in HW to support processing full speed USB is the stuff like clock recovery, frame delineation, bus states/control, and bit stuff/unstuffing, then we can work at the byte level and there is a reasonable chance all the PASM software gurus could get the rest done in a COG or two. Andy also mentioned the pull up resistor control too.

jmg · 2016-02-20 06:21

cgracey wrote: »

Then, we should add one to each low-level pin. That layout is coming together right now, so it's not too late. It would be a special resistor that is controlled exclusively by the USB mode in the smart pin.

If you make the user pullup 1.5k, is that not going to achieve the same thing ? Or was that what you meant ?

It should be possible to select either of LS & FS USB modes.

Cluso99 · 2016-02-20 06:37

We don't need CRC5 because almost all of it is precalculated once.

But depending on how much can be done in the smart pin hw, the bit loop might become simpler. Otherwise, the bit loop is required to accumulate the bit stream and perform bit unstuffing, which is also required before CRC16 can be calculated.

Some implementations of USB use lookup tables for the CRC16 so its certainly doable. In fact one of the P1 implementations does this.
But beware, the loop is sufficiently tight that 96MHz is required for FS on a P1 and EVERY clock is used. Turnaround is handled by a separate cog for FS.

There is a trick for stuffing/unstuffing. But if the speed is not 96MHz then the bits need to be in groups to account for the variation in the bit-times to average out.

rogloh · 2016-02-20 06:51

I think bit stuffing/unstuffing starts to add a lot overhead to the COG processing loop for full speed USB. Maybe it is doable but it means we have to respond much faster to arriving data bits at a rate of 12Mbps instead of 1.5MBs/, do some shifting/checking etc and still do all the regular operations in between including hub reads and CRC. With stuff/unstuff dealt with in HW we can always work with bytes to parse and there will be more time for everything. I suspect it might even be possible then to combine TX/RX into a single COG which would be awesome.

cgracey · 2016-02-20 06:59

rogloh wrote: »
I was looking at the new P2 instructions and the new RDLUT actually saves us an instruction compared to P2 hot, so my CRC table lookup accumulation code reduces down to 4 instructions per byte (within the existing USB byte processing loop). So I suspect this adds up to 8 clocks per byte if each instruction takes 2 clock cycles on the new P2. If true, then for full speed USB the COG doing software CRC needs 12MHz of the CPU or 6MIPs which for a P2 clocked at (say) 96MHz would then consume 12.5% of the COG's instruction bandwidth per streamed USB byte. This overhead is not too bad and should leave plenty for the remaining processing. If it can be interleaved during hub transfers, it should work out very well.
XOR CRC, USBBYTE
RDLUT TEMP, CRC
SHR CRC, #8
XOR CRC, TEMP
PS. this code uses a trick where it assumes address foldover and the RDLUT only using the 8 LSBs of the source address register. I suspect that is the still case on P2 but if not will require additional instructions.

The LUT address is 9 bits, so it will require an instruction. No big deal, though.

cgracey · 2016-02-20 07:03

rogloh wrote: »

I think bit stuffing/unstuffing starts to add a lot overhead to the COG processing loop for full speed USB. Maybe it is doable but it means we have to respond much faster to arriving data bits at a rate of 12Mbps instead of 1.5MBs/, do some shifting/checking etc and still do all the regular operations in between including hub reads and CRC. With stuff/unstuff dealt with in HW we can always work with bytes to parse and there will be more time for everything. I suspect it might even be possible then to combine TX/RX into a single COG which would be awesome.

Bit stuffing/unstuffing should be done in hardware, as it's really tedious and needs to be fast.

jmg · 2016-02-20 07:09

rogloh wrote: »

I think bit stuffing/unstuffing starts to add a lot overhead to the COG processing loop for full speed USB. Maybe it is doable but it means we have to respond much faster to arriving data bits at a rate of 12Mbps instead of 1.5MBs/, do some shifting/checking etc and still do all the regular operations in between including hub reads and CRC. With stuff/unstuff dealt with in HW we can always work with bytes to parse and there will be more time for everything. I suspect it might even be possible then to combine TX/RX into a single COG which would be awesome.

Yes, ideally the SW works on Bytes, and bit-level handlers are in the Pin Cells.
Following from UART and SPI modes, I'd think one COG and (hopefully just) 2 Pin Cells are needed,
Chip has said Paired Pins can be Mapped, which means similar logic, but hard-compiled for Tx in one pin, and Rx in the other.
The sample NCO (DPLL) would be fed from Rx into Tx cell.

rogloh · 2016-02-20 07:17

cgracey wrote: »

The LUT address is 9 bits, so it will require an instruction. No big deal, though.

It looks like the table driven CRC-16 is different to CRC32 anyhow so it might not be the exact same sequence of instructions above unless there are simple mathematical translations between them. But I believe a similar 256 entry lookup table based approach can apply. Even if it takes 2-3x the instructions shown above I think once we can work at the byte level things should be safe as far as processing bandwidth.

eg. some sample C code for it was this...

while (len--){
  crc = (crc << 8)  ^ crctable[((crc >> 8)  ^ *c++)];
}

This inner loop takes 7 instructions in PASM, and one further instruction outside end of the loop to clear out the upper 16 bits of CRC, maybe another to invert the result before checking if that is required (sometimes CRCs are inverted). Though we could either use CMP or XOR to test the result, so probably no need to worry about that inversion too much.

MOV TEMP, CRC
SHR TEMP, #8
AND TEMP, #$FF ` for using lower 256 entry of LUT RAM
XOR TEMP, USBDATA
RDLUT TEMP, TEMP
SHL CRC, #8
XOR CRC, TEMP

cgracey · 2016-02-20 07:32

jmg wrote: »

rogloh wrote: »

I think bit stuffing/unstuffing starts to add a lot overhead to the COG processing loop for full speed USB. Maybe it is doable but it means we have to respond much faster to arriving data bits at a rate of 12Mbps instead of 1.5MBs/, do some shifting/checking etc and still do all the regular operations in between including hub reads and CRC. With stuff/unstuff dealt with in HW we can always work with bytes to parse and there will be more time for everything. I suspect it might even be possible then to combine TX/RX into a single COG which would be awesome.

Yes, ideally the SW works on Bytes, and bit-level handlers are in the Pin Cells.
Following from UART and SPI modes, I'd think one COG and (hopefully just) 2 Pin Cells are needed,
Chip has said Paired Pins can be Mapped, which means similar logic, but hard-compiled for Tx in one pin, and Rx in the other.
The sample NCO (DPLL) would be fed from Rx into Tx cell.

I just got stumped.

Will I have to make the receiver notice the KKKKKKJ preamble to start receiving data, and then read in some number of bytes back-to-back?

How do I know how many bytes to receive, back to back? Is it implied for tokens and embedded for data packets?

I'm thinking that there's no immediate indication from DP/DM that the data stream finished.

Wait... I look for the EOP, right?

rogloh · 2016-02-20 07:42

Yes I believe you would look for the EOP to know when you are done.

Ideally it is dealt with in HW but there might be some scope for COG assistance in detecting the start of packet condition as it won't be doing a whole lot while it is waiting for something to initially arrive from the bus - so there might be a way for it to help you out perhaps at the bit level to detect a sync pattern and start things rolling. Maybe we can alternate between seeing bit transitions during IDLE / EOP changes, and byte processing in-between with a bit8 or bit31 flag being set from the Smartpin notifying us what is arriving...either unstuffed packet bytes or raw pinstates?

cgracey · 2016-02-20 07:44

So, the receiver logic waits for seven K's (transitions) on DP/DM at the bit period, and then knows that bytes will stream until it sees an EOP right after what was the last byte.

Is that it, aside from the bit unstuffing?

If the receiver-alert pattern of KKKKKKKJ is recognized, it goes into receive mode until the EOP arrives. When it's not in this mode, or sending mode, it could alert when DP/DM change between IDLE and other states.

Ariba · 2016-02-20 07:49

Chip

Yes EOP signales the end.
The "USB in a nutshell" document describes the initialization traffic quite well (better than I can).

rogloh · 2016-02-20 09:10

@Chip, From the USB2.0 spec (see section 7.1.10)

"The SYNC pattern used for low-/full-speed transmission is required to be 3 KJ pairs followed by 2 K’s for a total of eight symbols. Figure 7-35 shows the NRZI bit pattern, which is prefixed to each low-/full-speed packet."

Cluso99 · 2016-02-20 09:15

The code I have seen only looks for 4x K states and then waits for the next J state. Any less than 4x K and the sequence resets again. The EOP is signalled by SE0 IIRC. An SE1 is an abort code. After 6 data zeros (no change in bits state) the following bit must be a transition (swap bit states) and should be ignored for both data and crc.

rogloh · 2016-02-20 09:30

I think you might be talking about decoded 0s and 1s instead of J's and K's Cluso. According to Figure 7-32 in the spec the sync pattern is actually 00000001 on the wire in that order, which equates to KJKJKJKK I believe.

"The USB employs NRZI data encoding when transmitting packets. In NRZI encoding, a “1” is represented by no change in level and a “0” is represented by a change in level. Figure 7-31 shows a data stream and the NRZI equivalent. The high level represents the J state on the data lines in this and subsequent figures showing NRZI encoding. A string of zeros causes the NRZI data to toggle each bit time. A string of ones causes long periods with no transitions in the data."

This is why we need bit stuffing if we send a whole bunch of 1's. It will ensure that we get transitions on the bus for clock recovery if all 1's were sent in the long packet payload.

rogloh · 2016-02-20 09:53

One more thing Chip, and you would probably know this already - if you find there is too much logic going into the Smartpins for USB use such as clocking/state detection/bitstuffing/unstuffing etc, perhaps it might be possible to sit some of the additional USB logic somewhere in between a Smartpin and the COG using it (but on the COG side), as we would only ever need 16 instances of any USB functionality (at most one per COG) - and no sane COG would ever think to run multiple full speed USB instances.

What I am saying is there may not be a need to add full USB functionality to every single Smartpin, it is more a per COG function using a nominated pin pair or pair of Smartpins it can control and receive pin state data from. Maybe that approach makes it harder to design in and control, I don't know, but it could help save logic resources if that becomes a problem.

This idea may also benefit from a WAITUSB and POLLUSB type of instruction where the COG can check for incoming USB bytes/events from the nominated Smartpins, perhaps indicated by some prior SETUSB instruction (which could also possibly select full or low speed operation, some clocking source and whether to enable the pullup in order to configure full/low speed operation and connect/disconnect from the bus as a device, or the pull downs for the host.)

evanh · 2016-02-20 10:07

cgracey wrote: »

Ariba wrote: »

Something else to consider:
I just saw in a USB document that the pullups must be 1.5 kOhm +- 1%. I'm pretty sure that this is exaggerated. But if I remember correctly the pullups are now 1k and 10k, while they were 1.5k and 15k on the P2hot.

Why has this changed? Just to get nicer values or is 1.5 kOhm not possible?
I fear 1k will not work reliable for USB, 10k is no problem.

Andy

I was thinking about that today. I did change those values to be in decades. I wonder if its worth forcing the 1k to 1.5k, instead of just having an external resistor, just for USB. I could add a special 1.5k resistor to the pad. Also, I should increase the drive strength of the I/O pads in order to get the impedance down to the required 40, or so, ohms.

I doubt 1k will be a problem functionally. It's purpose is to indicate the speed mode which won't be affected. The only other concern then become power consumption and driver strength. Neither should be a biggie, it's not a major variation.

rogloh · 2016-02-20 10:12

rogloh wrote: »
This inner loop takes 7 instructions in PASM, and one further instruction outside end of the loop to clear out the upper 16 bits of CRC, maybe another to invert the result before checking if that is required (sometimes CRCs are inverted). Though we could either use CMP or XOR to test the result, so probably no need to worry about that inversion too much.
MOV TEMP, CRC
SHR TEMP, #8
AND TEMP, #$FF ` for using lower 256 entry of LUT RAM
XOR TEMP, USBDATA
RDLUT TEMP, TEMP
SHL CRC, #8
XOR CRC, TEMP

Using some fancy new P2 instructions we may be able to merge the first 3 instructions above by just using GETBYTE TEMP, CRC, #1 and if so, we are down to 5 instructions again from 7. Nice.

Ramon · 2016-02-20 10:18

cgracey wrote: »

Ah, I see now what she meant. In our case, the shift register IS the delay line, if you think about it.

Yes the result is the same.

A 8-bit (16-bit, or 32-bit) shift register clocked at x8 (x16, or x32) the desired sample frequency (F) can be used the same as a 8-taps (16-taps, or 32-taps) delay line with a equally-spaced propagation delay of period T/8 (T/16 or T/32) sampled at the same frequency (F).

We actually don't care if it is implemented with shift register or delay line. The only concern is to have this clock synchronized to the input signal.

And this leads to this questions:

- Would it be possible to have clock recovery on smart pins?

- And it would be possible too to use this recovered clock as input clock for the system (for a crystal-less device)?

No need to have 16x clock recovery circuits for all smart pins. With only 1 clock recovery circuit near the XI input pair, it will be possible to have a cheap crystal less system. Thought, If you are able to fit 16 clock recovery circuits for every smart pins that will be awesome !!

Cluso99 · 2016-02-20 12:36

Rogloh,
I need some free time to re-read my notes. I dug deep into it a few years ago, first for the P1, and subsequently for the Old P2.

What I came up with was a way to combine a number of the instructions in the bit read loop.
The same could be done to calculate the crc16. A couple of simple instructions plus a tiny bit of logic would make a significant improvement to the bit loop. The instructions could be simple (no D field) using the C or Z flag as the receive bit, and accumulate to one of the internal registers such as the X or Y register. This way logic might be minimal.

I agree, some logic could be in the cog which would reduce to only 16 copies.

I will draw it out in the next few days and see what helps. Just by being able to invert one of the pin pairs helps.

cgracey · 2016-02-20 12:57

rogloh wrote: »

One more thing Chip, and you would probably know this already - if you find there is too much logic going into the Smartpins for USB use such as clocking/state detection/bitstuffing/unstuffing etc, perhaps it might be possible to sit some of the additional USB logic somewhere in between a Smartpin and the COG using it (but on the COG side), as we would only ever need 16 instances of any USB functionality (at most one per COG) - and no sane COG would ever think to run multiple full speed USB instances.

What I am saying is there may not be a need to add full USB functionality to every single Smartpin, it is more a per COG function using a nominated pin pair or pair of Smartpins it can control and receive pin state data from. Maybe that approach makes it harder to design in and control, I don't know, but it could help save logic resources if that becomes a problem.

This idea may also benefit from a WAITUSB and POLLUSB type of instruction where the COG can check for incoming USB bytes/events from the nominated Smartpins, perhaps indicated by some prior SETUSB instruction (which could also possibly select full or low speed operation, some clocking source and whether to enable the pullup in order to configure full/low speed operation and connect/disconnect from the bus as a device, or the pull downs for the host.)

Certainly, we only need USB on every odd pin. Maybe the even pins could have 10Base-T.

cgracey · 2016-02-20 12:57

This USB smart pin mode is not very complex or big. I almost have the receiver worked out.

cgracey · 2016-02-20 12:59

rogloh wrote: »
rogloh wrote: »
This inner loop takes 7 instructions in PASM, and one further instruction outside end of the loop to clear out the upper 16 bits of CRC, maybe another to invert the result before checking if that is required (sometimes CRCs are inverted). Though we could either use CMP or XOR to test the result, so probably no need to worry about that inversion too much.
MOV TEMP, CRC
SHR TEMP, #8
AND TEMP, #$FF ` for using lower 256 entry of LUT RAM
XOR TEMP, USBDATA
RDLUT TEMP, TEMP
SHL CRC, #8
XOR CRC, TEMP
Using some fancy new P2 instructions we may be able to merge the first 3 instructions above by just using GETBYTE TEMP, CRC, #1 and if so, we are down to 5 instructions again from 7. Nice.

Super!

cgracey · 2016-02-20 13:01

Ramon wrote: »

...And this leads to this questions:

- Would it be possible to have clock recovery on smart pins?

- And it would be possible too to use this recovered clock as input clock for the system (for a crystal-less device)?

No need to have 16x clock recovery circuits for all smart pins. With only 1 clock recovery circuit near the XI input pair, it will be possible to have a cheap crystal less system. Thought, If you are able to fit 16 clock recovery circuits for every smart pins that will be awesome !!

This presents a chicken-and-egg dilemma, as their needs to a clock before the smart pins can operate. Remember that everything that goes into or comes out of a smart pin is clocked by the system clock.

Bob Lawrence (VE1RLL) · 2016-02-20 14:48

Comment moved to : USB helper instruction - P2 Possible additional Instructions ???

Ramon · 2016-02-20 16:11

cgracey wrote: »

This presents a chicken-and-egg dilemma, as their needs to a clock before the smart pins can operate. Remember that everything that goes into or comes out of a smart pin is clocked by the system clock.

Yes a internal system clock will always be needed, and the internal system clock will always dictate when the input/output is triggered (started), but the sampled input data (or the output data frequency) does not neccesarily needs to be at the internal system clock.

I am considering 2 scenarios: one that is truly related to smart pins, and another one that has more to do with the internal oscillator. Lets start with the last one:

1) Someone that wants to use the P2 internal RCFAST or RCSLOW modes but wants higher accuracy and/or temperature stability. So it happens that the device has pins connected to USB, ethernet, S/PDIF, a GPS 1pps signal (sounds weird?) ... or any other external external interface with accurate or stable clock.

Obviously the system will always start with internal RC mode, but eventually the system clock will be able to synchronize to the external frequency, if available (then a 'External_PLL_LOCKED' status can be used to know if we actually are locked to that frequency).

This scenario just only use the smart pins to route the input signal from one pin to somewhere near the internal RC oscillator. And is in this area were a the clock recovery circuit will 'lock' or synchronize the RCFAST internal oscillator (or RCSLOW) to that external signal.

2) the second scenario is intended for high speed communication. Is here where the smart pins actually need a clock recovery circuit. But don't worry, we don't need 64 clock recovery circuits.

Consider a typical scenario were there are two differential pins for input and two other differential pins for output. So actually no more than 16 circuits are needed if four pins share a clock recovery circuit, which is a pretty common case. And this number also match the number of COGS, but is just a coincidence we can live with only one or two circuits.

In this scenario we do clock recovery on the input signal (input pins) to feed the smart pin input shift register clock (were the data is being received). This shift register is buffered to parallelize the data into BYTE, WORD, or LONG. And this buffer is clocked with the internal system clock so we can transfer the shift register values into a HUB/COG? BYTE, WORD or LONG.

This way the input data can be sampled at a different frequency from our system clock. The shift register buffer will always send the data towards the HUB/COG at the system clock.

For transmission (TX) can be the same. The COG sends data (BYTE, WORD, LONG) to a smart pin TX register buffer at the internal system clock. And then the smart pin can send this data via a TX shift register clocked either with the internal system clock, or the recovered clock that is being received on the input pins.

This way data sent from the COG/HUB or received into the COG/HUB is always synchronized to the internal system clock. But will allow to receive data or sent data at a different external clock rate.

jmg · 2016-02-20 19:16

Ramon wrote: »

2) the second scenario is intended for high speed communication. Is here where the smart pins actually need a clock recovery circuit. But don't worry, we don't need 64 clock recovery circuits.

The USB modes need a edge-locked NCO, and there is benefit of using the same edge-locked NCO for Baud generation (locks on Start bit edge) * I think Chip was looking at including that, so that level of clock-sync should be supported in every pin.

I would avoid calling that clock recovery, as it is really just clock-sync, and requires a SysCLK rather higher than the Data rate.

A different case is Slave-Synchronous (Slave SPI), and most parts spec a limit below SysCLK on Slave CLK in for sampling reasons.

Ramon wrote: »

1) Someone that wants to use the P2 internal RCFAST or RCSLOW modes but wants higher accuracy and/or temperature stability. So it happens that the device has pins connected to USB, ethernet, S/PDIF, a GPS 1pps signal (sounds weird?) ... or any other external external interface with accurate or stable clock.

Obviously the system will always start with internal RC mode, but eventually the system clock will be able to synchronize to the external frequency, if available (then a 'External_PLL_LOCKED' status can be used to know if we actually are locked to that frequency).

This scenario just only use the smart pins to route the input signal from one pin to somewhere near the internal RC oscillator. And is in this area were a the clock recovery circuit will 'lock' or synchronize the RCFAST internal oscillator (or RCSLOW) to that external signal.

This quickly gets much more complex, and terminology details matter.

Present P2 PLL is an Analog PLL, which is low jitter and linear in nature.

This cannot lock to a 1pps signal, and even if it could, those locks are only nominal averages. RCFAST has way too much drift & jitter to be of any precision use.

Someone wanting 1pps GPS lock, will likely be using a VCTCXO, at 19,2MHz or 26MHz. and doing a capture of the 1pps, to drive a DAC to condition the VCTCXO, from the 2ppm post-reflow default, to down towards 1ppb

Some MCU's have a Calibrated and controlled RCFAST, that has an internal DAC with discrete steps. Those can DPLL to say USB, to around 0.2%, but with ongoing jitter.
That is a whole different animal, and Parallax do not have that IP.

What could be practical on a P2, is a couple of simple improvements

i) Expand the PLL from a single /N, to have a few-MHz PFD, and use PFD = Xtal/M = VCO/N. PFD runs at a few MHz, so the numbers are not large.

ii) Expand the CL choices on Crystal, to be more granular.
Some temperature compensated Crystal Oscillator use this, and it gives a simple means to shift the Xtal a few PPM, and centre on an average.
eg You could use a low cost Crystal or resonator, and use 1pps, to time-modulate the CL and so fine tune the Oscillator frequency.

Crystal-retrim is not as precise as the VCTCXO mentioned above, but slightly cheaper, and it needs no fundamental changes to P2 logic.
VCTCXO are now under $1

cgracey · 2016-02-21 12:59

cgracey wrote: »

I recompiled without the programmable state mode to see what the difference was.

Here is the Cyclone IV with the programmable state mode:

Here it is without:

Looks like a net difference of ~210 LE's, or ~20%. So, the programmable state mode takes about 1/5th of the smart pin logic.

I've just about got the USB receiver done. I still need to add bit-unstuffing, but that won't be much. Then, I'll add transmit to complete it.

So far, the USB receiver is taking 59 LE's, whereas the programmable state mode was taking 167 LE's.

The way the receiver works is pretty simple:

- It raises IN on receiving a byte, end of packet, or a constant DP/DM line change spanning seven bit periods.

- The top 5 bits of the 32-bit Z register hold a counter that increments on every update (IN gets raised).

- The bottom three 9-bit fields hold the latest three messages, with the bottom field holding the most recent. The older messages scroll left on each update.

- The 9-bit message values are:
* %1xxxxxxxx = byte received
* %000000000 = SE0
* %000000001 = IDLE
* %000000010 = RESUME
* %000000011 = SE1

- As soon as the last byte is received (SE0 sensed), an IDLE event is immediately posted so that the cog can hurry up and respond to the packet.

This has been challenging. There's got to be a better way to compose hardware than all this tedious thinking and typing. The brain would be so much more satisfied with some kind of 3D image of a machine, or something.

jmg · 2016-02-21 15:45

Sounds like great progress!
The Sync case in bit stuff systems is a stuff violation, so you may be able to merge a sync flag out of the destuff to save logic.
TX for host USB will also need a HW option to generate sync somehow (which the TX stuffer would usually prevent)

Rayman · 2016-02-21 16:37

I saw a P1 thread recently where Phil was able to get a 32kHz crystal to run using two Prop1 pins. Maybe that would help USB if it worked with a 12 MHz crystal and 2 P2 pins?

BTW: Is Chip saying that he's dropping state machine stuff in order to add USB support?

SmartPin USB (was: SmartPin state machine modes)

Comments