USB Transmit

jmg · 2016-03-21 03:55

rogloh wrote: »

For a USB transfer COG in general I suspect we will want to have FIFOs per endpoint stored in hub RAM and used by client COGs, which are then read/written by the USB COG on demand by the host (after getting a SETUP/IN/OUT token). As the USB data is streamed to/from hub, PASM code could compute the CRC for you during these USB transfers just using the look up table in stack RAM..

I thought there was enough LUT / Stack RAM to allow both Byte Tables and FIFO Buffers too, all COG local ?

rogloh · 2016-03-21 12:19

Yes there is enough space for that if you want to use it for that purpose. I believe the new P2 now has 512x32 bits of stack RAM space and the CRC-16 LUT will only need 256x16 bits of that. If the CRC-5 is also done in a lookup table, you could even re-use the otherwise unused upper byte(s) of the first 256 32 bit entries for that purpose. So after both CRCs there still should be another 256x32 bits available in the Stack RAM which could be allocated to hold some number of endpoint FIFOs.

However I would suspect that the applications making use of USB themselves are going to be running in other COGs and to transfer data back and forth between the USB COG(s) and the client, they could just read/write into FIFOs in hub RAM for each endpoint. Another small data structure per endpoint FIFO could exist (also in hub RAM) to hold current head/tail pointers per endpoint. The CRC COG would poll these FIFOs upon demand by the host (with IN/OUT/SETUP tokens) and it either responds by sending this requested data on the bus or stream it in from the bus into hub RAM and update the FIFO pointers accordingly after a successful transfer. The client COG can poll for new data by monitoring these pointers or perhaps be notified via some other mechanism.

For a simple dedicated function like supporting a USB mouse with known fixed endpoint buffer size etc, I suppose you could combine the entire USB and mouse functionality into a single COG, keeping its FIFO local to the USB COG and devise other means to be able to pull or push the mouse state from other COGs, but I was thinking it might be nice to ultimately have a general purpose P2 USB COG in the OBEX that is basically application independent that you could spawn with some given endpoint config(s) and any appropriate USB descriptors and is generic enough to just work at the level of transferring raw byte data in and out of endpoint FIFOs in hub RAM. Then you could easily integrate this into the client COG applications that would work by reading/writing to these FIFOs.... there will be many ways you could do it I guess.

lonesock · 2016-03-21 20:56

Random thought: if there is still a configurable LFSR available, maybe the CRC calculations can leverage that functionality?

(I apologize for not having kept up with the current status of the chip, but it looks like a ton of progress has been made!!)

Jonathan

Cluso99 · 2016-03-21 21:06

lonesock wrote: »

Random thought: if there is still a configurable LFSR available, maybe the CRC calculations can leverage that functionality?

(I apologize for not having kept up with the current status of the chip, but it looks like a ton of progress has been made!!)

Jonathan

Good question. We would only need one per cog.

Other uses besides CRC ?

jmg · 2016-03-21 21:26

rogloh wrote: »

Yes there is enough space for that if you want to use it for that purpose. I believe the new P2 now has 512x32 bits of stack RAM space and the CRC-16 LUT will only need 256x16 bits of that. If the CRC-5 is also done in a lookup table, you could even re-use the otherwise unused upper byte(s) of the first 256 32 bit entries for that purpose. So after both CRCs there still should be another 256x32 bits available in the Stack RAM which could be allocated to hold some number of endpoint FIFOs.

Good.

rogloh wrote: »

However I would suspect that the applications making use of USB themselves are going to be running in other COGs and to transfer data back and forth between the USB COG(s) and the client, they could just read/write into FIFOs in hub RAM for each endpoint. Another small data structure per endpoint FIFO could exist (also in hub RAM) to hold current head/tail pointers per endpoint. The CRC COG would poll these FIFOs upon demand by the host (with IN/OUT/SETUP tokens) and it either responds by sending this requested data on the bus or stream it in from the bus into hub RAM and update the FIFO pointers accordingly after a successful transfer. The client COG can poll for new data by monitoring these pointers or perhaps be notified via some other mechanism.

Yes, I was thinking of also local buffers, so at the intra-message level, the COG avoids HUB slot issues.
Up to 1024 Bytes of local storage should 'keep up' with FS USB, per direction.

As you say, the main data massaging will be done in another COG.

It would be nice if any OBEX code can run from 48MHz SysCLK, up.

lonesock · 2016-03-21 22:37

Cluso99 wrote: »

Good question. We would only need one per cog.

Other uses besides CRC ?

I was assuming it was there for PRNG applications. Not sure of any other uses?

Jonathan

Cluso99 · 2016-03-21 22:59

Thinking about the LFSR in the cog is not going to work with the rx and tx being in the smart pins because the LFSR will not have access to the serial stream

A lookup table is simple enough.

Not sure what Chip has planned for the 16KB?? ROM and how he is using it. Perhaps there could be the two CRC16 tables (IBM/USB & CCITT) stored there such that they could be loaded into LUT for lookup use???

msrobots · 2016-03-21 23:14

other uses for HW-CRC woiuld be 4 bit spi to sd-card.

needs crc for each line, so not nice in SW

Enjoy!

Mike

cgracey · 2016-03-22 01:30

Cluso99 wrote: »

lonesock wrote: »

Random thought: if there is still a configurable LFSR available, maybe the CRC calculations can leverage that functionality?

(I apologize for not having kept up with the current status of the chip, but it looks like a ton of progress has been made!!)

Jonathan

Good question. We would only need one per cog.

Other uses besides CRC ?

I could add this into the ALU, but to do what we need, it would take 8 iterations per byte. Not a big deal, really, just 8 clocks. This is pretty trivial to add. It would save LUT use for other things. After we get this USB working, we can look at that. I think it would be best to have two hardwired modes: CRC5 and CRC16 per USB. Making it fully configurable would be way too expensive.

pjv · 2016-03-22 01:33

Cluso,

An LFSR is used to generate the maximal pseudo random sequences for direct sequence spreading codes in spread spectrum radios. Presently I do this in software, but a hardware register would run at much higher speeds, and that is an advantage.

Cheers,

Peter (pjv)

Cluso99 · 2016-03-22 05:33

Chip, please forget CRC5. Most of the sequence can be precalculated so it's a breeze.

A CRC16 for both USB and CCITT is just a couple of Ors extra and a flop to select either one.
Probably you could use Something like this...

 SETX D ' nxxxxxxx_xxxxxxxx_iiiiiiii_iiiiiiii n=0=USB, n=1=CCITT CRC16 bit poly
 CRC  D ' 8 bit data byte to calculate the crc16
 More CRC instructions for the remainder of the packet
 GETX D ' returns the CRC into D

CRC instructions take 8 clocks. Other instructions can be used in between the CRC instructions provided SETX is not used.

rogloh · 2016-03-22 12:50

jmg wrote: »

Yes, I was thinking of also local buffers, so at the intra-message level, the COG avoids HUB slot issues.
Up to 1024 Bytes of local storage should 'keep up' with FS USB, per direction.

As you say, the main data massaging will be done in another COG.

It would be nice if any OBEX code can run from 48MHz SysCLK, up.

Yes 48MHz would be a nice target to aim for. With that don't you then get 2 hub slot opportunities per arriving USB byte or alternatively the equivalent of 16 P2 instructions? It may be tight but depending on worst case latencies that might almost be enough in the critical transfer loop to read USB bytes until a maximum length is reached, detect/exit on error, accumulate CRC16 and write USB data to hub with an auto incrementing pointer. If all that work is not possible in the cycles available, or for applications needing multiple longer streaming ISOCHRONOUS endpoint buffers of up to 1kB that won't all fit in the stack RAM we might have to operate the P2 faster than 48MHz which no doubt would be possible, or divide up the instruction workload over 4 bytes at a time and use less hub transfers of 32 bits at a time or do some similar balancing act.

I imagine many HID applications would use smaller endpoint buffers that could fit in the stack RAM - but you will still need a way (and spare cycles) to get it from there to the client COG, that's why streaming it directly into hub RAM as it arrives is useful. Actually I think it is only the ISOCHRONOUS streaming endpoints that need the large 1kB buffers, though BULK data transfers can reach up to 512 bytes too. If you are a device I guess you get to nominate these buffer sizes, but a host would need to be able to support the published endpoint transfer sizes for any attached USB devices it manages. Good thing is we have plenty of hub RAM for that.

Cluso99 · 2016-03-22 21:07

I think if you are a USB host you can send shorter length blocks anyway. You don't have to use the whole length that the device is capable of receiving.

I previously did a lot of work and some decoding for p2hot and was sure at 160+MHz FS would be fine as p2hot was 8x P1 instruction speed. P2 is now only 4x instruction speed but since we are working with 80MHz we are only 2x. But I run P1 (for USB testing) at 96MHz although I can run P1 at 104+MHz. 96MHz is optimal for USB bit banging.

rogloh · 2016-03-23 10:20

@Chip,
I wanted to ask - about the USB COG functional partitioning. Do you envisage with your USB HW design as it is currently evolving that we would be able to do both transmit and receive functions from the same COG (separately in time obviously) or is there some hardware design limit with reading/writing Smartpins and/or bus/pin turnaround that may preclude this?

The reason I ask is that USB operation itself is inherently half duplex and if the main USB software functionality can be kept in a single COG I expect it would be likely simpler to build up a software implementation that more easily co-ordinates between transmit and receive, tracking transaction sequences, timeouts etc, not to mention the potential benefit of a single COG USB implementation saving on COG usage (assuming it can fit in the overall instruction memory that is, and my gut feeling says it may, perhaps with a few clever tricks here and there).

Rayman · 2016-03-23 11:34

If it works think I'd rather just use LUT than add more cog instructions

Either you need USB or you don't
If you do then using one cog for it seems very reasonable

jmg · 2016-03-23 23:34

rogloh wrote: »

@Chip,
I wanted to ask - about the USB COG functional partitioning. Do you envisage with your USB HW design as it is currently evolving that we would be able to do both transmit and receive functions from the same COG (separately in time obviously) or is there some hardware design limit with reading/writing Smartpins and/or bus/pin turnaround that may preclude this?

The reason I ask is that USB operation itself is inherently half duplex and if the main USB software functionality can be kept in a single COG I expect it would be likely simpler to build up a software implementation that more easily co-ordinates between transmit and receive, tracking transaction sequences, timeouts etc, not to mention the potential benefit of a single COG USB implementation saving on COG usage (assuming it can fit in the overall instruction memory that is, and my gut feeling says it may, perhaps with a few clever tricks here and there).

? I think the whole focus is on a single COG, doing Rx/Tx half duplex.
Note the discussions around Turn Around Time.

The only open question seems to be around CRC, and that will self-answer with code.

To me, if the SW CRC is too Slow, or uses memory the USB Code turns out to need, then that dictates HW CRC.
If SW CRC can keep up with Full Speed, largest packets, with minimal resource impact, then SW CRC will be ok.

Rayman · 2016-03-24 00:36

Just that there's so much out there on the web on how to use look up tables to do USB CRC, I have to think it can be done it software...
Guess it's just a MIPS question...

The other thing is that P2 was very general purpose, it seemed, until USB, which is very specific. Adding USB only CRC would make it even more so.
Seems against the design philosophy...

Cluso99 · 2016-03-24 00:41

I am right in the middle of doing the crc in software now. Will update you once I have results.

jmg · 2016-03-24 01:43

Rayman wrote: »

Just that there's so much out there on the web on how to use look up tables to do USB CRC, I have to think it can be done it software...
Guess it's just a MIPS question...

Agreed

Rayman wrote: »

The other thing is that P2 was very general purpose, it seemed, until USB, which is very specific. Adding USB only CRC would make it even more so.
Seems against the design philosophy...

Not really, if the P2 can manage SW CRC to 1024 byte payloads, at 48Mhz SysCLKS, and with spare code space, SW is fine.
If not, then HW is needed to avoid compromising what is already a modest speed BUS.
CRC HW is not large, and you can use the 16b CRC for many things, it is not USB only.

The P2 alternatives like XMOS have not only USB, but HS USB (480MHz) - so the 12MHz P2 USB is relatively modest.

rogloh · 2016-03-24 04:38

Cluso99 wrote: »

I am right in the middle of doing the crc in software now. Will update you once I have results.

Great, let us know if you can manage it in less than 5 extra P2 instructions per USB data byte read/written in some existing packet streaming loop. That should be the benchmark. I had some sample code for that in prior posts.

Cluso99 · 2016-03-24 05:30

Spent all day on this

I have the 5bit working without a lookup because it can mostly be precalculated.
16 bits make up the CRC5: ADDR(7) + ENDP(4) + CRC5(5).

But the CRC16 is troublesome. Lots of info and calculates on the web, but most give different answers for the same data

And while I thik my code is correct, it does not fit the CRC16 in my snooping, nor does it fit with any of the online code generators either (my code or the snooped code)

Add to this, from what I understand, the CRC16 is only calculated on the data bytes (payload) ignoring the PID and SYN. However, I have also seen reference to including the PID and SYN and EOP. To make things worse, they say SYN is $01 whereas I believe it should be %01111100 or $7C.

jmg · 2016-03-24 05:39

Cluso99 wrote: »

Add to this, from what I understand, the CRC16 is only calculated on the data bytes (payload) ignoring the PID and SYN.

Yes, PID has this
"The first byte in every packet is a Packet Identifier (PID) byte. This byte needs to be recognised quickly by the SIE and so is not included in any CRC checks. It therefore has its own validity check. The PID itself is 4 bits long, and the 4 bits are repeated in an complemented form."

CRC merge with data gets tricky, as you also need to know what to preload it with (0,0FFFF or??) - and I'm vague on USB, but some systems send the complement CRC so that the total CRC sum (including the CRC) totals Zero.
Makes sense, as Zero is easier/faster to check than A==B.

Can you find a tested FPGA USB cell, that should have docs around the CRC ?

Addit: I find this
http://www.usb.org/developers/docs/whitepapers/crcdes.pdf

Says

The implementation starts off with the shift register loaded with all 1s.

and

The remainder is bit-wise inverted before being appended as the checksum to the input
pattern. Without this modification, trailing zeros at the end of a packet could not be
detected as data transmission errors.

I think that is saying preload with 0xffff, and invert the CRC, to send.

and I find posts that say including the appended CRC bytes. The resulting CRC should be 0x0000
and a search for that, points to code here
https://gist.github.com/anthonywebb/1a0bd749d3861db0752d

Ariba · 2016-03-24 10:25

Why all this discussion about CRC generation in hardware?
Rogloh has showed that you can do it in 5 Instructions. One of them is RDLUT so these 5 Instructions take 11 clock-cycles. Hardware will do it bitwise, so it takes min. 8 cycles without the transfer of the byte and the CRC result. At the end hardware crc may even be slower.

If every cycle matters I've also found a solution with 4 Instructions and 9 clock cycles.

Andy

Cluso99 · 2016-03-24 11:17

Andy,
I am just trying to get the CRC16 to work in software. No tables yet as I need to have code to verify its working. Then we can decide what needs to be done. In the meantime Chip is working on the smart pins so nothing being wasted here.

Here is a Data Packet...

SYNC + PID + DATA + CRC16 + EOP
7C 3C < FE 9F FF 7F FF FF FD FF > 44 D6 SE0 SE0 J

This is a packet and data is LSB first.
The CRC16 is only supposed to be on the DATA section of the packet (between < and > above.
The Polynomial is supposed to be $A001, initial value of $FFFF. I have tried $8005.
I have tried the online converters and they give differing results. I have tried XOR my CRC and then reversing it. Nothing gives the right result. But my program works fine for CRC5.

cgracey · 2016-03-24 11:49

Okay! I've got the USB all working now.

I had to do some unanticipated things to get it to work synergistically with software. Polling status via PINGETZ was way too slow for full-duplex testing in a single cog at 80MHz (not a real-world concern, but how I set up my initial testing).

Since USB goes into every even-numbered pin, but works with its upper odd companion pin, both smart pins' IN signals are now used to signal back to the cogs:

Even-pin IN goes high when the transmit buffer can receive another byte via PINSETY.
Odd-pin IN goes high when a byte is received and can be read via PINGETZ. PINGETZ can also be used at any time to read the latest byte and status bits.

This makes the polling really fast, so that during packet sending or receiving, only one PINSETY or PINGETZ needs to be done per byte, aside from watching the specific IN signal. Just sending or receiving with IN polling can be done at 24MHz signaling (twice what USB uses/requires). Also, this USB smart pin mode is almost exactly the same size as the serial modes (32 ALM's on Cyclone V), but is only in half the pins, so it takes half the silicon.

I also changed the regular serial transmit modes to raise IN when the transmit buffer can receive another byte via PINSETY. This will simplify keeping the transmitter busy. I left in the polling mode for baud-mode-transmit-done, but left out the transmit-buffer-full flag, because IN now handles that.

I'll prepare a new intermediate release tomorrow to cover all these changes.

Cluso99 · 2016-03-24 12:04

fantastic news Chip. Looking forward to having a play with P2 Emulation again.

Dave Hein · 2016-03-24 17:05

cluso, this C program gives the correct CRC for the sample messages given at https://superjameszou.wordpress.com/2010/09/06/a-real-example-for-usb-packets-transferring/. If I try it with your data I don't get the same CRC.

#include <stdio.h>

#define CRC16 0xa001;

int bytecrc(int val, int crc)
{
    int i;

    crc ^= val;

    for (i = 0; i < 8; i++)
    {
        if (crc & 1)
        {
            crc >>= 1;
            crc ^= CRC16;
        }
        else
            crc >>= 1;
    }

    return crc;
}

int usbcrc(unsigned char *buf, int num)
{
    int i;
    int crc = 0xffff;

    for (i = 0; i < num; i++)
        crc = bytecrc(buf[i], crc);

    return crc ^ 0xffff;
}

int main(void)
{
    int i, crc, num;
    //unsigned char data[] = {0xfe, 0x9f, 0xff, 0x7f, 0xff, 0xff, 0xfd, 0xff, 0x44, 0xd6};
    unsigned char data[] = {0x80, 0x06, 0x00, 0x02, 0x00, 0x00, 0x20, 0x00, 0xb1, 0x94};

    num = sizeof(data) - 2;

    crc = usbcrc(data, num);
    printf("Computed CRC is %2.2x %2.2x.  Correct CRC is %2.2x %2.2x\n", crc&255, crc>>8, data[num], data[num+1]);

    return 0;
}

jmg · 2016-03-24 19:31

Dave Hein wrote: »

cluso, this C program gives the correct CRC for the sample messages given at

If you run the CRC algorithm to the end of the Packet, including CRC (as received), does that sum to 0x00 ?

Dave Hein wrote: »

If I try it with your data I don't get the same CRC.

Does that mean a capture mistake ?

jmg · 2016-03-24 19:40

cgracey wrote: »

Okay! I've got the USB all working now.

Great. What does working mean ?
You have USB traffic code working, and a P2 replying ?

cgracey wrote: »

Just sending or receiving with IN polling can be done at 24MHz signaling (twice what USB uses/requires).

Nice.
Does that mean, with SW overhead, a 48MHz SysCLK will be able to fully feed FS USB streams ?

cgracey wrote: »

I also changed the regular serial transmit modes to raise IN when the transmit buffer can receive another byte via PINSETY. This will simplify keeping the transmitter busy. I left in the polling mode for baud-mode-transmit-done, but left out the transmit-buffer-full flag, because IN now handles that.

Sounds very good, should be able to sustain higher MBd, with lower SysCLKs

Cluso99 · 2016-03-24 21:12

Smile! Somewhere yesterday I did a simplification and didn't realise an introduced error
data := data >> 1 does not simplify to data >= 1 but should be data >> = 1
Now to retest all over again for crc16

FWIW I don't like using these simplifications but I needed the speed and I believe they are faster.

Here is my code. It works for CRC5 but does not generate the correct result in CRC16 as per the snooped traffic in LS (LS & FS should be the same)

PRI CalculateCRC(poly, crc, data, bits) | i, c
'   POLY5 := $14    POLY16 := $A001 / $8005             ' polynomial
'   BITS5 := 5      BITS16 := 16                        ' no. of bits to process
'   INIT5 := $1F    INIT16 := $FFFF                     ' initial value
'   XOR5  := $1F    XOR16  := $FFFF                     ' final XOR with

    repeat i from 0 to bits-1
      c := (data ^ crc) & $01                           ' data bit 0 XOR crc bit 0
      data >>= 1                                         ' data >> 1
      crc  >>= 1                                         ' crc  >> 1
      if c
        crc ^= poly                                     ' if c==1: crc xor poly
    return crc

The problem I found is that the online generators are giving different results, and none give the result snooped on the bus. I have tried xoring the resultant crc and reversing. In fact I have a routine that tries all possibilities as a 16 bit input to my generator to find the correct result obtained from the snooping crc.

If you include the crc in the generator, for CRC5 the residual should be $0C (I verified this too), and CRC16 should be $800D (cannot get this). I have tried including the PID, then PID plus SYNC. I also tried the CCITT CRC16 just in case - definitely not this.

I also tried both poly's $A001 and $8005 (and $8408 from CCITT).

I have yet to try varying initial values.

USB Transmit

Comments