Sapieha,
I didn't post USB implementations. There are lots, and I have seen some obvious errors. I cannot tell the good from the bad.
I presume your post is only to show the various blocks used in a whole USB implementation.
There is way too much risk to implement the whole USB in the P2. A serdes will be a big help.
I am sure Chip can find (or ask) for the NRZI implementation and the CRC implementation if he needs to.
Hope you understand I am just trying to get the basics agreed upon. I don't want to overcomplicate it.
I took a quick peek - hope to elaborate later, but my soldering iron is hot
If there is not a single level of buffering, the cog cannot reliably react within 2 clocks, never mind refilling - meaning a minimum of 3-5 clocks to react to DREAD/DREQ, which limits the max bit rate to clkfreq/5
with a single level latch, it decouples the timing, giving the cog/task (n bits)*(clocks_per_bit) clocks to react - all of a sudden a task can handle a clkfreq/2 sync stream, vs. a cog having trouble with clkfreq/4 - a HUGE win.
You would have noticed I have not said much about this. Many of you have said buffering is mandatory.
I disagree. I think latching (single latch) is nice to have but not mandatory and here are my reasons. Please comment...
I was thinking that the serdes (at least sync mode) would operate more like the old video/waitvid in the P2 (how I think it works anyway). That is, you load up a value and set a timer (waitcnt/passcnt) and wait for tx to be done. Similarly, in rx, sw would setup the serdes to wait for a change in input level. Sw would also have to independently be looking for this (waitPXX or polling). Once detected, the sw would read the CNT value and add n-bit-times, and use this later to wait on the completed frame. So, the sw is ultimately waiting for the frame to be completed and therefore could read the serdes register directly. This avoids the latching/buffering issue.
The reason that I am not pushing for latching is silicon and Chip's time. Latching is preferred, but not mandatory IMHO, because I would rather a simple serdes than the whole thing become too complex for the time/silicon we have available and it not go in at all.
Also, Chip has done latching before so he will know how to do it if he has the time/silicon. He has changed the instruction set and that is a huge improvement but we don't yet know what impact that has on the available silicon.
To sum up...
I would rather define the serdes better, and leave latching to Chip. If it's in then great, if not I am not going to cry. I can see so many possibilities with a generalised serdes, way beyond just UART/USB/Ethernet.
I was not sure how significant amount the additional logic would be for bi-directional shifting... however I must note, the single-level latching will give plenty of time time to do REV even as a task
Bill,
Sorry I forgot. It is not simple to just reverse the serialiser to do MSB first. Lots of silicon is required, or else read/write needs to reverse this. That is why I see the REV instruction for this.
I took a quick peek - hope to elaborate later, but my soldering iron is hot
If there is not a single level of buffering, the cog cannot reliably react within 2 clocks, never mind refilling - meaning a minimum of 3-5 clocks to react to DREAD/DREQ, which limits the max bit rate to clkfreq/5
with a single level latch, it decouples the timing, giving the cog/task (n bits)*(clocks_per_bit) clocks to react - all of a sudden a task can handle a clkfreq/2 sync stream, vs. a cog having trouble with clkfreq/4 - a HUGE win.
Thanks Bill. Just what I was after was an example. My work is running at 6-7 clocks which will be doubled in the real P2, so I didn't see this timing problem (I had blinkers on). Chip had of course resolved this in his UART spec. So we just need to enhance (downgrade by making start/stop/address configurable, and add in the NRZI, input and output options, inversion options, and clock options) and maybe we are good to go???
I still need to be able to drive the SERDES directly for some things, so we need to be able to access this register too. I will make the appropriate changes and post a block diagram shortly.
We will require a n-bits configuration (to latch) option.
What is your opinion in keeping the internal data representation just like a UART? - ie a start=0, stop=1. If the transmission is inverted, this is done either with a configuration bit(s) or using the I/O pin inversion functions.
Bill said...
A BIG THANK YOU to Ray for working on low level support for full speed USB on the P2!!!!!!!!
Yes, but it would need 4 SERDES as the %!%$#@$! 4 bit standard requires a separate CRC16 for each of the four data lines.
I also seem to remember that it is patent encumbered (I could be wrong)
There was earlier suggestion of 4 SERDES. If this is the case with final implementation, it might be possible. Not sure what the transistor count for CRC16 would be....
Patent encumbered, yes. IIRC someone was working on 4 bit mode for Prop1... But without the CRC16 hardware it wouldn't be much faster than SPI mode. Since CRC16 could be used for many purposes, I'm not sure the legality? It would be really interesting to know where the line is drawn!
Is there a short description somewhere of exactly what USB FS expects for bit stuffing / unstuffing - its NRZI implementation? I don't have time to read the full spec, but I'd love to see the on-the-wire signalling shown simply (I know, I could google for a while, but I bet you have it at hand)?
Regarding latch transparency - if the P2 had a "transparent" mode for them where it clocked data into the xmit shifter on every clock from the latch, and into the latch from the recv buffer on every clock, you would get your transparent access without additional muxes - i think.
Given USB being half duplex, with buffers, we are in serious danger of one cog potentially handling FOUR USB FS streams (using one task each), while another cog can handle the protocol layer...
Thanks Bill. Just what I was after was an example. My work is running at 6-7 clocks which will be doubled in the real P2, so I didn't see this timing problem (I had blinkers on). Chip had of course resolved this in his UART spec. So we just need to enhance (downgrade by making start/stop/address configurable, and add in the NRZI, input and output options, inversion options, and clock options) and maybe we are good to go???
I still need to be able to drive the SERDES directly for some things, so we need to be able to access this register too. I will make the appropriate changes and post a block diagram shortly.
We will require a n-bits configuration (to latch) option.
What is your opinion in keeping the internal data representation just like a UART? - ie a start=0, stop=1. If the transmission is inverted, this is done either with a configuration bit(s) or using the I/O pin inversion functions.
There was earlier suggestion of 4 SERDES. If this is the case with final implementation, it might be possible. Not sure what the transistor count for CRC16 would be....
Patent encumbered, yes. IIRC someone was working on 4 bit mode for Prop1... But without the CRC16 hardware it wouldn't be much faster than SPI mode. Since CRC16 could be used for many purposes, I'm not sure the legality? It would be really interesting to know where the line is drawn!
Is there a short description somewhere of exactly what USB FS expects for bit stuffing / unstuffing - its NRZI implementation? I don't have time to read the full spec, but I'd love to see the on-the-wire signalling shown simply (I know, I could google for a while, but I bet you have it at hand)?
Regarding latch transparency - if the P2 had a "transparent" mode for them where it clocked data into the xmit shifter on every clock from the latch, and into the latch from the recv buffer on every clock, you would get your transparent access without additional muxes - i think.
Given USB being half duplex, with buffers, we are in serious danger of one cog potentially handling FOUR USB FS streams (using one task each), while another cog can handle the protocol layer...
I am not concerned about patents for crc because if there were any they would have expired long ago. I implemented the crc16 (known as the IBM version) in the early 80's, both for bisync and sdlc. They were in use much earlier than this. They have been described by using polynomials, so there isn't anything that could be patentable since then. Various mainframe companies implemented the protocols so they could exchange data.
I don't really see the need for multiple USB ports on the P2 but of course someone will prove me wrong.
I am not sure the transparency latch method will work. Would be easier just to have r/w access to each serdes and each latch. We have enough instruction bits and these could be combined with similar instructions like ACCx etc.
I am still interested to see if any of this could be combined with the counters - could make for some really interesting future possibilities.
I am not concerned about patents for crc because if there were any they would have expired long ago. I implemented the crc16 (known as the IBM version) in the early 80's, both for bisync and sdlc. They were in use much earlier than this. They have been described by using polynomials, so there isn't anything that could be patentable since then. Various mainframe companies implemented the protocols so they could exchange data.
I don't really see the need for multiple USB ports on the P2 but of course someone will prove me wrong.
I am not sure the transparency latch method will work. Would be easier just to have r/w access to each serdes and each latch. We have enough instruction bits and these could be combined with similar instructions like ACCx etc.
I am still interested to see if any of this could be combined with the counters - could make for some really interesting future possibilities.
The bit stuffing seems simple, should not take too many transistors
I think the P2 has a 1's count instruction, that would help (if not done in hardware) ... maybe (like you suggest) the counters could help with it.
USB:
My diagram post #75 shows (basics) what is required to unstuff. NRZI is there also.
My post #77 gives an example of the instructions to bit-bang it (rx).
I don't think the 1's count instruction will help.
Those links are for LS but the concept is for the most part identical. That is where I started. BradC and scanlime both posted USB code for P1.
SD:
Yes, the SD group have a patent restricting 4bit use and charge $$$.
With the few people wondering what USB2 would take and saying it would be too hard, could we use an external transceiver like the SMSC USB3450 to implement host or device mode? How hard would that be? Can any serdes design read 4-8 bit words off of adjacent I/O pins and shift by the said amount in order to get reasonable-speed I/O?
With the few people wondering what USB2 would take and saying it would be too hard, could we use an external transceiver like the SMSC USB3450 to implement host or device mode? How hard would that be? Can any serdes design read 4-8 bit words off of adjacent I/O pins and shift by the said amount in order to get reasonable-speed I/O?
You would likely be better off with something a little newer, from the same vendor. Probably a charger spec'd part, to get critical mass.
USB3343 looks to need a byte wide, 60MHz serializer. (15MHz @ 32 bit load rate ) with CLK and handshakes.
Not 'too hard', but not trivial either. Testing would be time consuming.
Price is lower than a full FT232H, and looks to have good ESD specs and some interesting carkit modes.
You would likely be better off with something a little newer, from the same vendor. Probably a charger spec'd part, to get critical mass.
USB3343 looks to need a byte wide, 60MHz serializer. (15MHz @ 32 bit load rate ) with CLK and handshakes.
Not 'too hard', but not trivial either. Testing would be time consuming.
Price is lower than a full FT232H, and looks to have good ESD specs and some interesting carkit modes.
Just thought I would "push" this thread again as Chip is almost ready to hopefully digest it.
Lots of interesting discussion. Here are couple of my posts that might pique the interest
#78 Possible SERDES block diag including unstuffing as used by USB (only 1 shifter shown)
Also a brief description of USB here (and possible problem with above diagram) http://forums.parallax.com/showthread.php/150946-No-Simple-P2-SERDES?p=1215469&viewfull=1#post1215469
#73 & #78 CRC16 and a block diag
#77 Proposed instruction for C_XOR_PIN -> C
And for an external reality check, on Serial Interfaces already on silicon, here is some numbers from the new 200MHz Microchip PIC32MZEC
For link speeds they claim :
* 50 MHz Serial Quad Interface (SQI)
* Six UART modules (25 Mbps): Supports LIN 1.2 and IrDA protocols
* Six 4-wire SPI modules (50 Mbps) [these include i2s Audio option]
* SQI configurable as an additional SPI module (50 MHz)
* 50 MHz External Bus Interface (EBI)
As discussed before, that makes 50MHz(+) HW support (where HW manages bit-level stuff, and SW manages bytes/words) a sensible target.
Lest Timers be overlooked, this is the timer mix
Nine 16-bit Timers/Counters (four 16-bit pairs combine to create four 32-bit timers)
Nine Capture inputs (32 bit, numerous edge modes, small FIFO in each) and Nine Compare/PWM outputs (16/32)
but Microchip seem to have missed atomic control of multiple Captures, in spite of having many spare bits for simple alias.
Something the P2 can get right ?
Comments
I didn't post USB implementations. There are lots, and I have seen some obvious errors. I cannot tell the good from the bad.
I presume your post is only to show the various blocks used in a whole USB implementation.
There is way too much risk to implement the whole USB in the P2. A serdes will be a big help.
I am sure Chip can find (or ask) for the NRZI implementation and the CRC implementation if he needs to.
Hope you understand I am just trying to get the basics agreed upon. I don't want to overcomplicate it.
If there is not a single level of buffering, the cog cannot reliably react within 2 clocks, never mind refilling - meaning a minimum of 3-5 clocks to react to DREAD/DREQ, which limits the max bit rate to clkfreq/5
with a single level latch, it decouples the timing, giving the cog/task (n bits)*(clocks_per_bit) clocks to react - all of a sudden a task can handle a clkfreq/2 sync stream, vs. a cog having trouble with clkfreq/4 - a HUGE win.
A BIG THANK YOU to Ray for working on low level support for full speed USB on the P2!!!!!!!!
I still need to be able to drive the SERDES directly for some things, so we need to be able to access this register too. I will make the appropriate changes and post a block diagram shortly.
We will require a n-bits configuration (to latch) option.
What is your opinion in keeping the internal data representation just like a UART? - ie a start=0, stop=1. If the transmission is inverted, this is done either with a configuration bit(s) or using the I/O pin inversion functions.
Thanks Bill.
There was earlier suggestion of 4 SERDES. If this is the case with final implementation, it might be possible. Not sure what the transistor count for CRC16 would be....
Patent encumbered, yes. IIRC someone was working on 4 bit mode for Prop1... But without the CRC16 hardware it wouldn't be much faster than SPI mode. Since CRC16 could be used for many purposes, I'm not sure the legality? It would be really interesting to know where the line is drawn!
Is there a short description somewhere of exactly what USB FS expects for bit stuffing / unstuffing - its NRZI implementation? I don't have time to read the full spec, but I'd love to see the on-the-wire signalling shown simply (I know, I could google for a while, but I bet you have it at hand)?
Regarding latch transparency - if the P2 had a "transparent" mode for them where it clocked data into the xmit shifter on every clock from the latch, and into the latch from the recv buffer on every clock, you would get your transparent access without additional muxes - i think.
Given USB being half duplex, with buffers, we are in serious danger of one cog potentially handling FOUR USB FS streams (using one task each), while another cog can handle the protocol layer...
With buffering, SPI master should be able to run at clkfreq/2, so at 160MHz we'd get 10MB/sec - MUCH better than what we currently get on P1.
Which is patent encumbered? CRC16? or 4-bit SD? If CRC16, maybe you were thinking of US Patent 6,282,691? If so, it expired in August (see http://www.uspto.gov/web/patents/patog/week42/OG/TOC.htm).
http://www.beyondlogic.org/usbnutshell/usb3.shtml
there are bits from various sections.
I am not concerned about patents for crc because if there were any they would have expired long ago. I implemented the crc16 (known as the IBM version) in the early 80's, both for bisync and sdlc. They were in use much earlier than this. They have been described by using polynomials, so there isn't anything that could be patentable since then. Various mainframe companies implemented the protocols so they could exchange data.
I don't really see the need for multiple USB ports on the P2 but of course someone will prove me wrong.
I am not sure the transparency latch method will work. Would be easier just to have r/w access to each serdes and each latch. We have enough instruction bits and these could be combined with similar instructions like ACCx etc.
I am still interested to see if any of this could be combined with the counters - could make for some really interesting future possibilities.
I thought the SD Card Consortium had a patent on the 4 bit mode ... and charged large licensing fees.
Here are some I found:
http://www.usbmadesimple.co.uk/ums_3.htm
http://www.obdev.at/articles/implementing-usb-1.1-in-firmware.html
http://hackaday.com/2012/12/03/msp430-bit-banged-usb-1-1/
The bit stuffing seems simple, should not take too many transistors
I think the P2 has a 1's count instruction, that would help (if not done in hardware) ... maybe (like you suggest) the counters could help with it.
My diagram post #75 shows (basics) what is required to unstuff. NRZI is there also.
My post #77 gives an example of the instructions to bit-bang it (rx).
I don't think the 1's count instruction will help.
Those links are for LS but the concept is for the most part identical. That is where I started. BradC and scanlime both posted USB code for P1.
SD:
Yes, the SD group have a patent restricting 4bit use and charge $$$.
Would you consider posting the Verilog/vhdl code for the UART implementation you did (just the UART section, no instructions)?
You would likely be better off with something a little newer, from the same vendor. Probably a charger spec'd part, to get critical mass.
USB3343 looks to need a byte wide, 60MHz serializer. (15MHz @ 32 bit load rate ) with CLK and handshakes.
Not 'too hard', but not trivial either. Testing would be time consuming.
Price is lower than a full FT232H, and looks to have good ESD specs and some interesting carkit modes.
Lots of interesting discussion. Here are couple of my posts that might pique the interest
#78 Possible SERDES block diag including unstuffing as used by USB (only 1 shifter shown)
Also a brief description of USB here (and possible problem with above diagram) http://forums.parallax.com/showthread.php/150946-No-Simple-P2-SERDES?p=1215469&viewfull=1#post1215469
#73 & #78 CRC16 and a block diag
#77 Proposed instruction for C_XOR_PIN -> C
I proposed another instruction (crc16 generation for 1 bit in C) - looking for it now (check back with this post for the link)
#2853+ In the Propeller II update thread, pedward made some proposals for CRC generation. Also see the following few posts for more discussion.
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1216792&viewfull=1#post1216792
For link speeds they claim :
* 50 MHz Serial Quad Interface (SQI)
* Six UART modules (25 Mbps): Supports LIN 1.2 and IrDA protocols
* Six 4-wire SPI modules (50 Mbps) [these include i2s Audio option]
* SQI configurable as an additional SPI module (50 MHz)
* 50 MHz External Bus Interface (EBI)
As discussed before, that makes 50MHz(+) HW support (where HW manages bit-level stuff, and SW manages bytes/words) a sensible target.
Lest Timers be overlooked, this is the timer mix
Nine 16-bit Timers/Counters (four 16-bit pairs combine to create four 32-bit timers)
Nine Capture inputs (32 bit, numerous edge modes, small FIFO in each) and Nine Compare/PWM outputs (16/32)
but Microchip seem to have missed atomic control of multiple Captures, in spite of having many spare bits for simple alias.
Something the P2 can get right ?