Two pins should be able to handle it, especially if odd and even pins had different USB smarts.
Something I've always wanted to see, but have never found anywhere, is a diagram of USB protocol from the wire level up. If there was something definitive to look at, this could be easy. Diving deep into the huge USB specification to try to construct such information by making lots of inferences has put me off. If I wanted someone to know how something worked, I would explain it in very direct terms. For whatever reason, these protocol standards are never written like that.
Chip,
LS and FS effectively reverses the functions of the pin pairs. But I guess we could swap the pin pair using the smart pins, so its not really an issue here.
I did a lot of work a long time ago in understanding to bottom level protocol. It is not that bad although the crc16 could do with some help. Before smart pins I had worked out how a couple of instructions would help make software life much easier. I posted the info a couple of years ago. Note I think there is an error in the way I worked the instructions out.
Even if there is no HW support CRC-16 can potentially be achieved in the COGs 256 entry LUT. I think the old hot P2 could do the CRC accumulation on each byte in 5 clocks, within an existing byte processing loop for example. Might be faster on the new P2, haven't checked.
There was lots of older USB related discussions here... but some of this was assuming extra instructions and other pure software methods etc.
I'm kind of thinking that the programmable mode, as someone pointed out, doesn't do a whole lot, outside of some basic counting and shifting operations. I think to get real FPGA-imagined flexibility from a programmable mode, it would take a lot finer granularity in functions and much more silicon.
All those measuring and timing modes are basically made redundant by the programmable state mode, and vice-versa. I like the fixed modes, because they don't require any setup. You just use them.
All those measuring and timing modes are basically made redundant by the programmable state mode, and vice-versa.
I had been meaning to ask you that very question. The obvious next question then becomes, what is the ALM count with only the programmable mode existing?
PS: I don't see why a bunch of macros, or similar, aren't out of the question when it comes to setting the programmable mode for each of the regular uses. And the documentation for configured data flow would be no different to your existing docs.
All those measuring and timing modes are basically made redundant by the programmable state mode, and vice-versa.
I had been meaning to ask you that very question. The obvious next question then becomes, what is the ALM count with only the programmable mode existing?
PS: I don't see why a bunch of macros, or similar, aren't out of the question when it comes to setting the programmable mode for each of the regular uses. And the documentation for configured data flow would be no different to your existing docs.
Those dedicated modes are actually a little better, but I didn't realize when I typed that. They can count time, in addition to states and events. The programmable mode uses up X and Y just for configuration, so there are no registers left for a reloadable 32-bit counter to track time.
If we could identify what the macros need to be, we could do that, but that's another month of development, probably.
Doh! X&Y consumed is obvious now. So the equivalent programmable cell would need more registers right off the bat, and more logic to match. As you said first, FPGA like reconfigurability always has a resource cost.
Nothing specific. We're going to need verification that the lowest-level signaling is correct. I think we'll need a hardware analyzer to see that. I found one for $400 that looks pretty decent. It shows all non-transactional signalling, which is below the radar of a software analyzer.
Sounds useful.
It should be possible/simplest to get USB RX working first, sniffing in parallel with a connected USB device.
Even just a WAIT SE0 to toggle a pin and resync a 12MHz NCO (or restart a SPI Sync Rx) should give some Frame and data connections.
SE0 => Resync is simple in existing SW.HW as a starting base, but to keep that 25% data aligned over a larger packet, needs around 20ppm - not hard if you control both clocks, but a little tight for a PC Oscillator.
Smaller packets would be ok, eg 128 bytes is going to be 25% aligned at ~250ppm, which is a more typical figure.
This could be a good test for a NCO-Sync'd BAUD Clock.
The WAIT SE0 can capture time for frame and I think PC Clk is 12000 x FrameSync freq,
The SysCLKs in that time, can be NCO adjusted to give 12000 overflows, to give a SW-DPLL, before edge-resync is implemented.
Nothing specific. We're going to need verification that the lowest-level signaling is correct. I think we'll need a hardware analyzer to see that. I found one for $400 that looks pretty decent. It shows all non-transactional signalling, which is below the radar of a software analyzer.
Sounds useful.
It should be possible/simplest to get USB RX working first, sniffing in parallel with a connected USB device.
Even just a WAIT SE0 to toggle a pin and resync a 12MHz NCO (or restart a SPI Sync Rx) should give some Frame and data connections.
SE0 => Resync is simple in existing SW.HW as a starting base, but to keep that 25% data aligned over a larger packet, needs around 20ppm - not hard if you control both clocks, but a little tight for a PC Oscillator.
Smaller packets would be ok, eg 128 bytes is going to be 25% aligned at ~250ppm, which is a more typical figure.
This could be a good test for a NCO-Sync'd BAUD Clock.
The WAIT SE0 can capture time for frame and I think PC Clk is 12000 x FrameSync freq,
The SysCLKs in that time, can be NCO adjusted to give 12000 overflows, to give a SW-DPLL, before edge-resync is implemented.
I'm kind of thinking that the programmable mode, as someone pointed out, doesn't do a whole lot, outside of some basic counting and shifting operations. I think to get real FPGA-imagined flexibility from a programmable mode, it would take a lot finer granularity in functions and much more silicon.
All those measuring and timing modes are basically made redundant by the programmable state mode, and vice-versa. I like the fixed modes, because they don't require any setup. You just use them.
They may not do fancy things, but the simple things they can do will be very useful.
Some applications that come to mind:
- Fault state shut down of a PWM signal
- make a larger pulse out of a very small
- divide a clock
- make a phase comparator for a PLL
- Add a deadband to a PWM
- add a complementary output
- sample an input until a trigger on another pin
- decimate a clocked SigmaDelta signal
- make a shifter input with 1..32 bits
and so on
These are all things that will be hard to do in software and are not covered by the dedicated modes.
Yes there are some features missing. A second counter for example to count the shifted in bits, if you need the StateCounter for somethig else. Or a way to reset the StateCounter to make a retriggerable monoflop.
A bit counter could be done by counting SIGNAL commands just like the StateCounter counts NEXT-State commands.
Maybe just 1 and 8 will be enough: Raise the INx at every SIGNAL command or raise it after 8 SIGNAL commands. This needs only 1 additional config bit.
By combining the State- and the Signal counter you can raise the INx after 32*8 = 256 events.
I'm kind of thinking that the programmable mode, as someone pointed out, doesn't do a whole lot, outside of some basic counting and shifting operations. I think to get real FPGA-imagined flexibility from a programmable mode, it would take a lot finer granularity in functions and much more silicon.
All those measuring and timing modes are basically made redundant by the programmable state mode, and vice-versa. I like the fixed modes, because they don't require any setup. You just use them.
They may not do fancy things, but the simple things they can do will be very useful.
Some applications that come to mind:
- Fault state shut down of a PWM signal
- make a larger pulse out of a very small
- divide a clock
- make a phase comparator for a PLL
- Add a deadband to a PWM
- add a complementary output
- sample an input until a trigger on another pin
- decimate a clocked SigmaDelta signal
- make a shifter input with 1..32 bits
and so on
These are all things that will be hard to do in software and are not covered by the dedicated modes.
Yes there are some features missing. A second counter for example to count the shifted in bits, if you need the StateCounter for somethig else. Or a way to reset the StateCounter to make a retriggerable monoflop.
A bit counter could be done by counting SIGNAL commands just like the StateCounter counts NEXT-State commands.
Maybe just 1 and 8 will be enough: Raise the INx at every SIGNAL command or raise it after 8 SIGNAL commands. This needs only 1 additional config bit.
By combining the State- and the Signal counter you can raise the INx after 32*8 = 256 events.
Andy
Those are good ideas.
One thing to consider: inputs are clocked, so no asynchronous phenomena can occur.
I'm redoing the M register layout now to accommodate serial modes with a bit-count field:
USB timing is tight to do FS by software on the P1 at 96MHz. But it is doable.
I worked out the bit receiving loop and a bit of hw help with 2 instructions would go a long way to making it easy to do.
I don't believe a fully compliant USB is required. There are a lot of non-compliant USB implementations that work reliably.
I posted all the info including a sample receive routine in the old P2 threads years ago. I have been meaning to get it running properly on the P1 but so far I haven't found the time. My latest P1 boards have the hw interface built in ready to go.
Currently I am working on my P1 PropOS to complete the SD Driver as stay-resident and also complete the conversion of the final part of Michael Park's Sphinx propeller compiler (LEX and LINK works, but CODEGEN still has problems). Sphinx is FAT16 only and requires conversion to use Kye's FAT16/32 Driver (which is what I use in PropOS).
Once this is complete, I can have another look at USB FS.
I wasn't planning on using anything special for debugging USB FS. I can use my RamBlade to log up to 512K USB pin samples (up to 8bits/pins) to the external SRAM.
USB timing is tight to do FS by software on the P1 at 96MHz. But it is doable.
I worked out the bit receiving loop and a bit of hw help with 2 instructions would go a long way to making it easy to do.
I don't believe a fully compliant USB is required. There are a lot of non-compliant USB implementations that work reliably.
I posted all the info including a sample receive routine in the old P2 threads years ago. I have been meaning to get it running properly on the P1 but so far I haven't found the time. My latest P1 boards have the hw interface built in ready to go.
Currently I am working on my P1 PropOS to complete the SD Driver as stay-resident and also complete the conversion of the final part of Michael Park's Sphinx propeller compiler (LEX and LINK works, but CODEGEN still has problems). Sphinx is FAT16 only and requires conversion to use Kye's FAT16/32 Driver (which is what I use in PropOS).
Once this is complete, I can have another look at USB FS.
I wasn't planning on using anything special for debugging USB FS. I can use my RamBlade to log up to 512K USB pin samples (up to 8bits/pins) to the external SRAM.
Cluso, do you remember how many discrete signaling states there are in USB full-speed and slow-speed? I'm allocating mode codes and I'm wondering how many discrete things we need to be accommodating. Is this a sensible list?:
- byte transmit, or packet transmit?
- byte receive, or packet receive?
- low, low
- high, high
- generate ACK
- generate NAK
Some configuration packets have a 5-bit CRC, and data packets have a 16-bit CRC, right? Should any of those CRCs be computed in hardware? Is there any quick-turnaround response required that would necessitate hardware?
I see now what needs to be done. It's pretty straightforward.
Is it appropriate to leave all CRC computation to the cog? I think someone said that they had that working in 5 clocks per byte on Prop2 Hot, which means 10 clocks on this architecture.
It looks like sending can be done by writing 9-bit values via PINSETY, where if bit 8 is cleared, it means data, and if bit 8 is set it means just do the SE0 for two clocks, then J, then quit driving the two lines (EOP, or end-of-packet). These 9-bit transmit values will be double-buffered, so that they transmit back-to-back without delays. IN will signal when it can accept another byte.
Receiving is a little more complex, because some status must be conveyed at times when there's no data.
Maybe receive-data/report-line-status can be the default state and when you want to transmit, you just do two initial PINSETY's to get things started and double-buffered, and you give another byte on every IN high. Then, when you give it a bit-8-high value it will do the EOP signaling. After that, it returns to receive-data/report-line-status mode. This smart pin mode will need to control two pins' pin-level DIR's.
Does this sound viable? Did I miss anything?
P.S. GETPINZ will return any received byte in bits 7..0, while the upper bits will contain status information, like current state, and is the byte new per last IN rise.
States:
J detected (idle state) - raises IN
K detected (wake-up) - raises IN
SE0 detected (unplugged if host or periodic keep-alive signal if non-host) - raises IN
SE1 detected (illegal condition) - raises IN
byte received (available in Z[7:0] via GETPINZ)- raises IN
byte sent (ready for another, cancels transmit if timeout) - raises IN
Maybe those 9-bit PINSETY codes need to be expanded:
$000..$0FF = transmit byte
$100 = enter idle state
$101 = transmit EOP, then enter idle state
$102 = output SE0 (issue keep-alive)
$103 = output K (issue wake-up)
(no need for generate J, as it's same as idle state, which is undriven)
I think we can just do an EOP whenever the transmit buffer runs dry. Maybe no need to command it.
All those measuring and timing modes are basically made redundant by the programmable state mode, and vice-versa.
I had been meaning to ask you that very question. The obvious next question then becomes, what is the ALM count with only the programmable mode existing?
PS: I don't see why a bunch of macros, or similar, aren't out of the question when it comes to setting the programmable mode for each of the regular uses. And the documentation for configured data flow would be no different to your existing docs.
Those dedicated modes are actually a little better, but I didn't realize when I typed that. They can count time, in addition to states and events. The programmable mode uses up X and Y just for configuration, so there are no registers left for a reloadable 32-bit counter to track time.
If we could identify what the macros need to be, we could do that, but that's another month of development, probably.
It's funny you should bring this up. Just yesterday, I was seeing what it would take to implement a couple of the other modes using the state machine modes. As you note, in many cases, the dedicated modes are better. On the other hand, there are some places where the state machine can effectively do the same work. One example is mode %10010 (time A-input high states). This can be implemented with either state mode (I found the 2-bit 1-pattern mode easier for this example). However, there are a few differences:
* In the dedicated mode, timeout is indicated by a value of zero. In state machine mode, timeout is indicated by saturation.
* In the dedicated mode, count starts at one. In state machine mode, count starts at zero.
As you can see, the differences are relatively minor. One thing that was not minor, however, was getting the state machine configurations figured out! Maybe this would get easier with time/practice. However, I see a simple GUI tool in my future for creating these.
7.1.18 Bus Turn-around Time and Inter-packet Delay
Inter-packet delays are measured from the SE0-to-J transition at the end of the EOP to the J-to-K transition that starts the next packet.
A device is required to allow two bit times of inter-packet delay. The delay is measured at the responding device with a bit time defined in terms of the response. This provides adequate time for the device sending the EOP to drive J for one bit time and then turn off its output buffers.
The host must provide at least two bit times of J after the SE0 of an EOP and the start of a new packet (TIPD). If a function is expected to provide a response to a host transmission, the maximum inter-packet delay for a function or hub with a detachable (TRSPIPD1) cable is 6.5 bit times measured at the Series B receptacle. If the device has a captive cable, the inter-packet delay (TRSPIPD2) must be less than 7.5 bit times as measured at the Series-A plug. These timings apply to both full-speed and low-speed devices and the bit times are referenced to the data rate of the packet.
The maximum inter-packet delay for a host response is 7.5 bit times, measured at the host’s port pins. There is no maximum inter-packet delay between packets in unrelated transactions.
The bold text, I believe, defines the software pinch point for full-speed USB, does it not?
We have up to 7.5 bit periods (625ns, or 50 clocks at 80MHz) to get a response headed back to the host.
This may be tight, especially if CRC checking is involved. Is this realistic to do? Back in the early days of USB 1.1 when MCU's ran at only a few MHz, you can imagine how this response mechanism HAD TO BE in hardware. What a pain! I think we could skate around this.
Chip,
For LS J=10 and K=01 and its the reverse for FS.
IIRC SE0=00 and SE1=11.
It does not matter which way around the two pins are, as they get reversed between LS & FS.
What would make this nice is the ability to read two adjacent pins that are fed into a 2x2 LUT with the two outputs set as follows
D- D+ | X0 X1
----------|----------
0 0 | 0 0 = SE0
0 1 | 1 1 = J or K
1 0 | 1 0 = K or J
1 1 | 0 1 = SE1
If this could be done by an instruction, then the Z could be set for SE0/SE1 and C set for D+=1.
This would permit sw RCL/RCR to accumulate the C bit into a long if NZ. A JMP can be done on Z for SE0/SE1 to be tested.
What was an issue is the time to read the bits, determine their polarity (either J or K) or otherwise jmp out to test for SE0/SE1. Also unstuffing needs to be done per bit time. Due to bit unstuffing, I have always thought that this would need to be done in sw.
The initial frames use CRC5 and can be calculated easily as most of the string can be precalculated. IIRC these are not byte multiples.
The data frames use CRC16 (IBM not CCITT). While a lookup table can be used, there was some timing issue that caused problems. Perhaps the LUT will overcome this.
I found an excellent article that describes the frames a long time ago. I will look for it over the w/e.
The higher level code can be done in sw. The reply needs to take place IIRC in 16 12MHz clocks.
One other thing I noticed while going through that last exercise was that it feels like there is a mismatch between the inputs and the opcodes. Here's what I mean by that:
In the example above, I was only concerned with A-current and A-previous (B-current and B-previous were "don't care"). As a result, I ended up with 4 possible input conditions. As it turned out, this perfectly coincided with the 4 opcode slots. However, X[15:0] and Y[15:0] were 75% redundant (the same 4-bit pattern repeated 4 times). In other words, I could have done the same mapping with X[3:0] and Y[3:0] and a single bit indicating whether I was using A or B input.
Note: In the above example, I used the 2-bit, 1-pattern mode and did not switch states. I was also able to implement the example using the 1-bit, 2-pattern mode (with state switching). Either way, I did not need any more than X[3:0] and Y[3:0].
On the other hand, it's reasonable to assume that there are examples where you would use both A and B inputs, resulting in up to 16 possible conditions (32 if you include the state bit). However, you are still limited to only 4 opcodes (8, if you are using the 2-bit, 1-pattern mode with state changes). All of these combinations must result in the selection of only 1 of 4 possible opcodes. And if a number of those combinations are NOPs, then you are left with only three available opcodes for all of the other combinations. Actually, it could be as little as two opcodes if you are using the 1-bit, 2-pattern mode and need a NOP for both states. Ideally, there would be a way to have the NOP be an implied opcode. Or, put another way, it would be nice if there were a way to stating input combinations as NOP or "don't care" without using an opcode slot.
And if you just happen to need to react to every possible input combination, it will only work if they can all be mapped to the same 4 opcodes. Now, it might be possible for some scenarios to judiciously reprogram the state machine on the fly, thereby extending the number of available opcodes. But I suspect this would be a very complicated affair.
In the end, though, no matter which way you go, you are going to end up either under-utilizing part of the state machine modes, or over-utilizing them. I suspect that there are very few, if any, cases where you will be able to ideally utilize them.
Unbelievable ! So that code have been laying around for SIX YEARS?
I think this proves that USB is not just a two weeks project. Certainly USB is not just getting the physical layer working. It seems a nightmare of protocol messages and complexity.
I have just read her resumee and not only she is expert on the USB protocol, but she is also skilled on propeller and FPGA. I wonder why on earth parallax didn't asked Micah (scanlime) before to improve USB on P1v or P2? Just sent her a free Prop123 FPGA board !
PS: she also made a high level USB protocol analyzer: vusb-analyzer.sourceforge.net
I also have read the whole thread and there is a thing that catched my attention : DELAY LINES.
Micah said that this was an option to speed up signal receiving.
Chip, is it possible to implement programmable delay lines on smart pins?
The smart pin will monitor a IO pin. With 8, 16, or 32 taps it will write to a BYTE, WORD or LONG. A register will be used to program the DELAY LINE (eg.: from 1 to 10 ns in 0.2ns steps, or whatever is needed)
I also have read the whole thread and there is a thing that catched my attention : DELAY LINES.
Micah said that this was an option to speed up signal receiving.
Chip, is it possible to implement programmable delay lines on smart pins?
The smart pin will monitor a IO pin. With 8, 16, or 32 taps it will write to a BYTE, WORD or LONG. A register will be used to program the DELAY LINE (eg.: from 1 to 10 ns in 0.2ns steps, or whatever is needed)
We could have taps, but they would have to be at the clock period, which is currently 12.5ns at 80MHz. A USB bit period is 83.33ns, so 6.66 taps would be equal to one USB bit period. I don't know what it would help, for USB, anyway. Do you see where Micah was talking about this?
The more I think about the state machine modes, the more I come to the following conclusion: there are three possible ways forward, none of which include leaving the modes as they are.
Option 1:
Simplify the state machines considerably. Only operate on A input (or B input, if selectable), with the 4 possible combinations directly selecting one of the 4 possible opcodes. No need for X[15:0] or Y[15:0]. No need for state change. This also means, no need for state change bit in the opcodes or state change configuration fields. Repurpose all of that to make the remaining stuff more robust. You might even be able to fully (and easily) reproduce some of the other pin cell modes at that point.
Option 2:
Go big! Make the state machine even more capable, essentially making it possible to do most or all of the other pin cell modes entirely with a state machine. This would require, for instance, that ability to fully map all A/B combinations to unique opcodes, plus several other changes. While this will certainly require more work to do any of the existing pin modes, it will also be much more capable of handling new modes that we haven't even thought of yet. In essence, this requires a rewrite of the smart pin/cell to be nothing but a configurable state machine. Unfortunately, there is a great deal of risk in this, both in terms of time and gate cost.
Option 3:
Get rid of the state machines altogether. My concern is that, as the currently stand, they will be a minor niche player in the overall smart pin/cell story. However, because they look like they should be capable of playing a much stronger role, people are going to unsuccessfully try to make them do more than they can actually do. This will end up frustrating users and giving an overall bad impression of the smart pin/cell capability. In other words, it may be better to cut the state machines altogether than to have them negatively impact perceived capabilities of the P2.
We could have taps, but they would have to be at the clock period, which is currently 12.5ns at 80MHz. A USB bit period is 83.33ns, so 6.66 taps would be equal to one USB bit period. I don't know what it would help, for USB, anyway. Do you see where Micah was talking about this?
scanlime Posts: 106
April 2010 edited April 2010 Flag0
Hanno said...
Good progress Micah!
I almost forgot about a thought I had to reduce cog usage (I looked at your code before bedtime, had the thought, and forgot about it the next couple days- see what you think)
You currently use 2 cogs to receive one bit at a time every 2 instructions. After receiving 16 bits with one cog that cog has some time to write the data to hub.
Using the "mov x,ina" instruction, you can read multiple bits at the same time- provided that they're waiting for you on the Propeller's IO pins. Using delay lines with multiple pins, you can use this trick to read multiple bits at the same time. I would start with reading data into the cog's ram and spooling it back to hub ram when the receive is finished.
Good luck!
Hanno
Thanks!
Delay lines would definitely help trade cogs for pins. But part of the fun IMHO is to do this with no external active components. If I'm going to buy a delay line chip, might as well make it a USB host controller chip [noparse]:)[/noparse]
I also have read the whole thread and there is a thing that catched my attention : DELAY LINES.
Micah said that this was an option to speed up signal receiving.
Chip, is it possible to implement programmable delay lines on smart pins?
The smart pin will monitor a IO pin. With 8, 16, or 32 taps it will write to a BYTE, WORD or LONG. A register will be used to program the DELAY LINE (eg.: from 1 to 10 ns in 0.2ns steps, or whatever is needed)
We could have taps, but they would have to be at the clock period, which is currently 12.5ns at 80MHz. A USB bit period is 83.33ns, so 6.66 taps would be equal to one USB bit period. I don't know what it would help, for USB, anyway. Do you see where Micah was talking about this?
I asked for being it programmable because delay lines are also basic building blocks for high speed signal decoding (CMI Coded Mark Inversion used in SDH/SONET). This will open the doors for P2 to 'telco heaven'. Dreaming for a P2 decoding a STM-16 signal.
But don't put too much effort on this. It actually might not be so easy. The currently available high speed delay line ICs (nanosecond range) are sold in big TQFP-32 packages at US $12 each. Only PECL or LVPECL.
Going back to the USB topic:
Post by Hanno that started the DELAY LINE discussion:
http://forums.parallax.com/discussion/comment/896924/#Comment_896924
"You currently use 2 cogs to receive one bit at a time every 2 instructions. After receiving 16 bits with one cog that cog has some time to write the data to hub.
Using the "mov x,ina" instruction, you can read multiple bits at the same time- provided that they're waiting for you on the Propeller's IO pins. Using delay lines with multiple pins, you can use this trick to read multiple bits at the same time"
http://forums.parallax.com/discussion/comment/896937/#Comment_896937
"Delay lines would definitely help trade cogs for pins. But part of the fun IMHO is to do this with no external active components. If I'm going to buy a delay line chip, might as well make it a USB host controller chip"
Hanno actually asked to use delay lines on multiple input pins on the propeller.
The idea we can actually implement is to have one single pin pin with multiple internal delay line taps that will feed a register with a BYTE, WORD, or LONG. This will have the same effect as a having a system clock with x8, x16 or x32 higher speed !! Will this work? Amazing.
The more I think about the state machine modes, the more I come to the following conclusion: there are three possible ways forward, none of which include leaving the modes as they are.
Option 1:
Simplify the state machines considerably. Only operate on A input (or B input, if selectable), with the 4 possible combinations directly selecting one of the 4 possible opcodes. No need for X[15:0] or Y[15:0]. No need for state change. This also means, no need for state change bit in the opcodes or state change configuration fields. Repurpose all of that to make the remaining stuff more robust. You might even be able to fully (and easily) reproduce some of the other pin cell modes at that point.
Option 2:
Go big! Make the state machine even more capable, essentially making it possible to do most or all of the other pin cell modes entirely with a state machine. This would require, for instance, that ability to fully map all A/B combinations to unique opcodes, plus several other changes. While this will certainly require more work to do any of the existing pin modes, it will also be much more capable of handling new modes that we haven't even thought of yet. In essence, this requires a rewrite of the smart pin/cell to be nothing but a configurable state machine. Unfortunately, there is a great deal of risk in this, both in terms of time and gate cost.
Option 3:
Get rid of the state machines altogether. My concern is that, as the currently stand, they will be a minor niche player in the overall smart pin/cell story. However, because they look like they should be capable of playing a much stronger role, people are going to unsuccessfully try to make them do more than they can actually do. This will end up frustrating users and giving an overall bad impression of the smart pin/cell capability. In other words, it may be better to cut the state machines altogether than to have them negatively impact perceived capabilities of the P2.
I agree. I started making new 6-bit mode codes, with room for word-size control for the serial modes, and I removed the programmable mode, already.
You know, what you said about only using a-current and b-current is right on, because you can then use STATES to make up for the loss. Then, you're only dealing with 4-bit rule patterns, instead of 16-bit rule patterns. Or, 8-bit 4-way patterns. It's true that the current programmable mode would just frustrate people. The fixed modes are great, because there's almost nothing to learn. You just use them and they pleasantly work.
We could have taps, but they would have to be at the clock period, which is currently 12.5ns at 80MHz. A USB bit period is 83.33ns, so 6.66 taps would be equal to one USB bit period. I don't know what it would help, for USB, anyway. Do you see where Micah was talking about this?
scanlime Posts: 106
April 2010 edited April 2010 Flag0
Hanno said...
Good progress Micah!
I almost forgot about a thought I had to reduce cog usage (I looked at your code before bedtime, had the thought, and forgot about it the next couple days- see what you think)
You currently use 2 cogs to receive one bit at a time every 2 instructions. After receiving 16 bits with one cog that cog has some time to write the data to hub.
Using the "mov x,ina" instruction, you can read multiple bits at the same time- provided that they're waiting for you on the Propeller's IO pins. Using delay lines with multiple pins, you can use this trick to read multiple bits at the same time. I would start with reading data into the cog's ram and spooling it back to hub ram when the receive is finished.
Good luck!
Hanno
Thanks!
Delay lines would definitely help trade cogs for pins. But part of the fun IMHO is to do this with no external active components. If I'm going to buy a delay line chip, might as well make it a USB host controller chip [noparse]:)[/noparse]
Simplify the state machines considerably. Only operate on A input (or B input, if selectable), with the 4 possible combinations directly selecting one of the 4 possible opcodes. No need for X[15:0] or Y[15:0]. No need for state change. This also means, no need for state change bit in the opcodes or state change configuration fields. Repurpose all of that to make the remaining stuff more robust. You might even be able to fully (and easily) reproduce some of the other pin cell modes at that point.
Actually, if you kept the state feature, you could do 8 directly-mapped opcodes. and still have 12 bits left over for additional configuration:
X[4:0] opcode for input %00, state 0
X[9:5] opcode for input %01, state 0
X[14:10] opcode for input %10, state 0
X[19:15] opcode for input %11, state 0
X[25:20] UNUSED
X[31:26] existing options
Y[4:0] opcode for input %00, state 1
Y[9:5] opcode for input %01, state 1
Y[14:10] opcode for input %10, state 1
Y[19:15] opcode for input %11, state 1
Y[25:20] UNUSED
Y[31:26] existing options
Comments
Even if there is no HW support CRC-16 can potentially be achieved in the COGs 256 entry LUT. I think the old hot P2 could do the CRC accumulation on each byte in 5 clocks, within an existing byte processing loop for example. Might be faster on the new P2, haven't checked.
There was lots of older USB related discussions here... but some of this was assuming extra instructions and other pure software methods etc.
http://forums.parallax.com/discussion/154509/p2-and-full-speed-usb-slave-requirements-ideas
All those measuring and timing modes are basically made redundant by the programmable state mode, and vice-versa. I like the fixed modes, because they don't require any setup. You just use them.
I had been meaning to ask you that very question. The obvious next question then becomes, what is the ALM count with only the programmable mode existing?
PS: I don't see why a bunch of macros, or similar, aren't out of the question when it comes to setting the programmable mode for each of the regular uses. And the documentation for configured data flow would be no different to your existing docs.
Those dedicated modes are actually a little better, but I didn't realize when I typed that. They can count time, in addition to states and events. The programmable mode uses up X and Y just for configuration, so there are no registers left for a reloadable 32-bit counter to track time.
If we could identify what the macros need to be, we could do that, but that's another month of development, probably.
Sounds useful.
It should be possible/simplest to get USB RX working first, sniffing in parallel with a connected USB device.
Even just a WAIT SE0 to toggle a pin and resync a 12MHz NCO (or restart a SPI Sync Rx) should give some Frame and data connections.
SE0 => Resync is simple in existing SW.HW as a starting base, but to keep that 25% data aligned over a larger packet, needs around 20ppm - not hard if you control both clocks, but a little tight for a PC Oscillator.
Smaller packets would be ok, eg 128 bytes is going to be 25% aligned at ~250ppm, which is a more typical figure.
This could be a good test for a NCO-Sync'd BAUD Clock.
The WAIT SE0 can capture time for frame and I think PC Clk is 12000 x FrameSync freq,
The SysCLKs in that time, can be NCO adjusted to give 12000 overflows, to give a SW-DPLL, before edge-resync is implemented.
The hardware side is not too overwrought, so you can focus on that and just let SW be SW.
Do you think that USB analyzer would be particularly helpful, or would a scope be sufficient? Maybe no need to spend $400.
They may not do fancy things, but the simple things they can do will be very useful.
Some applications that come to mind:
- Fault state shut down of a PWM signal
- make a larger pulse out of a very small
- divide a clock
- make a phase comparator for a PLL
- Add a deadband to a PWM
- add a complementary output
- sample an input until a trigger on another pin
- decimate a clocked SigmaDelta signal
- make a shifter input with 1..32 bits
and so on
These are all things that will be hard to do in software and are not covered by the dedicated modes.
Yes there are some features missing. A second counter for example to count the shifted in bits, if you need the StateCounter for somethig else. Or a way to reset the StateCounter to make a retriggerable monoflop.
A bit counter could be done by counting SIGNAL commands just like the StateCounter counts NEXT-State commands.
Maybe just 1 and 8 will be enough: Raise the INx at every SIGNAL command or raise it after 8 SIGNAL commands. This needs only 1 additional config bit.
By combining the State- and the Signal counter you can raise the INx after 32*8 = 256 events.
Andy
Those are good ideas.
One thing to consider: inputs are clocked, so no asynchronous phenomena can occur.
I'm redoing the M register layout now to accommodate serial modes with a bit-count field:
0 = 6 bits
1 = 7 bits
2 = 8 bits
3 = 9 bits
4 = 10 bits
5 = 16 bits
6 = 24 bits
7 = 32 bits
Having five bits for 1..32 is just too costly in terms of mode bits.
I worked out the bit receiving loop and a bit of hw help with 2 instructions would go a long way to making it easy to do.
I don't believe a fully compliant USB is required. There are a lot of non-compliant USB implementations that work reliably.
I posted all the info including a sample receive routine in the old P2 threads years ago. I have been meaning to get it running properly on the P1 but so far I haven't found the time. My latest P1 boards have the hw interface built in ready to go.
Currently I am working on my P1 PropOS to complete the SD Driver as stay-resident and also complete the conversion of the final part of Michael Park's Sphinx propeller compiler (LEX and LINK works, but CODEGEN still has problems). Sphinx is FAT16 only and requires conversion to use Kye's FAT16/32 Driver (which is what I use in PropOS).
Once this is complete, I can have another look at USB FS.
I wasn't planning on using anything special for debugging USB FS. I can use my RamBlade to log up to 512K USB pin samples (up to 8bits/pins) to the external SRAM.
Cluso, do you remember how many discrete signaling states there are in USB full-speed and slow-speed? I'm allocating mode codes and I'm wondering how many discrete things we need to be accommodating. Is this a sensible list?:
- byte transmit, or packet transmit?
- byte receive, or packet receive?
- low, low
- high, high
- generate ACK
- generate NAK
Some configuration packets have a 5-bit CRC, and data packets have a 16-bit CRC, right? Should any of those CRCs be computed in hardware? Is there any quick-turnaround response required that would necessitate hardware?
I wonder where Michael Park is these days.
http://www.usbmadesimple.co.uk/ums_3.htm
I see now what needs to be done. It's pretty straightforward.
Is it appropriate to leave all CRC computation to the cog? I think someone said that they had that working in 5 clocks per byte on Prop2 Hot, which means 10 clocks on this architecture.
It looks like sending can be done by writing 9-bit values via PINSETY, where if bit 8 is cleared, it means data, and if bit 8 is set it means just do the SE0 for two clocks, then J, then quit driving the two lines (EOP, or end-of-packet). These 9-bit transmit values will be double-buffered, so that they transmit back-to-back without delays. IN will signal when it can accept another byte.
Receiving is a little more complex, because some status must be conveyed at times when there's no data.
Maybe receive-data/report-line-status can be the default state and when you want to transmit, you just do two initial PINSETY's to get things started and double-buffered, and you give another byte on every IN high. Then, when you give it a bit-8-high value it will do the EOP signaling. After that, it returns to receive-data/report-line-status mode. This smart pin mode will need to control two pins' pin-level DIR's.
Does this sound viable? Did I miss anything?
P.S. GETPINZ will return any received byte in bits 7..0, while the upper bits will contain status information, like current state, and is the byte new per last IN rise.
States:
J detected (idle state) - raises IN
K detected (wake-up) - raises IN
SE0 detected (unplugged if host or periodic keep-alive signal if non-host) - raises IN
SE1 detected (illegal condition) - raises IN
byte received (available in Z[7:0] via GETPINZ)- raises IN
byte sent (ready for another, cancels transmit if timeout) - raises IN
Maybe those 9-bit PINSETY codes need to be expanded:
$000..$0FF = transmit byte
$100 = enter idle state
$101 = transmit EOP, then enter idle state
$102 = output SE0 (issue keep-alive)
$103 = output K (issue wake-up)
(no need for generate J, as it's same as idle state, which is undriven)
I think we can just do an EOP whenever the transmit buffer runs dry. Maybe no need to command it.
This means there's only one smart pin mode code.
http://akbar.marlboro.edu/~mahoney/support/alg/alg/node186.html
USB 2.0 Protocol Analyzer
http://www.ellisys.com/products/usbex200/
http://www.ellisys.com/company/press.php
http://www.ellisys.com/archive/images/usbex200.gif
It's funny you should bring this up. Just yesterday, I was seeing what it would take to implement a couple of the other modes using the state machine modes. As you note, in many cases, the dedicated modes are better. On the other hand, there are some places where the state machine can effectively do the same work. One example is mode %10010 (time A-input high states). This can be implemented with either state mode (I found the 2-bit 1-pattern mode easier for this example). However, there are a few differences:
* In the dedicated mode, timeout is indicated by a value of zero. In state machine mode, timeout is indicated by saturation.
* In the dedicated mode, count starts at one. In state machine mode, count starts at zero.
As you can see, the differences are relatively minor. One thing that was not minor, however, was getting the state machine configurations figured out! Maybe this would get easier with time/practice. However, I see a simple GUI tool in my future for creating these.
The bold text, I believe, defines the software pinch point for full-speed USB, does it not?
We have up to 7.5 bit periods (625ns, or 50 clocks at 80MHz) to get a response headed back to the host.
This may be tight, especially if CRC checking is involved. Is this realistic to do? Back in the early days of USB 1.1 when MCU's ran at only a few MHz, you can imagine how this response mechanism HAD TO BE in hardware. What a pain! I think we could skate around this.
For LS J=10 and K=01 and its the reverse for FS.
IIRC SE0=00 and SE1=11.
It does not matter which way around the two pins are, as they get reversed between LS & FS.
What would make this nice is the ability to read two adjacent pins that are fed into a 2x2 LUT with the two outputs set as follows If this could be done by an instruction, then the Z could be set for SE0/SE1 and C set for D+=1.
This would permit sw RCL/RCR to accumulate the C bit into a long if NZ. A JMP can be done on Z for SE0/SE1 to be tested.
What was an issue is the time to read the bits, determine their polarity (either J or K) or otherwise jmp out to test for SE0/SE1. Also unstuffing needs to be done per bit time. Due to bit unstuffing, I have always thought that this would need to be done in sw.
The initial frames use CRC5 and can be calculated easily as most of the string can be precalculated. IIRC these are not byte multiples.
The data frames use CRC16 (IBM not CCITT). While a lookup table can be used, there was some timing issue that caused problems. Perhaps the LUT will overcome this.
I found an excellent article that describes the frames a long time ago. I will look for it over the w/e.
The higher level code can be done in sw. The reply needs to take place IIRC in 16 12MHz clocks.
In the example above, I was only concerned with A-current and A-previous (B-current and B-previous were "don't care"). As a result, I ended up with 4 possible input conditions. As it turned out, this perfectly coincided with the 4 opcode slots. However, X[15:0] and Y[15:0] were 75% redundant (the same 4-bit pattern repeated 4 times). In other words, I could have done the same mapping with X[3:0] and Y[3:0] and a single bit indicating whether I was using A or B input.
Note: In the above example, I used the 2-bit, 1-pattern mode and did not switch states. I was also able to implement the example using the 1-bit, 2-pattern mode (with state switching). Either way, I did not need any more than X[3:0] and Y[3:0].
On the other hand, it's reasonable to assume that there are examples where you would use both A and B inputs, resulting in up to 16 possible conditions (32 if you include the state bit). However, you are still limited to only 4 opcodes (8, if you are using the 2-bit, 1-pattern mode with state changes). All of these combinations must result in the selection of only 1 of 4 possible opcodes. And if a number of those combinations are NOPs, then you are left with only three available opcodes for all of the other combinations. Actually, it could be as little as two opcodes if you are using the 1-bit, 2-pattern mode and need a NOP for both states. Ideally, there would be a way to have the NOP be an implied opcode. Or, put another way, it would be nice if there were a way to stating input combinations as NOP or "don't care" without using an opcode slot.
And if you just happen to need to react to every possible input combination, it will only work if they can all be mapped to the same 4 opcodes. Now, it might be possible for some scenarios to judiciously reprogram the state machine on the fly, thereby extending the number of available opcodes. But I suspect this would be a very complicated affair.
In the end, though, no matter which way you go, you are going to end up either under-utilizing part of the state machine modes, or over-utilizing them. I suspect that there are very few, if any, cases where you will be able to ideally utilize them.
Unbelievable ! So that code have been laying around for SIX YEARS?
I think this proves that USB is not just a two weeks project. Certainly USB is not just getting the physical layer working. It seems a nightmare of protocol messages and complexity.
I have just read her resumee and not only she is expert on the USB protocol, but she is also skilled on propeller and FPGA. I wonder why on earth parallax didn't asked Micah (scanlime) before to improve USB on P1v or P2? Just sent her a free Prop123 FPGA board !
PS: she also made a high level USB protocol analyzer: vusb-analyzer.sourceforge.net
Micah said that this was an option to speed up signal receiving.
Chip, is it possible to implement programmable delay lines on smart pins?
The smart pin will monitor a IO pin. With 8, 16, or 32 taps it will write to a BYTE, WORD or LONG. A register will be used to program the DELAY LINE (eg.: from 1 to 10 ns in 0.2ns steps, or whatever is needed)
Universal Serial Bus Specification
Compaq
Intel
Microsoft
NEC
We could have taps, but they would have to be at the clock period, which is currently 12.5ns at 80MHz. A USB bit period is 83.33ns, so 6.66 taps would be equal to one USB bit period. I don't know what it would help, for USB, anyway. Do you see where Micah was talking about this?
Option 1:
Simplify the state machines considerably. Only operate on A input (or B input, if selectable), with the 4 possible combinations directly selecting one of the 4 possible opcodes. No need for X[15:0] or Y[15:0]. No need for state change. This also means, no need for state change bit in the opcodes or state change configuration fields. Repurpose all of that to make the remaining stuff more robust. You might even be able to fully (and easily) reproduce some of the other pin cell modes at that point.
Option 2:
Go big! Make the state machine even more capable, essentially making it possible to do most or all of the other pin cell modes entirely with a state machine. This would require, for instance, that ability to fully map all A/B combinations to unique opcodes, plus several other changes. While this will certainly require more work to do any of the existing pin modes, it will also be much more capable of handling new modes that we haven't even thought of yet. In essence, this requires a rewrite of the smart pin/cell to be nothing but a configurable state machine. Unfortunately, there is a great deal of risk in this, both in terms of time and gate cost.
Option 3:
Get rid of the state machines altogether. My concern is that, as the currently stand, they will be a minor niche player in the overall smart pin/cell story. However, because they look like they should be capable of playing a much stronger role, people are going to unsuccessfully try to make them do more than they can actually do. This will end up frustrating users and giving an overall bad impression of the smart pin/cell capability. In other words, it may be better to cut the state machines altogether than to have them negatively impact perceived capabilities of the P2.
scanlime Posts: 106
April 2010 edited April 2010 Flag0
Hanno said...
Good progress Micah!
I almost forgot about a thought I had to reduce cog usage (I looked at your code before bedtime, had the thought, and forgot about it the next couple days- see what you think)
You currently use 2 cogs to receive one bit at a time every 2 instructions. After receiving 16 bits with one cog that cog has some time to write the data to hub.
Using the "mov x,ina" instruction, you can read multiple bits at the same time- provided that they're waiting for you on the Propeller's IO pins. Using delay lines with multiple pins, you can use this trick to read multiple bits at the same time. I would start with reading data into the cog's ram and spooling it back to hub ram when the receive is finished.
Good luck!
Hanno
Thanks!
Delay lines would definitely help trade cogs for pins. But part of the fun IMHO is to do this with no external active components. If I'm going to buy a delay line chip, might as well make it a USB host controller chip [noparse]:)[/noparse]
From Here:
http://forums.parallax.com/discussion/121321/working-full-speed-12-mb-s-bit-banging-usb-host-controller/p2
I asked for being it programmable because delay lines are also basic building blocks for high speed signal decoding (CMI Coded Mark Inversion used in SDH/SONET). This will open the doors for P2 to 'telco heaven'. Dreaming for a P2 decoding a STM-16 signal.
But don't put too much effort on this. It actually might not be so easy. The currently available high speed delay line ICs (nanosecond range) are sold in big TQFP-32 packages at US $12 each. Only PECL or LVPECL.
Going back to the USB topic:
Hanno actually asked to use delay lines on multiple input pins on the propeller.
The idea we can actually implement is to have one single pin pin with multiple internal delay line taps that will feed a register with a BYTE, WORD, or LONG. This will have the same effect as a having a system clock with x8, x16 or x32 higher speed !! Will this work? Amazing.
Other worth to check posts:
http://forums.parallax.com/discussion/comment/894864/#Comment_894864
http://forums.parallax.com/discussion/comment/894877/#Comment_894877
http://forums.parallax.com/discussion/comment/894888/#Comment_894888
http://forums.parallax.com/discussion/comment/894968/#Comment_894968
http://forums.parallax.com/discussion/comment/895258/#Comment_895258
http://forums.parallax.com/discussion/comment/895900/#Comment_895900
http://forums.parallax.com/discussion/comment/896518/#Comment_896518
I agree. I started making new 6-bit mode codes, with room for word-size control for the serial modes, and I removed the programmable mode, already.
You know, what you said about only using a-current and b-current is right on, because you can then use STATES to make up for the loss. Then, you're only dealing with 4-bit rule patterns, instead of 16-bit rule patterns. Or, 8-bit 4-way patterns. It's true that the current programmable mode would just frustrate people. The fixed modes are great, because there's almost nothing to learn. You just use them and they pleasantly work.
Ah, I see now what she meant. In our case, the shift register IS the delay line, if you think about it.
Actually, if you kept the state feature, you could do 8 directly-mapped opcodes. and still have 12 bits left over for additional configuration: