High speed prop-to-prop communication, WAITPEQ timing
ManAtWork
Posts: 2,176
in Propeller 1
Hello,
I'm designing an industrial machine controller that has to deal with multiple servo drives and lots of IO channels. As I'd run out of pins with only one prop, anyway, I decided to use one slave prop per servo axis and one master prop doing the trajectory planning. And BTW, as I'd also run out of memory I decided to use an Intel NUC for the GUI and as file/network server.
Each slave prop sits on it's own PCB. All slave PCBs are stacked on top of the master PCB with daisy chained pin headers. The maximum bord-to-board ditance is <8" with good grounding and no high currents nearby so I don't expect noise problems. I have less than 8 pins left over so I don't think that parallel/multi-bit transmission would give any advantage.
I've seen a trick somewhere where the PHSA register is used to output a serial data stream at 20Mbit/s. But the highest data rate a propeller could receive is 10Mbit/s because shifting in one bit always takes two instructions (TEST and RCL), 4 clocks each. If all propellers run from the same 5MHz crystal, there is no need to transmit an extra serial bus clock. As all slaves have to be served exactly once per time slice there is also no need to transmit any address information. Each slave knows it's time slot. So I think I'll only need two IO pins per prop, one "start of timeslice" strobe and a (bidirectional) data line. If I can afford two cogs for the communication I could do it full-duplex and transmit downstream and upstream data at the same time over two bus lines. But I don't think that'll be necessary.
So the (peak) data rate is 10Mbit/s or 8 clocks per bit. I send one start bit followed by 32 data bits. Now the question:
How do I know the exact timing of the WAITPEQ command? Given...
WAITPEQ dataInPin,dataInPin ' wait for start bit
TEST dataInPin,INA wc
RCL data,#1
... how many clocks pass between the triggering edge of WAITPEQ until the sampling of the input pin of the TEST instruction? For the transmitter I can easily check timing with the scope. But for the receiver it's difficult to tell the exact sampling time. For optimum reliability it woul be best to sample in the middle of a data bit. If the WAITPEQ and the TEST instruction sample data at the same clock (out of 4 per instruction) then it would be best to insert two NOPs between the two instructions to skip the start bit and the first half of the first data bit.
Is this correct? Anyone a better idea?
I'm designing an industrial machine controller that has to deal with multiple servo drives and lots of IO channels. As I'd run out of pins with only one prop, anyway, I decided to use one slave prop per servo axis and one master prop doing the trajectory planning. And BTW, as I'd also run out of memory I decided to use an Intel NUC for the GUI and as file/network server.
Each slave prop sits on it's own PCB. All slave PCBs are stacked on top of the master PCB with daisy chained pin headers. The maximum bord-to-board ditance is <8" with good grounding and no high currents nearby so I don't expect noise problems. I have less than 8 pins left over so I don't think that parallel/multi-bit transmission would give any advantage.
I've seen a trick somewhere where the PHSA register is used to output a serial data stream at 20Mbit/s. But the highest data rate a propeller could receive is 10Mbit/s because shifting in one bit always takes two instructions (TEST and RCL), 4 clocks each. If all propellers run from the same 5MHz crystal, there is no need to transmit an extra serial bus clock. As all slaves have to be served exactly once per time slice there is also no need to transmit any address information. Each slave knows it's time slot. So I think I'll only need two IO pins per prop, one "start of timeslice" strobe and a (bidirectional) data line. If I can afford two cogs for the communication I could do it full-duplex and transmit downstream and upstream data at the same time over two bus lines. But I don't think that'll be necessary.
So the (peak) data rate is 10Mbit/s or 8 clocks per bit. I send one start bit followed by 32 data bits. Now the question:
How do I know the exact timing of the WAITPEQ command? Given...
WAITPEQ dataInPin,dataInPin ' wait for start bit
TEST dataInPin,INA wc
RCL data,#1
... how many clocks pass between the triggering edge of WAITPEQ until the sampling of the input pin of the TEST instruction? For the transmitter I can easily check timing with the scope. But for the receiver it's difficult to tell the exact sampling time. For optimum reliability it woul be best to sample in the middle of a data bit. If the WAITPEQ and the TEST instruction sample data at the same clock (out of 4 per instruction) then it would be best to insert two NOPs between the two instructions to skip the start bit and the first half of the first data bit.
Is this correct? Anyone a better idea?
Comments
Just curious about why you would need such high rates?
Pretty excited about this development. Any more specifics?
I'm a servo motor fan, myself. :cool:
Edit: BTW, I use an Android tablet for my front-end, via a Bluetooth link.
You can check the Rx, with a pin-change after TEST. That has a known delay.
I'd probably experiment by shifting the Start edge, and confirming the TEST sample point is centre-bit.
(ie move it both ways until it breaks, then choose the mid-value)
The WAIT opcodes are 1 SysCLK granular, so they will align to 12.5ns of 1/8 of your baud time.
If you are pushing highest speeds, you might want to look at sending more than 32b - eg a HUB mailbox could be serviced by 32b+4-6-8b of address for example.
You probably want two delays, WAIT for SyncEdge, and WAIT for MySlaveSlot, and ensure the last slave has time to finish housekeeping before the next SyncEdge.
I'd suggest a modest anti-ringing resistor on the BUS'd lines.
It could be worth testing this link thru the newest CAN Tansceivers- the MCP254x spec to 8MHz with a 100ns min Tx bit time, so they could just fit here.
Those buy you some ESD, differential signaling, and should work better with typical unstacked board spacings.
Using CAN, a 4 pin Prop interface would cover your PropTimedProtocol, as well as any duplex use that might come later. (eg P2 could change to a more conventional 10MBd 32b async)
(strictly, you could get by with 3, but that makes master/slave choice hard-wired, not software.)
I'm using RS485 networks (or single wire I/O) at around 2Mb or so and the reason for this speed is due to latency, just how quickly do I need responses and results? But my network also supports full-duplex over half-duplex so the speed is necessary to support smooth operation and multiple simultaneous channels as sometimes I want to connect a virtual terminal to a node over remote connections and talk to the Forth shell.
So the question is, what latency and update rate can you tolerate and how many channels are you supporting?
Throughput would be close to 1.25MB/s as there is no async framing per byte.
Of course, any single slave cannot sustain a continual 10Mb, but the OP mentions one per axis, so say half a dozen or so.
If you send data to every slave, in every burst, as I think the OP means to do, the others have not-my-message time to copy COG-HUB.
Packet size is another question.
Simplest to code & time, is a unrolled loop, but that limits payloads to maybe 6-8 longs, per record.
A TCXO would allow larger total frames
(just a shame the standard, low cost GPS ones are not P1-aligned, but maybe GPS + 2G80 Logic is a solution)
eg at 8 longs per slave and 8 slaves, one SysCLK in that maps to ~61ppm
Latencey to refresh all slaves is a nice short 204us
Hence my suggestion to test this, and iterate both ways experimentally, to find the centre.
Given test has to sample the pin, then update Carry, which quadrant do you think it actually samples in ?
I don't think it's too hard to manage setting a carry flag as an outcome. Presumably during writeback stage, just the same as any other instruction.
Given the app, I still suggest to treat the device as the final word, and measure it.
Pencils, and theory, can only get you so far...
a 4 axis open-loop (step/dir output, no feedback) version of the controller is already available (see here and here). But now customers ask for more axes and closed loop operation which requires ADCs for glass scale inputs and more cogs for all the PID loops.
I don't really need such high data rates. But doing it slower doesn't mean less hardware and code complexity. So I'd like to do it as simple and robust as possible with enough reserves for later extensions. I have to transmit and receive at least 3 longs per axis and timeslice (1ms). For a maximum of 8 axes this is 24 longs per ms. But it's always a good idea to have some bandwidth left over for diagnosis and debugging so i could mirror all important variables in nearly realtime.
Jmg,
yes, if the bus lines become longer a transceiver would be a good idea to improve noise immunity. I'd prefer RS485 over CAN. Proper termination is a must to avoid ringing, of course.
As the number of connector pins is not an issue (number of prop IO pins IS) I could do it this way:
2 pins differential master clock (5MHz)
2 pins differential data
2 pins differential start of frame strobe (SOF)
1 pin daisy chained slave select (SSI, SSO)
This requires 5 IO pins per slave: data, OE, SOF, SSI, SSO (master clock goes to XI). The slave select daisy chain is only to assign a virtual number to each slave once at boot time. The master signals SSI to the first slave and configures it. When done, the first slave passes SSO to the second slave in the chain and so on. So each slave knows its position in the chain without the need for any addressing (programmed or with DIP switches).
Theres no need for a TCXO. All props run perfectly synchronised from the same crystal clock.
Peter,
I plan for at most 8 servo axes and a few IO extensions maybe. As I said I need to transmit 6 longs of time critical data per axis and synchronisation has to be exact to a microsecond. So I can write the data to hub ram when all slaves are updated, exactly synchronized when all (time critical) communication is done. This takes up to say 100µs of a 1ms time slice. The rest of the time is left over for monitoring/debugging data transfer.
Evanh,
thanks for answering the original question. If WAITPEQ and TEST both sample at the same "quadrant" (1- out-of-4 instruction clocks) then two NOPs should center the bit sampling. I could probably insert a WAITCNT with increased delays to find out the critical transitions.
Obviously no one here knows for sure.
I'd be really interested in how you implement all this on a P1, time critical data to 8 servos exact to the microsecond over serial buses, that is, after you have worked out that WAITPEQ thing
Okay, I'm being cheeky, your NetBob looks good, but I still look forward to the implementation then Bene.
Instruction counting is cool!
Although, the fastest rate would be achieved by having the Cog at each end hold the entire packet for duration of burst transfer. But this obviously incurs significant stop-start behaviour as next packet can't be initiated until the previous one is disposed of and a new one is completely formed.
Maybe whole packet in the Cog is safest approach after all.
I'd agree, the packets are modest in size.
fastest code is wholly unrolled, but that does limit total packet size.
Possible may be a more compact hybrid along the lines of
in psuedo code
WAITPEQ dataInPin,dataInPin ' wait for start bit
WAITForMySlot
WordLoop
UpdateAddressInRCL
BitLoop
TEST dataInPin,INA wc
RCL data,#1
DJNZ BitCtr,BitLoop
DJNZ WordCtr,WordLoop
Each bit is now 12 SysClks, and there is a slight pause for DJNZ & UpdateA, guess +12 SysCLKs
Txmit is then matched timing-wize, and every 32nd bit is a little longer. ~3% peak impact.
Wait values simply adjust as needed.
I have decided to do the transmission with 5Mb/s because a higher burst rate is not of much use. A relatively low but continous speed is better than a high burst clock with long pauses in between to read and store the data from and to buffers.
The hardware looks like this:
Other than originally planned I switched to an asynchronous (UART like) protocol. This way I don't need to transmit the clock and I can process the data during the start/stop bit intervals. It is also more portable because other processors are not as flexible as the propeller and are not capable of bitbanging any sophisticated serial protocol. So I could use an ARM CPU in the main module which has an FPU and more memory and computing power.
The propeller is still the best choice for the axis and IO modules. Precise and deterministic timing is required to generate step/dir signals or to read data from encoders or glass scales.
The Main and axis modules can be stacked like this:
ie, that makes 4MBd, 4.8MBd, 6MBd, 8MBd, 12MBd possible candidates. The FT232H / FT2232H parts make nice fast UART testers.
Both have their uses.
Higher burst speeds mean you can finish sending a short packet sooner, which gives more processor time for other tasks, even if the sustained average is not the maximum.
eg in working on the P2D2 EFM8UB3 interface I tuned to get a peak Rx speed of 8MBd 8.n.2, where UB3 can read up to 1792 bytes into a FIFO, before the slower USB average becomes an issue.
USB side sends 64 byte packets, so UB3.TX can send 8M.8.n.2 to MCU gapless for 64 bytes, then slight gaps can occur.
Yup, that's why I like the pairing idea of a RaspPi and a P2 (or P1) - so I took Peter's P2D2-40pin, and swapped a few pins to match the RaspPi-40pin, to make them plug compatible, in host or slave mode.
How are you planning to achieve high asynchronous baud rates AND high throughput?
BTW, the ping-pong network normally operates over RS485 (half-duplex) but simulates a full-duplex point to point connection that can act as a remote serial console.
Here's a snippet of the receive code:
A 96MHz prop can align better with the speeds of USB-UARTS, where 4MBd general form, and 6MBd (single baud choice) may be possible ?
If there was some way to synchronize the two then we'd have a chance, but asynchronous is just that. I have used 2 stop bits as well at times but there are still limits.
Hub ram read/write throughput is also no problem. There's only one big packet transferred from the master to all slaves and one big packet back to the master once every millisecond. Each slaves picks only it's own "slot", the part of the data which is meant especially to that slave. One slot is currently only 12 bytes. I store the big packet in cog ram and only transfer the slots really needed to hub ram.
I could post some sample code as soon as it's tested.
We did a rather similar topology a while ago, only done as a 'twisted ring' link.
Async UARTS run in a circle, and the 9th bit edge is sensed, and (eg) 12 bytes after a _/= edge are extracted and replaced by that slaves info, with 9th bit low. Result is each slave moves the 9th bit one packet to the right, and removes/replaces their payload of ONLY their packet, all others are simple echoed.
No address needed, and not super timing critical, and supports any number of slaves with no code changes.
It does need fully duplex uarts, but in small MCUs those are in silicon.
google finds this about BAUD rates
https://www.avrfreaks.net/forum/same70-uart-maximum-baudrate-maybe-bug-start
suggests 18.75M/N is the supported equation ?
I have to restart the cog for each packet anyway as any lost start bits could freeze the communication cog. So a second cog has to watch for timeouts.
BTW, the MISO packet is seamlessly composed out of multiple data-slices from different slaves. Each slaves knows when its slot begins and ends. Output enable for the MISO bus line is switched on and off during the stop bits so no glitch or collision occurs.
If you are doing that, you may need to pre-bias the terminations, so any gap in hand-over does not drop to 0V.
CAN transceivers have a fixed offset, and drive one way only, so they have no hand-over timing issues.
The outer loop is unrolled because there are only few instructions left for load/store, loop jumps, slot handling and other tasks not directly related to bitbanging.