High speed prop-to-prop communication, WAITPEQ timing

ManAtWork · 2016-06-30 15:25

Hello,

I'm designing an industrial machine controller that has to deal with multiple servo drives and lots of IO channels. As I'd run out of pins with only one prop, anyway, I decided to use one slave prop per servo axis and one master prop doing the trajectory planning. And BTW, as I'd also run out of memory I decided to use an Intel NUC for the GUI and as file/network server.

Each slave prop sits on it's own PCB. All slave PCBs are stacked on top of the master PCB with daisy chained pin headers. The maximum bord-to-board ditance is <8" with good grounding and no high currents nearby so I don't expect noise problems. I have less than 8 pins left over so I don't think that parallel/multi-bit transmission would give any advantage.

I've seen a trick somewhere where the PHSA register is used to output a serial data stream at 20Mbit/s. But the highest data rate a propeller could receive is 10Mbit/s because shifting in one bit always takes two instructions (TEST and RCL), 4 clocks each. If all propellers run from the same 5MHz crystal, there is no need to transmit an extra serial bus clock. As all slaves have to be served exactly once per time slice there is also no need to transmit any address information. Each slave knows it's time slot. So I think I'll only need two IO pins per prop, one "start of timeslice" strobe and a (bidirectional) data line. If I can afford two cogs for the communication I could do it full-duplex and transmit downstream and upstream data at the same time over two bus lines. But I don't think that'll be necessary.

So the (peak) data rate is 10Mbit/s or 8 clocks per bit. I send one start bit followed by 32 data bits. Now the question:

How do I know the exact timing of the WAITPEQ command? Given...

WAITPEQ dataInPin,dataInPin ' wait for start bit
TEST dataInPin,INA wc
RCL data,#1

... how many clocks pass between the triggering edge of WAITPEQ until the sampling of the input pin of the TEST instruction? For the transmitter I can easily check timing with the scope. But for the receiver it's difficult to tell the exact sampling time. For optimum reliability it woul be best to sample in the middle of a data bit. If the WAITPEQ and the TEST instruction sample data at the same clock (out of 4 per instruction) then it would be best to insert two NOPs between the two instructions to skip the start bit and the first half of the first data bit.

Is this correct? Anyone a better idea?

Mickster · 2016-06-30 18:36

Hi Nicolas,

Just curious about why you would need such high rates?

Pretty excited about this development. Any more specifics?
I'm a servo motor fan, myself. :cool:

Edit: BTW, I use an Android tablet for my front-end, via a Bluetooth link.

jmg · 2016-06-30 22:05

ManAtWork wrote: »

How do I know the exact timing of the WAITPEQ command? Given...

WAITPEQ dataInPin,dataInPin ' wait for start bit
TEST dataInPin,INA wc
RCL data,#1

... how many clocks pass between the triggering edge of WAITPEQ until the sampling of the input pin of the TEST instruction? For the transmitter I can easily check timing with the scope. But for the receiver it's difficult to tell the exact sampling time.

You can check the Rx, with a pin-change after TEST. That has a known delay.
I'd probably experiment by shifting the Start edge, and confirming the TEST sample point is centre-bit.
(ie move it both ways until it breaks, then choose the mid-value)

The WAIT opcodes are 1 SysCLK granular, so they will align to 12.5ns of 1/8 of your baud time.

If you are pushing highest speeds, you might want to look at sending more than 32b - eg a HUB mailbox could be serviced by 32b+4-6-8b of address for example.

You probably want two delays, WAIT for SyncEdge, and WAIT for MySlaveSlot, and ensure the last slave has time to finish housekeeping before the next SyncEdge.

I'd suggest a modest anti-ringing resistor on the BUS'd lines.

It could be worth testing this link thru the newest CAN Tansceivers- the MCP254x spec to 8MHz with a 100ns min Tx bit time, so they could just fit here.

Those buy you some ESD, differential signaling, and should work better with typical unstacked board spacings.

Using CAN, a 4 pin Prop interface would cover your PropTimedProtocol, as well as any duplex use that might come later. (eg P2 could change to a more conventional 10MBd 32b async)
(strictly, you could get by with 3, but that makes master/slave choice hard-wired, not software.)

Peter Jakacki · 2016-07-01 01:14

So if you send and receive at 10Mb does that mean your throughput can be ~1Mb/sec? Will your code be able to process a byte every microsecond? I don't don't think I've seen the high speed serial method using the counter being used effectively, although it looks great. The receive cog can't receive and process this unless it takes time out and passing it through the hub is slow etc.

I'm using RS485 networks (or single wire I/O) at around 2Mb or so and the reason for this speed is due to latency, just how quickly do I need responses and results? But my network also supports full-duplex over half-duplex so the speed is necessary to support smooth operation and multiple simultaneous channels as sometimes I want to connect a virtual terminal to a node over remote connections and talk to the Forth shell.

So the question is, what latency and update rate can you tolerate and how many channels are you supporting?

jmg · 2016-07-01 02:08

Peter Jakacki wrote: »

So if you send and receive at 10Mb does that mean your throughput can be ~1Mb/sec? Will your code be able to process a byte every microsecond? I don't don't think I've seen the high speed serial method using the counter being used effectively, although it looks great. The receive cog can't receive and process this unless it takes time out and passing it through the hub is slow etc.
...
So the question is, what latency and update rate can you tolerate and how many channels are you supporting?

Throughput would be close to 1.25MB/s as there is no async framing per byte.

Of course, any single slave cannot sustain a continual 10Mb, but the OP mentions one per axis, so say half a dozen or so.
If you send data to every slave, in every burst, as I think the OP means to do, the others have not-my-message time to copy COG-HUB.
Packet size is another question.
Simplest to code & time, is a unrolled loop, but that limits payloads to maybe 6-8 longs, per record.

A TCXO would allow larger total frames
(just a shame the standard, low cost GPS ones are not P1-aligned, but maybe GPS + 2G80 Logic is a solution)

eg at 8 longs per slave and 8 slaves, one SysCLK in that maps to ~61ppm
Latencey to refresh all slaves is a nice short 204us

Peter Jakacki · 2016-07-01 02:13

I'm more concerned about what is actually required in a practical sense than what could possibly be achieved. I have used continuous 10Mb style comms (20Mhz stream) on the Prop for a totally different use altogether but that was mainly to provide virtual wires and didn't require any real processing.

evanh · 2016-07-01 02:16

ManAtWork wrote: »

...
WAITPEQ dataInPin,dataInPin ' wait for start bit
TEST dataInPin,INA wc
RCL data,#1

... how many clocks pass between the triggering edge of WAITPEQ until the sampling of the input pin of the TEST instruction?

I believe it's a simple 4 clocks (ie: The TEST instruction length).

jmg · 2016-07-01 02:41

evanh wrote: »

ManAtWork wrote: »

... how many clocks pass between the triggering edge of WAITPEQ until the sampling of the input pin of the TEST instruction?

I believe it's a simple 4 clocks (ie: The TEST instruction length).

That presumes the wait-release and test-sample are in the same sub-fraction quadrant of the opcode.
Hence my suggestion to test this, and iterate both ways experimentally, to find the centre.

evanh · 2016-07-01 02:46

The WAIT executes on every clock. The Cog execute can align to any single sys-clock.

jmg · 2016-07-01 03:12

evanh wrote: »

The WAIT executes on every clock. The Cog execute can align to any single sys-clock.

Yup, as I said above.
Given test has to sample the pin, then update Carry, which quadrant do you think it actually samples in ?

evanh · 2016-07-01 04:54

TEST instruction does its sample four clocks after the WAIT detects an edge. Edge detection jitter is limited to one sys-clock period.

evanh · 2016-07-01 04:57

There is no variations.

evanh · 2016-07-01 05:21

Ah, I think I understand what you're asking. No, there's no reason for the IN port to be read at different stages of the instruction cycle for those two instructions.

I don't think it's too hard to manage setting a carry flag as an outcome. Presumably during writeback stage, just the same as any other instruction.

jmg · 2016-07-01 06:41

evanh wrote: »

Ah, I think I understand what you're asking. No, there's no reason for the IN port to be read at different stages of the instruction cycle for those two instructions.

Perhaps, but WAIT also takes a couple of SysCLKs to 'get going' - basically flip over from opcode mode, to sample-pin-state-engine, which is not the same logic as the opcode sample point.
Given the app, I still suggest to treat the device as the final word, and measure it.
Pencils, and theory, can only get you so far...

ManAtWork · 2016-07-01 07:47

Mickster,

a 4 axis open-loop (step/dir output, no feedback) version of the controller is already available (see here and here). But now customers ask for more axes and closed loop operation which requires ADCs for glass scale inputs and more cogs for all the PID loops.

I don't really need such high data rates. But doing it slower doesn't mean less hardware and code complexity. So I'd like to do it as simple and robust as possible with enough reserves for later extensions. I have to transmit and receive at least 3 longs per axis and timeslice (1ms). For a maximum of 8 axes this is 24 longs per ms. But it's always a good idea to have some bandwidth left over for diagnosis and debugging so i could mirror all important variables in nearly realtime.

Jmg,

yes, if the bus lines become longer a transceiver would be a good idea to improve noise immunity. I'd prefer RS485 over CAN. Proper termination is a must to avoid ringing, of course.

As the number of connector pins is not an issue (number of prop IO pins IS) I could do it this way:
2 pins differential master clock (5MHz)
2 pins differential data
2 pins differential start of frame strobe (SOF)
1 pin daisy chained slave select (SSI, SSO)
This requires 5 IO pins per slave: data, OE, SOF, SSI, SSO (master clock goes to XI). The slave select daisy chain is only to assign a virtual number to each slave once at boot time. The master signals SSI to the first slave and configures it. When done, the first slave passes SSO to the second slave in the chain and so on. So each slave knows its position in the chain without the need for any addressing (programmed or with DIP switches).

Theres no need for a TCXO. All props run perfectly synchronised from the same crystal clock.

Peter,

I plan for at most 8 servo axes and a few IO extensions maybe. As I said I need to transmit 6 longs of time critical data per axis and synchronisation has to be exact to a microsecond. So I can write the data to hub ram when all slaves are updated, exactly synchronized when all (time critical) communication is done. This takes up to say 100µs of a 1ms time slice. The rest of the time is left over for monitoring/debugging data transfer.

Evanh,

thanks for answering the original question. If WAITPEQ and TEST both sample at the same "quadrant" (1- out-of-4 instruction clocks) then two NOPs should center the bit sampling. I could probably insert a WAITCNT with increased delays to find out the critical transitions.

evanh · 2016-07-01 10:01

There's unlikely to be extra post-detection execution stages for a WAIT instruction. There looks to be some setup (WAITPEQ is listed as 6+ clocks total) but that's only relevant if you are wanting to use WAITs within back-to-back frames.

Obviously no one here knows for sure.

Peter Jakacki · 2016-07-01 12:39

ManAtWork wrote: »

I plan for at most 8 servo axes and a few IO extensions maybe. As I said I need to transmit 6 longs of time critical data per axis and synchronisation has to be exact to a microsecond. So I can write the data to hub ram when all slaves are updated, exactly synchronized when all (time critical) communication is done. This takes up to say 100µs of a 1ms time slice. The rest of the time is left over for monitoring/debugging data transfer.

I'd be really interested in how you implement all this on a P1, time critical data to 8 servos exact to the microsecond over serial buses, that is, after you have worked out that WAITPEQ thing

Okay, I'm being cheeky, your NetBob looks good, but I still look forward to the implementation then Bene.

evanh · 2016-07-01 23:22

The trick is to match the serial data framing to suit Hub cycle times. This way one can minimising the Hub access time (Common trick in Prop1 display drivers) and thereby sustain an optimal burst rate.

Instruction counting is cool!

Although, the fastest rate would be achieved by having the Cog at each end hold the entire packet for duration of burst transfer. But this obviously incurs significant stop-start behaviour as next packet can't be initiated until the previous one is disposed of and a new one is completely formed.

evanh · 2016-07-01 23:24

Hmm, on second thoughts, that first idea of aligning to the Hub cycle is difficult when dealing with two Props.

Maybe whole packet in the Cog is safest approach after all.

jmg · 2016-07-02 01:39

evanh wrote: »

Maybe whole packet in the Cog is safest approach after all.

I'd agree, the packets are modest in size.

fastest code is wholly unrolled, but that does limit total packet size.
Possible may be a more compact hybrid along the lines of

in psuedo code
WAITPEQ dataInPin,dataInPin ' wait for start bit
WAITForMySlot
WordLoop
UpdateAddressInRCL
BitLoop
TEST dataInPin,INA wc
RCL data,#1
DJNZ BitCtr,BitLoop
DJNZ WordCtr,WordLoop

Each bit is now 12 SysClks, and there is a slight pause for DJNZ & UpdateA, guess +12 SysCLKs

Txmit is then matched timing-wize, and every 32nd bit is a little longer. ~3% peak impact.
Wait values simply adjust as needed.

ManAtWork · 2019-09-25 08:35

Update: the project has been re-activated after a long time...

I have decided to do the transmission with 5Mb/s because a higher burst rate is not of much use. A relatively low but continous speed is better than a high burst clock with long pauses in between to read and store the data from and to buffers.

The hardware looks like this:

Other than originally planned I switched to an asynchronous (UART like) protocol. This way I don't need to transmit the clock and I can process the data during the start/stop bit intervals. It is also more portable because other processors are not as flexible as the propeller and are not capable of bitbanging any sophisticated serial protocol. So I could use an ARM CPU in the main module which has an FPU and more memory and computing power.

The propeller is still the best choice for the axis and IO modules. Precise and deterministic timing is required to generate step/dir signals or to read data from encoders or glass scales.

The Main and axis modules can be stacked like this:

jmg · 2019-09-25 22:32

ManAtWork wrote: »

Other than originally planned I switched to an asynchronous (UART like) protocol. This way I don't need to transmit the clock and I can process the data during the start/stop bit intervals.

If using Async, you might like to look at 24M/N baud rates, as that aligns with what USB-UARTS can deliver.
ie, that makes 4MBd, 4.8MBd, 6MBd, 8MBd, 12MBd possible candidates. The FT232H / FT2232H parts make nice fast UART testers.

ManAtWork wrote: »

I have decided to do the transmission with 5Mb/s because a higher burst rate is not of much use. A relatively low but continuous speed is better than a high burst clock with long pauses in between to read and store the data from and to buffers.

Both have their uses.
Higher burst speeds mean you can finish sending a short packet sooner, which gives more processor time for other tasks, even if the sustained average is not the maximum.
eg in working on the P2D2 EFM8UB3 interface I tuned to get a peak Rx speed of 8MBd 8.n.2, where UB3 can read up to 1792 bytes into a FIFO, before the slower USB average becomes an issue.
USB side sends 64 byte packets, so UB3.TX can send 8M.8.n.2 to MCU gapless for 64 bytes, then slight gaps can occur.

ManAtWork wrote: »

.... It is also more portable because other processors are not as flexible as the propeller and are not capable of bitbanging any sophisticated serial protocol. So I could use an ARM CPU in the main module which has an FPU and more memory and computing power.

Yup, that's why I like the pairing idea of a RaspPi and a P2 (or P1) - so I took Peter's P2D2-40pin, and swapped a few pins to match the RaspPi-40pin, to make them plug compatible, in host or slave mode.

Peter Jakacki · 2019-09-25 23:09

I've written dozens of high-speed serial drivers but the limiting factor with the P1 is the slow hub access. I've interleaved calculating the write index into receiving bits and then wrote the result to hub when before the stop bit is validated, and then if it is validated, updated the hub write index etc. But this is still limited to about 3MBd with 100% throughput, even as an unrolled loop. In fact the console driver in Tachyon accepts two non-simultaneous inputs, one from the console or one from the ping-pong networking which defaults to 2MBd. The transmit is unbuffered and at high baud rates is more efficient than using buffers anyway.

How are you planning to achieve high asynchronous baud rates AND high throughput?

BTW, the ping-pong network normally operates over RS485 (half-duplex) but simulates a full-duplex point to point connection that can act as a remote serial console.

Here's a snippet of the receive code:

                        waitcnt rxcnt,rxticks           ' wait until middle of first data bit
                        test    rxpin,ina wc            ' sample bit 0
                        muxc    rxdat,#01               ' and assemble rxdata
                        waitcnt rxcnt,rxticks
                        test    rxpin,ina wc            ' sample bit 1
                        muxc    rxdat,#02
                        waitcnt rxcnt,rxticks
                        test    rxpin,ina wc            ' sample bit 2
                        muxc    rxdat,#04
                        waitcnt rxcnt,rxticks
                        test    rxpin,ina wc            ' sample bit 3
                        muxc    rxdat,#08
                        waitcnt rxcnt,rxticks
                        test    rxpin,ina wc            ' sample bit 4
                        muxc    rxdat,#$10
 ' data bit 5
                        waitcnt rxcnt,rxticks
                        test    rxpin,ina wc            ' sample bit 5
                        muxc    rxdat,#$20
                        mov     X1,rxbuf                ' squeeze in some overheads, calc write pointer
                        add     X1,rxwr                 ' X points to buffer location to store
' data bit 6
                        waitcnt rxcnt,rxticks
                        test    rxpin,ina wc            ' sample bit 6
                        muxc    rxdat,#$40
                        add     rxwr,#1                 ' update write index
                        and     rxwr,rxwrap
' last data bit 7
                        waitcnt rxcnt,rxticks
                        test    rxpin,ina wc            ' sample bit 7
                        muxc    rxdat,#$80
                        wrbyte  rxdat,X1               ' save data in buffer - could take 287.5ns worst case
' stop bit
stopins                 waitcnt rxcnt,rxticks           ' check stop bit early (need to detect errors and breaks)
                        test    rxpin,ina wc            ' sample stop bit
                if_c    wrword  rxwr,rxwrP              ' update hub index for code reading the buffer if all good
                if_c    mov     breakcnt,#100           ' reset break count
                if_c    jmp     #receive

jmg · 2019-09-25 23:34

Peter Jakacki wrote: »

I've written dozens of high-speed serial drivers but the limiting factor with the P1 is the slow hub access. I've interleaved calculating the write index into receiving bits and then wrote the result to hub when before the stop bit is validated, and then if it is validated, updated the hub write index etc. But this is still limited to about 3MBd with 100% throughput, even as an unrolled loop.

You can buy a little more time using 2 stop bits (8.n.2), and even 8.m.2 can add another bit time if needed.
A 96MHz prop can align better with the speeds of USB-UARTS, where 4MBd general form, and 6MBd (single baud choice) may be possible ?

Peter Jakacki · 2019-09-25 23:39

It's the asynchronous nature of the data and the variable latency of accessing the hub that is the problem. Sometimes it hits a sweet spot, other times it falls off the cliff.
If there was some way to synchronize the two then we'd have a chance, but asynchronous is just that. I have used 2 stop bits as well at times but there are still limits.

ManAtWork · 2019-09-26 09:11

The master is an ATSAME70 running with 300MHz main clock so I think 5Mb/s = 300/60 should be no problem. If it turns out that the prescaler has to contain some 2^n factor then I'd take 300/64=4.69Mb/s and insert some NOPs at the propeller side to adjust the timing. I don't use FTDI chips here, no chance to meet the timing requirements with an USB device. The high speed bus is used only internally to the stack shown in the picture. Each module has female pin header on the right and a male one at the left side. No external connections or cables.

Hub ram read/write throughput is also no problem. There's only one big packet transferred from the master to all slaves and one big packet back to the master once every millisecond. Each slaves picks only it's own "slot", the part of the data which is meant especially to that slave. One slot is currently only 12 bytes. I store the big packet in cog ram and only transfer the slots really needed to hub ram.

I could post some sample code as soon as it's tested.

jmg · 2019-09-26 09:27

ManAtWork wrote: »

The master is an ATSAME70 running with 300MHz main clock so I think 5Mb/s = 300/60 should be no problem. If it turns out that the prescaler has to contain some 2^n factor then I'd take 300/64=4.69Mb/s and insert some NOPs at the propeller side to adjust the timing. I don't use FTDI chips here, no chance to meet the timing requirements with an USB device. The high speed bus is used only internally to the stack shown in the picture. Each module has female pin header on the right and a male one at the left side. No external connections or cables.

Hub ram read/write throughput is also no problem. There's only one big packet transferred from the master to all slaves and one big packet back to the master once every millisecond. Each slaves picks only it's own "slot", the part of the data which is meant especially to that slave. One slot is currently only 12 bytes. I store the big packet in cog ram and only transfer the slots really needed to hub ram.

We did a rather similar topology a while ago, only done as a 'twisted ring' link.
Async UARTS run in a circle, and the 9th bit edge is sensed, and (eg) 12 bytes after a _/= edge are extracted and replaced by that slaves info, with 9th bit low. Result is each slave moves the 9th bit one packet to the right, and removes/replaces their payload of ONLY their packet, all others are simple echoed.
No address needed, and not super timing critical, and supports any number of slaves with no code changes.
It does need fully duplex uarts, but in small MCUs those are in silicon.

google finds this about BAUD rates
https://www.avrfreaks.net/forum/same70-uart-maximum-baudrate-maybe-bug-start
suggests 18.75M/N is the supported equation ?

ManAtWork · 2019-09-26 09:29

Another idea to push throughput is to load the transmit buffer to cog ram automatically by restarting the cog for each packet. coginit loads the whole 2kB from hub to cog ram with the fastest possible throughput of one long every 16 clocks. This isn't really needed for the time critical data which is only 12 bytes per slave. But it's useful for debugging. So sending a chunk of memory dump takes no extra time.

I have to restart the cog for each packet anyway as any lost start bits could freeze the communication cog. So a second cog has to watch for timeouts.

BTW, the MISO packet is seamlessly composed out of multiple data-slices from different slaves. Each slaves knows when its slot begins and ends. Output enable for the MISO bus line is switched on and off during the stop bits so no glitch or collision occurs.

jmg · 2019-09-26 09:51

ManAtWork wrote: »

.... Each slaves knows when its slot begins and ends. Output enable for the MISO bus line is switched on and off during the stop bits so no glitch or collision occurs.

If you are doing that, you may need to pre-bias the terminations, so any gap in hand-over does not drop to 0V.
CAN transceivers have a fixed offset, and drive one way only, so they have no hand-over timing issues.

ManAtWork · 2019-09-26 10:40

Yes, pre-bias to the stop bit level is a good idea. But more to prevent false start bits being generated from noise during the gap between packets when no slaves sends. The hand-over gap between slave slots is only a few clock cycles (one prop instruction). The parasitic bus line capacity should help keeping the stop bit level in this case.

ManAtWork · 2019-09-26 11:01

here is a snippet of the (preliminary) code

              or      OUTA,maskMISO             ' stop bit out, MISO=1
              mov     CTRA,#0                   ' prevent PHSA from disturbing MISO
              ' one instruction gap before waitpne to allow for
              ' asynchronous arrival of start bit

byte2         waitpne maskMOSI,maskMOSI         ' wait for start bit in
              andn    OUTA,maskMISO             ' start bit out, MISO=0
              rcr     shiftIn,#1                ' shift in last MOSI bit
              sub     slotCnt,#1 wz             ' 2 instructions left for outer loop housekeeping
              
        if_z  shr     slotBits,#1               ' ... next slot transmit enable?
              mov     CTRA,modeNCO              ' 1st data bit of 2nd byte PHSA MSB -> MISO
              test    maskMOSI,INA wc
              mov     b,#7

bitLoop2      rcr     shiftIn,#1                ' shift in last MOSI bit
              shl     PHSA,#1                   ' next TX bit -> MISO
              test    maskMOSI,INA wc           ' get RX bit <- MOSI
              djnz    b,#bitLoop2

              ' last djnz takes 8 cycles
              or      OUTA,maskMISO             ' stop bit out, MISO=1
              mov     CTRA,#0                   ' prevent PHSA from disturbing MISO

The outer loop is unrolled because there are only few instructions left for load/store, loop jumps, slot handling and other tasks not directly related to bitbanging.

High speed prop-to-prop communication, WAITPEQ timing

Comments