Serial I/O Throughput -- A Comparison.

greybeard · 2014-07-06 22:47

I have a project that uses serial communication to pass data back and forth from a Windows Application and a propeller based board. Currently it uses the JCCogSerial module and runs well at 230400 baud. I recently wanted to increase the throughput and started examining the use of SimpleIDE to compare throughput performance. I have always held the opinion that C code would increase the throughput performance.

I build a very simple C project that echoes a string, i.e. receives a character, add the character to a string, and retransmit that string back to the host when a Carriage return/line feed is received. I used both a C3 unit and a PropStick for the target systems and the Parallax Serial Terminal that comes with the Propeller Tool because it handles baud rates in multiples of 115200 baud. I first used the SimpleIDE build-in FDSERIA driver.. The results were disappointing. the echoed string has many unprintable characters at 230400 baud. Additional testing indicates that it target can send strings reliable a bauds much higher indicating that the problem with the echo is related to the receipt of the code by the target.

I next used Spin2Cpp to convert JDCOGSERIAL and JDCOGSERIALDEMO and used these instead of FDSERIAL. This code started dropping bytes and introducing spurious characters in the echo and 460800 baud. Additional testing indicated the target could repeatedly send a fixed string without with out dropping characters. Again suggesting that the received part of the code is the culprit.

Finally, I used the JDCOGSERIAL spin code as the driver for a spin program to serve as the echo engine. Surprisingly, It would echo strings without dropping and introducing spurious characters up to 921600 baud.

Why did I do this??? It was my intent to convert the Spin Code, in the propeller based board, to C to increase performance. These results suggest that performance may actually be degraded.

Heater. · 2014-07-06 23:40

greybeard,

So far your tests say nothing about the relative performance of C and Spin. Or at least very little. You are using different drivers in each case.

I imagine all the drivers you mention use PASM for the actual bit banging of the serial protocol. That PASM should be the same no matter if it is used from C or Spin. I also imagine the PASM part sets the limit on speed therefore things should perform the same no matter if used from C or Spin.

I'm going to ignore the Spin2Cpp converted code as I have no idea what kind of code that actually produces and it's not something I would want to use seriously.

Now. JDCogSerial and FDSERIAL are very different. From this thread http://forums.parallax.com/showthread.php/131646-Difference-between-FullDuplexSerial-and-JDCogSerial I read that:

"JDCogSerial keeps its serial buffer in the cog's memory while FullDuplexSerial keeps its buffers in the hub memory...it claims speeds > 750KBps."

Clearly keeping buffers in COG creates a much faster serial driver that FDSERIAL.

I guess what you need is a hand written C wrapper around the JDCogSerial in much the same way as the FDSerial.

Heater. · 2014-07-06 23:53

Having written that I'm starting to wonder...

We are told that JDCogSerial can achieve much higher baud rates than FDSerial. The guess is that this is because it keeps it's buffers in COG rather than HUB RAM.

That makes sense. At least when dealing with bursts of data that are small enough to fit in those COG buffers.

But, what happens when the incoming data is a continuous stream? That data in the COG buffers has to be written out from COG to HUB at some point for use by other COGs.

It seems to me that at that point using COG buffers no longer helps you, you are having to continuously write to HUB anyway. In fact I start to think COG buffers should be worse. Why, because you have more work to do, first keep the received data in a COG buffer, then read it from that buffer and write it to HUB.

I would have guessed that for continuous data streams HUB buffers would sustain higher baud rates.

Has anybody tested and compared this sort of thing?

Bill Henning · 2014-07-07 10:47

The hub does not limit the speed - even with one byte reads/writes, the hub can do 5M ops per sec, which would correspond to 10Mbps - which is not feasible even in pasm (unless two props share a clock, then maybe). Buffering to reading/writing 4 longs when possible would theoretically allow for almost 40Mbps.

At 80Mhz, the maximum half-duplex async I could see a cog do is 10Mbps, and I would not want to do that for critical systems (ie start bit, 8 data, stop bit) ... 10Mbps async should be OK at 100MHz.

5Mbps half duplex would be fine at 80Mhz with unrolled hand tuned code.

5Mbps full duplex may be (barely) possible at 80Mhz.

With good enough pasm code, 2.5Mbps full duplex should work (one port per cog) @ 80Mhz, and maybe 3Mbps @ 100Mhz.

Note, Beau has code for effective 14Mbps @ 80MHz, however that relies on 34 bt packets (start bit, 32 data bits, stop bit)

FYI, I think the latest FTDI usb/ser chips could do 2.5Mbps

Heater. · 2014-07-07 11:12

Bill,

The claim seems to be that JDCogSerial can achieve much higher baud rates than FDSERIAL. Where I assume FDSERIAL is Chip's standard issue FullDuplexSerial. I have no idea what JDCogSerial is.

We are talking regular UART here not some weird 32 bit packet thing.

Now, I can believe that stashing bytes in a COG buffer is quicker than using a HUB based FIFO. For short bursts. But I find it hard to believe it works out for sustained data rates.

Bill Henning · 2014-07-07 12:35

I was talking about theoretical fastest start-8data-stop bit that PASM could handle, as the OP indicated a need for speed

I have no idea about this JDCogSerial either.

Heater. wrote: »

Bill,

The claim seems to be that JDCogSerial can achieve much higher baud rates than FDSERIAL. Where I assume FDSERIAL is Chip's standard issue FullDuplexSerial. I have no idea what JDCogSerial is.

We are talking regular UART here not some weird 32 bit packet thing.

Now, I can believe that stashing bytes in a COG buffer is quicker than using a HUB based FIFO. For short bursts. But I find it hard to believe it works out for sustained data rates.

greybeard · 2014-07-07 13:44

I really didn't want to start a firestorm. I posted this as a point of interest only. The test results suggest that the FDSERIAL and JDCOGSERIAL seem to exhibit much poorer performance as a receiver at high baud rates as compared to transmission performance and that JDCOGSERIAL was slightly better as a receiver. The limits for reliable two way exchange of data seems to be about 230400 baud for JDCOGSEIAL and 115200 for FDSERIAL.
Speculative reasons may be.
1. the FTID buffer fills up and doesn't have room for transmission.
2. the text window (screen) is too slow and looses characters.
3. a problem in the receive routine of both.
4. PASM code give higher priority to incoming data and starves out the transmission.
5. A problem with Spin2CPP (doubtful but still possible)

I'm going to let the pros handle this one. I have the data I need to for now. I'll revisit the problem later when I have the time and patience for it.

Thanks guys. Have fun with this one.

DavidZemon · 2014-07-10 07:32

I've written an unbuffered UART driver in C++ that requires one cog and handles very high speeds. Here's my test results:

All tests performed with XTAL @ 80 MHz
Max speed [baud]:
- Send
  - CMM: 1,428,550
  - LMM: 1,428,550
- Receive
  - CMM: 740,720
  - LMM: 740,720
Max transmit speed [average bitrate of PropWare::UART::puts() w/ 8N1 config]:
- CMM: 339,227
- LMM: 673,230
TODO: Determine maximum baudrate that receive_array() and receive() can read data in 8N1 configuration with minimum stop-bits between each word

Do take note of the fact that "max speed [baud]" is not the same and "max transmit speed". There's a delay between each character as it reads in the next byte does its necessary prep work - so for high speeds like these you end up with many more than 1 stop bit in 8N1 configuration. So, taking into account those pauses, you can achieve a maximum effective baud of 673,230 using LMM or 339,000 baud with CMM.

Also note, I haven't tested max receive speed - only max receive baud.

Lots more information and full docs can be found here.

DavidZemon · 2014-07-10 07:38

PropWare certainly can't compete with 921,000 receiving baud (yet), but it does offer parity. I'm not sure how critical your application is, but maybe you could deal with a decrease in speed if it improved reliability.

Kye · 2014-07-10 09:43

Zero jitter TX and RX in ASM with one cog is limited to a max standard baud rate of 115200. The actual maximum rate is higher but non-standard...

The full duplex driver parallax provides has a lot of jitter on the TX line. If you want to get zero jitter you have to use a waitcnt approach. This involves a single task in a cog running at 4X the baud rate running the RX code 3/4ths the time and the TX code 1/4th the time. http://obex.parallax.com/object/246

greybeard · 2014-10-24 14:52

I started the thread and appreciate all the replies. They have all been on the mark and my subsequent work has verified all of them.

The application uses small packets in somewhat periodic fashion. Since there is a lot of traffic, I wanted to increase transfer rates so the app would have more time to attend to other matters. I found I could transmit from the PROP to HOST at a very high rate. I could transmit for HOST to PROP at a very high rate. But the my main interest it to exchange date between HOST and PROP at a high speed. When I set up test for round trip transfer, I found the reliable baud rates when down appreciably for bot JCOG and FSERIAL.

When I set up a test for the prop to echo packets (50 bytes or so) immediately as they are received, I found the round trip throughput was reduced to about 292000 using the Windows Serial API and the FTID dll before I started loosing bytes. Note, that the app does not use any handshaking and that could also be a major limiting factor.

I take this to be upper limit for my application. Obtaining Higher baud rates will require major design changes in software and hardware.

Again. Thanks for the feedback

MJB · 2014-10-29 03:27

Peter writes that his Tachyon Forth TX can go up to 3Mbit/s with the capable FTDI-USB chip.
RX is handled in s seperate COG.
I have no numbers on a sustained data rate for TX alone,
and not for echo either.

But 3 Mbit/s TX is at least impressive, even for burst writes.

DavidZemon · 2015-04-18 21:24

An update to PropWare's (unbuffered) serial routines yields the following:

All tests performed with XTAL @ 80 MHz

Max burst speed:
- Send: 4,410,000 baud
- Receive: 740,720 baud
Max transmit speed (average bitrate of puts/send_array w/ 8N1 config): 2,680,144 bps
transmit delay between words for single- and multi-byte routines
- CMM:
  - send: 63.0 us
  - puts/send_array: 1.0 us
- LMM:
  - send: 15.6 us
  - puts/send_array: 1.0 us

I'll be updating the receive routines soon

jmg · 2015-04-18 22:47

I get 80/18 = 4.4444444' MBd ? 4.410 seems to be fractional cycle ?

Is the 2,680,144 bps the average sustained Tx speed - Indicates Byte times of 13.266 Bit times (=4.266 stop bits, on average)

Peter Jakacki · 2015-04-19 00:52

MJB wrote: »

Peter writes that his Tachyon Forth TX can go up to 3Mbit/s with the capable FTDI-USB chip.
RX is handled in s seperate COG.
I have no numbers on a sustained data rate for TX alone,
and not for echo either.

But 3 Mbit/s TX is at least impressive, even for burst writes.

FYI

Transmit is fairly easy as it is unbuffered and at the higher baud rates you can go over 4Mbaud but don't expect it to be a standard baud rate. It makes far more sense at high baud rates to not buffer as reading and writing to a buffer can take far longer than actually bit-banging the character. For sustained high-speed operation without additional stop bits between characters it would be necessary to run the transmit routine in it's own cog possibly in half-duplex fashion with the receive routine and use a single character buffer to minimize overhead. Of course I'm talking about running app code from hub and not from the same cog as the transmit.

For receive the maximum baud rate for buffered receive with only a single stop bit between characters (no delays) is close to 3Mbaud although 2M is a bit more of a standard baud rate.

The FT232R chip is limited to 3Mbaud so it wasn't possible to test this interface beyond that.

jmg · 2015-04-19 01:28

Peter Jakacki wrote: »

For receive the maximum baud rate for buffered receive with only a single stop bit between characters (no delays) is close to 3Mbaud although 2M is a bit more of a standard baud rate..

Is there a simple table of Baud : StopBits needed ?

Small MCUs can add 2 stop bits reasonably easily, some can do 3 stop bits.
3.0625MBd is a 'standard value' for a EFM8BB1 for example, but it can also do 4.08333MBd and 2.45MBd, 2.04166MBd, 1.75MBd etc... (24.5M/2N)
In Prop terms, @ 80MHz 3.0625MBd is 26 SysClocks bit time. (0.47% Baud error)

DavidZemon · 2015-04-19 07:20

jmg wrote: »

I get 80/18 = 4.4444444' MBd ? 4.410 seems to be fractional cycle ?

Thanks, I'll trust you're right. I just measured it with my logic analyzer. I probably shouldn't have reported such precision when only using a 24 Msps LA.

jmg wrote: »

Is the 2,680,144 bps the average sustained Tx speed - Indicates Byte times of 13.266 Bit times (=4.266 stop bits, on average)

That's an average over 58 characters. The extra stop bits is due to the read from hub ram and other overhead in for each word. That's where the minimum delay of ~1us comes in.

DavidZemon · 2015-04-19 19:50

PropWare's receive routines are now capable of 2.75 Mbaud bursts. I am not sure what length of time is required between words.

jmg · 2015-04-19 20:09

SwimDude0614 wrote: »

PropWare's receive routines are now capable of 2.75 Mbaud bursts. I am not sure what length of time is required between words.

Do you mean bytes or 32b words ? - or does it receive UART bytes into 32b words, so naturally handles multiples of 4 bytes, with a smaller time needed between each quad-element, than between following quads ?
Small MCUs are all pretty much byte based UARTS (with the exception of Infineon XMC1000 series)
Some do have small FIFOs, and if more than 2 stop bits are needed, they all can run a timer-paced Send.

DavidZemon · 2015-04-19 20:12

It is configurable from 1- to 16-bit words.

jmg · 2015-04-19 20:26

SwimDude0614 wrote: »

It is configurable from 1- to 16-bit words.

Why stop at 16 ? With crystals, 32b is another natural choice for Prop-Prop links ?

DavidZemon · 2015-04-19 20:32

jmg wrote: »

Why stop at 16 ? With crystals, 32b is another natural choice for Prop-Prop links ?

Because it has configurable stop bit width and parity, and all start, stop parity and data bits have to fit in a single register. A bump to 24 bit max would probably make sense though

jmg · 2015-04-19 20:42

SwimDude0614 wrote: »

Because it has configurable stop bit width and parity, and all start, stop parity and data bits have to fit in a single register. A bump to 24 bit max would probably make sense though

That limits you to 29-30 bits, and the Prop has a natural node at 3 x 9 bits = 27 bits, with a couple of .flag bit options to use all 29-30b of data field.

davidsaunders · 2015-04-19 20:51

OK, interesting. What would be possible with a minimal bitbanger, using a fixed 8-N-1 little endian configuration (almost universal).

If shifting out from a buffer of multiple 32 bit words in cog, you could figure on 5 instruction loop at 4 clocks per instruction, so that is 20 clocks per bit, which makes for 160 clocks for the first byte, plus another 20 clocks (to keep consistency) for the stop bit, that is 180 clocks for each byte, 180/8 = 22.5 clocks per bit, at 80MHz that is 80,000,000/22.5 = 3555555.556bps. So I would like to see a loop that can do over 4Mbps while maintaining 8-N-1 and allowing a multi word in cog buffer.

DavidZemon · 2015-04-19 20:55

The set_data_width function doesn't have the logic to safely handle that yet. If I opened it up that high it would leave the possibility of overflow. It's definitely something that should be done though.

DavidZemon · 2015-04-19 20:59

davidsaunders wrote: »

OK, interesting. What would be possible with a minimal bitbanger, using a fixed 8-N-1 little endian configuration (almost universal).

If shifting out from a buffer of multiple 32 bit words in cog, you could figure on 5 instruction loop at 4 clocks per instruction, so that is 20 clocks per bit, which makes for 160 clocks for the first byte, plus another 20 clocks (to keep consistency) for the stop bit, that is 180 clocks for each byte, 180/8 = 22.5 clocks per bit, at 80MHz that is 80,000,000/22.5 = 3555555.556bps. So I would like to see a loop that can do over 4Mbps while maintaining 8-N-1 and allowing a multi word in cog buffer.

Do you really care if it maintains only one stop bit? If not, you should be asking about max throughput in bits per second, not baud. If your receiver relies on no more than 1 stop bit, it's definitely a hindrance

jmg · 2015-04-19 21:07

SwimDude0614 wrote: »

Do you really care if it maintains only one stop bit? If not, you should be asking about max throughput in bits per second, not baud. If your receiver relies on no more than 1 stop bit, it's definitely a hindrance

I would rephrase that slightly, and say relies on more than 2 stop bits is a hindrance.
Sometimes you want 2 Stop bits for data-creep reasons, and most MCUs can manage 2.
Going above 2 however, is certainly not so easy.

Serial I/O Throughput -- A Comparison.

Comments