Reduced Latency 1X and 4X Serial
ksltd
Posts: 163
The objects, below, are generally smaller and faster than the OBEX version from which they were derived. They have the following other changes:
Transmit and Receive buffers and their associated sizes are provided to the driver at the time a port is defined. The Buffers may be any size greater than one byte
Context switch latency has been reduced for both the transmitter and receiver and, as a result, sustainable aggregate baud rates are improved
Put_Bytes has been added, which allows substantial improvements in aggregate transmit throughput
Get_Bytes_Timed has been added, which allows substantial improvements in aggregate receive throughput
Hardware flow control and support for inverted data have been removed along with all self-modifying code
cognew is not used; the ID of the core on which the driver is to be run must be explicitly provided
Documentation is substantial
Sustainable full duplex throughput of the 1x driver should be approximately 900000 baud ; I use it reliably at 460800
Sustainable full duplex throughput of the 4x driver should be approximately 4 ports @ 200000 baud up to 1 port @ 350000 baud
For reference, the OBEX fullduplexserial4port with 400 bytes of statically allocated buffers is 619 longs. The Serial_4X provided here with no statically allocated buffers is 433 longs. Allocating the same 400 bytes brings that to 533 longs, for a savings of 86 longs, better performance and added functionality.
There remains a little opportunity to reduce latency at the expense of code size, but I don't believe that tradeoff makes much sense as it won't push performance to the next industry standard baud rate in a given configuration.
Serial_1X.spin
Serial_4X.spin
Transmit and Receive buffers and their associated sizes are provided to the driver at the time a port is defined. The Buffers may be any size greater than one byte
Context switch latency has been reduced for both the transmitter and receiver and, as a result, sustainable aggregate baud rates are improved
Put_Bytes has been added, which allows substantial improvements in aggregate transmit throughput
Get_Bytes_Timed has been added, which allows substantial improvements in aggregate receive throughput
Hardware flow control and support for inverted data have been removed along with all self-modifying code
cognew is not used; the ID of the core on which the driver is to be run must be explicitly provided
Documentation is substantial
Sustainable full duplex throughput of the 1x driver should be approximately 900000 baud ; I use it reliably at 460800
Sustainable full duplex throughput of the 4x driver should be approximately 4 ports @ 200000 baud up to 1 port @ 350000 baud
For reference, the OBEX fullduplexserial4port with 400 bytes of statically allocated buffers is 619 longs. The Serial_4X provided here with no statically allocated buffers is 433 longs. Allocating the same 400 bytes brings that to 533 longs, for a savings of 86 longs, better performance and added functionality.
There remains a little opportunity to reduce latency at the expense of code size, but I don't believe that tradeoff makes much sense as it won't push performance to the next industry standard baud rate in a given configuration.
Serial_1X.spin
Serial_4X.spin
Comments
In case any of you missed it, this was discussed in recent serial driver thread.
http://forums.parallax.com/showthread.php/160155-Four-Port-Serial-with-152-Longs*-(with-lots-of-caveats)
857142
923076
1M
1.0909M
1.2M
etc
In many serial apps there is a need for 9 bit mode, and 2 stop bits can be useful to manage baud-skew in continual streaming situations.
For Prop-Prop apps, lengths right up to 32b wold be useful.
I'm testing small slave MCUs ( very cheap, sub 50c) as chain-able slave IO, and they use 9 bit mode, 2 Stop bits on the 'master' (PC or Prop) and they can sustain 3MBd or 4MBd continual in a TTL-232 Ring topology.
The advantage of a Ring, is there is no address housekeeping or pins-lost.
The position in the Ring sets what data each gets.
And industry standard baud rates continue to dominate asynchronous serial interfacing as reflected by most every terminal emulator and most 3rd party serial interface devices even when those devices are build from hardware that is more flexible.
Perhaps you'd meant to post to some other thread?
Thanks here too. I've been busy at work/home and haven't had much time on the forums. I'm interested to look at your optimizations, and indeed the pasm is well documented. Overall, I like the fact that the buffers are external, defined at run time. And I like the fast transfer to/from the buffers using bytemoves. Those are useful out of the ordinary features for existing serial objects. "Added functionality" is in the eye of the beholder. I'm sure that before you can blink an eye someone will ask you for a feature that is not included, say, open baudmode on one port, inverted on another, 9 bits, pacing, yada yada...
I do wonder why you chose not to use cognew?
This is the correct thread, perhaps I miss-understood what I thought was your desire for a general use, with changes like a run-time select of buffer size.
To me, it seems a natural step to add Data length (for example) as a user parameter. (even if just as a single-edit location header define)
Feel free to ignore any suggestions that go outside your comfort zone.
I have given you real baud rates, and I'm thus unsure what you actually mean by 'industry standard baud rates'
- to me, a Baud rate that is able to be requested, and given, on industry standard equipment (FT2xx, CP21xx) , is rather by definition, an industry standard baud rate.
The terminals I use, allow me to select all the baud rates I gave above. Most modern MCUs are similar.
Legacy values like 9600 and 115200 could be considered (very old) subsets of ~2015 industry standard baud rates
It has nothing to do with comfort zone - it has to do with viability of implementation. There is no possible way to implement dynamic bit-width symbol sizes with anything approaching reasonable performance or code size. I have drivers that implement 16 and 32-bit symbol sizes, but they cannot be of interest to anyone because the equipment to with which they communicate requires proprietary protocol.
While you may remain steadfast in your denial of industry standards for baud rate, the fact remains that the vast majority of asynchronous communications equipment is limited in choice to a subset of 110, 134.5, 300, 600, 1200, 2400, 4800, 9600, 19200, 38400, 57600, 115200, 230400, 460800 and 921600 baud.
And, running at 80Mhz, it's similarly not possible to receive within the framework of this driver at rates beyond about 900K baud.
No there's no requirement to use industry standard rates in the implementation; you may use any rate up to recommendations. My only reference to the standard rates were about the potential for additional performance and my framing of those tradeoffs vis-a-vis code size. Yes I'm aware of other communication rates. Yes I'm aware of techniques for driving serial lines faster at 80MHz. But your input is not pertinent to this topic, which is about the derivative works to the existing drivers posted above. Your suggestions are as meaningful as suggesting the entire thing be implemented in 10 words of code or support SLDC or bake cookies - or all three.
And there was no solicitation of suggestions anyway. I invest 3ish hours cleaning up some code, testing it thoroughly, meticulously documenting latency through it and what I get are your inane comments. These drivers as a result of comments in another thread at the request of that thread's participants. If you'd like to post your own drivers that implement something else or if you'd like to further polish the turd that is what's posted above - by all means, go right ahead. Please.
You wonder why I don't use cognew, I wonder why cognew exists! locknew & lockret, too. Here's my perspective:
The device has very limited resources. That's true in terms of cores, locks and memory. We all know that.
The language does not support dynamic instantiation of objects. The object hierarchy is static.
If you look at the mess that results from using cognew - storing the result, returning errors, managing (or usually not) those errors, it's all just wasted code that either mucks up your implementation at worst or consumes precious memory at best. And it's all for not!
Any application has to have some mechanism for not running out of these resources. Since the entire framework is static, these resources can be statically allocated. And, the statically allocated core IDs and lock IDs can be passed into the implementations that require them.
The result is less code, none of those silly error paths that never fail and a cleaner understanding of what's going on.
I mean, have you ever constructed an application in which running out of cores or locks is handled and recovered from gracefully? Do you ever shoot down one thread in order to start another? Don't you conceptually allocate cores and locks when you're architecting your solution in the first place?
I can imagine why cognew and locknew may have felt like good ideas when the architecture was being defined, but from my view of the world, they're just warts without merit.
Do you see it differently?
edit - Note that if there were 256 or maybe even 16 cores and/or locks, I'd likely view it all differently. There's certainly a cross over where dynamic allocation brings benefits - but not when the resource pool size is but 8.
btw, if you toil and labor long for peanuts then peanuts are what you get, welcome to the "peanut gallery"
Re: "support for inverted data have been removed along with all self-modifying code". The FDS4port object does not use dynamic self-modifying code, apart from the jmpret vectoring mechanism. It's initialization section does patch the code in situ to support things like inverted and open modes. Supporting them is "free" in terms of speed of execution, faster than the original FDS, which has to take time to test flags in the main loop to support those capabilities. I'm not suggesting that you need to go beyond your 31ish hours. It does what you want. Great! It's just that other folks need modes other than non-inverted/driven.
You've missed most of my points, but I'll address this one mainly for the benefit of others :
Yes, I admit there are legacy values, often quoted, but they are NOT 2015 Industry standard baud rates - they are merely legacy values, and in some cases can be worse than understanding how the 2015 hardware works.
I consider Industry standard baud rates, as what I can get from the UARTs on my desk, using the terminals & COMMs libraries have here on my PC, in 2015. (& for the last half decade)
Those are most commonly derived from a 12MHz Virtual Baud Clock (in some cases, like FT232H 24MHz)
In fact, I have been testing system in the last few days at 3MBd and 4MBd . ( FYI, The Rasp PI can go up to 4MBd )
To clarify why this matters, you mentioned the code cannot go "at rates beyond about 900K baud.", that's fine.,
yet you also seem to consider 460800 and 921600 as somehow the only valid values in this region.
To give some 2015 figures, I can operate at any of these Baud rates below, and these values could be important to anyone using the libraries, but are still within your provided spec.
Sounds good, I believe there is a place for both.
Multi-port of course has lower ceilings, and in many cases can do the job, but the 1-4MB region is growing quite rapidly.
I'd suggest giving users control over message length, which would open up 9 bit modes used by many small MCUs - it can be a static/compile time setting.
I can also see 18 or 32 bit transmissions as being useful for Prop-Prop applications.
( - and I'm sure you'll spec 2015 baud figures, alongside some legacy numbers from the distant past )
Yes, I have many uses for 9-bit mode and multi-drop addressing which of course includes half-duplex RS485 so the driver will be patched to handle this. I don't have a problem with variable bit length serial it's just that when we are talking asynch then 32-bits starts to stretch the timing tolerances at higher speeds as it is the edge of the start bit that synchronizes the reception. Anyway, I like the fact that I can grab a general-purpose multi-port object and shape it to my requirements. As for higher speeds I can take my high-speed receive driver and use it anytime as the transmission is bit-banged on the spot rather than buffered as this saves valuable time and ensures timing precision (as does the receiver).
But all the Prop serial drivers accept any baud-rate up to their maximum anyway, the problem is that many PC serial drivers and terminal emulators are stuck in "standard" baud-rates and they rarely seem to go above 115.2k baud which is understandable if the serial port was actually RS232, which it rarely ever is anymore.
The benefit I see is larger serial data can free code space that was otherwise consumed packet-parsing.
Which Terminal Emulators in 2015 are stuck with only legacy value choices ?
Windows drivers expect a 32 bit value for Baud, and that value often gets passed to the device end to manage.
Most systems I use will accept any Baud value, and deliver their best-ability ( Ideally, this is the rounded-nearest-value, but surprising how many vendors do not quite get that detail right )
From a management viewpoint I prefer to not ask for any value, but to restrict myself to the rounded(standard) values I calculate from the Virtual Baud clock.
It's usually quickly obvious what ability your Serial Ports have, as the more primitive/ancient fail to configure - and you then fall back to the legacy value.As you say, that is very rare.
You seem to have a reading comprehension problem. First, as others have pointed out, these drivers support any baud rate up to some limit at which the implementation's latency prevents it from working. Again, arbitrary baud rates are supported. Do you follow?
Second, whether you can program any given device such as an FTDI USB-to-serial device to an arbitrary baud rate is rather more an implementation detail of that device than a practical consideration governing it's usability. It's a side effect of the fact that the device itself is organized with a pre-scaler that's some number of bits wide.
However, in many instances, the API available in Windows, Linux and other operating systems for device independent serial communication using a simple enumeration of baud rate and does not provide access to the lower level arbitrary baud rate of the physical device. End user software also usually has these limitations. Look no further than the Parallax terminal emulator or Microsoft's hyperterminal on windows for evidence of this.
Finally, it beyond question that most end-point devices with serial interfaces have very simplistic approaches to selecting baud rate. Whether it's a GPS an embedded touch screen or a VINC1L chip, one use an enumeration to select a baud rate from a limited palette.
The result is that while many devices have a wide range of capability, most implementations built atop those offer few choices.
This is not the exception, this is the norm.
Your comments regarding baud rate conversion and larger symbol sizes further show that you fail to comprehend the inherent limitations of asynchronous communications. There's nothing about symbol size that is associated with the code size of packet parsing. The amount of code associated with packet parsing is a fundamental characteristic of the packet organization, not the underlying symbol size. It takes as much code to parse TCP packets irrespective of whether the underlying transport moves bits, bytes or larger symbols.
Further, the reason for 8-bit symbols in asynchronous communications has everything to do with baud rate conversion and frequency drift between the domains of the transmitter and receiver. As symbol size grows, asynchronous communications becomes inherently less tolerant of frequency drift combined with the approximations of baud rate that result from using simple prescalers to generate baud clocks. These effects are multiplicative, not linear.
But none of this is material to this thread and in fact does a massive disservice to those who simply wanted lower latency implementation. Not one of your remarks, questions or statements has any bearing on the content. In addition to being misleading, you're also being disruptive. At the risk of repeating myself, perhaps a different thread would be a better place for you contributions.
I'm not sure my approach is dissimilar to yours. My "main program" has an initialization routine that generally calls all the initialize and start primitives. It makes use of an object that has a list of "keys" that contain global declarations for things like debug flags and other build-time options. Also within that list of "keys" are the lock and core assignments and the buffer sizes for use with objects like these serial drivers. I tend to have 25ish such declarations. The buffers themselves are declared in the main program and passed through into the drivers at initialization time.
When I start a new project, I simply clone this boilerplate and clean out the project specifics and leave behind the stuff that's germane to my reuseable libraries. Then as the new project evolves, I add keys as necessary. I spend close to zero time thinking about it but bear none of the overhead associated with the dynamic allocation mess. I also discover that I'm close to exhausting resources early as I always have a good handle on what's already allocated. Frankly, I'd be pretty miffed if I spend a week implementing a new pile-o-code only to discover that its implementation was non-workable at runtime because there were no available cores or locks. And because I strive for 100% path coverage in testing, all the unreachable code in the there's-no-available-core path sticks out like a sore thumb.
For me, it's the right tradeoff.
There's nothing about cognew that improves abstraction. All it does is necessitate error checking from which recovery is impossible and defers to runtime the detection of design errors. You'll notice that my serial implementation does handle the starting, it's just the core id that is passed in rather than a success boolean being passed out.
I'd guess it would take two hours to go through the OBEX and rip out all usage of cognew, locknew and lockret. The result would be smaller and more deterministic and, to my way of thinking, that would be an improvement for a body of work that is largely focused on embedded systems.
On top of that, it frees up the locknew/lockret instructions for use in algorithms that need to manage a pool of 8 things. I use those instructions to provide high performance buffer pool management as they do so atomically. So it stones two birds with one kill ... so to speak.
The most important consideration for latency for general purpose within the implementation is the time that is required to shift in the last data bit, buffer the byte and get back to the start bit detector. With the stop bit changes I'd previously made, the time allowed to do that is 1.5 bit times. It turns out it is this path, not the bit shifting path that is the worst case, by far.
The second most important consideration is minimizing the time from shifting out the stop bit in the transmitter back through extraction of the next byte in the buffer and starting the start bit. All of this latency is dead time on the wire and, again, it's a far, far longer path than through the bit shift code.
The crazy thing is that it does relatively little good to crank up the receiver beyond the means of the transmitter - or vice versa. Balance is key. It also means that the latency math in the 4X driver is wrong. Because the latency of the 4X driver is really largely controlled by the worst case paths of the other 7 threads, the calculation won't change by much, but it is wrong and should be corrected.
With all that, I revisited the thread context switch points and put together a pretty large spreadsheet to help balance yin and yang. The result is a driver that's capable of receive rates at 980K and transmit rates at 960K. However, in order to hit those baud rates, the transmitter's per-character latency is about 265 clocks - that's 3.2 bit times (meaning 32%) at 960K baud and really slows down transmitted throughput. But I don't believe it's possible to do this appreciably faster with a single core.
Next up, a two core implementation that should reach 2M baud or better.
Serial_1X.spin
edit: wrong file was previously attached
With regard to latency and the other 7 threads... Your Serial_4x allows ports to be disabled, but, correct me if I am wrong, it does not patch out the execution of the first co-routine associated with each disabled task. For receive, that is not much, but for transmit it involves a hub access, which increases jitter.
The simplest way to avoid that is to patch the jmpret to jmp at the time of initialization. In your notation, for the P0 transmit, That is the way I left it in FDS4port, version 1.01 in the OBEX.
However, the jumps that were previously jmprets are still in the scanning loop. More recently I've eradicated those jumps from the loop so that it only executes the jmprets associated with enabled port tasks, the minimal limit cycle. In my notation, the initialization becomes, This last movs requires that the patching be done in the correct order and wrap around for the whole loop.
I couldn't disagree more. This has nothing to do with what one is trying to do. The "burden" of static allocation as compared to the burden of constructing the return paths and the failure paths to check for the errors is demonstrably less.
And the cognew/locknew approach brings precisely zero value for it's costs. If Spin supported dynamic object instantiation or if the addressing model between the cores and the hub memory were different, I might come to a different conclusion. But as it stands, the concept simply encourages one to have code paths that can't be tested, fail to plan ahead for their resource requirements and/or ignore errors that are returned with no ability to resolve.
Sure, I know it's perverse to recommend that people think about how to solve their problem at the outset, but why on earth would anyone recommend a dynamic allocation strategy for a resource when there's no dynamic behavior?
If one were going to thoughtfully choose one solution over the other, there's no possible rationale for choosing cognew/locknew other than precedent, which is dogma not a rationale. Seriously, other than precedent, where is the value?
Your observations about stop and initialization are different - these actually are design considerations. They're a functional capability, not a "style" thing, and one that I require.
FYI - I don't store the core ID for stop, I pass it again to stop.
Yet you have nicely proven that education about 2015 baud rates certainly is relevant, and needed
My Windows APIs, and Terminals, have no problems handing proper Baud parameters, nor do most of the MCUs I work with.
Many readers may be surprised to find theirs do too. It is those readers I mainly address.
Is the RX path able to receive packed data at ~980k or does it need extended STOP bits ?
You can test this if you have a FT232H, as that can do sustained transmit at 923.077k or 857.143k
RX Testing with other USB-Bridge devices needs care at these high baud rates, as they often insert extra stop bits themselves. - That means a system tested with CP2105, may fail if a user connects a FT232H instead. (the CP2105 can also do 923.077k or 857.143k, but not sustained)
2 Stop bits is a common specification at high speeds, and it is not just to gain more handling time- if you want to pack sustained duplex data, the extra stop buys some baud-skew tolerance.
And when you say that cores and locks being in short supply "is not the case for library routines" that's because the locus for management isn't the library routine, it's the application. The library routine has no concept of the global resource allocation situation and it shouldn't! Attempts to implement dynamic allocation of sparse resources outside of the locus of management is a path to ruin and destruction. The result is that when one architects their application, they have to dig through the implementation of each library to determine its resource requirements because those aren't made explicit by the API. It's precisely backwards.
It's exactly like having to change a constant in the OBEX quad serial driver to increase the buffer size for port 2's receiver. The goal of modular, reusable software is to parameterize it, not require that one root around within it to understand how to make it work. But you're arguing for the opposite here.
My project isn't really that special and it's not my project that ought to drive decisions such as these. And, especially in the area of education, teaching best practices is always appropriate. And the cognew/locknew approach simply isn't a best practice, its a kludge. It requires one to understand the complexities of allocation in a baroque fashion. And it allows one to spend a lot of time creating implementations that cannot possibly work and not learning of that until runtime. There are always things that aren't discovered until runtime, but the goal of architecture and design (and education) are to preempt those things a priori, not encourage them.
You're right, I ripped out the code to eliminate the unused threads when I ripped out the code that implemented hardware flow control and data inversion. For me, the added code size isn't worth the increase in baud rate and it adds to the non-causality of the whole mess that when one starts using a new port, the old ones might now work differently or stop working.
As you may have noticed, I avoid non-causality like the plague.
Plus, in my make believe world of standard baud rates, it only gets one from 115200 to 230400 when running with one or two ports.
I'm curious what your own experience is with port count and baud rates - how do you use the thing in your application? Is faster baud rate desirable or is your configuration static in that regard? I ask because after working through a two core single port implementation, I'm thinking that a two core quad port should see substantial performance improvement. Do you have spare cores and want faster quad port performance?
When using 4 ports I might have, say, debug to terminal screen (enabled only when USB connected), GPS at 9600 baud, cellular modem at 9600 or higher, XBee at 9600. The system may include external instruments that deliver data via uart, or an RS485 or SPI connection and may involve an open or inverted mode. Optimization for low power may be crucial and may depend on baud rate along with the other concurrent tasks. I use my evolved version of FDS4port almost exclusively, part of my template, with ports enabled as needed and special versions with expanded buffers occasionally added to projects that need them. It is pretty humdrum stuff for a uart, no fancy high speed protocols. I do like your run-time configurable buffers. If I get around to it, I may adopt that too in some backward-compatible manner.
From what I've seen at higher baud rates, minimizing latency from start bit detection to cnt capture had the highest impact on correctness as the baud rate went up. I was surprised by that at first, but it's obvious now. The second biggest impact was rounding correctly in the calculation of ticks-per-bit - a classic case of garbage in, garbage out.
But when I think of eliminating jitter, as usual, I consider the absurd end point. It would require that the code path length between every pair of jmpret instructions be equal, meaning add a few more jmprets in long paths and add a few nops to short paths. It's not possible to fully normalize because of variable length hub execution times. But I've convinced myself that path length normalization won't really "fix" anything. I'd love to hear a counter example.
What I do know to be the case is that there's tremendous variability in best case/worst case code path lengths and those get multiplied by ~8. And the result is that it's entirely possible to see something work a lot between failures because the worst cases line up very infrequently. Things might not collide for weeks of testing - or ever.
I found the two optimizations I mentioned, above, as the result of running the driver at high baud rates with four transmitters on one board shoveling bytes to four receivers on another and the second board shoveling data back to the first - all running flat out at the highest possible rates with random burst lengths. In that setup, things were either reliable or not - it was pretty damned binary. That also made me believe that the jitter concerns aren't the problem, they're just a symptom of being oversubscribed in the worst case, but not seeing the worst case happen very often.
That whole adventure is what's caused me to conclude that at $3 a pop, real uarts are a pretty sound investment! Less code, fewer IOs lower latency and higher throughput; win, win, win, win.
Give the jitter thing a think and let me know where you come out on it.
http://forums.parallax.com/showthread.php?129776-Anybody-aware-of-high-accuracy-(0.7-or-less)-serial-full-duplex-driver
It has a lot to say about jitter. There are my own measurements, PhiPi's high accuracy PBJ driver and lonesock's high speed driver. Clever coding and insight.
The real jitterbugs are the hub accesses, which are randomized by the asynchronous nature of the beast. No amount of padding could compensate. The longest segments in terms of number of instructions have two hub accesses each. Once in a blue moon, or unluckily maybe once a fortnight, all of the tasks will hit that slowest point at the same time. That will not be a problem at low baud rates, but it will be an error at some limiting high rate. The better accuracy is achieved by syncing to cnt or to phsa, but that is limited to relatively lower speeds.
Working at 9600 baud there is no problem, unless I want to run at clkfreq=5MHz for example to achieve lower power consumption.
Which brings me back to the question of dynamically removing those port drivers that aren't in use from the linked list of active threads - the jmpret list.
Since the transmitter checks for characters in the output buffers isn't anywhere near the worst case path, I don't see the benefit of adding code to trim those code paths. While it might allow a 2 port configuration to work with some combination of baud rates, that configuration would fail if a 3rd port became active at any baud rate because that 3rd port would occasionally introduce substantially more latency that the queue checking.
And so, to me, it seems the question at hand is what's the implementation goal. Is the goal to build a driver that provides maximum throughout in minimal configurations or is it to provide maximum throughput in the maximum configuration. Should it be optimized for best throughput when driving one port or four ports?
Since there's a much faster one port driver, I'd say the goal should be the latter. And if that's the goal, adding the code to prune the threading list doesn't help, it just increases code size.
No?