Asynchronous programming
Seairth
Posts: 2,474
in Propeller 2
As I've been getting back up to speed on JavaScript, I have also been getting up to speed on the Node.js environment. This has brought me face-to-face with single-threaded asynchronous programming. This, in turn, has made me go back and reevaluate how I was doing asynchronous programming in C# (the language I'm currently doing my daily work in), as well as reexamine asynchronous programming in a few other languages that I use less often. The concept isn't new (e.g. the ubiquitous Windows GUI thread uses this model), but I think it had become less known and less used due to the availability of multi-threading (and the significant coverage that multi-threading has gotten over the years). There is very clearly a resurgence in the popularity and adoption of single-threaded asynchronous programming and I think it's something that every programmer should be comfortable with (even before multi-threading, imho).
The reason I bring this up is because I've come to the realization that this could be an "answer" for doing effective multitasking on the P2. Now, I acknowledge that not everyone thinks this is a question that's actually being asked and that with 16 cogs available, it's just not a concern. Except, I think it is a concern. "Asynchronous" does not mean "parallel". Using additional cogs to perform asynchronous tasks has all sorts of issues that you have to take into account (much like using threads does on a modern multi-core processor). One of the advantages of using a single thread (or, in this case, cog) is that every asynchronous task is guaranteed to run serially relative to each other, significantly reducing concurrency issues. Further, when working in an I/O bound environment (which I think the P2 clearly qualifies as), appropriate use of the asynchronous model allows for more effective use of the cog (e.g. still do other meaningful work while waiting for I/O to complete). And, if your language of choice supports the asynchronous model (e.g. "await"), you can still write your code in a mostly sequential format (the way we tend to think about our code anyhow). I'm not saying that a single-cog asynchronous model should be used instead of multiple cogs, but that each approach has their place. As it stands now, the Propeller only really encourages one approach (use multiple cogs). This, in turn, reinforces the mindset of "just use another cog", which ends up translating to "just use another thread" on other platforms.
While I hope to see a flavor of JS running on the P2 that encourages the asynchronous model, what I'd really like to see is support for asynchronous programming in Spin. I realize this is not a simple undertaking and I am not asking for it to be added now. This post is more about getting others to also think about the subject. Not only do I think this is important from an educational perspective, I think we will start to see more people coming to the P2 who are already comfortable with that model (and might even expect it). As a result, I think it's a worthwhile subject to keep in mind over the next few years.
The reason I bring this up is because I've come to the realization that this could be an "answer" for doing effective multitasking on the P2. Now, I acknowledge that not everyone thinks this is a question that's actually being asked and that with 16 cogs available, it's just not a concern. Except, I think it is a concern. "Asynchronous" does not mean "parallel". Using additional cogs to perform asynchronous tasks has all sorts of issues that you have to take into account (much like using threads does on a modern multi-core processor). One of the advantages of using a single thread (or, in this case, cog) is that every asynchronous task is guaranteed to run serially relative to each other, significantly reducing concurrency issues. Further, when working in an I/O bound environment (which I think the P2 clearly qualifies as), appropriate use of the asynchronous model allows for more effective use of the cog (e.g. still do other meaningful work while waiting for I/O to complete). And, if your language of choice supports the asynchronous model (e.g. "await"), you can still write your code in a mostly sequential format (the way we tend to think about our code anyhow). I'm not saying that a single-cog asynchronous model should be used instead of multiple cogs, but that each approach has their place. As it stands now, the Propeller only really encourages one approach (use multiple cogs). This, in turn, reinforces the mindset of "just use another cog", which ends up translating to "just use another thread" on other platforms.
While I hope to see a flavor of JS running on the P2 that encourages the asynchronous model, what I'd really like to see is support for asynchronous programming in Spin. I realize this is not a simple undertaking and I am not asking for it to be added now. This post is more about getting others to also think about the subject. Not only do I think this is important from an educational perspective, I think we will start to see more people coming to the P2 who are already comfortable with that model (and might even expect it). As a result, I think it's a worthwhile subject to keep in mind over the next few years.
Comments
Async coding is useful - Apache falls down after a few hundred clients connect because the thread overhead starts to overwhelm the system, whereas ngingx using a single main thread, pulls requests from a queue, and services them as fast as possible, so it will slow down, but can handle an order of magnitude more clients.
But it requires more than that. Though I am sure there are other implementations, it seems the most common approach is to have a message pump. In a case like Node.js, the message pump is baked into the infrastructure. In the case of C#, C++, and so on, you are responsible for setting up the message pump. In an environment like Spin, I'm not quite sure what the best approach is. Because it's interpreted, I think it would be more efficient if the message pump was integrated (written in PASM). On the other hand, not all applications require the asynchronous model and shouldn't be subjected to the additional overhead of an unused message pump. On the third hand, writing a message pump that you must manually run and maintain is bound to be slower than having it integrated.
1. Have one cog that cycles through a dynamic event queue, looking for event triggers.
2. When a trigger occurs, spawn a new cog to handle it.
3. When the handler completes, stop the cog, freeing it for other handlers.
-Phil
The current design or majority of the way programmers "think" is that each thread is like a car going down the road and we have 8 or more lanes to choose from in which other simultaneous threads can occur. This is still for the most part 2-dimentional thinking and what programmers fail to realize is that car going down the road is really a 40 passenger bus. So the way I see it, most people tend to program as if they are driving an empty bus. A waste of gas and programmatically inefficient.
With the async model, you don't start a separate cog to run the handler. Otherwise, the "event queue" cog will start executing tasks in parallel or have to block until the "handler" cog completes. In the first case, parallel execution defeats the point of the async model. In the second case, the blocked "event queue" cog is a wasted resource.
In order for async to work with a single cog, the important thing is that the handler(s) never block. Instead, they should always set up a callback (another handler) to execute when the blocking operation (usually I/O) is complete. Ideally, all blocking code would be handled by drivers on separate cogs that are designed to work within the async model.
But that alone is not enough. Typically, to be able to write the handler code in any semblance of a sequential pattern, the language needs to provides some support. For instance, in Node.js, it is common to use continuation-style callbacks and closures. For JS, in general, there are all of those Promise-based APIs that are available. In newer iterations of javascript/C#/etc, there is even syntax sugar like "await" to hide much of the dirty details. The point is, to make async programming "user friendly", the language needs to be somewhat involved.
Closures, though, can be expensive on a resource-limited device (note: it can be done, obviously). And I'm not sure whether closures are strictly necessary to enable effective async programming. Mostly, the closures are concerned about capturing relevant stack data. I suspect there is a more explicit way to "capture" that a language could support without having to add support for closures. (Mind you, closures have all sorts of other uses that still make them very worthwhile.) In a language like Spin, I think it actually makes more sense to make the capturing mechanism explicit, as it better teaches what other languages are implicitly doing on your behalf.
Nice analogy!
The way I see it, with our modern multi-core, super-fast processors and memory hierarchies is this:
I have an 4, 8, 16 or more lane highway down which I can run 4, 8, or 16 Ferraris at the same time. Not only that I can fill up the lanes with hundreds of Ferraris all following each other. This is fine until my Ferraris want to change lanes. Then everything crashes and burns (race condition) or they do it in an orderly manner and everything slows down (Amdahl's Law). I won't use 40 seat buses because then I outgrow my cache lines/COG space and everything slows down dramatically.
And which begat which? I suspect those rather complicated architectures came about because of the heavy reliance on threading by most of the popular programming languages to achieve asynchronous programming. CPU manufacturers then started adding hyper-threading and multiple cores precisely because there was so much multi-threaded software (whether in the form of process, pthreads, etc). That, in turn, reinforced the idea that the future of programming was in writing increasingly parallel code. It was a feedback loop that led most everyone in one direction: parallel processing equals faster code. So, of course you want to run your Ferraris on the 16 lanes of the i7 highway. The architecture designers want you to do exactly that.
As I said earlier, though, there is definitely a right time and place to use parallel processing. The Propeller architecture is predicated on using parallel where parallel is appropriate. And that's great. It's one of the things I genuinely love about the architecture. But parallel is not the same as asynchronous. I admit that I have regularly conflated the two over the years, as I suspect many others have. This is particularly likely to happen if your programming language only encourages one way of doing things.
We are forced into adopting multiple cores by physics.
For decades we have had the luxury of processors getting smaller and faster by leaps and bounds. They grew from 8 to 16 to 32 to 64 bit. They sprouted pipelines and caches and all kind of go faster tricks that eat tons of transistors. And of course their clock speed went up dramatically over that time. In that situation it's mostly not worth the bother to develop a multi-core processor because you could get the performance boost you crave by other means using a single core. Besides a single core is simpler to reason about and program.
Recently though we have hit a limit on the maximum clock speed we can achieve. And I suspect we have run out of ideas for increasing single core performance at the same clock speed.
The result is that if you need more processing performance then you have to adopt more cores. Which is great if you have a lot of independent tasks to do. Or you can divide your problem into a lot of mostly independent tasks.
Yeah. To my mind the parallel nature of the Propeller was not about raw computer performance. If you wanted that an ARM or something would be better. Rather the Prop is about timing integrity and maintaining that when you have multiple things to do. Also yeah. To my mind the "asynchronous programming model" is not about threads or multiple cores or interrupts etc. It's a name for a style of programming at a high level, abstract, way. In the asynchronous model you don't have multiple lines of code in your source potentially executing at the same time (on different cores or different threads), you don't have execution of your source lines of code suddenly suspended in the middle of a sequence to go and do something else in your code. You don't have a loop running around polling things. Rather you simply set up a chunk of code that will get run when some event happens, some other chunk that will run on some other event, and so on. Each of those chunks starts at the top and runs to the end before anything else can run. Those chunks had better be short and sweet, a while(1) loop in there will block the entire program.
Of course under the hood of the programming model there may well be threads and interrupts going on, there may well be multiple cores in use. But you don't see any of that in the abstraction of your asynchronous model source code.
-Phil
In general bits of your code do interact with each other and share data. Else why are they in the same program? If those bits of code are running at random times and potentially at the same time, whether interleaved by threading/interrupts or actually on separate cores, then they will trip over each other with race conditions.
The way to prevent that in the threaded model is with mutual exclusion mechanisms, mutexes or disabling interrupts etc. This has a performance hit as now contexts have to swapped and code is halted during shared access. It also has a complexity hit as now your program is a lot harder to reason about.
The asynchronous programming model takes a different approach. It simply says "this piece of code will run from top to bottom, when it is done something else can be run if need be, nothing else will run until it is done.
This makes life much easier for the programmer, for example when building GUIs that must respond to keyboard, mouse and lots of other events.
It can also be a huge performance win. As noted above, the nginx web server uses the async model rather than spawning lots of threads or processes as Apache does. nginx can far out perform Apache when handling millions of concurrent connections because it has far less context switching going on and uses a lot less memory.
Again, I'm not arguing against doing asynchronous things in parallel. What I'm arguing is that sometimes you don't want to do asynchronous things in parallel (e.g. overhead of starting a cog, dealing with concurrency issues, etc) and that a language which makes non-parallel asynchronous programming both possible and easy is a good thing.
To achieve deterministic timing over the entire process, the LOW priority interrupt is set to 1/4th of the bit period of the BAUD I need to communicate externally with.... in this case 19.53kBaud or 12.8uS.
Below is just a basic design flow that I usually setup to start a new project with. Code is generally quick and to the point. In a current project I am running a precision X,Y,Z cutting machine with a 4dSystems 7" touch screen which includes 3-Stepper motors(X,Y,Z), a Spindle motor 0V-10V speed control, a PWM H-Bridge, an X,Y,Z Joystick to control positioning, a handful of proximity sensors and relay I/O's, touch screen, and a few other items all orchestrated from a single threaded processor approach.
My point is that 1 COG of the Propeller could indeed handle all of the above, but with the mindset that "we" have been programming the Propeller with to date, most of us would be on about our 6th COG by now and having that feeling in the back of our mind that we were running out of room. With the infamous words from Back to the Future ... "You're not thinking 4th dimensionally" can apply to how most of us program a multi-threaded processor.
The way I see it is that the Prop and Spin has been designed with simplicity in mind. Simplicity for the novice programmer, or even the experienced programmer that wants to get something up and running "now". Just grab a UART object, PWM object, or whatever you need from OBEX. Write your top level object and away you go. No time wasted head scratching, wondering if I actually have enough time for my COG to run this code I have borrowed from the net, or how to integrate it into my interrupt system, or what priorities it should have, etc, etc.
Sure that can be very wasteful of COGs. Whole 32 bit CPU's can be underutilized on some simple task provided by some object. A lot of the time, who cares? It works. It was easy. Next job...
Of course if your application is getting too big to fit then it's time to rethink the approach.
In fact I find the idea of just running in one execution unit quite false, given that we are at the end of faster processors and going multicore is the only available way, at the moment. Apache (or IIS) on large multicore systems can be scaled up as long as you can afford it.
JasonDorie has a point with Apache and ngingx, but at some point the single execution line of ngingx is saturated and you can't scale up with better/different HW.
Simulating parallel execution with threads is like being half pregnant. And as Jason and Heater. remarked full of overhead.
Real parallel execution needs another way of thinking as throw a thread at it. I guess it will need some different approach in the language used, OCCAM allows to specify serial or parallel execution, but that goes not far enough to release the burden of a programmer to think in parallel while planning the project.
Most multi-cores in use run under some OS trying to hide the cores from the programs, since the programs are not written to use cores.
But on P1 and P2 this is conceptual different, the program can and has to take care by itself how and if to use multiple cores.
So especially on a P1/P2 the idea of running in just one execution unit is - hmm - misplaced?
Sure I liked JavaScript after a while and I am sure that on a P2 some small JavaScript engine will exist, as soon a Heater gets a real P2, but as language it is not really a good fit for the P2.
Maybe the languages used on FPGAs can give some hints on how a language can describe parallel things in some sane manner.
Enjoy!
Mike
Oh goody, lots of juicy morsels to disagree with there, if I may... It can take some readjustment to get ones head around closures, first class functions, event-handling etc. Especially coming from traditional languages that don't have such things. But to be clear none of this is anything like GOTO. That is very misleading.
Yeah, "nothing new there". Event driven programming dates back to at least SmallTalk in the 1970's. JavaScript was inspired by Simula from the 1960's and Self in the 1980's. What goes around comes around. Yes but...you are going to run out of steam on one core with Apache long before you do with nginx. All that context switching of Apache is part of the problem. Having to allocate gobs of memory per thread is another. When adding more cores nginx will use them, and use them efficiently with it's event driven model on each core https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/
Eventually if you want to scale more you need a whole other box, or perhaps hundreds of them. Life is much cheaper if the thing you are trying to scale is efficient to start with. For this reason nginx is used by huge operations such a netflix, uber, etc. Sure you can. See above. For both Apache and nginx at some point you need more boxes. You will need them sooner with Apache. Hmm...wait a minute. OCCAM is inspired by the concepts of Communicating Sequential Processes (CSP) As laid down my Tony Hoare in the 1977. CSP is a rigorous way of thinking about parallel processes. Yes. However the underlying mechanics, whether operating system or hardware, should be abstracted away in the high level languages we use. How so? As Beau points out above, if a lot of functionality can be squeezed into one core then why waste it's capacity by not doing so? This logic is the same for the Prop as it is for a multi-core Intel server. I'll see what I can do.
I'm not suggesting a language like JS be all there is. But I think it can make a good simple "glue" for all that stuff happening on multiple cores below it. I also wondered about that. If you look at Verilog or VHDL you find they are event driven. Only for every event handler they instantiate a whole new pile of hardware. Which you can keep doing until you run out of gates.
Hmmm...damn...I did not disagree with so much after all.
All the best.
+1, particularly on the change of mindset needed to take full advantage of multiple cores and HLL language support for the program and programmer to deal with parallel processing. Better imo to have access to that functionality than to hide it and have the compiler try to deal with it.
On the one hand we have:
"OCCAM allows to specify serial or parallel execution, but that goes not far enough to release the burden of a programmer to think in parallel while planning the project."
On the other hand we have:
"P1 and P2 this is conceptual different, the program can and has to take care by itself how and if to use multiple cores."
The first statement wants to release the burden of parallel thinking, the second wants the programmer to think about it.
Which would you like.
The description of Occam is just wrong anyway. Occam demands that the programmer think about what functionality runs in parallel and what does not. And how they communicate. And where they run, perhaps on different chips altogether.
It is only a waste when:
there is a need to scale. A great many use cases have no such need
, or
there is a potential to run it on cost reduced hardware, say a 2 COG prop
; otherwise,
there are strong arguments for rapid development, robust operation.
Scope really matters, and I find discussions like these similar to those involving language portability. While it's nice to have, steering all the tools, and or people that way carries a lot of unnecessary costs.
The thing here is whether those investments will pay off. They often don't.
Millions of 6502 micros were shipped last year. They are all over the place providing low cost, robust basic control, or are in things like toys. Not portable, won't need to scale, etc...
I am not suggesting P2, and it's tools, ignore either of these. The work should be done, just not at the expense of more direct, rapid, lean means and methods.
Frankly, those have big payoffs too, just in a different direction.
20 years ago, I made very similar arguments related to parametric, associative CAD assemblies. People saw it was possible to link everything together. Move a hole, and the whole airplane can adjust, if needed, to resolve all the design intent implications of that move.
Today, it's interesting. We topped out on single thread execute, and it turns out humans also top out in their ability to understand, and more importantly, troubleshoot big constructs like that. Worse, when the core intent needs to change, the work required to reactor is an order more difficult than it would otherwise be.
Where the investments were made anyway, the payoff never did happen in the vast majority of cases. The reasons were lack of changes, and or changes not accounted for in the design intent, as well as the management of all that complexity diluting the value of the initial work.
Hardware limits took a big bite too. At that time, people expected 5Ghz and faster compute devices would become available. We know today that didn't happen. Single threaded performance does continue to increase, but it's glacial.
Back then, I took a lot of heat for teaching mixed model. Use the complexity in some places where it's actually needed, say for things being simulated and iterated to optimize performance, coat, weight, or where there are known changes to process, say variations, catalog items.
Avoid it otherwise, which means adding human checks, and or more software checks to help highlight change implications so people can identify and remedy them.
Today, almost all large scale product design does not use a full model. We simply cannot understand and or compute something like that for say, an aeroplane.
Interestingly, there was a very happy artifact of doing it that way, and that is models that do not have so many linear dependencies are much easier to process multi core fashion. This applies to assemblies as well as parts.
The other artifact has to do with changes. Turns out, having the software analyze a change, find it's scope and present a user with all of that and simple tools to modify the geometry model directly, not by modifying the history that generated it, is also something that can be handled multi core and or is fast enough to very easily beat any sort of history / intent based change methods.
These CAD problems look a lot like software problems. There are conditionals, math, data, dependencies, a hierarchy of related things, structures, and interface points. The history, input parameters, math, all equate to source code. The CAD tool looks like a compiler, and the product, geometry representation, looks like a binary. Assembly tools, can appear like linkers and loaders, working with binaries, and or source where that can all actually be computed.
What proved most effective over time was to compartmentalize the model, limit complexity overall, invest in interface points, and limit dynamic parameters.
That combination, at just about any scale, known catalog part type use cases aside, is the most flexible and efficient overall, despite the fact that it does take more work, more often than other more tightly integrated means and methods require.
The root factors are hardware compute limits, and use to be storage limits, but we are in good shape there, along with human limits when it comes to understanding just what is likely to change and why.
I feel these software problems will map to a similar space, with similar dynamics.
Best case is likely to be a few sets of tools. Some, simple, lean, direct, like SPIN with inline PASM, PASM alone. Others, traditional known to be used and portableish, like C. And maybe a specialized set as discussed here. Tachyon is specialized, BTW. Really nails it's use cases.
There won't really be a one size fits all.
Reuse, in terms of portability will be high when the core isn't written for the hardware. It will be near useless where it is written to the hardware.
Reuse, in terms of being able to run binaries in multiple tool / contexts could be very high, regardless of how portable they are at a source code level.
IMHO, how the P2 can offer both a hardware environment where compartmentalization is for free or just easy to do, as well as one where parallel processing works just as well, is a big advantage, given we can promote reuse in both portable source and binary blob fashion. Often, assembly code can stand in where I mention binary, depending. Assemblers vary, and they should. ( syntax, workflow, lots of things in play here)
Filters and parsers can smooth all that out much like we found out real time in the moment software analysis on big models worked very well in CAD land.
Both are needed, and binaries will see more use, given the specialized hardware and the need to apply it to various problems.
An OBEX rooted in this reuse idea, and tools to facilitate making it happen are where the real returns are, and will multiply specific tool chains and their operating philosophy accordingly.
People should be able to attack their problems, applying the right high level tools, supporting those with useful bits written in a variety of ways, IMHO.
In CAD land, the outcome of this kind of thinking ended up looking nothing like the original visions presented early on.
Northrop Grumman, for example, actually uses neutral, parameter / history free model representation at the top, or product level. It's all positioned within a heiaracy of coordinate systems, many often representing key interface points. The many different components, subsystems are generated in a variety of ways, with a diverse, sometimes custom, purpose built array of tools. This ranges from the various and familiar CAD to custom software engineered to design based on input data of various kinds.
It's all nicely compartmentalized, efficient and reuse is high, despite initial create and maintain tool diversity.
FWIW, which is exactly what you paid.
The holy grail in software has been high level languages and such that abstract away all that messy detail involved in building a program whilst not introducing overheads in performance and or size. C++, for example, has done a pretty good job at that.
But guess what?
When programming in such high level languages you still need to be aware of the reality of the architecture you are building for. The classic case today is use of cache memory. Traverse a large multi-dimensional array in the wrong order and you can increase your run time by a factor of 10 or 100 or more. Use a linked list of objects, spread throughout dynamically allocated RAM and you are doomed with cache misses. The language and compilers cannot help you with these realities.
Or there was CORBA. Brilliant, with CORBA you can make a function call and CORBA will make that into a request over the network to some service and return the result. Well, you had better be aware that what looks like a normal harmless function call is actually a network transaction that will take ages and perhaps fail in interesting ways. Perhaps it's better not to try and make a network transaction look like just another simple function.
When it comes to parallel processing perhaps its better not to abstract things away too much. It should be clear that this bit of functionality I have written will run in my thread, or on another thread, or on another core or on a totally different computer.
Agreeing with "Real parallel execution needs another way of thinking as throw a thread at it." and the other hand. A programmer needs the mindset and the tools to deal with multiple cores and true parallel execution. What we have now is a kludge of patches to languages and compilers that were designed for single core systems.
It is said that all of programming can be boiled down to 3 things after you have basic atomic statements:
1) Sequence - Do this, then that, then the other....
2) Selection - Do this or that depending on some condition.
3) Iteration - Do this until some condition is true.
Arguably what is missing is:
4) Simultaneous - Do this and that at the same time.
The assumption being you do actually have processors around to do this and that in parallel to get more performance.
Decades ago we had languages like Occam which had a "seq" construct. You had to explicitly state that you wanted statements executed one after the other. And a "par" statement which would allow for parallel execution of statements.
This puts all of the organization of a programs parallelism into the hands of the programmer. For whatever reason this did not catch on much. A shame because it's a great way to program multi-core systems like the Propeller of the XMOS devices (Those Occam concepts are recreated in the xc language for XMOS).
Well, except for Google's Go language which continues the CSP tradition with it's channels and "go routines"
Then there is another school of thought. For example when I want to add two large arrays I don't really want to write a loop to do it. I don't want to have to write anything about spawning threads to get the job done faster on multiple processors. No, I just want write "arrayA + arrayB" and have the compiler figure out the best way to parallelize it to get the job done as quickly as possible.
This kind of auto-parallelization of programs seems to be hard and still elusive.
Things like OpenMP go some way to help auto-parallize things. But then it turns out that writing non-trival algorithms in parallel friendly ways is not so easy.
kwinn, The XMOS devices are all about parallel execution. David May, a founder of XMOS, once said that the best XMOS programmers he had found were all hardware engineers. Doing things in parallel was natural for them. The software guys had their minds blinkered by their software education.
Why not use Verilog (or VHDL) for multiprocessor language?
Verilog allows for parallel processing.
Verilog is bit more low level. More to my taste. But it's like working in assembler.
Speaking of which. Ada has constructs for parallel processing. https://en.wikibooks.org/wiki/Ada_Programming/Tasking Nobody liked it.