Parallelism?
msrobots
Posts: 3,709
I am reading thru the actual P2/FPGA documentation and getting more and more excited over the Beast.
On the P1 we have:
- 8 cogs running parallel. Without tricks like LMM each cog has its own memory for code and shared access to the hub.
-- each cog has 2 counter running in parallel. And waitvid is (sort of) parallel too.
that's about it.
On the P2 we have:
- 8 cogs running in parallel. Now able to run code out of cog AND hub ram.
-- each cog can run 4 tasks in parallel.
-- each cog has 2 counters and video running in parallel mode.
so far almost as above, but now things get funny:
-- the WIDE stuff - RDBYTEC/RDWORDC/RDLONGC/RDWIDEC - will speed up HUB access quite much and (sort of) runs parallel
-- the XFR stuff - transfers pins to memory or vice versa running parallel once activated. wow.
-- big mul, big div, square root, cordic - running in parallel.
-- SERA/SERB - real fast serial transfer between cogs/chips bitbanging runs parallel.
wow.
and it goes on:
4 DACs each cog, that's 32 DACs at the same time (more if multiplexed?).
all pins have ADC mode? (delta-sigma ADC mode)? pincfg? any docs to read?
what the hell is GOERZEL?
simply wow.
CHIP and all your good People from Parallax - what a wonderful product you are producing here.
Now I am getting interested in that new Cyclone Board Parallax is developing.
Enjoy!
Mike
On the P1 we have:
- 8 cogs running parallel. Without tricks like LMM each cog has its own memory for code and shared access to the hub.
-- each cog has 2 counter running in parallel. And waitvid is (sort of) parallel too.
that's about it.
On the P2 we have:
- 8 cogs running in parallel. Now able to run code out of cog AND hub ram.
-- each cog can run 4 tasks in parallel.
-- each cog has 2 counters and video running in parallel mode.
so far almost as above, but now things get funny:
-- the WIDE stuff - RDBYTEC/RDWORDC/RDLONGC/RDWIDEC - will speed up HUB access quite much and (sort of) runs parallel
-- the XFR stuff - transfers pins to memory or vice versa running parallel once activated. wow.
-- big mul, big div, square root, cordic - running in parallel.
-- SERA/SERB - real fast serial transfer between cogs/chips bitbanging runs parallel.
wow.
and it goes on:
4 DACs each cog, that's 32 DACs at the same time (more if multiplexed?).
all pins have ADC mode? (delta-sigma ADC mode)? pincfg? any docs to read?
what the hell is GOERZEL?
simply wow.
CHIP and all your good People from Parallax - what a wonderful product you are producing here.
Now I am getting interested in that new Cyclone Board Parallax is developing.
Enjoy!
Mike
Comments
Parallax could end up making more money from selling FPGA boards than they ever would from selling P2 boards!
Ross.
I am using the P1 in Forth to stop and start cogs on a task demand basis. That seems quite powerful to me as you can have a couple of cogs just doing communications services, maybe a couple overseeing status, and the other four morphing through tasks as required.
The main attractions to me of P2 are the built-in ADC/DACs, and the additional Hubram space. ... a more powerful, more compact Forth platform.
Forth is not a religion. It is a cult.
Some consider it a disorder!!
Back to Mike's comment: yes, "the Beast" will be amazing and the FPGA is just fun to play with!
Well if Forth is a cult.... it is a cult for Thinkers.
And C is another cult. but it is for porters that prefer to borrow code.
Assembler is the one true religion.
Assembler is then old testament of the software religious world. The basis on which all following religions are built. Ignored by most:)
Forth is such a radically 4G language that the interpreters are able to reproduce and mutate without human intervention.
Forth usually implements a software round-robin, where each task gets 1 pass, and surrenders execution to the next task. For the most part each tasks appears to always have full control until we get really tight on time critical functions.
The P2's 8 cogs each have 4 threads, which works out to 32 parallel execution opportunities. PF6 spreads the software multitasker robin over all 32 execution opportunities. Now we can have way more than 32 tasks, and all will look like they have have full control until we start getting really tight on timing. Even if we lose a thread that gets stuck doing something time consuming, we only reduce the capacity by 1/32.
So it behaves as if it has time slicing and load balancing, which it doesn't, but it turns out that it does the time slicing and load balancing really well. For free, without any extra overhead (aside from the mutlitasker round-robin itself). Forth is fun.
I haven't looked at the guts yet but while it's preemptive in the sense that the inner interpreter loop is aware of time slots and gives each execution thread a time slot in a round robin fashion. Time slot expiration is the only preemptive event as far as I recall.
As for true paralllism, it can't go beyond the 8 provided by the cores (plus whatever hardware parallelism the Px offers).
Mike's original comment is interesting in that iy enumerates all the things that may be going on within the P2 in a truly productive pafallel manner....truly cool!!
I'm not sure if I have the terminoliogy right, but here goes
Propforth does not do preemptive, each task executes until completion , its up to us to not wander off into the weeds. (cooperative multitasking, release control at the end of a loop iteration or when asking for I/O or slow resources, etc, as per normal forth technique).
If a thread gets stuck and never comes back, its not big deal becasue in THIS scheme:
-->> each task always executes on a different thread <<--- (unless there are exactly 32 tasks that execute in exactly the same time, which can't normally happen in a real application).
All the threads are in an execution pool. The next available thread takes the next task in the queue. There is always another free thread if one thread is occupied.
It has IMPLICITE load balancing in that what happens ocures passivley, simply as result of the implementation.
This is why I think its kind of cool, if we follow our regular rules properly (surrender execution at the logical times, end of loop, at I/O, etc) we passively get a result that LOOKS like we have fancy expensive load balancing, but we get it for free. Or at least for no additional cost aside from following our regular guidelines as ususal.
re: rick: "As for true paralllism, it can't go beyond the 8 provided by the cores" - On P1, this is 8 cores. On P2, this is 32 threads. PF6 will has 32 parallel threads executing at any given time, if we understand the 32 threads idea correctly. Although I did not ask about the case when there is only one task, I imagine the other threads would sleep.
I guess in an interpreted language system the same can be done at the byte code level. That byte code virtual machine gets the interrupts and such and then decides which threads byte codes it wants to run next. Or there could be one VM for each thread and pre-emption goes on as above.
"cooperative" scheduling usually refers to the fact that a thread will run as long as it likes, potentially holding up all other threads forever. Threads are expected to call some I/O functions or something like suspend() in order to give the OS a chance yo run the other threads.
I have seem yet another form of threading that was a bit odd: Basically a thread was not allowed to have loops. The language did not support them. Each thread would be called at regular intervals, 10ms or 100ms say, from a timer interrupt and it would be expected to run from top to bottom and exit. In fact It had no choice as there were no loops! This is sort of interrupt driven, like preemptive, but also cooperative. This had the great advantage that when you compiled a bunch of modules into a program the compiler could tell you exactly how much of those 10ms or 100ms time slots you had consumed. There was never any danger of missing a real-time deadline. Which is just as well as it was used to control Rolls-Royce jet engines.
Heater, you have seen a lot of very interesting things indeed. Thanks for sharing that.
That language was called LUCOL. LUcas COntrol Language. And created by Lucas Aerospace. It's the only system I have ever worked on where the compiler would end by giving you a summary of the execution time your code will be using when it runs. It could do that because there were no loops in the language so analysing all the possible code paths through a module and reporting the execution time of the critical path was easy.
Programmers universally hated it. "It's old fashioned", "it's crude", "it's not structured programming" "bla bla bla."
However one team I worked with decided to do their next big control project in Ada. Oh yeah, that's all the rage now, even mandated by the yanks. It's modern, it's structured, it has type safety (as if that helps at all), it even has objects!
Guess what? When I came to test their creation it was using between 50 and 90% of each available time slot. Control algorithms have to be accurately timed. That is 50 to 90% seemingly at random depending on what was going on at the that instant. I had no way to tell if that code would not some time exceed the time budget and cause a system watchdog rest. A very bad situation.
I asked the dev team "Can you prove to me that this program will never hit a situation where it will exceed it's time budget?" Of course they could not!
Luckily the tubo-prop aircraft project that was supposed to be the fly by wire system for was cancelled:)
Like that except no interupts, and no need to waste timers or call to some suspend. Everything inherently surrenders execution at I/O or loop. Load balancing just happend like pouring water from a glass.
Yes, potentially one CAN hold up a given thread, but thats just like any other bug, to be found and corrected in tesing.
The cool thing here is the if anything DOES knock out a thread, the rest of the system keeps running, and it will barely have an impact on the rest of the system.
Not sure what you mean by "knock out". If it means stuck in and endless loop that would normally hang up the entire cooperative system. On a single CPU that is. Unless of course the VM, Forth in this case, is executing words from different threads in some "round robin" or other fashion.
Of course a "stuck" thread will not impact work going on in other CPUs.
Now, what's the deal with "load balancing". Load balancing generally means there is a bunch of work queued up to do. Perhaps coming in all the time, like requests to a web server. And there is more than one processor available to do the work. The load balancers job is to take those packets of work and distribute them among the available processors so that they are all kept optimally busy and the packets of work therefore get done quicker.
I have seen no evidence of load balancing in any description of a Forth system yet.
Interesting idea. The threads themselves may not have internal loops, but each thread could be considered an implicit loop. Since the OS would loop through and execute each thread perhaps it should have been called Loopy instead of Lucol ;-)
Yes.
yes
Yes, for example 30 tasks.
In my case, the queue consists one pass of each task. When one pass of a ttask is executed, the next pass gets put in the back of the queue
Yes. 8 on P1; 32 on P2 (each of the 8 cogs has 4 threads, and these are our "execution opportunities".
yes.
Me too.There is no load balancing in any forth that I know of. BUT the PF6 kernel behaves as if it has load balancing. The next pass of the next task in the queue is assigned to the next free execution opportunity (thread on P2, cog on P1). If the tasks are constructed according to the regular guidelines, the system will behave as if it has balancing, in that the load gets distributed to the available processors. If one execution opportunity becomes unavailable, it has little impact on the remaining tasks. So it behaves as if it has very effective load balancing, even though there really is none.
We could imagine though that the OS (if we can call it that) has no loops either. The entire thing using JavaScript as a pseudo code looks like this: That is to say the program actually halts until the next timer tick interrupt. If that does not come nothing happens.
There are no loops anywhere in the code!
Honestly, it looks a lot like a software emulation of what a real silicon state machine would do. Iterate through all of it, no matter what. That's a clock tick. Then iterate through again, another clock tick. Flags keep the state and that controls whatever is controlled, counts what needs to be counted, changes what needs to be changed, responds, etc...
A similar thing could be done with a Prop easy enough. Instead of compiling to some standard linear code, a compiler similar to what the FPGA compiler does would be needed. Maybe even actually use Verilog.
Once it's broken down into the states and flags, those act like gates and signals. All the "strips" of code get written, and each strip has a finite number of "logic blocks" based on it's time to execute and memory size.
From there, fire it up and it would act just like hardware, one step at a time...
Heck, I could be way off the mark, but that's how I see such a system working optimally.
Re: loops
There could be loops in the states. A counter, for example, getting iterated each cycle through the body of code. It could count up, trigger something, count a bit more, trigger something else, count a bit more, trigger the reset to zero. There's your loop right there.
Seems to me a lot of it could be done in parallel too. Really, it depends on what depends on what...
In any case, "the code" is more like a substrate or mere vehicle for the real program, which gets encoded in the data processed by the code.
Of course the "state" is not just flags used by the tasks but also any variables they might have. And variables they use to communicate with each other.
As a simple example a PID controller needs to be iterated at a fixed rate. Every iteration will accumulate that integral term in a variable that persists from one call to the next. That accunulator is part of it's state. The PID itself needs no loops.
Interesting that you should mention FPGA. Reminds me that I was wrong in saying that Lucol was the only programming system I have seen where the compiler could tell you how much time you have used. Verilog and VHDL systems also do that. They know exactly how many gates they have to string together to do the work of a single clock tick and produce timing reports.
To do this on the Prop you don't need a compiler or language as complex as Verilog or VHDL. You could do it just by writing loopless code objects in Spin. Following compilation an analyser could:
a) Report an error if your code contains "repeat" (or recursive calls)
b) Calculate from the resulting byte codes and their known implementation in the interpreter how long your code takes to run.
This could be done from he listing output of BSTC for example.
Spin makes this much easier as it does not have GOTO statements.
Might be possible to do it in PASM as well. That's a bit harder as we have jumps and self modifying code !
Jumps would have to never be allowed to go "backwards" up the code and self modifying code would have to be disallowed.
Point is that on every iteration the sytem knows you are not going to bust that sacred time budget you have before the next timer tick and thus cause chaos. Yes. That was one of the prime motivations for Lucol, to be able to easily an predicatly spread the work over many processors in parallel. They just did not have enough CPU power back in the 1980's/90s for the control systems they wanted to build.
That "predicatbly" being critical in avionics work. Both in timing and the "provability" of the correctness of your code. You can look at it that way. There are those that say good code is "data driven" anyway.
One could use a consistent set of PASM, just do it over and over and over, same block or blocks of instructions. If this were boiled down to the nubs, it looks a lot like game of life type stuff in an abstract sense. The program just runs. It's the data and how it interacts that gets things done. Or there could be a mix too, so long as the rules were followed.
Quite the mental exercise! Thanks again. Fun thoughts! I get the importance of provability and predictability. Aerospace is such a detail oriented dicipline anyway. Very experienced people often phrase the entire thing as being in the business of getting the details right. Every time. That airplanes, rockets, engines and all matter of things happen is a mere artifact of just getting the details right.
Edit: (again) You know, a study of these kinds of things would be valuable. It's not so much the possibility of employing them, though somebody somewhere is likely servicing odd things like this. It's more about having some robust set of concepts to ideate with. We think we've seen the scope of how to solve problems, but the truth is highly likely otherwise.
You have an interesting analogy with the game of life. In the game of life every iteration is constant time. Totally predictable. What happens in the data structure though evolves over eons of iterations. I do hope avionics systems are a bit more predicatble than that though?
Details, details. The emphasis on a lot of development in computing is increased performance, or at least percieved performance and through put. To that end we have processors with pipelines, which already makes knowing execution timing difficult. Then add caches, in multiple levels. Now an instruction can take 1ns or thousands depending on where it's operands are. Now add virtual memory and paging and multi-tasking. Boom all bets are off as to knowing when anyting happens.
On top of that layer some software conveniences for programmers, interpreted languages and garbage collection...
I'm guessing this is why Lucol was universally hated by software engineers. I'm sure hardware guys, who are used to timing issues and love Verilog and VHDL would have appriciated it for what it is far more.
Totally agree. To many young'uns such things are old fashioned and pointless to think about. Which kind of misses the point that just because an idea is old does not make it stupid or bad or useless. Maybe it would be wise to be aware of the concepts, they might be just what you need one day.
This:
Agreed. Often in a general purpose sense, not just computing. Old ideas, means, methods, often don't take much. Sometimes we don't have much.
I can totally see the hardware guys grooving on this for sure.
We are gonna have to SKYPE one of these days. I'll send a PM. Now, I'm off to Friday night movie time. Mrs has picked something. Hope it's good.
That would depend on how you interpret the data.
Are you counting people or software?
Whatever PB is claiming that PF6.0 will do on the P2 is certainly going to force me to up my game and learn more ... each cog with 4 threads, and so on. Forth is for thinkers. I rest my case.
A rough estimate could be:
#Forth_interpreters = #existing_processors * #possible_threading_models
Historically, there appears to be at least one forth for every processor that has existed so far. I did not check this thoroughly.
The number of Forth users decreases when one dies, but they says there a new one born every minute. The unknown is how long before a given individual tires of doing it the hard way and tries forth.
Back in the late 70's I sold and serviced 8080/Z80 computers (S100 bus), and one of my customers purchased a system with an 8 port serial board to use as a Key to Disk system for data entry. I found it hard to believe that a 64KB 2MHz 8080 could possibly collect data from 8 terminals and write it to floppy disk at any reasonable speed, but he managed to write the software to do it.
There were 8 serial I/O routines, each with a memory buffer. Each I/O routine would be called in turn, and each one checked to see if there was a character in the UART. If a character was available it would be echoed to the terminal, loaded into a register, and the main program (re-entrant single copy) would be called to process it and write data to a floppy. It was absolutely amazing to see eight people typing away at terminals and realize that a single little 8 bitter could do all this without any noticeable delays.
While I am sure there were multiple loops in the main program the individual I/O routines were
very similar to the LUCOL ideal of no loops.