Certainly my buggy program is not a soft error. I presume by "soft error" you mean those random errors caused by cosmic rays flipping bits, or EMI, or that "glitchy processor" mentioned above, or other such random failure. Such faults go away on reset or power cycle. On the other hand my bug is permanently there.
Sometimes it can be hard to tell the difference though. Surely you have had bugs that are triggered by some weird combination of inputs, or a specific sequence of events or timing oddities. These bugs may be rare and randomly occurring.
An MMU could help debug such "hostile programs". As can array bounds checking, divide by zero traps and so on.
The question is still there What do you do when such a fault occurs? Stop forever and light a fault LED, reset and rerun everything, restart just part of the system, something else? Your application requirements will dictate that. For example I have worked on systems when any error is grounds for halting and triggering an emergency stop of the machine it is controlling before things get dangerously out of control, no time to be restarting anything.
I'm inclined go with the view that an MMU on a micro-controller is not really of much use in general.
Of course I take the view that an MMU is just a kludgy way to get apparent private memory spaces for processes when you don't actually have physical separate memory systems for each processor. In the same way that interrupts are a kludgy way to appear to have a processor per asynchronous event when you don't actually have enough processors.
My dream machine would have separate 512K RAM areas for each of it's 16 cores and have hardware channels for pumping data between them (and chip to chip communication). Code run from RAM at full speed like any other processor. Real process isolation.
In the world I work in management don't like me trying to fix other peoples bugs. I have done but more often they'll opt for the quickest workaround, which may mean fitting extra hardware to compensate. But the workaround most commonly chosen involves the machine operators following a set procedure that avoids triggering the bug.
PS: I not employed in software engineering so it's frowned on when I start doing such work. I do technician/maintenance work, ie: Keep the machines running.
Arguably an MMU can protect you from your own buggy program. Program steps out of bounds, it gets aborted. This at least saves you from continuing with incorrect data and producing wrong results.
Question is then what? Halt the machine? Reboot it automatically? This rather depends on your application requirements. A reset is often acceptable, that is why we have watchdogs that do that.
Thinking about flight control assistance and self driving cars/big rigs, I would prefer a reboot, not just halting the process.
Better just a reboot of the failing process. Not all of them.
Basically, If your program do not write into the lower 32K of your EEPROM you should be able to recover a P1 system without problem.
But you loose any state information, not nice if the quad-copter (or Boing 777) is trying to land on your dog/wife/car/house at the same time.
So I think the way to use three independent systems (say IBM/ARM/Motorola) and do some voting , seems quite reasonable to me to solve the redundancy.
Sure a MMU can give some protection, but somehow processes/tasks/cores have to communicate and need shared memory, being able to be clobbered.
No.
The prop2 does not really need a MMU.
First 512 longs are protected memory in the COG. No other COG can access those, except restarting that COG.
Hubexec and rdlong wrlong can access a shared resource called hub ram. If you really need to you can mask the address before each rdlong/wrlong with a simple 'and' to restrict it to some memory area of the hub. It's software.
I think the COG-memory protection given by 8/16 COG's is way better then any MMU, programmed by an OS can do. No rouge process can access any other COG and change something. It can just stop and/or restart it with it's own code. Can't even read the COG memory. Not even an OS.
Fundamentally, a MMU doesn't fix any bugs by itself, the protection it provides is to the other supposedly non-buggy programs. When there is only the one program running, then there is nothing left to protect.
PS: In the case of a many core'd design like the Prop it can still be considered a single program when everything is in the one purpose customised application.
Quite so loopy, multiple redundant systems is the way to go for ultimate reliability. I worked on the Primary Flight Computers of the Boeing 777. There are three PFC boxes, each of which contains an Intel 486, Motorola 680xx and AMD29K processor cards. Having three different architectures allows for there being different bugs in the CPU implementations or the compilers.
Turns out though that if one PFC should ever fail it does not reboot it halts, that leaves you in a far less fault tolerant state unless the pilot takes some action. There is a fourth PFC, waiting in stand by, that can be switched in if I remember correctly.
If the whole shebang fails they all shut down! No worries the 777 can be flown using the remaining analog control system, all be it not so easily.
It also turns out that there are single faults that can defeat a 3 way voting system. Causing it be unsure as to which node has actually failed, hence failing in total. See Leslie Lamport's paper "The Byzantine Generals Problem": http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf it's a nice read if you ever find yourself riding a 777
Processes/tasks/cores do not need shared memory to communicate. It can be done over dedicated communication channels, XMOS chips have such channel hardware between cores. Google servers cooperate over IP networks.
I think the COG-memory protection given by 8/16 COG's is way better then any MMU, programmed by an OS can do. No rouge process can access any other COG and change something.
This is true in that you can directly read/write someone else's COG space. But it does not help. Firstly I can corrupt the code before it is loaded to another COG. Secondly I can corrupt what ever HUB space a COG is using for communication.
An MMU cannot fix bugs in your program. It can however add some robustness in the face of rarely triggered bugs. An out of bounds access halts the process, a COG, and provided something in the system notices that COG can be restarted.
It gets complex though. If COGs maintain significant state they won't be coming up in the state the users of that COG expect it to be in. Easy for a simple UART driver. Not so easy for a video driver, what was on the screen when it died?
All in all it's kludgy, messy, hit and miss way to be detecting faults.
A halted/resetting controller is no use to the operator in the field. And it's just going to do the same thing again when in the same condition, so until it's fixed the MMU is still not helping. And an operator can manually reset it anyway.
Protection from bugs is only useful if there is other programs that need protection. When it's just the one purpose customised application then bug protection doesn't do anything.
I don't disagree. As I said it all depends on you application.
There are plenty of software bugs that only show up rarely. If your application maintains little state information and losing the results of a particular iteration are is not critical a reset can get it running again and service is maintained. A remote status monitoring system might be such an example.
If your application has significant state it may be better to just halt and light that fault LED. An example might be the controller of a 3D printer. A reset halfway through a job is not going to help.
Yep, as the guy from Google responsible for architecting their fault tolerance said "Checksum everything". However ECC does not save you from those elusive software bugs. Or oddities like the Intel Pentium FOOF bug that corrupted some floating point calculations sometimes.
I was thinking of the FDIV bug which occasionally produced incorrect floating point divide results. Intel calculated it was so rare that no one would notice and they could ship the chip anyway. Of course people did notice. It was a scandal at the time. https://en.wikipedia.org/wiki/Pentium_FDIV_bug
...explain how javascript can be used to violate it's memory allocations?
Those examples will all be hitting the bounds check, not the MMU.
True. If your your JS engine is working correctly. In the context of this little sub-debate we cannot assume that.
Let's talk C.
It's a trivial exercise to convert those JS examples to C and blow up your memory bounds triggering an MMU fault.
I'm not sure what we are disagreeing about any more. I have no desire for an MMU on the Prop and part of my argument, way back with mOOtykins here, was that with a "safe" programming environment it was not necessary.
As you point out JS makes for safety in that respect. No MMU required.
It's a trivial exercise to convert those JS examples to C and blow up your memory bounds triggering an MMU fault.
I'm not sure what we are disagreeing about any more. I have no desire for an MMU on the Prop and part of my argument, way back with mOOtykins here, was that with a "safe" programming environment it was not necessary.
This part started from me claiming that that particular FDIV FPU bug wouldn't cause memory violations. Your examples in JS only uses the FPU because it's typeless. Those examples in C wouldn't be using the FPU thereby not affected by the FPU bug.
Yep, agreed, not worth the nit-picking. A MMU does indeed help with debugging.
Just to answer that question about a FDIV bug leading to a memory violation:
Imagine measuring/calculating a series of values of something and that operation uses floating point, specifically FDIV in this case. Say we want to create a histogram of those values, we separate them out into range bins, depending on the bin number we increment a counter for that bin. Thus creating a histogram.
Of course we keep those counters in an array and simply increment the array element indexed by the bin number, in sort of C pseudo code we have:
int histogram[256];
int binNumber = 0;
float measurement:
for (i = 0; i < whatever; i++)
{
measurement = measure();
binNumber = binIt(measurement);
histogram[binNumber]++;
}
There we have it. A floating point bug, hardware or software, can cause an incorrect integer value to be calculated, which can be used to access an array, which may be well out of bounds a cause a memory violation.
That's dopey coding practice if there is no bounds check on a data generated index. Even JS will bale on that one if the results are insane enough.
Not to mention you're convoluting things by converting from int to float and back again. I say this because you've allocated 256 bins. That's pretty hostile coding really.
We cannot say that is "dopey" unless we know what is in measure() and binIt(). They could of course combined into be the same function. There no convoluting going on, there is only one conversion from, float to int, done here.
Certainly we hope any program is checking it's inputs. But is common practice to not check the range of every parameter to every function. Why would you? You know your code is correct for the allowed inputs.
Let's say measure() is doing something complex, for example reading 9 DOF motion sensor and "fusing" those readings into a heading angle returned as a float value in range 0.0 to 1.0. Then binIt() does a simple modulus calculation to turn the range 0.0:1.0 into an integer 0:255. We will assume sensor inputs have been checked for sanity and the code reviewed, tested and correct.
Then strikes the FDIV bug, triggered by some specific floating point values, BOOM! Measure() produces a huge float number, binIt() produces a huge int value, the array bound are exceeded the, the memory fault is triggered.
In general this is pretty much the same as the idea of the randomly "glitchy" CPU logic introduced some posts back.
You have not read what my code example and narrative says. That 256 is nothing to do with the raw input number range. It is simply the number "range buckets" I want some value "measurement" classified into so that I can count the values in each bucket and form a histogram. That 256 could could be 10 or 100 or 1000. Whatever. My raw input ranges can be anything.
In my hypothetical example I said nothing about the type or range of the raw sensor reading. It's a 9 DOF motion sensor, i.e. 3 gyro, 3 accelerometer and 3 magnetometer. I indicated that all raw inputs would be range checked.
measure() gets those 9 raw inputs, perhaps does a quaternion sensor fusion algorithm on them and produces a single float 0.0:1.0
binIt() take that float and converts it to a "range bucket" or bin number. Perhaps by simply multiplying by the number of bins we want and doing a floor operation to round it to an int. In this case I have 256 bins but it could be any arbitrary number, perhaps chosen only to make my histogram plot look nice.
In the last line we increment a counter in each range bucket. To form our histogram.
All this code could be perfectly OK for all possible values of input.
But a faulty FDIV or other floating point operation can trigger an MMU fault as I describe.
Okay, gotcha now. Not something I've tried to do for sure. Looking up the Pentium 1's FDIV bug it wouldn't have produced any error that could generate out of bounds on that array. I don't think a worst case would even round down the binning index. Would need a much bigger array range for that to have a chance.
There is a littered history of slight rounding errors in FPUs but they're slight. I guess a slight rounding up error has the potential of maybe reaching one index beyond the max. Funny how arrays are often allocated +1 for safety.
If the amplification was so severe that a small error could throw wild indexes then it clearly needed bounds checking post-scaling as well as pre-scaling. I'd hate to think of what dataset would need that.
Still the point stands, random wrong results, whether due to hard faults like the FDIV bug or those "glitchy" CPU errors can give rise to MMU faults. Not to mention any number of other surprises. I believe the point way back was if you were lucky enough to get an MMU fault that was better than continuing with wrong data.
Having said all of that. I did actually work on a cryptographic system decades ago where there was such paranoia about leaking plain text data that there were indeed physically switched banks of RAM holding "red" and "black" data . A sort of crude MMU. There were software assertions all over the code to ensure that all function parameters and return values were sane. And all kind of other checks, continuously running EPROM and RAM checks for example. They even had a team of guys checking the assembler produced by the compiler we were using !
In the case of dedicated controller it's six of one, half a dozen of the other. A recurring reset from MMU fault or continuing without a reset until maybe a watchdog trips.
What's red and black data? Is that like repeat the exercise over again to double check?
It's that thing about not leaking plain text messages or crytpto keys out over the network. Plan text and keys were "red" data, encrypted data was "black". There was hardware switching added to the RAM circuitry to ensure that tasks that did not need to see "red" data could not, like the wireless protocol driver for example. Mean while a task dealing with plain text could not drop data into the protocol drivers areas where it may get transmitted. All this on an Intel 8088 based system. This crude MMU did actually allow access to 8MB or RAM as well. It was a PC compatible hand held machine.
Comments
Sometimes it can be hard to tell the difference though. Surely you have had bugs that are triggered by some weird combination of inputs, or a specific sequence of events or timing oddities. These bugs may be rare and randomly occurring.
An MMU could help debug such "hostile programs". As can array bounds checking, divide by zero traps and so on.
The question is still there What do you do when such a fault occurs? Stop forever and light a fault LED, reset and rerun everything, restart just part of the system, something else? Your application requirements will dictate that. For example I have worked on systems when any error is grounds for halting and triggering an emergency stop of the machine it is controlling before things get dangerously out of control, no time to be restarting anything.
I'm inclined go with the view that an MMU on a micro-controller is not really of much use in general.
Of course I take the view that an MMU is just a kludgy way to get apparent private memory spaces for processes when you don't actually have physical separate memory systems for each processor. In the same way that interrupts are a kludgy way to appear to have a processor per asynchronous event when you don't actually have enough processors.
My dream machine would have separate 512K RAM areas for each of it's 16 cores and have hardware channels for pumping data between them (and chip to chip communication). Code run from RAM at full speed like any other processor. Real process isolation.
Thinking about flight control assistance and self driving cars/big rigs, I would prefer a reboot, not just halting the process.
Better just a reboot of the failing process. Not all of them.
Basically, If your program do not write into the lower 32K of your EEPROM you should be able to recover a P1 system without problem.
But you loose any state information, not nice if the quad-copter (or Boing 777) is trying to land on your dog/wife/car/house at the same time.
So I think the way to use three independent systems (say IBM/ARM/Motorola) and do some voting , seems quite reasonable to me to solve the redundancy.
Sure a MMU can give some protection, but somehow processes/tasks/cores have to communicate and need shared memory, being able to be clobbered.
No.
The prop2 does not really need a MMU.
First 512 longs are protected memory in the COG. No other COG can access those, except restarting that COG.
Hubexec and rdlong wrlong can access a shared resource called hub ram. If you really need to you can mask the address before each rdlong/wrlong with a simple 'and' to restrict it to some memory area of the hub. It's software.
I think the COG-memory protection given by 8/16 COG's is way better then any MMU, programmed by an OS can do. No rouge process can access any other COG and change something. It can just stop and/or restart it with it's own code. Can't even read the COG memory. Not even an OS.
No, no.
It is good like it is. No need for a MMU at all.
Enjoy!
Mike
PS: In the case of a many core'd design like the Prop it can still be considered a single program when everything is in the one purpose customised application.
Turns out though that if one PFC should ever fail it does not reboot it halts, that leaves you in a far less fault tolerant state unless the pilot takes some action. There is a fourth PFC, waiting in stand by, that can be switched in if I remember correctly.
If the whole shebang fails they all shut down! No worries the 777 can be flown using the remaining analog control system, all be it not so easily.
It also turns out that there are single faults that can defeat a 3 way voting system. Causing it be unsure as to which node has actually failed, hence failing in total. See Leslie Lamport's paper "The Byzantine Generals Problem": http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf it's a nice read if you ever find yourself riding a 777
Processes/tasks/cores do not need shared memory to communicate. It can be done over dedicated communication channels, XMOS chips have such channel hardware between cores. Google servers cooperate over IP networks. This is true in that you can directly read/write someone else's COG space. But it does not help. Firstly I can corrupt the code before it is loaded to another COG. Secondly I can corrupt what ever HUB space a COG is using for communication.
But, yeah, just say no to an MMU.
An MMU cannot fix bugs in your program. It can however add some robustness in the face of rarely triggered bugs. An out of bounds access halts the process, a COG, and provided something in the system notices that COG can be restarted.
It gets complex though. If COGs maintain significant state they won't be coming up in the state the users of that COG expect it to be in. Easy for a simple UART driver. Not so easy for a video driver, what was on the screen when it died?
All in all it's kludgy, messy, hit and miss way to be detecting faults.
Protection from bugs is only useful if there is other programs that need protection. When it's just the one purpose customised application then bug protection doesn't do anything.
I don't disagree. As I said it all depends on you application.
There are plenty of software bugs that only show up rarely. If your application maintains little state information and losing the results of a particular iteration are is not critical a reset can get it running again and service is maintained. A remote status monitoring system might be such an example.
If your application has significant state it may be better to just halt and light that fault LED. An example might be the controller of a 3D printer. A reset halfway through a job is not going to help.
Yep, as the guy from Google responsible for architecting their fault tolerance said "Checksum everything". However ECC does not save you from those elusive software bugs. Or oddities like the Intel Pentium FOOF bug that corrupted some floating point calculations sometimes.
PS: Soft errors, rather than bugs, are often the cause of a watchdog trip.
I'm very sure I can trivially create a program that indexes an array way out of bounds and trips the MMU due to an unexpected floating point result.
Or I might be using a JavaScript engine where all numbers are floating point.
Plenty of scope for out of bounds memory mayhem with buggy float hardware.
I was thinking of the FDIV bug which occasionally produced incorrect floating point divide results. Intel calculated it was so rare that no one would notice and they could ship the chip anyway. Of course people did notice. It was a scandal at the time. https://en.wikipedia.org/wiki/Pentium_FDIV_bug How about, program: Result: Or program: Result: Who knows what JS does in the face of a bug in it's interpreter? We cannot discount such a possibility in this discussion. No, I am not. What makes you think so?
Let's talk C.
On second thought, it's 3AM monday morning here so I better hit the sack.
I'm not sure what we are disagreeing about any more. I have no desire for an MMU on the Prop and part of my argument, way back with mOOtykins here, was that with a "safe" programming environment it was not necessary.
As you point out JS makes for safety in that respect. No MMU required.
This part started from me claiming that that particular FDIV FPU bug wouldn't cause memory violations. Your examples in JS only uses the FPU because it's typeless. Those examples in C wouldn't be using the FPU thereby not affected by the FPU bug.
Yep, agreed, not worth the nit-picking. A MMU does indeed help with debugging.
Imagine measuring/calculating a series of values of something and that operation uses floating point, specifically FDIV in this case. Say we want to create a histogram of those values, we separate them out into range bins, depending on the bin number we increment a counter for that bin. Thus creating a histogram.
Of course we keep those counters in an array and simply increment the array element indexed by the bin number, in sort of C pseudo code we have: There we have it. A floating point bug, hardware or software, can cause an incorrect integer value to be calculated, which can be used to access an array, which may be well out of bounds a cause a memory violation.
Not to mention you're convoluting things by converting from int to float and back again. I say this because you've allocated 256 bins. That's pretty hostile coding really.
Certainly we hope any program is checking it's inputs. But is common practice to not check the range of every parameter to every function. Why would you? You know your code is correct for the allowed inputs.
Let's say measure() is doing something complex, for example reading 9 DOF motion sensor and "fusing" those readings into a heading angle returned as a float value in range 0.0 to 1.0. Then binIt() does a simple modulus calculation to turn the range 0.0:1.0 into an integer 0:255. We will assume sensor inputs have been checked for sanity and the code reviewed, tested and correct.
Then strikes the FDIV bug, triggered by some specific floating point values, BOOM! Measure() produces a huge float number, binIt() produces a huge int value, the array bound are exceeded the, the memory fault is triggered.
In general this is pretty much the same as the idea of the randomly "glitchy" CPU logic introduced some posts back.
Turning that into any float to then turn back into an unscaled int is dopey.
You have not read what my code example and narrative says. That 256 is nothing to do with the raw input number range. It is simply the number "range buckets" I want some value "measurement" classified into so that I can count the values in each bucket and form a histogram. That 256 could could be 10 or 100 or 1000. Whatever. My raw input ranges can be anything.
In my hypothetical example I said nothing about the type or range of the raw sensor reading. It's a 9 DOF motion sensor, i.e. 3 gyro, 3 accelerometer and 3 magnetometer. I indicated that all raw inputs would be range checked.
measure() gets those 9 raw inputs, perhaps does a quaternion sensor fusion algorithm on them and produces a single float 0.0:1.0
binIt() take that float and converts it to a "range bucket" or bin number. Perhaps by simply multiplying by the number of bins we want and doing a floor operation to round it to an int. In this case I have 256 bins but it could be any arbitrary number, perhaps chosen only to make my histogram plot look nice.
In the last line we increment a counter in each range bucket. To form our histogram.
All this code could be perfectly OK for all possible values of input.
But a faulty FDIV or other floating point operation can trigger an MMU fault as I describe.
There is a littered history of slight rounding errors in FPUs but they're slight. I guess a slight rounding up error has the potential of maybe reaching one index beyond the max. Funny how arrays are often allocated +1 for safety.
If the amplification was so severe that a small error could throw wild indexes then it clearly needed bounds checking post-scaling as well as pre-scaling. I'd hate to think of what dataset would need that.
Still the point stands, random wrong results, whether due to hard faults like the FDIV bug or those "glitchy" CPU errors can give rise to MMU faults. Not to mention any number of other surprises. I believe the point way back was if you were lucky enough to get an MMU fault that was better than continuing with wrong data.
Having said all of that. I did actually work on a cryptographic system decades ago where there was such paranoia about leaking plain text data that there were indeed physically switched banks of RAM holding "red" and "black" data . A sort of crude MMU. There were software assertions all over the code to ensure that all function parameters and return values were sane. And all kind of other checks, continuously running EPROM and RAM checks for example. They even had a team of guys checking the assembler produced by the compiler we were using !
What's red and black data? Is that like repeat the exercise over again to double check?