Can I remove the BSOD?

Solution 1:

A BSOD is simply the visual error message for what Windows calls a bugcheck. Bugchecks (or, as *nix calls them, kernel panics) happen because the OS can't "deal with it".

For example - an instruction tries to read from an address that just doesn't exist. Or raises a page fault trying to read memory under circumstances when a page fault can't be resolved.

Exactly what data would you provide to the code that needs the result of the read operation, so that it can "go on"?

These errors usually occur because the instruction in question is accessing the wrong address. What address would you change it to when trying again? How do you know what it should have been?

Of course these errors can be raised when writing to memory, too. Ok, you could just not do the write and go on. But code doesn't just write to memory because it feels like it. Some other code later is going to need that information that was supposed to be written to memory, and then it will have the same issues with "just go on." More on that in a moment.

A possible workaround would be greater isolation of components. Suppose this error happened within your sound card driver - a fairly nonessential device for many things we do on our machines. You might say "let's just report the error and stop using the sound card until the next reboot, maybe the user will update the driver or replace the sound card."

The trouble is that all code and data in kernel mode is equally trusted. An error raised in kernel mode means that all the code and data in kernel mode is, by the evidence just seen, not trustable. And there's no way to know that it hasn't done more subtle damage before raising the error the OS noticed. (This results in the mantra "the victim is not always the culprit." It's very common to find that the code that raised the error did so because some other component in kernel mode corrupted memory contents. It's often quite difficult to find the other component.)

Another aspect of the bugcheck approach: These errors are "reported" (by crashing) as soon as they're noticed (usually because they induce an unhandleable exception). In a few bugcheck reasons it actually would be possible to do something seemingly reasonable and go on. However, this obscures the fact that an error or occurred. Errors in kernel mode will likely eventually lead to an "I really can't go on" condition.

In the example of the failed memory write given above, yes, you could say "so just don't do the write". But eventually something is going to need the data that wasn't written, and then that code will fall over.

Crashing at the first sign of an error in kernel mode, even a sort-of-recoverable one, gets you a memory dump as close to the problem as possible. Long experience (with OSs that go back decades before NT first shipped; both BSD and VMS date to the late 70s/early 80s) has shown that even if you do manage to sort-of fix things up and go on, and the system crashes later it's going to be that much more difficult to find the problem.

The problem is not likely that you'd "wreck good hardware". It's that there just isn't any way to shrug and keep going that is likely to have a happy result. It's like reaching a fork in the road but your GPS is telling you to go straight ahead, a path that doesn't exist. There's no way to follow that direction and there is no hint as to which fork to take. You could pick a path at random... this isn't likely to result in getting to where you want to go. (Or a stable OS.)

But if, let's say, each driver was in some sort of a sandbox or memory partition of its own, then we could assume that it had done no damage outside of that partition, right?

Actually, yes! The trouble now is that you're not describing x86/x64 architecture or the way OSs generally use it.

But, hmmm... what if you could run each driver in its own process? Processes are isolated from each other, right? If one of them makes a mistake like this we could just shut down that process and the rest of the OS keeps going.

Turns out that for devices for which speed is not too important, you can! This is what Windows' "User Mode Driver Frameworks" is about.

But invoking those user-mode drivers involves a very long code path with many ring transitions and process-to-process context switches. (Well, long compared to invoking a kernel mode driver.) You wouldn't want to use UMDF for your disks or video card. You could move your HID drivers there, but the HID drivers are pretty much solved problems.

Still, there are some devices that can tolerate the latencies (especially on today's fast CPUs), and you're going to see more and more drivers moving to user mode in the future. For a device that really needs USB 3 speeds a UMDF driver would likely be too slow, but a device that could work ok on USB 1.1 (like a serial port adapter) probably wouldn't mind using a UMDF function driver. With the improvements in UMDF architecture that came with Win 8.1 you're going to see more and more devices using UMDF.

At least part of the "bus driver" that handles the USB host controller interface for all USB devices would still be in kernel mode, because it has to touch I/O ports and registers, and that stuff is only accessble in kernel mode; changing that would throw all security out the window.

Which leads to this: For the core OS and kernel mode drivers, they're going to have to stay in kernel mode to be able to do some of the things they have to do - using privileged instructions, responding to interrupts and handleable exceptions, accessing I/O hardware... all the stuff that's in the "System Programming" parts of the x86/x64 instruction set references. And for unhandle-able exceptions in kernel mode code I'm afraid the answer to "just go on" is still "and do what, exactly?"

Sorry to put it bluntly, but to say that an undefined operation attempted in kernel mode will just result in an error message and "deal with it" is to misunderstand what kernel mode is supposed to mean. You can't just make up a result and have any reason to think that the subsequent code will be happy with it.