Why would certain network switches stop working, others are fine?
I am sure many of us are used to (or have experienced) the routine (or at least occasional need) of having to reboot (or power cycle) a cable modem, dsl modem, router or hub/switch.
However, I decided to post here in response to a recent widespread issue I just experienced. We have a number of network closets across several buildings.
Most of them have managed, high quality switches - gigabit, fiber between buildings and sometimes between closets.
Over the weekend we had some sort of power glitch. However the glitch only hit one building, not all the locations.
After that, there were lots of network issues. Issues with printers, loss of connectivity and more. Across all the buildings.
It seems like all of the 'high quality' managed network equipment is fine. However, in some areas we have some consumer grade - non managed - switches. For example, a large office that has only 1 network drop, but requires several connections. We have now been gradually making our way around to all of these switches (due to users calling with issues) and power cycling them. That fixes the issue for the user. The switch usually looks normal. Some of them all the lights are on (when they shouldn't be).
So why would all these switches start malfunctioning? Some kind of bogus routing data being pushed out from a switch hit with the power glitch?
I'm going to invoke Occam's Razor on this. While I suppose it's possible that some specific malformed packet(s) could cause your lower cost switches to fall into the failure mode you're describing I'd consider that a very unlikely cause. The switches that you're describing as having problems (small, unmanaged switches) aren't likely to have spanning tree implementations, let alone support for layer 3 switching and dynamic routing protocols. That type of switch should be "blind" to the actual content of the frames its switching, beyond using the source and destination MAC addresses to make switching decisions.
This makes me believe that you had a power issue more widely than you realize.
Going with a power issue assumption, I'd say you're having problems with the low cost switches because they're likely low quality switches. I know this sounds trite, but that's been my experience with networking gear over my entire career (with very few exceptions). You generally get what you pay for (and, though something may be priced incorrectly the market sorts that out pretty quickly).
A higher cost switch is typically going to have a better power supply that is more likely to run within tolerances when exposed to "glitchy" utility power. I suspect that the power supplies in your lower cost switches probably started putting out bad power when the utility power went out of spec. At that point, some part of the "brains" of the switch ended up in a "this should never happen" scenario because one or more of the power rails drifted too far out of tolerance.
An Ethernet switch isn't typically a single ASIC running the whole show but rather are typically groups of systems of ASICs that do different jobs connected to each other. Without knowing about the architecture of the switch in question it's hard to say anything definite. I've had experiences with a model of switch, many years ago, that used a single ASIC to run a group of 4 ports. Certain types of failures would cause groups of 4 ports on the switch to "flake out" while the rest of the switch kept running fine. A partial failure of a switch isn't abnormal in my experience.
In the case of your failure, the parts of the switch that handled keeping the lights on, for example, kept running fine. The physical interface hardware (the PHYs) probably kept running just fine (since you were probably seeing "lights" out on the far ends of the connections). Something else, however, didn't keep working right and you ended up seeing a lack of connectivity. In the cases where I've been "fortunate" enough to catch a switch "in the act" of failing like this I've plugged my laptop into a "problem" port and observed (using Wireshark) a totally "dark" network without any broadcast packets or the other "noise" commonly associated with a typical "working network". Packets transmitted into these ports never showed up elsewhere in the network-- they just fell into a "black hole". I bet you'd see something similar in your situation.
Cisco 1900 series switches where notorious for this some years ago.
These switches used 2 power-feeds internally: 5 Volt for CPU/backplane, 12 Volts for CAM memory. On a short power-spike the 5 Volt remained stable enough for the switch to continue running but the 12 Volt dropped enough for the CAM memory tables to get corrupted. Unfortunately there was no way the switch CPU could detect the memory corruption which caused all sorts of havoc with L2 switching and ARP.
That's why we put a small UPS in each patch-cabinet. That was a lot cheaper than resetting every switch by hand. (And dealing with the pissed users.)
Cisco fixed this in later models. I have heard of the same issues with older HP switches as well.
I'm certain there is a lot of hardware around, especially in the consumer/soho segment, that has similar issues. A good quality power-supply is still one of the more expensive components of the device so it's usually the first item that gets down-graded when the PHBs decide that the margins on the product are too low.