What's the toughest bug you ever found and fixed? [closed]

Solution 1:

A jpeg parser, running on a surveillance camera, which crashed every time the company's CEO came into the room.

100% reproducible error.

I kid you not!

This is why:

For you who doesn't know much about JPEG compression - the image is kind of broken down into a matrix of small blocks which then are encoded using magic etc.

The parser choked when the CEO came into the room, because he always had a shirt with a square pattern on it, which triggered some special case of contrast and block boundary algorithms.

Truly classic.

Solution 2:

This didn't happen to me, but a friend told me about it.

He had to debug a app which would crash very rarely. It would only fail on Wednesdays -- in September -- after the 9th. Yes, 362 days of the year, it was fine, and three days out of the year it would crash immediately.

It would format a date as "Wednesday, September 22 2008", but the buffer was one character too short -- so it would only cause a problem when you had a 2 digit DOM on a day with the longest name in the month with the longest name.

Solution 3:

This requires knowing a bit of Z-8000 assembler, which I'll explain as we go.

I was working on an embedded system (in Z-8000 assembler). A different division of the company was building a different system on the same platform, and had written a library of functions, which I was also using on my project. The bug was that every time I called one function, the program crashed. I checked all my inputs; they were fine. It had to be a bug in the library -- except that the library had been used (and was working fine) in thousands of POS sites across the country.

Now, Z-8000 CPUs have 16 16-bit registers, R0, R1, R2 ...R15, which can also be addressed as 8 32-bit registers, named RR0, RR2, RR4..RR14 etc. The library was written from scratch, refactoring a bunch of older libraries. It was very clean and followed strict programming standards. At the start of each function, every register that would be used in the function was pushed onto the stack to preserve its value. Everything was neat & tidy -- they were perfect.

Nevertheless, I studied the assembler listing for the library, and I noticed something odd about that function --- At the start of the function, it had PUSH RR0 / PUSH RR2 and at the end to had POP RR2 / POP R0. Now, if you didn't follow that, it pushed 4 values on the stack at the start, but only removed 3 of them at the end. That's a recipe for disaster. There an unknown value on the top of the stack where return address needed to be. The function couldn't possibly work.

Except, may I remind you, that it WAS working. It was being called thousands of times a day on thousands of machines. It couldn't possibly NOT work.

After some time debugging (which wasn't easy in assembler on an embedded system with the tools of the mid-1980s), it would always crash on the return, because the bad value was sending it to a random address. Evidently I had to debug the working app, to figure out why it didn't fail.

Well, remember that the library was very good about preserving the values in the registers, so once you put a value into the register, it stayed there. R1 had 0000 in it. It would always have 0000 in it when that function was called. The bug therefore left 0000 on the stack. So when the function returned it would jump to address 0000, which just so happened to be a RET, which would pop the next value (the correct return address) off the stack, and jump to that. The data perfectly masked the bug.

Of course, in my app, I had a different value in R1, so it just crashed....

Solution 4:

This was on Linux but could have happened on virtually any OS. Now most of you are probably familiar with the BSD socket API. We happily use it year after year, and it works.

We were working on a massively parallel application that would have many sockets open. To test its operation we had a testing team that would open hundreds and sometimes over a thousand connections for data transfer. With the highest channel numbers our application would begin to show weird behavior. Sometimes it just crashed. The other time we got errors that simply could not be true (e.g. accept() returning the same file descriptor on subsequent calls which of course resulted in chaos.)

We could see in the log files that something went wrong, but it was insanely hard to pinpoint. Tests with Rational Purify said nothing was wrong. But something WAS wrong. We worked on this for days and got increasingly frustrated. It was a showblocker because the already negotiated test would cause havoc in the app.

As the error only occured in high load situations, I double-checked everything we did with sockets. We had never tested high load cases in Purify because it was not feasible in such a memory-intensive situation.

Finally (and luckily) I remembered that the massive number of sockets might be a problem with select() which waits for state changes on sockets (may read / may write / error). Sure enough our application began to wreak havoc exactly the moment it reached the socket with descriptor 1025. The problem is that select() works with bit field parameters. The bit fields are filled by macros FD_SET() and friends which DON'T CHECK THEIR PARAMETERS FOR VALIDITY.

So everytime we got over 1024 descriptors (each OS has its own limit, Linux vanilla kernels have 1024, the actual value is defined as FD_SETSIZE), the FD_SET macro would happily overwrite its bit field and write garbage into the next structure in memory.

I replaced all select() calls with poll() which is a well-designed alternative to the arcane select() call, and high load situations have never been a problem everafter. We were lucky because all socket handling was in one framework class where 15 minutes of work could solve the problem. It would have been a lot worse if select() calls had been sprinkled all over of the code.

Lessons learned:

  • even if an API function is 25 years old and everybody uses it, it can have dark corners you don't know yet

  • unchecked memory writes in API macros are EVIL

  • a debugging tool like Purify can't help with all situations, especially when a lot of memory is used

  • Always have a framework for your application if possible. Using it not only increases portability but also helps in case of API bugs

  • many applications use select() without thinking about the socket limit. So I'm pretty sure you can cause bugs in a LOT of popular software by simply using many many sockets. Thankfully, most applications will never have more than 1024 sockets.

  • Instead of having a secure API, OS developers like to put the blame on the developer. The Linux select() man page says

"The behavior of these macros is undefined if a descriptor value is less than zero or greater than or equal to FD_SETSIZE, which is normally at least equal to the maximum number of descriptors supported by the system."

That's misleading. Linux can open more than 1024 sockets. And the behavior is absolutely well defined: Using unexpected values will ruin the application running. Instead of making the macros resilient to illegal values, the developers simply overwrite other structures. FD_SET is implemented as inline assembly(!) in the linux headers and will evaluate to a single assembler write instruction. Not the slightest bounds checking happening anywhere.

To test your own application, you can artificially inflate the number of descriptors used by programmatically opening FD_SETSIZE files or sockets directly after main() and then running your application.

Thorsten79

Solution 5:

Mine was a hardware problem...

Back in the day, I used a DEC VaxStation with a big 21" CRT monitor. We moved to a lab in our new building, and installed two VaxStations in opposite corners of the room. Upon power-up,my monitor flickered like a disco (yeah, it was the 80's), but the other monitor didn't.

Okay, swap the monitors. The other monitor (now connected to my VaxStation) flickered, and my former monitor (moved across the room) didn't.

I remembered that CRT-based monitors were susceptable to magnetic fields. In fact, they were -very- susceptable to 60 Hz alternating magnetic fields. I immediately suspected that something in my work area was generating a 60 Hz alterating magnetic field.

At first, I suspected something in my work area. Unfortunately, the monitor still flickered, even when all other equipment was turned off and unplugged. At that point, I began to suspect something in the building.

To test this theory, we converted the VaxStation and its 85 lb monitor into a portable system. We placed the entire system on a rollaround cart, and connected it to a 100 foot orange construction extension cord. The plan was to use this setup as a portable field strength meter,in order to locate the offending piece of equipment.

Rolling the monitor around confused us totally. The monitor flickered in exactly one half of the room, but not the other side. The room was in the shape of a square, with doors in opposite corners, and the monitor flickered on one side of a diagnal line connecting the doors, but not on the other side. The room was surrounded on all four sides by hallways. We pushed the monitor out into the hallways, and the flickering stopped. In fact, we discovered that the flicker only occurred in one triangular-shaped half of the room, and nowhere else.

After a period of total confusion, I remembered that the room had a two-way ceiling lighting system, with light switches at each door. At that moment, I realized what was wrong.

I moved the monitor into the half of the room with the problem, and turned the ceiling lights off. The flicker stopped. When I turned the lights on, the flicker resumed. Turning the lights on or off from either light switch, turned the flicker on or off within half of the room.

The problem was caused by somebody cutting corners when they wired the ceiling lights. When wiring up a two-way switch on a lighting circuit, you run a pair of wires between the SPDT switch contacts, and a single wire from the common on one switch, through the lights, and over to the common on the other switch.

Normally, these wires are bundeled together. They leave as a group from one switchbox, run to the overhead ceiling fixture, and on to the other box. The key idea, is that all of the current-carrying wires are bundeled together.

When the building was wired, the single wire between the switches and the light was routed through the ceiling, but the wires travelling between the switches were routed through the walls.

If all of the wires ran close and parallel to each other, then the magnetic field generated by the current in one wire was cancelled out by the magnetic field generated by the equal and opposite current in a nearby wire. Unfortunately, the way that the lights were actually wired meant that one half of the room was basically inside a large, single-turn transformer primary. When the lights were on, the current flowed in a loop, and the poor monitor was basically sitting inside of a large electromagnet.

Moral of the story: The hot and neutral lines in your AC power wiring are next to each other for a good reason.

Now, all I had to do was to explain to management why they had to rewire part of their new building...