Two days ago I felt very uncomfortable with Big O notation. I've already asked two questions:

  1. Why to calculate "Big O" if we can just calculate number of steps?
  2. The main idea behind Big O notation

And now almost everything has become clear. But there are few questions that I still can't understand:

  1. Suppose we have an algorithm that runs in $1000n$ steps. Why people say that $1000$ coefficient becomes insignificant when $n$ gets really large (and that's why we can throw it away)? This really confuses me because no matter how large $n$ is but $1000n$ is going to be $1000$ times bigger than $n$. And this is very significant (in my head). Any examples why it is considered insignificant as $n$ tends to infinity would be appreciated.

  2. Why Big O is told to estimate running time in worst case? Given running time $O(n)$, how is it considered to be worst case behavior? I mean in this case we think that our algorithm is not slower than $n$, right? But in reality the actual running time could be $1000n$ and it is indeed slower than $n$.

  3. According to the definition, Big O gives us a scaled upper bound of $f$ as $n \to +\infty$, where $f$ is our function of time. But how do we interpret it? I mean, given algorithm running in $O(n)$, we will never be able to calculate the actual number of steps this algorithm takes. We just know that if we double the size of the input, we double the computation time as well, right? But if that $O(n)$ algorithm really takes $1000n$ steps then we also need to multiply the size of the input by $1000$ to be able to visualise how it grows, because $1000n$ and $n$ have very different slopes. Thus in this case if you just double the computation time for the doubled size of the input, you're going to get wrong idea about how the running time grows, right? So how then do you visualise its growth rate?

I want to add that I know the definition of Big O and how to calculate it, so there is no need in explaining it. Also I've already googled all these questions tons of times and no luck. I'm learning calculus at the moment, so I hope I asked this question in the right place. Thank you in advance!


Reading between the lines, I think you may be misunderstanding Big O analysis as being a replacement for benchmarking. It's not. An engineer still needs to benchmark their code if they want to know how fast it is. And indeed in the real world, an $\mathcal{O}(n)$ algorithm might be slower than an $\mathcal{O}(n^2)$ algorithm on real-world data sets.

But, as $n$ approaches infinity, the $\mathcal{O}(n^2)$ algorithm will ultimately be slower. For the sake of example, if we allow constant factors in our Big-O notation, then an $\mathcal{O}(10n)$ algorithm will take more "steps" than an $\mathcal{O}(n^2)$ algorithm, if $n$ is less than $10$ ($10\cdot 10 = 10^2$).

But if $n$ is $100$, then the $\mathcal{O}(n^2)$ algorithm takes ten times as long. If $n$ is $1000$, it takes a hundred times as long. As $n$ grows, so too does this difference. That manner in which the two algorithms differ is what we are analyzing when we use Big O analysis.

Hopefully that example also makes it clear why the constant factor is irrelevant and can be ignored. Whether it's ten, a hundred, a thousand, a million, or a quadrillion ultimately does not matter, because as $n$ approaches infinity, the $\mathcal{O}(n^2)$ algorithm is eventually going to be slower anyway. That's the power of exponential growth.

The crux of it is that Big O analysis is a mathematical concept which does NOT tell an engineer how fast something is going to run or how many steps it's going to take. That's what benchmarking is for. Big O analysis is still a very helpful tool in algorithm design, but if you're interested in exactly how long something takes to run, you'll need to look elsewhere.


Algorithms do not have a "speed". How fast a given algorithm runs in practice depends on the way it is implemented and the hardware and support software it runs on. So the speed of an algorithm does not depend just on the algorithm itself and thus is extremely difficult to analyze mathematically. For a given input, Algorithm A may run ten times faster that Algorithm B on one kind of computer, but only half as fast as B on another type.

So the "speed" of an algorithm is an ill-defined theoretical concept - in virtually all circumstances it can only be determined experimentally. Instead, the Big-O notation analyzes a performance-related property of algorithms that can be expressed mathematically: not its speed, but how well its performance scales. That means, Big-O does not tell you how fast an algorithm is in absolute terms, but only how its "speed" changes when you change the problem size (e.g. the number of elements to sort) . For example, for an O(n²) algorithm, doubling the problem size will increase the runtime four-fold while doubling the problem size for an O(n) algorithm will only double the runtime.

As a tool that is only supposed to answer questions about scaleabilty but not absolute absolute performance, constant factors do not matter, because they do not change the scaling behavior: An O(n²) algorithm will quadruple its runtime when doubling its problem size, no matter what the original runtime was. For example, for a problem set size of n=1000, one O(n²) algorithm may take one microsecond while a different O(n²) algorithm may take two hours. Big-O notification is not concerned with these absolute differences. It does tell you, however, that both will approximately quadruple their runtime when you double the problem set size to n=2000 - the faster of these two example algorithms from one microsecond to four microseconds, and the slower one from two hours to eight hours.

With that limitation in mind, Big-O it is not a replacement for speed measurements, but it is easier to obtain and useful enough in its own right to base certain decisions on without requiring experimental measurements.


You are right: The $O$ notation only tells you how big it is within a constant factor. If you want something more accurate than that, do not use $O$. The reason that $O$ notation is widely known and widely used is: it has been found useful over the years. Of course in some settings, other estimates may be more useful than $O$.


I think this has all been addressed before (on mse even!), but perhaps not in one answer, so I'll say a̶ ̶f̶e̶w̶ ̶w̶o̶r̶d̶s̶ (so much for "a few" :P) here.

For (1), I think you're misunderstanding the relevance of "dropping" constant factors and lower order terms. When we analyze runtime of an algorithm, we're doing it in some model of how computers work, and there's a variety of choices. We typically work with some version of the Word-RAM Model, but there's a smattering of low-level choices you can make. The notion of computation is extremely robust, but the notion of runtime isn't. We want to know that our notion of runtime doesn't care about which precise model we use, because different models model different facets of "real world" computers. Depending which model you use the precise constants in your runtime might change by constant factors, or might pick up lower order terms. This is analogous to not being able to tell exactly how many steps it'll take a computation to run, since that depends on the precise hardware you're running it on, as well as the specifics of how the algorithm is implemented (rather than the algorithm itself). What we can say, however, is that "up to lower order terms", the runtime is consistent. So we throw the lower order terms away, because they can be influenced by hardware or implementation details.

It still requires some justification, though, to know that throwing away constant factors "doesn't really matter", and this is why people often talk about how $1000n$ "isn't that much bigger" than $n$ whenever $n$ is extremely large. And I do mean extremely. If we take $n \approx 10^{10,000}$, then $1000n \approx 10^{10,003}$. In this case the difference really is tiny compared to the rest of the numbers involved.

Also, despite what they tell you in introductory programming classes the constants really do matter. We can't tell exactly what they are (for the reasons I mentioned above) but you can tell if your constants are $\approx 1000$ or if they're $\approx 2$, and you're entirely right in thinking those are different numbers!

In the real world, if you have a quadratic algorithm with a small constant and a linear algorithm with a huge constant, it's often better to use the quadratic algorithm because we'll almost never be running our code on inputs of size $10^{10,000}$. In fact, we're already doing this: There are a bunch of algorithms for which our theoretical big-O is smaller than the algorithm used in practice, yet the theoretical algorithm is never used because the constants involved are so massive. These algorithms are called galactic, and maybe knowing they exist will help you feel like you're not going crazy worrying about these things.


For (2), "worst case" refers to the worst of all possible inputs. For instance, say we want to write code that checks if a list has an even number in it. There's an obvious algorithm to do this

for x in list:
  if isEven(x): return True
return False

On the list $[2,3,4,5,6]$ this algorithm runs super quickly! Because the first element is already even. On the list $[1,3,5,7,9,11,13,14]$ it runs comparatively slowly, because it has to wait until the very end to see if an even number exists or not.

This does not have to do with $1000n$ being bigger than $n$. Remember that what we really care about is the runtime up to constants, because if you use a really crummy model of computation, or if you're really unlucky with which computer you use, you might pick up extra constant factors in your runtime without actually changing your algorithm at all.

This is in contrast to "average case" runtime, for instance. Here, instead of looking at the max of all the possible runtimes, we look at the average of the possible runtimes. As an example where this matters, you can show that quicksort takes $\approx n^2$ many steps in the worst case, but $\approx n \log n$ steps on average.


For (3), there's a substantial difference between, say, $n^2$ and $n \log n$ and $n$. Knowing how the runtime scales as your input scales is still a major factor in deciding which algorithm to use. For instance, to use the kinds of numbers you're worried about, say we multiply our input size by $10,000$. Does the runtime go up by a factor of $100,000,000$, by a factor of $10,000$ (plus an extra additive factor of $40,000n$), or by a factor of $10,000$? Obviously this matters a lot, and you can see that for inputs of size $1,000,000$ (which is tiny compared to some input sizes that real companies like google work with), an $n^2$ algorithm is going to be slower than a $1000n$ algorithm.


I hope this helps ^_^


The usefulness of comparison is this: No matter what you change the "parameters" you still arrive at the same general growth factor. Yes, 1000n is 1000 times larger than n. That is obvious. But what happens if you upgrade your cpu or run it on two threads? Then what? Well, REGARDLESS, the algorithms will STILL behave in the same way. n and 3n and 9n and 0.3432n all will behave in the same way regardless that the process runs" because the process depends linearly on n.

What we O(n) tells us is precisely how much time that the algorithm needs as it has to gobble up more and more data.

E.g., suppose you had an algorithm A that takes in a data set S and gives out T. We can represent that like T = SA where we can use reverse polish notation to express any complex of algorithms, e.g. 'T = SAB+RBAAA' where you can interpret the operations how you like(just read it from left to right).

Then that algorithms absolute performance might change with different cpus, say, if they are all in the same class under O(n) equivalence they all run the same(under equivalence, not absolute equality).

Think about it even more delicately: Let A_k be the algorithm at it's "kth step". Then another algorithm processing the exact same data would be the algorithm at the B_k step.

Then O(n) tells us that A_{k+1}/A_k and B_{k+1}/B_k are both the same class. O(n) is really the equivalence class for whatever family of functions that all grow at the same functional rate.

E.g., suppose that A_k takes 20 seconds to go to state A_{k+1} and 15 seconds to go from state A_{k+1} to A_{k+2}. Suppose we have a faster computer that runs A, which we will call B. Now B_k takes 15 seconds to go to state B_{k+1} and 11.3184 seconds to go from state B_{k+1} to B_{k+2}.

Then 15/20 = 0.75 = 11.3184/15.

That is, the same relative proportion of work is required by both machines. We compare just the algorithms, not the hardware they are running on.

Now, in reality, O(n) is not exactly what I have described above but it follows the same idea where we want to be able to say two things are the same even when they are not.

Why? Because in many problems we do not need to know the exact behavior of a function but only how it generally behaves. E.g., in convergence problems we maybe only want to prove something converges. In this case we don't need to know about local behavior. If we have a dominating function and know that the dominating function substitution operator is an identity on the limit operator then can compute and do limits much easier. Quick! What is the limit of 1/x*(cos(x)*3 + 1/x^cos(3x)) as x-> oo? Well, an application of mental O(n)-nastics can quickly tell one that it is probably 0. Rather than worry about all the junk after 1/x we can just use 1/x because, as far as the limit is concerned, they both are the same result(the limit sends them to the same equivalence class).

limit 3/x is the exact same as limit 2/x when x->oo. Just like $3x + 4$ is the exact same expression as 2x + cos(30.43Pi). They are both linear expressions. They are EXACT under that level of abstraction.

Here, we are doing the exact same thing but with function growth rates. But not their exact growth rates but their basic "shape". E.g., we can think of all lines as growing at the same rate. Same as all parabolas. Even though $3x^2$ technically grows faster than $x^2$ we consider them the same because the rate of growth change is actually constant with respect to the base function(think of the linear case). Of course then we can combine classes such as linear and quadratic terms. $x^2 + 3x$. Well, since we are concerned with large x and already proved that $x^2$ dominants $3x$(we can put in any more dominant form and still get a dominant form) we can say ignore "submissive" parts. E.g., $x^2 + 3x == x^2 == 3x^2 = x^2 + cos(3x) == ...$.

To recap if you take our algorithm A, it has a certain exact run speed that one could plot on a graph. But if you upgrade the hardware that graph will change. It will probably just look like a scaled graph though.

But say you run it now on a different OS and that different OS does different types of scheduling so your graph doesn't look quite the same. BUT it's same general pattern exists as before... at least for long run times!

So O(n) notation lets us talk about these issues more precisely. It's a way to classify the set of functions in to a useful partition/equivalence relation and then to apply them to find more useful consequences.

In your mind it is "Why are we making things so complex just to talk about run time performance? It makes no sense! Algorithm A takes 10 seconds to compute S and algorithm B takes 34.432 seconds to compute S! Algorithm A is better!". But that is not a useful metric. Why is A better? Just because it ran faster in your example? Maybe it is really not faster but just has better performance on small run times(small data sets) because it uses caching then becomes the worst algorithm on the planet. Then what? Knowing that A actually dominants B in O(n) tells us that B performs better in the long run.

By thinking in terms of O(n) you are thinking about more general ways that functions relate(it is sorta how with topology you can think about how different functions are equivalent(under homeomorphic transformations)). Also you are thinking more "futuristically" in that your "hardware" will upgrade automatically. E.g. if you design algorithms based on O(n) fundamentals then you can know if your algorithms are terrible or not no matter what machine they run on. As machines become faster the algorithms will still be the same "rank" compared to where they were before. So you are comparing algorithms sorta in their overall design and not one specifics that can change in any moment.

It's sort of like the inverse of Taylor. With Taylor series we add terms to get more and more terms = more and more precision. With O(n) we want to do the same but add less and get more dominance. E.g., 1 -> 1 + 2n -> 1 + 2n + n^2 vs 1 + 2n + n^2 -> 2n + n^2 -> n^2.

While the technical definition and usage has more "legalese" getting the idea down that this is something different and more useful is the key. Unfortunately they just give the definitions without explain why it was useful to create those definitions and most people then are taught the definitions RATHER than the experiences that led up to the definition. [E.g., they give you no or little context which means you are almost surely not going to understand it unless you happen to have some experiences that let you. ]

A large part of mathematics has a need to compare things in a "vague" way. That is, there is some overlap that we want to understand but the way we currently represent it makes it difficult to discuss. The machinery is then created to be able to transform(transformations are built) the problem in to a different way to think about it.

I.e., Once you realize comparing the performance of algorithms(which is just long term behavior in mathematics) is futile then you still need some way to compare them. The need to compare never went away, just the realization that the representation was inadequate. Hence a new representation is required and therefor a new language and that means definitions and their terms. When that is done for the algorithms we eventually get to the concept of simply representing(talking about) algorithms in terms of their longest "long term behavior". We do not care about initial fluctuation or minor aberrations... just the general long term behavior.

The definitions go about defining precisely what they mean by that.