Mathematically, why was the Enigma machine so hard to crack?
Solution 1:
I strongly disagree with the other answer which trivializes the contributions of Alan Turing and his group, as well as the Polish mathematicians who first worked on the problem. Of course increasing the numbers makes the problem harder, but even with a modern computer a brute force attack on the often cited 150 million million combinations (this number actually varied throughout the war and for different configurations of the Enigma machine) would be a tall order. With computers not even existing at the start of the war, there was absolutely no chance to attack the problem in a brute force way.
Therefore new methods to reduce the possible number of combinations had to be developed. One example is Banburismus, a statistical method developed by Alan Turing. This was a method which allowed excluding many possibilities even without any "cribs", i.e., known plaintext parts of the message.
The basic idea is that, given two natural language strings, they will share many more letters in the same positions than two random strings would. This is due to the fact that certain letters are much more probable than others. For instance, an example from the linked page:
HereisthefirstmessageofapairNothingspecialjustordinaryEnglishtext
BelowitanotherAgainyoucanseethatitismerelyarandomexamplemessage
- - - - - - - - -
Here we have nine matching letters or overlaps. With two random strings, we would expect an average of two or three matches for a message of this length.
A crucial insight then was that this property is preserved even if both messages are enciphered through the Enigma machine. This allowed the development of an attack which could discard many combinations on statistical grounds.
The description of the method makes it clear that the beginnings of information theory, formally established by Claude Shannon in his paper "A Mathematical Theory of Communication" in 1948, are already present in these ideas. Saying that it was just a matter of building a fast enough machine to try all combinations is an insult to these theoretical achievements.
Just for fun, here's a bit of math on the letter frequency problem. Let's say $X_1 \in \{ 1, 2, \ldots, 26 \}$ denotes the event that a letter in the first message is A, B, ..., Z, and similarly $X_2$ a letter in the second message. Since the messages are independent, so are these events. So let $p \in \mathbb R^{26}$ denote the vector $(P(X_1 = 1), \ldots, P(X_1 = 26))$, i.e., the letter frequencies for our language.
Then the probability that one position contains the same letter in both messages is $$ P(X_1=X_2) = \sum_{i=1}^{26} P(X_1 = i \land X_2=i) = \sum_{i=1}^{26} P(X_1 = i) P(X_2=i) = \| p \|_{\ell_2}^2. $$ (Here we implicitly assumed that each letter is independent from the next, which is not quite true. Using digram or trigram frequencies would be more accurate.)
If our "language" is random, i.e., $p=(\frac 1{26}, \ldots, \frac 1{26})$, we get that the above probability for a matching letter is $\frac 1{26}$. For our 63-letter message, we would thus expect around $63/26 \approx 2.4$ matching letters.
In fact this is the lowest possible probability: since the probabilities in $p$ must sum up to 1, we obtain from the Cauchy-Schwarz inequality that $$ 1 = \sum_{i=1}^{26} p_i \le \sqrt{26} \|p\|_{\ell_2}, $$ and hence $$ \|p\|_{\ell_2}^2 \ge \frac 1 {26}. $$
Using the letter frequencies for English given on Wikipedia, we can compute that for the English language, we have $$ \|p\|_{\ell_2}^2 = 0.0655 > 0.0385 = \frac 1 {26}. $$ I assume that by taking di- and trigrams into account, this number would rise even more.
Update: By downloading a text version of "War and Peace" from Project Gutenberg and massaging it in IPython, I found very similar letter distributions to the ones given on Wikipedia, resulting in $\|p\|_{\ell_2}^2 \approx 0.0659$.
Then I extracted pairs of random 60-letter strings from the text and counted the number of matches between them. Repeating this experiment 1,000,000 times, I found an average probability for a match of 0.0659 per letter. So it seems that the simple first order estimate $\|p\|_{\ell_2}^2$ is actually a very good approximation to the probability of matching letters.
Solution 2:
In modern computer cryptography, large numbers are one of the most important factors. That's why 40-bit encryption used to be considered security back in 1995, but today (with 512 bit encryption available on almost any security device) would be considered a joke.
My reading of the history of cracking Enigma is that failures in use and implementation of Enigma by the Germans, combined with effective intelligence gathering were the most critical factors in enabling the codebreakers to succeed. There wasn't much to the algorithm itself, it just had a huge number of combinations.
Like the Rubik's Cube. It's just a finite non-Abelian group, but it's the huge order of that group that makes it so you can't just write down the entire Cayley table and find the minimum number of operations to a solution from there.
The reason the large "bombas" were constructed in the cracking of Enigma was to speed up the process, because there were so many possible combinations. Which again brings us back to modern computer cryptography, where the computing power available determines the length of time it would take to brute force decryption of a given key size, so using the longest practical key is critical.
This quote from Marian Rejewski, one of the Polish codebreakers who worked on Enigma, basically says (to me), "the Germans increased the number of combinations which makes our job a lot harder":
we quickly found the [wirings] within the [new rotors], but [their] introduction [...] raised the number of possible sequences of drums from 6 to 60 [...] and hence also raised tenfold the work of finding the keys. Thus the change was not qualitative but quantitative. We would have had to markedly increase the personnel to operate the bombs, to produce the perforated sheets (60 series of 26 sheets each were now needed, whereas up to the meeting on July 25, 1939, we had only two such series ready) and to manipulate the sheets.
(emphasis mine)
Solution 3:
The answer to the question "Mathematically, why was the Enigma machine so easy to crack?":
The first major weakness was the fact that the same settings were used for a whole day. There are always messages that are easier to crack and others that are harder to crack; using the same settings for a day meant only one exceptionally easy to crack (and likely very unimportant) message needed to be cracked to crack all the messages of a day.
The second major weakness was the fact that in every state, the Enigma machine produced an Enigma permutation (13 cycles of length 2), which made it accessible to mathematical attacks.
The third major weakness was the fact that the huge number of possible settings could be separated into separate and easier problem. It was possible to first crack the rotor settings, and then the plugboard settings separately. (Same mistake was made in DVD encryption, where a 40 bit code could be separated into a 25 bit and a 16 bit code that could be cracked separately).
Solution 4:
In every possible state, an Enigma machine produces a permutation of the letters A to Z. And not just any permutation, but one that exchanges pairs of letters, for example in a certain setting it might exchange A and Q, B and F, C and R and so on. Such a permutation, which consists entirely of cycles of length 2, is called an "Enigma Permutation".
After transmitting a letter, the machine state would be changed in a deterministic way, so a different Enigma permutation was used. Importantly, a code cracker can be assumed to know all encrypted messages, because they were just sent over radio.
Rejewski's theorem says: "The composite of any two Enigma permutations consists of disjunct cycles in pairs of equal lengths". Since the total cycle length is 26, you might have for example two cycles of length 1, two cycles of length 5, and two cycles of length 7. Another quite simple theorem says that a permutation M and a permutation S M S^-1 have the same cycle characteristics.
For the first few years, every transmission started by setting the machine into a fixed start state (known to sender and receiver, but not known to the code cracker), then the sender would pick a random three letter code and transmit it twice, then sender and receiver would use that three letter code to change the machine settings. After that, each message was sent with different machine settings.
All machines sent the first six letters with the same permutations P1, P2, P3, P4, P5, P6. If the sender transmitted ABC ABC, and the receiver receives RST XYZ, then permutation P1 exchanges A and R, P4 exchanges A and X. The cracker knows R and X, but not A. But the cracker knows that the permutation P1 P4 maps R to X, because P1 maps the known R to an unknown A, and P4 maps the unknown A to the known X. If you receive enough messages, with different random letters ABC, you gather enough information to find the complete permutation P1 P4. Same for P2 P5 and P3 P6.
Now each of these permutations consists by Rejewski's theorem of cycles in pairs of equal lengths, with the lengths adding up to 26 or the lengths of one half of each pair adding up to 13. That gives a pattern; in the example above the pattern would be (1, 5, 7). And this pattern only depends on the initial rotor settings, it is independent of the plugboard (the second of the two theorems mentioned).
With the initial 3 rotors which could be installed in 3! = 6 orders, and 26 x 26 x 26 initial rotor rotations, there were 105,456 possible initial settings, each of which would produce 3 patterns for the permutations P1P4, P2P5, and P3P6.
What the polish mathematicians did was create an index: For each of the 105,456 initial positions they found over months work the 3 patterns associated with each position. And with that it was easy to find the initial rotor settings for a day after intercepting about 100 messages. Not all settings created unique patterns, but usually a pattern would be produced by one or very few initial settings. Some rotor settings are bad news; for example there are 313 rotor settings producing three pairs of cycles of length 13.
The plugboard settings could also be discovered: Knowing the initial rotor settings, we can determine the permutation P1' P4' that would have happened without the plugboard. We also have the permutation P1 P4 that was produced with the plugboard. These can be compared, and we have the same information for P2' P5' vs. P2 P5 and P3' P6' vs P3 P6. That was usually enough to determine the plugboard settings.
So initially the Polish were able to decipher messages by hand. This stopped working when the transmission method changed (no 3 letters transmitted twice) and when 3 rotors were replaced by 5, with 60 possible rotor choices.
Later methods were substantially based on guessing messages or parts of messages. Since the same initial settings were used over a whole day, if just one message was cracked, every single message for the day was cracked (if not, all the messages for the day were unreadable). A good source of messages that could be guessed were weather reports: Since they were not very secret, they would be transmitted through the country with an easily cracked code, so the exact weather reports could be recovered. The exact same messages were sent to submarines with the enigma code, so that was a good source for known message texts.
Later a fourth rotor was introduced for top secret messages. That would have been uncrackable at the time. Except the settings for the four rotor machines were the same as the three rotor machines with another ring in 26 possible positions. So after cracking the three rotor code, just 26 attempts were needed to crack the four rotor machine.