Reducing download time using prime numbers
An issue has cropped up recently in programming with which I could greatly benefit from the expertise of proper mathematicians. The real-world problem is that apps often need to download huge chunks of data from a server, like videos and images, and users might face the issue of not having great connectivity (say 3G) or they might be on an expensive data plan.
Instead of downloading a whole file though, I've been trying to prove that it's possible to instead just download a kind of 'reflection' of it and then using the powerful computing of the smartphone accurately reconstruct the file locally using probability.
The way it works is like this:
- A file is broken into its bits (1,0,0,1 etc) and laid out in a predetermined pattern in the shape of a cube. Like going from front-to-back, left-to-right and then down a row, until complete. The pattern doesn't matter, as long as it can be reversed afterwards.
- To reduce file size, instead of requesting the whole cube of data (the equivalent of downloading the whole file), we only download 3 x 2D sides instead. These sides I'm calling the reflection for want of a better term. They have the contents of the cube mapped onto them.
- The reflection is then used to reconstruct the 3D cube of data using probability - kind of like asking a computer to do a huge three-dimensional Sudoku. Creating the reflection and reconstructing the data from it are computationally heavy, and as much as computers love doing math, I'd like to lighten their load a bit.
The way I'm picturing is like a 10x10 transparent Rubik's cube. On three of the sides, light is shone through each row. Each cell it travels through has a predetermined value and is either on or off (either binary 1 or 0). If it's on, it magnifies the light by its value. If it is off, it does nothing. All we see is the final amount of light exiting each row, after travelling through the cube from the other side. Using the three sides, the computer needs to determine what values (either 1 or 0) are inside the cube.
At the moment I'm using normal prime numbers as the cell values to reduce computing time, but I'm curious to know if there is another type of prime (or another type of number completely) that might be more effective. I'm looking for a series of values that has the lowest possible combination of components from within that series.
Here is a rough diagram:
It might help to imagine that light shines in at the green arrows, and exits with some value at the red arrows. This happens for each row, in each of the three directions. We're left with only three 2D sides of numbers, which are used to reconstruct what's inside the cube.
If you look where the 14 exits on the left, it can have two possible combinations, (3 + 11 and 2 + 5 + 7). If for arguments sake we were to assume it were 3 and 11 (coloured green), then we could say at the coordinate where 3 and 11 exist, there are active cells (magnifying the light by their value). In terms of data, we would say this is on (binary 1).
In practice we can rarely say for certain (for 2 and 3 we could) what an inside value has based on its reflection on the surface, so a probability for each is assigned to that coordinate or cell. Some numbers will never be reflected on the surface, like 1, 4 or 6, since they can't be composed of only primes.
The same happens in the vertical direction, where the output is 30, which has multiple possibilities of which two correspond to the possibility shown in the horizontal direction with an exit of 14, coloured blue and pink (since they hit the 23, the same as 3 in the horizontal direction).
This probability is also added to that coordinate and we repeat in the front-to-back direction, doing the same a final time. After this is done for each cell in the whole cube, we have a set of three probabilities that a cell is either on or off. That is then used as the starting point to see if we can complete the cube. If it doesn't 'unlock' we try a different combination of probabilities and so forth until we have solved the 3D Sudoku.
The final step of the method is once the cube is solved, the binary information is pulled out of it and arranged in the reverse pattern to how it was laid out on the server. This can then be cut up (say for multiple images) or used to create a video or something. You could cough in theory cough download something like Avatar 3D (280GB) in around 3 minutes on decent wifi. Solving it (nevermind building the pixel output) would take a while though, and this is where I'm curious about using an alternative to prime numbers.
You might have guessed that my maths ability goes off a very steep cliff beyond routine programming stuff. There are three areas of concern / drawbacks to this method:
- it is rubbish at low levels of data transfer. A 10 x 10 x 10 cube for instance has a larger 'surface area' than volume. That's because while each cell can hold one bit (either 1 or 0), each surface cell needs to be a minimum of 8 bits (one character is one byte, or 8 bits). We can't even have 'nothing', since we need null to behave as a type of placeholder to keep the structure intact. This also accounts for why in the above diagram, a 1000x1000x1000-cell cube has its surface areas multiplied by 4 characters (the thousandth prime is 7919 - 4 characters) and the 10'000(cubed)-cell cube has its surface areas multiplied by 6 characters (10'000th prime is 104729, six characters). The aim is to keep total character length on the 2D side to a minimum. Using letters could work, as we could go from a-Z with 52 symbols, before paying double bubble for the next character (the equivalent to "10"). There are 256 unique ASCII characters, so that's the upper limit there.
- the factorials are still too high using prime numbers. Is there a series of numbers that are both short in character length (to avoid the problem above) and have very few possible parents? I'm leaning towards some subset of primes, but lack the maths to know which - some sort of inverted Fibonacci? The fewest possible combinations, the faster the computer will solve the cube.
- I haven't tested yet if its possible to use a third, fourth or nth side to increase either the capacity of the cube or the accuracy of the reflection. Using a say octahedron (yellow below) instead of a cube might be better, just hurts the brain a little to picture how it might work. I'm guessing it would tend towards a sphere, but that's beyond my ability.
EDIT:
Thank you for your helpful input. Many answers have referred to the Pigeonhole principle and the issue of too many possible combinations. I should also correct that a 1000 x cube would require 7 digits not 4 as I stated above, since the sum of the first 1000 primes is 3682913. I should also emphasise the idea isn't to compress in the common sense of the word, as in taking pixels out of an image, but more like sending blueprints on how to build something, and relying only on the computation and knowledge at the receiving end to fill in the blanks. There is more than one correct answer, and will mark correct the one with the most votes. Many thanks for the detailed and patient explanations.
Solution 1:
Universal lossless compression is impossible, this cannot work. The set of binary strings of length $n$ is not injective to the set of binary strings of length $n-1$.
There are $2^{n}$ binary strings of length $n$. Let us call the set of all binary strings of given length $S_{k}$. When $k=n-1$, there are $|S_{n-1}| = 1+2+2^{2}...+2^{n-1}=2^{n}-1$
Since $2^{n} > |S_{n-1}|$, the pidgeon hole principle implies that there is no injective function from the set of all binary strings of length $n$ to any $S_{k}$ for $k<n$. This means that a perfect lossless compression scheme such as the one described is an impossibility since it would constitute such a function.
Furthermore, for any number $n$ there exists a string of length $n$ whose Kolmogorov-Chaitin complexity is $n$.
If you assume this is not the case you have some function $g:P \to X$ where $p_{x} \in P$ is a program encoding a string $x \in X$ whose complexity $C(x)<|x|$. such that $g(p_{x})=x$ and$|p_{x}|< |x|$. We also know that $x \neq y \implies p_{x} \neq p_{y}$. Again by the pigeon hole principle if all $2^{n}$ strings of length $n$ have a program whose length is less than $n$, of which there can only be $2^{n}-1$, there must be a program that produces two different strings, otherwise the larger set isn't covered. Again this is absurd, so at least one of these strings has a complexity of at least its own length.
Edit: To questions about lossy compression, this scheme seems unlikely to be ideal for the vast majority of cases. You have no way of controlling where and how it induces errors. Compression schemes such as H.264 leverage the inherent visual structure of an image to achieve better fidelity at lower sizes. The proposed method would indiscriminately choose a solution to the system of equations it generates, which would likely introduce noticeable artifacts. I believe that it's also possible for these systems to be overdetermined, using unnecessary space, whereas a DCT is an orthogonal decomposition.The solution of these equations would also probably have to be found using a SAT/SMT solver, whose runtime would also make decompression very costly if not intractable, while DCT based techniques are well studied an optimized for performance, though I am not an expert on any of these subjects by any means.
Solution 2:
Full credit for an excellent question on which you have spent a lot of creative energy. This is a comment rather than an answer because it is too long for the usual comment box.
You have an interesting idea and please do continue to investigate it, but did you know data compression has a very wide research base already? You may want to investigate in case your idea, or something similar, is already available. There is a Wikipedia article on compression as an entry point to the subject.
A few observations are:
-
Sound, movie and picture files in use by most applications are all highly compressed, usually (but not always) using imperfect algorithms with an associated loss of data. Formats like zip, jpeg, mp3, mov and even png, are therefore not 'ready to go on arrival' but need sometimes extensive processing before use. Algorithms have developed a lot over the last 50+ years and in some cases are highly sophisticated.
-
In general, compression exploits regularity in the original data (like a photo, where large parts of the image may be similar, or text where the message may only use a small number of different words) and works less well or not at all when file data are truly random.
-
In some applications, data loss can be acceptable, for example in a picture or in music where the eye or ear cannot really see or hear small imperfections, and in these cases very high levels of compression can be obtained (e.g. 90% reduction in size). But for many purposes you need guaranteed perfect results as loss of accuracy simply won't be acceptable, for instance in a computer program file or in text. The data can often still be compressed, exploiting existing regularity, but a saving is not guaranteed in every case.
A final comment on these observations is that applying compression twice is rarely effective because regular data once compressed becomes much less regular and so less susceptible to further compression, so trying to compress a jpg or mp3 file will not usually save anything and may even cost space.
Solution 3:
I'll start off by confessing that I didn't read the entire description of your compression scheme in detail. I don't really need to, since what you're trying to accomplish is an input-agnostic lossless data compression scheme, and that's impossible.
Theorem 1: There is no lossless compression scheme that maps every file of length $n$ bits to a file of length less than $n$ bits.
The proof follows easily from the pigeonhole principle and the fact that there are $2^n$ distinct files of $n$ bits but only $2^n-1$ distinct files of less than $n$ bits. This implies that any compression scheme mapping $n$-bit files to files of less that $n$ bits must map at least two input files to the same output file, and thus cannot be lossless.
Theorem 2: There is no lossless compression scheme for which the average length of the output files is shorter than the average length of the input files (with averages taken over all possible input files at most $n$ bits long, for any non-negative integer $n$).
This stronger theorem is slightly trickier to prove. To keep it short, I'll assume you're familiar with basic set-theoretic notation, such as $R \cup S$ for the union, $R \cap S$ for the intersection and $R \setminus S = \{x \in R : x \notin S\}$ for the difference of the sets $R$ and $S$.
We do need a bit of extra notation. Specifically, let $\operatorname{len}(x)$ denote the length of the file $x$, and let $\operatorname{len}(S) = \sum_{x \in S} \operatorname{len}(x)$ denote the summed total length of all files in the set $S$. Thus the average length of files in $S$ is $\frac{\operatorname{len}(S)}{|S|}$, where $|S|$ is the number of files in $S$. Also $\operatorname{len}(R \cup S) = \operatorname{len}(R) + \operatorname{len}(S)$ for any two disjoint sets of files $R$ and $S$.
Now let $f$ be a lossless compression scheme, i.e. a function mapping files to files, let $A$ be the set of all files of length at most $n$ bits, and let $B = \{f(x) : x \in A\}$ be the set of files obtained by applying $f$ to each file in $A$. Since $f$ is lossless, it must map each input file in $A$ to a distinct output file; thus $|B| = |A|$, and thus also $|A \setminus B| = |B \setminus A|$.
Since $A$ is the set of all files at most $n$ bits long, it follows that $\operatorname{len}(x) ≤ n$ if and only if $x \in A$. Thus $$\operatorname{len}(A \setminus B) ≤ n \cdot |A \setminus B| = n \cdot |B \setminus A| ≤ \operatorname{len}(B \setminus A)$$ (where the second inequality is strict unless $A = B$, implying that $A \setminus B = B \setminus A = \emptyset$).
Furthermore, since $A \cap B$ is disjoint from both $A \setminus B$ and $B \setminus A$, $$\begin{aligned}\operatorname{len}(A) &= \operatorname{len}(A \cap B) + \operatorname{len}(A \setminus B) \\ &≤ \operatorname{len}(A \cap B) + \operatorname{len}(B \setminus A) \\ &= \operatorname{len}(B).\end{aligned}$$ And since $|A| = |B|$, as noted above, we thus have $\frac{\operatorname{len}(A)}{|A|} ≤ \frac{\operatorname{len}(B)}{|B|}$ (where, again, the inequality is strict unless $A = B$).
Ps. So how do real-world lossless compression schemes work, then? Simply put, they rely on the fact that most "natural" files (containing uncompressed image or sound data, program code, text, etc.) tend to be highly repetitive and quite unlike a random file of similar length. Basically, those "natural" files form only a small subset of all files of their length, and clever compression software can identify such repetitive files and shorten them, while (slightly) increasing the length of other, more random-looking input files.
It's also possible to do this in a way that is guaranteed to never increase the length of any file by more than a small fixed number of bits even in the worst case. In fact, it's possible to take any lossless compression scheme $f$ and turn it into a scheme $f'$ that is nearly as efficient and only ever increases the length of any input file by at most one bit: simply define $$f'(x) = \begin{cases} (0, f(x)) & \text{if } \operatorname{len}(f(x)) < \operatorname{len}(x) \\ (1, x) & \text{otherwise,} \end{cases}$$ where $(b, x)$ denotes prepending the bit $b$ to the file $x$.
Solution 4:
As others have pointed out, this cannot work for general reasons.
However, you may be more interested in understanding where your analysis for this specific scheme breaks down.
Consider a cube of $n^3$ bits. In each row there are $2^n$ different combinations, so if you can store $k$ values for each entry on one of the three sides, you can at best reduce the number of possibilities to $2^n/k$.
In your example for $n=1000$ and using primes, you have to and able to store not actually values up to 7919, but you to the sum of the first 1000 primes, which requires 7 decimal digits. In any case, there would be 10000000 different values to encode $2^{1000}\approx 10^{300}$ different combinations, so in the best case, one value on the side corresponds to $10^{293}$ different combinations of bits in that row. Even projecting in the different directions won't make a significant difference.
If you want to make this work, you need many more digits to store your sums, or go to a very high dimension rather than 3. In both cases the required extra data to transmit will be equal to our exceed the original amount of data.
Solution 5:
Extrapolating from what you said, using prime numbers will hurt data compression, not help it.
Suppose we have a sequence of 8 bits: ($b_0, b_1, b_2, b_3, b_4, b_5, b_6, b_7$).
We could make a product of prime numbers to encode that sequence of bits: $x = 2^{b_0} × 3^{b_1} × 5^{b_2} × 7^{b_3} × 11^{b_4} × 13^{b_5} × 17^{b_6} × 19^{b_7}$.
But now the range of possible values of $x$ is $[1, 9699690]$, which would take 24 bits to represent in standard notation. This is far worse than our original 8 bits.
Products of primes can be used for theoretical math proofs though: https://en.wikipedia.org/wiki/G%C3%B6del_numbering