Huffman optimal code
I can solve similar problem with 4 symbols. But I cannot solve it when the number of symbols is more than 4. However, here is the probability of the symbols:
I need to develop optimal code with the Huffman method.
The tree which I got looks like this:
Here is how I reduced the symbols:
Just follow the algorithm: combine the least frequent pair, and keep doing so. Here the least frequent are $X_1$ and $X_{11}$; they combine to make a new node $\{X_1,X_{11}\}$ with weight $0,01+0,02=0,03$. Now your list of nodes looks like this, after I reorder it by weight:
$$\begin{array}{rcc} X_i:&X_8&\{X_1,X_{11}\}&X_7&X_5&X_3&X_{10}&X_{12}&X_2&X_6&X_9&X_4\\ p_i:&0,03&0,03&0,04&0,05&0,06&0,07&0,08&0,1&0,11&0,2&0,23 \end{array}$$
The smallest two are $X_8$ and $\{X_1,X_{11}\}$, so we combine them to make a new node $\{X_1,X_8,X_{11}\}$ of weight $0,03+0,03=0,06$, and we now have the following list:
$$\begin{array}{rcc} X_i:&X_7&X_5&X_3&\{X_1,X_8,X_{11}\}&X_{10}&X_{12}&X_2&X_6&X_9&X_4\\ p_i:&0,04&0,05&0,06&0,06&0,07&0,08&0,1&0,11&0,2&0,23 \end{array}$$
Now combine $X_7$ and $X_5$ to make a node $\{X_5,X_7\}$ of weight $0,09$, reducing the list to this:
$$\begin{array}{rcc} X_i:&X_3&\{X_1,X_8,X_{11}\}&X_{10}&X_{12}&\{X_5,X_7\}&X_2&X_6&X_9&X_4\\ p_i:&0,06&0,06&0,07&0,08&0,09&0,1&0,11&0,2&0,23 \end{array}$$
The next step combines $X_3$ with $\{X_1,X_8,X_{11}\}$, and the one after that combines $X_{10}$ and $X_{12}$ to produce the list
$$\begin{array}{rcc} X_i:&\{X_5,X_7\}&X_2&X_6&\{X_1,X_3,X_8,X_{11}\}&\{X_{10},X_{12}\}&X_9&X_4\\ p_i:&0,09&0,1&0,11&0,12&0,15&0,2&0,23 \end{array}$$
At this stage your your graph looks like this:
* * * * * * *
/ \ X2 X6 / \ / \ X9 X4
/ \ / \ / \
X5 X7 * * X10 X12
X3 / \
/ \
* *
/ \ X8
/ \
* *
X1 X11
Can you finish it now? Just keep on doing the same thing until it becomes a single tree. There are seven separate pieces now, so that will take six more steps. Once you have the tree, we can worry about how to get the actual Huffman code from it.
Added: There are two possible trees; here’s a rough sketch of one of them.
This algorithm is very simple, as the other posts point out. Doing your example on paper takes almost as long as writing a program, so here is the program.
First, some sample runs including your example.
$ ./huffman.pl 0.1 0.1 0.1 0.4 0.3 X01 0.100000 1110 X02 0.100000 1111 X03 0.100000 110 X04 0.400000 0 X05 0.300000 10 $ ./huffman.pl 0.01 0.1 0.06 0.23 0.05 0.11 0.04 0.03 0.2 0.07 0.02 0.08 X01 0.010000 011110 X02 0.100000 1111 X03 0.060000 0110 X04 0.230000 10 X05 0.050000 11101 X06 0.110000 010 X07 0.040000 11100 X08 0.030000 01110 X09 0.200000 00 X10 0.070000 1100 X11 0.020000 011111 X12 0.080000 1101 $ ./huffman.pl 15 7 6 6 5 scaling by a factor of 39 at ./huffman.pl line 35. X01 0.384615 0 X02 0.179487 111 X03 0.153846 101 X04 0.153846 110 X05 0.128205 100
The algorithm follows, implemented in Perl. There is not much to say about it: start with a forest of singleton trees and iteratively keep merging the two with the smallest sum value, recording the cumulative sums. Traverse the resulting tree with the path to a leaf giving the Huffman code of that leaf.
#! /usr/bin/perl -w # sub buildcode { my ($cref, $pref, $t) = @_; if(exists($t->{label})){ $cref->{$t->{label}} = $pref; return; } buildcode($cref, $pref . '0', $t->{left}); buildcode($cref, $pref . '1', $t->{right}); } MAIN: { my @freq = @ARGV; die "need at least one symbol " if scalar(@freq) == 0; my $n = scalar(@freq); my $total = 0; for(my $pos=0; $pos<$n; $pos++){ my $val = $freq[$pos]; die "not a decimal number: $val" if $val !~ /^\d+(\.\d*)?$/; $total += $freq[$pos]; } if(abs(1-$total) > 1e-12){ warn "scaling by a factor of $total"; for(my $pos=0; $pos<$n; $pos++){ $freq[$pos] /= $total; } } my @pool; for(my $pos=0; $pos<$n; $pos++){ push @pool, { sum => $freq[$pos], label => "X" . ($pos+1), }; } @pool = sort { $a->{sum} <=> $b->{sum} } @pool; while(scalar(@pool) >= 2){ my ($ma, $mb); $ma = shift @pool; $mb = shift @pool; my $node = { sum => $ma->{sum} + $mb->{sum}, left => $ma, right => $mb }; my $pos; for($pos = 0; $pos<scalar(@pool); $pos++){ last if $node->{sum} < $pool[$pos]->{sum}; } splice @pool, $pos, 0, $node; } my $code = {}; buildcode $code, '', $pool[0]; for(my $pos=0; $pos<$n; $pos++){ printf "X%02d %05f %s\n", $pos+1, $freq[$pos], $code->{'X' . ($pos+1)}; } 1; }