Huffman optimal code

I can solve similar problem with 4 symbols. But I cannot solve it when the number of symbols is more than 4. However, here is the probability of the symbols:

enter image description here

I need to develop optimal code with the Huffman method.


The tree which I got looks like this: enter image description here


Here is how I reduced the symbols: enter image description here


Just follow the algorithm: combine the least frequent pair, and keep doing so. Here the least frequent are $X_1$ and $X_{11}$; they combine to make a new node $\{X_1,X_{11}\}$ with weight $0,01+0,02=0,03$. Now your list of nodes looks like this, after I reorder it by weight:

$$\begin{array}{rcc} X_i:&X_8&\{X_1,X_{11}\}&X_7&X_5&X_3&X_{10}&X_{12}&X_2&X_6&X_9&X_4\\ p_i:&0,03&0,03&0,04&0,05&0,06&0,07&0,08&0,1&0,11&0,2&0,23 \end{array}$$

The smallest two are $X_8$ and $\{X_1,X_{11}\}$, so we combine them to make a new node $\{X_1,X_8,X_{11}\}$ of weight $0,03+0,03=0,06$, and we now have the following list:

$$\begin{array}{rcc} X_i:&X_7&X_5&X_3&\{X_1,X_8,X_{11}\}&X_{10}&X_{12}&X_2&X_6&X_9&X_4\\ p_i:&0,04&0,05&0,06&0,06&0,07&0,08&0,1&0,11&0,2&0,23 \end{array}$$

Now combine $X_7$ and $X_5$ to make a node $\{X_5,X_7\}$ of weight $0,09$, reducing the list to this:

$$\begin{array}{rcc} X_i:&X_3&\{X_1,X_8,X_{11}\}&X_{10}&X_{12}&\{X_5,X_7\}&X_2&X_6&X_9&X_4\\ p_i:&0,06&0,06&0,07&0,08&0,09&0,1&0,11&0,2&0,23 \end{array}$$

The next step combines $X_3$ with $\{X_1,X_8,X_{11}\}$, and the one after that combines $X_{10}$ and $X_{12}$ to produce the list

$$\begin{array}{rcc} X_i:&\{X_5,X_7\}&X_2&X_6&\{X_1,X_3,X_8,X_{11}\}&\{X_{10},X_{12}\}&X_9&X_4\\ p_i:&0,09&0,1&0,11&0,12&0,15&0,2&0,23 \end{array}$$

At this stage your your graph looks like this:

       *         *         *         *            *         *        *
      / \       X2        X6        / \          / \       X9       X4
     /   \                         /   \        /   \  
    X5   X7                       *     *      X10  X12
                                 X3    / \        
                                      /   \  
                                     *     *  
                                    / \   X8  
                                   /   \  
                                  *     *  
                                 X1    X11

Can you finish it now? Just keep on doing the same thing until it becomes a single tree. There are seven separate pieces now, so that will take six more steps. Once you have the tree, we can worry about how to get the actual Huffman code from it.

Added: There are two possible trees; here’s a rough sketch of one of them.

enter image description here


This algorithm is very simple, as the other posts point out. Doing your example on paper takes almost as long as writing a program, so here is the program.

First, some sample runs including your example.

$ ./huffman.pl 0.1 0.1 0.1 0.4 0.3
X01 0.100000 1110
X02 0.100000 1111
X03 0.100000 110
X04 0.400000 0
X05 0.300000 10

$ ./huffman.pl 0.01 0.1 0.06 0.23 0.05 0.11 0.04 0.03 0.2 0.07 0.02 0.08
X01 0.010000 011110
X02 0.100000 1111
X03 0.060000 0110
X04 0.230000 10
X05 0.050000 11101
X06 0.110000 010
X07 0.040000 11100
X08 0.030000 01110
X09 0.200000 00
X10 0.070000 1100
X11 0.020000 011111
X12 0.080000 1101

$ ./huffman.pl 15 7 6 6 5
scaling by a factor of 39 at ./huffman.pl line 35.
X01 0.384615 0
X02 0.179487 111
X03 0.153846 101
X04 0.153846 110
X05 0.128205 100

The algorithm follows, implemented in Perl. There is not much to say about it: start with a forest of singleton trees and iteratively keep merging the two with the smallest sum value, recording the cumulative sums. Traverse the resulting tree with the path to a leaf giving the Huffman code of that leaf.

#! /usr/bin/perl -w
#

sub buildcode {
    my ($cref, $pref, $t) = @_;

    if(exists($t->{label})){
      $cref->{$t->{label}} = $pref;
      return;
    }

    buildcode($cref, $pref . '0', $t->{left});
    buildcode($cref, $pref . '1', $t->{right});
}


MAIN: {
    my @freq = @ARGV;

    die "need at least one symbol "
      if scalar(@freq) == 0;

    my $n = scalar(@freq);

    my $total = 0;
    for(my $pos=0; $pos<$n; $pos++){
      my $val = $freq[$pos];
      die "not a decimal number: $val"
          if $val !~ /^\d+(\.\d*)?$/;

      $total += $freq[$pos];
    }

    if(abs(1-$total) > 1e-12){
      warn "scaling by a factor of $total";

      for(my $pos=0; $pos<$n; $pos++){
          $freq[$pos] /= $total;
      }
    }

    my @pool;

    for(my $pos=0; $pos<$n; $pos++){
      push @pool, 
      { sum => $freq[$pos], label => "X" . ($pos+1), };
    }

    @pool = sort { $a->{sum} <=> $b->{sum} } @pool;

    while(scalar(@pool) >= 2){
      my ($ma, $mb);

      $ma = shift @pool; $mb = shift @pool;

      my $node = {
          sum => $ma->{sum} + $mb->{sum},
          left => $ma, right => $mb
      };

      my $pos;
      for($pos = 0; $pos<scalar(@pool); $pos++){
          last if $node->{sum} < $pool[$pos]->{sum};
      }
      splice @pool, $pos, 0, $node;
    }

    my $code = {};
    buildcode $code, '', $pool[0];

    for(my $pos=0; $pos<$n; $pos++){
      printf "X%02d %05f %s\n", $pos+1,
      $freq[$pos], $code->{'X' . ($pos+1)};
    }

    1;
}