The idea behind Huffman encoding is to translate text or other data to a tree that will serve as a compression algorithm for its data. It reduces the memory usage of characters by assigning the most frequent characters to use a smaller number of bits, and less frequent characters longer bitstrings. This idea of using different bitlengths for different characters of the same set is called variable length encoding.
A prefix in an encoding is the idea where some character’s bitstring exists at the beginning of another characters bitstring. If we have prefixes in our encoding, we would need to add spaces or some way to split characters such that they do not mix with each other. However, since we are trying to find a binary representation, we want only two representative states, so we want our encoding to be prefix-free. A function that maps a set S to distinct bitstrings such that no two elements are prefixes of one other is called a prefix code. Clearly, this type of encoding will make our string unambiguous.
Huffman coding creates a prefix code based on frequencies. If $S$ is our alphabet, then for any element $x \in S$, we define $f_x$ to be the frequency of that letter in our string. That is, if our string has $n$ characters, we expect our string to have $nf_x$ elements that are equal to $x$.
An optimal prefix code is as efficient as possible, or, it minimizes the average number of bits per letter.
Tree Approach
In order to approach this using a binary tree and making sure our encoding is prefix free, we need to make sure that every one of our letters is a leaf. That way, we can encode our characters as each time they go left(0) or right(1). This way, we can also say that the length of our encoding for a certain character $x$ is just the distance from the root note to $x$(denoted in a binary tree as $depth_T(v)$).
Another thing that is important about our encoding is that it is crucial that our binary tree be full. If our tree is not full, then we are leaving out potential optimization capability in our encoding by assigning a longer encoding to a character for which it is not necessary.
Huffman’s Approach
We begin by assuming we know the structure of the tree, and we only need to figure out how to label the leaves accordingly. We can use the following claim.
Suppose that $u$ and $v$ are leaves of $T$ such that $depth(u) < depth(v)$. Then, $f_u \geq f_v$, assuming $u’s$ letter is $u$ and $v’s$ letter is $v$
Another important thing to note is that the leaves assigned at the same depth do not matter. Therefore, if two leaves exist at the same depth, we could essentially switch their characters/encodings and it would be the same, our algorithm would still be optimal.
To actually get the structure of the tree, we need another lemma.
There is an optimal prefix code corresponding to $T$ that has the two lowest frequency letters assigned to sibling leaves
Using this idea, what we can do is take the two lowest frequency leaf nodes and combine them to make their parent a pseudo-leaf node, that has a frequency of the sum of its children’s frequencies.
Huffman(S)
If |S| = 2 then
Encode one letter using 0 and the other using 1
Else
Let y, z be the two lowest frequency letters
Replace y and z with a new letter w of frequency fw = fy+fz
Recursively construct a prefix code for this set with T
Define a prefix code as such:
Start with T
Take the leaf w and add y, z, as its two children
Running Time
Since we are combining at least two nodes into a single node every iteration, we have at most $n-1$ iterations. Since identifying the lowest frequency elements of a set can be done in $O(n)$ time, the total runtime is $\Theta(n^2)$. However, clearly this implementation follows a minheap very closely. Therefore, if we use a minheap where the keys are the frequencies, we can extract the min in $O(lgn)$ time. Then, we insert a new node that has both these minnodes as children into the minheap with a frequency of their sum. Using a heap, we get a total runtime of $\Theta(nlgn)$.