Huffman coding and Huffman trees
Huffman coding is a form of losslessA form of compression that encodes digital files without losing detail. Files can also be restored to their uncompressed quality.compressionA method of reducing file sizes, particularly in digital media such as photos, audio and video. which makes files smaller using the frequency with which characters appear in a message. This works particularly well when characters appear multiple times in a stringA sequence of characters often stored as a variable in a computer program. These characters can include numbers, letters and symbols. as these can then be represented using fewer bitThe smallest unit of data in computing represented by a 1 in binary.. This reduces the overall size of a file.
Calculating compressed data bit requirements
The string BBC BITESIZE uses 12 different characters - including the space. This could be saved using the 7 bit ASCIIAmerican Standard Code for Information Interchange. A 7-bit character set used for representing English keyboard characters. representation of each character:
| B | B | C | B | I | T | E | S | I | Z | E |
| B |
| B |
| C |
| B |
| I |
| T |
| E |
| S |
| I |
| Z |
| E |
To compress this, a programmer could create a reduced character set that lists all of the characters that have been used, each with a unique bit patternA number of bits that together represent something, for example a character.:
| B | 1 |
| C | 01 |
| [SPACE] | 11 |
| I | 100 |
| T | 101 |
| E | 110 |
| S | 111 |
| Z | 1000 |
| B |
| 1 |
| C |
| 01 |
| [SPACE] |
| 11 |
| I |
| 100 |
| T |
| 101 |
| E |
| 110 |
| S |
| 111 |
| Z |
| 1000 |
The problem with this occurs when these are put together to form a longer bit pattern as it creates ambiguous strings, for example:
101 could mean: BC or T
Huffman coding makes it impossible to have a bit pattern that could be interpreted in more than one way. Using the BBC BITESIZE string, the algorithmA sequence of logical instructions for carrying out a task. In computing, algorithms are needed to design computer programs. would be as follows:
1. Calculate how many times each character appears in the string.
| B | 3 |
| C | 1 |
| [SPACE] | 1 |
| I | 2 |
| T | 1 |
| E | 2 |
| S | 1 |
| Z | 1 |
| B |
| 3 |
| C |
| 1 |
| [SPACE] |
| 1 |
| I |
| 2 |
| T |
| 1 |
| E |
| 2 |
| S |
| 1 |
| Z |
| 1 |
2. Put these in order of least to most frequent.
| C | 1 |
| [SPACE] | 1 |
| T | 1 |
| S | 1 |
| Z | 1 |
| I | 2 |
| E | 2 |
| B | 3 |
| C |
| 1 |
| [SPACE] |
| 1 |
| T |
| 1 |
| S |
| 1 |
| Z |
| 1 |
| I |
| 2 |
| E |
| 2 |
| B |
| 3 |
3. Place the two least common characters along with their frequency in a binary treeA data structure in which each node has a maximum of two child nodes.. This is done by combining these in a new node and adding the individual frequencies of each character together.
4. Update the table showing the combined characters.
| T | 1 |
| S | 1 |
| Z | 1 |
| I | 2 |
| E | 2 |
| C _ | 2 |
| B | 3 |
| T |
| 1 |
| S |
| 1 |
| Z |
| 1 |
| I |
| 2 |
| E |
| 2 |
| C _ |
| 2 |
| B |
| 3 |
5. Take the next two frequent characters and repeat steps 3 and 4 using single letters first until all characters are combined.
6. Next, follow each branch until the bottom is reached, recording a 0 for branches on the left and a 1 for branches on the right.
7. This creates a unique code for each character which can be identified by following various paths down the tree. For example, the code for C can be found by moving down the path below:
8. A set of optimal and completely unique bit patterns will have been created by following the tree until each individual character is reached.
| B (3) | 10 |
| E (2) | 00 |
| I (2) | 110 |
| Z (1) | 010 |
| S (1) | 1111 |
| T (1) | 1110 |
| [SPACE] (1) | 0111 |
| C (1) | 0110 |
| B (3) |
| 10 |
| E (2) |
| 00 |
| I (2) |
| 110 |
| Z (1) |
| 010 |
| S (1) |
| 1111 |
| T (1) |
| 1110 |
| [SPACE] (1) |
| 0111 |
| C (1) |
| 0110 |
This particular technique is efficient because the letters that appear most frequently are assigned the shortest bit patterns. This makes the memory used for the compressed file optimal while preventing any confusion between the binary patterns.
This means that we can calculate the memory needed as below:
| Character | Bit pattern | Times used | Total bits |
| B | 10 | 3 | 2*3=6 |
| E | 00 | 2 | 2*2=4 |
| I | 110 | 2 | 3*2=6 |
| Z | 010 | 1 | 3*1=3 |
| S | 1111 | 1 | 4*1=4 |
| T | 1110 | 1 | 4*1=4 |
| [SPACE] | 0111 | 1 | 4*1=4 |
| C | 0110 | 1 | 4*1=4 |
| Total number of bits | 35 |
| Character | B |
|---|---|
| Bit pattern | 10 |
| Times used | 3 |
| Total bits | 2*3=6 |
| Character | E |
|---|---|
| Bit pattern | 00 |
| Times used | 2 |
| Total bits | 2*2=4 |
| Character | I |
|---|---|
| Bit pattern | 110 |
| Times used | 2 |
| Total bits | 3*2=6 |
| Character | Z |
|---|---|
| Bit pattern | 010 |
| Times used | 1 |
| Total bits | 3*1=3 |
| Character | S |
|---|---|
| Bit pattern | 1111 |
| Times used | 1 |
| Total bits | 4*1=4 |
| Character | T |
|---|---|
| Bit pattern | 1110 |
| Times used | 1 |
| Total bits | 4*1=4 |
| Character | [SPACE] |
|---|---|
| Bit pattern | 0111 |
| Times used | 1 |
| Total bits | 4*1=4 |
| Character | C |
|---|---|
| Bit pattern | 0110 |
| Times used | 1 |
| Total bits | 4*1=4 |
| Character | |
|---|---|
| Bit pattern | |
| Times used | Total number of bits |
| Total bits | 35 |
How to calculate bits for uncompressed ASCII data
To determine the number of bitThe smallest unit of data in computing represented by a 1 in binary. needed to store uncompressedThe full version of a file or data with nothing taken out, for example the highest quality of an image or video. Uncompressed files take up more storage space. data in ASCIIAmerican Standard Code for Information Interchange. A 7-bit character set used for representing English keyboard characters., the following calculation is required.
Example
The original string of BBC BITESIZE would use ASCII 7 bits per character. There are 12 characters in the string BBC BITESIZE, which would result in a total of 12 x 7 = 84 bits. Huffman coding reduced the size to 35 bits, which means Huffman coding resulted in a saving of 84 – 35 = 49 bits.