Fundamentals of data representation - AQAHuffman coding and Huffman trees

All data is represented as binary digits, whether it is numbers, text, images or sound. Calculations are also made in binary.

Part ofComputer ScienceComputational thinking and problem solving

Huffman coding and Huffman trees

Huffman coding is a form of which makes files smaller using the frequency with which characters appear in a message. This works particularly well when characters appear multiple times in a as these can then be represented using fewer . This reduces the overall size of a file.

Calculating compressed data bit requirements

The string BBC BITESIZE uses 12 different characters - including the space. This could be saved using the 7 bit representation of each character:

BBCBITESIZE
B
B
C
B
I
T
E
S
I
Z
E

To compress this, a programmer could create a reduced character set that lists all of the characters that have been used, each with a unique :

B1
C01
[SPACE]11
I100
T101
E110
S111
Z1000
B
1
C
01
[SPACE]
11
I
100
T
101
E
110
S
111
Z
1000

The problem with this occurs when these are put together to form a longer bit pattern as it creates ambiguous strings, for example:

101 could mean: BC or T

Huffman coding makes it impossible to have a bit pattern that could be interpreted in more than one way. Using the BBC BITESIZE string, the would be as follows:

1. Calculate how many times each character appears in the string.

B3
C1
[SPACE]1
I2
T1
E2
S1
Z1
B
3
C
1
[SPACE]
1
I
2
T
1
E
2
S
1
Z
1

2. Put these in order of least to most frequent.

C1
[SPACE]1
T1
S1
Z1
I2
E2
B3
C
1
[SPACE]
1
T
1
S
1
Z
1
I
2
E
2
B
3

3. Place the two least common characters along with their frequency in a . This is done by combining these in a new node and adding the individual frequencies of each character together.

Binary tree with two characters combined

4. Update the table showing the combined characters.

T1
S1
Z1
I2
E2
C _2
B3
T
1
S
1
Z
1
I
2
E
2
C _
2
B
3

5. Take the next two frequent characters and repeat steps 3 and 4 using single letters first until all characters are combined.

Binary tree with all characters combined

6. Next, follow each branch until the bottom is reached, recording a 0 for branches on the left and a 1 for branches on the right.

Binary tree with all branches on the left labelled with a 0 and all branches on the right labelled with a 1

7. This creates a unique code for each character which can be identified by following various paths down the tree. For example, the code for C can be found by moving down the path below:

Binary tree with path which creates the code 0 1 1 0

8. A set of optimal and completely unique bit patterns will have been created by following the tree until each individual character is reached.

B (3)10
E (2)00
I (2)110
Z (1)010
S (1)1111
T (1)1110
[SPACE] (1)0111
C (1)0110
B (3)
10
E (2)
00
I (2)
110
Z (1)
010
S (1)
1111
T (1)
1110
[SPACE] (1)
0111
C (1)
0110

This particular technique is efficient because the letters that appear most frequently are assigned the shortest bit patterns. This makes the memory used for the compressed file optimal while preventing any confusion between the binary patterns.

This means that we can calculate the memory needed as below:

CharacterBit patternTimes usedTotal bits
B1032*3=6
E0022*2=4
I11023*2=6
Z01013*1=3
S111114*1=4
T111014*1=4
[SPACE]011114*1=4
C011014*1=4
Total number of bits35
CharacterB
Bit pattern10
Times used3
Total bits2*3=6
CharacterE
Bit pattern00
Times used2
Total bits2*2=4
CharacterI
Bit pattern110
Times used2
Total bits3*2=6
CharacterZ
Bit pattern010
Times used1
Total bits3*1=3
CharacterS
Bit pattern1111
Times used1
Total bits4*1=4
CharacterT
Bit pattern1110
Times used1
Total bits4*1=4
Character[SPACE]
Bit pattern0111
Times used1
Total bits4*1=4
CharacterC
Bit pattern0110
Times used1
Total bits4*1=4
Character
Bit pattern
Times usedTotal number of bits
Total bits35

How to calculate bits for uncompressed ASCII data

To determine the number of needed to store data in , the following calculation is required.

Example

The original string of BBC BITESIZE would use ASCII 7 bits per character. There are 12 characters in the string BBC BITESIZE, which would result in a total of 12 x 7 = 84 bits. Huffman coding reduced the size to 35 bits, which means Huffman coding resulted in a saving of 84 – 35 = 49 bits.