Fundamentals of data representation - AQACharacter encoding

All data is represented as binary digits, whether it is numbers, text, images or sound. Calculations are also made in binary.

Part ofComputer ScienceComputational thinking and problem solving

Character encoding

Computers work in . As a result, all characters, whether they are letters, punctuation or are stored as binary numbers. All of the characters that a computer can use are called a .

Two standard character sets in common use are:

ASCII code

ASCII uses seven , giving a character set of 128 characters. The characters are represented in a table, called the ASCII table. The 128 characters include:

  • 32 control codes - mainly to do with printing
  • 32 punctuation codes, symbols, and space
  • 26 upper case letters
  • 26 lower case letters
  • numeric digits 0-9

We tend to say that the letter ‘A’ is the first letter of the alphabet, ‘B’ is the second and so on, all the way up to ‘Z’, which is the 26th letter. In ASCII, each character has its own assigned number. For example:

CharacterDecimalBinaryHexadecimal
A65100000141
Z9010110105A
a97110000161
z12211110107A
048011000030
957011100139
Space32010000020
!33010000121
CharacterA
Decimal65
Binary1000001
Hexadecimal41
CharacterZ
Decimal90
Binary1011010
Hexadecimal5A
Charactera
Decimal97
Binary1100001
Hexadecimal61
Characterz
Decimal122
Binary1111010
Hexadecimal7A
Character0
Decimal48
Binary0110000
Hexadecimal30
Character9
Decimal57
Binary0111001
Hexadecimal39
CharacterSpace
Decimal32
Binary0100000
Hexadecimal20
Character!
Decimal33
Binary0100001
Hexadecimal21

‘A’ is represented by the decimal number 65 (binary 1000001, hex 41), ‘B’ by 66 (binary 1000010, hex 42) and so on up to ‘Z’, which is represented by the decimal number 90 (binary 1011010, hex 5A).

Similarly, lowercase letters start at decimal 97 (binary 1100001, hex 61) and end at decimal 122 (binary 1111010, hex 7A).

When is stored or transmitted, its ASCII or Unicode number is used, not the character itself.

For example, in binary, the word "Computer" would be represented as:

1000011 1101111 1101110 1110000 1110101 1110100 1100101 1110010

Unicode

While suitable for representing English characters, 256 characters is far too small to hold every character in other languages, such as Chinese or Arabic. Unicode uses 16 bits, giving a range of over 65,000 characters. This makes it more suitable for those situations.

Unicode also allows us to represent additional characters that are more visual such as emojis and emoticons.