How many characters are there? Is it:
By the end of these slides, you will know and understand the answer
Warning: historical summaries are always inaccurate, misleading and biased
These slides are no exception - they provide a summary of the history of characters, inaccurate because they are simplified and biased towards explaining the overall current state of affairs
At first, there were 128 characters, which we now know as the ASCII characters or plain text characters
They are the ones you can type on a USA keyboard
They began as encodings on 8-hole paper tape
One of the 8 holes was used as a parity bit (check bit), chosen to make the number of holes for each character even, so 7 bits were available
There were other conventions such as EBCDIC (still used on some IBM computers) but ASCII became the standard
For convenience, e.g. for skipping spare blank paper tape at the start of a
roll, the character with code zero, i.e. no holes (called NUL,
written '\0'
in C) is a character meaning 'ignore me', and is now
used in C to mean 'end of string'
When tape was occasionally patched by hand, unwanted characters were deleted by punching all the holes, so the character with code 127 (DEL) also meant 'ignore me', but is now used in editing to mean 'forward delete'
Most ASCII characters (95 out of 128) can be typed and printed
The other 33 (codes 0 to 31 and 127) were used to control devices such as printers, or to act as markers
They are called control characters, and the first 26 can be generated on a keyboard using the CTRL key
The most important is newline (code 10
, or hex 0a
,
or CTRL+J
, or '\n'
) which marks the end of a line
Suppose you create a file eg.txt on Linux containing:
One Two
The result of using od
is:
> od -c -t x1 eg.txt 0000000 O n e \n T w o \n 4f 6e 65 0a 54 77 6f 0a 0000010
You can see the newlines (hex 0a
= 10
)
On Linux, lines end with 0a
= \n
= linefeed, and this is still the standard way that newlines are
stored in programs running in memory
On MacOs, 0d
= \r
= return = enter used to be used,
which is still the standard way that newlines are typed on a keyboard (but
since MacOs X, MacOs has become more Linux-compatible, and 0a
is now used)
On Windows, the pair 0d
0a
is used, which is still
the standard for network transmission of text on the Web
Why is there this disagreement?
Because on original printers, \r
meant 'carriage return',
i.e. return the print head to the left edge of the paper, and \n
meant 'line feed', move the paper up by one line
After a while, there were many situations where it was unnecessary to use both characters, but the three resulting choices were randomly chosen in different situations, by historical accident
Are newlines separators or terminators?
So, should the last line of a text file end in a newline?
If not:
So the answer is terminators and you should configure your editor to add a final newline, if necessary
Each character has a code from 0 to 127
The codes for 'a' to 'z' are contiguous
The codes for 'A' to 'Z' are contiguous, so to convert a letter to upper case, subtract the code for 'a' and add the code for 'A'
The codes for '0' to '9' are contiguous, but they are not
0 to 9, so to convert a digit character to a number, subtract the code for '0'
Otherwise, you almost never need to know what the codes are
The ASCII characters support the English language only
In fact, they have a USA bias, because they include the $
sign, but not any other currency sign
So what was done for other human languages?
For English variants, and a few other languages, ASCII was OK
Many languages with small numbers of characters used 256 characters
The first 128 were ASCII (for programs) and the remainder were for the local language alphabet
This allowed a character to fit in one byte (parity bits no longer being needed)
Ideograph languages (Chinese, Japanese, Korean) used their own multi-byte schemes
The problem with these encoding schemes was that they were incompatible with each other
No convention was ever established inside the text, or in file names, to say which language was being used
In any case, support was needed for multi-language documents, because lots of documents need to contain occasional quotes from other languages
To deal with this, Unicode was created
Unicode is a collection of all the characters in all human languages
This collection excludes font, size, italic, bold, colour... issues, which are regarded as graphical
Each character has a unique code, and the first 128 are the ASCII characters
It makes all previous character sets obsolete, but it also evolves (e.g it now includes Klingon!)
At first, the number of characters in Unicode was limited to 65536 (i.e. each had a 16-bit = 2-byte code)
So, at this point in Computer Science history, many people thought characters would be two bytes in size for ever more
However, Unicode claimed to be a character set (allocating codes) not an encoding (specifying how codes are represented in bytes)
After a while, it became clear that 65536 characters would not be enough
Unicode continued to expand beyond that limit, but there was an understanding that no more than 4 bytes would ever be needed for a character
People expected, eventually, 4 bytes per character with a maximum of 4294967296 characters
When Unicode was extended beyond 65536 characters, some extension mechanism was needed in 2-byte systems to deal with the new extra characters
Two blocks of characters within the 65536 were reserved and used as "first half of extended four-byte character" and "second half of extended four-byte character"
This allowed some of the codes beyond 65536 to be represented
The extension mechanism was incorporated within Unicode
The two blocks of codes used for the extension mechanism were permanently reserved within the Unicode collection
The particular encoding with 2-bytes per character, using the reserved blocks for extended characters, was standardised as the UTF-16 encoding
Another encoding was created and has become very popular, and should eventually make all others obsolete
It is called UTF-8
Instead of using a fixed number of bytes per character, it uses a variable number of bytes per character
It uses 1 byte for the first few characters, then 2 bytes, then 3 or 4 bytes e.g. for ideograph languages
It has been argued that UTF-8 is unfair, because it favours letter-based languages over ideograph-based languages, making ideographs take more space than necessary
The most common situation where this might be a problem is with web pages written in one of the CJK languages
But such web pages have so many ASCII characters in their HTML skeletons that they are shorter in UTF-8 than in any 'fairer' alternative
In UTF-8, the first 128 characters, i.e. the original ASCII characters, are stored in one byte
The zero in the most significant bit is reserved to mean "this is a one byte character"
A big advantage of this is that all 'legacy' ASCII files, including almost all program source files, are valid UTF-8 files already, with no conversion needed
The prefix 1
in the most significant bit of a byte is reserved to mean "part of a multi-byte character"
The prefix 10
in the most significant two bits means "continuation byte in a multi-byte character", so 6 bits are left for data
The prefixes 110
, 1110
, 11110
mean "first byte of 2-, 3-, or 4-byte character"
Starting at a random point in some UTF-8 text, you can sort out where the character boundaries are
At this point, a new convention on the limit of the number of characters was established
It was decided that there should never be more than 4 bytes in a UTF-8 character, and there should never be codes beyond what UTF-16 can cope with
The new limit is 17 * 65536 = 1114112
So the answer to the original question is that, at various times in history, the number of characters (or more accurately character slots) has been regarded as 128, 256, 65536, 4294967296
But now it is 1114112
The effects of Unicode are seen in the evolution of programming languages
The C language is based on the 128 ASCII characters
Some messy 'wide character' library facilities were added later for handling 65536 characters
But it is now
recommended to use arrays of the ordinary char
type to hold
text encoded using UTF-8
Languages such as Haskell, Java, JavaScript are typically designed with two bytes per character, allowing 65536, with UTF-16 usually available as an extension mechanism
Newer languages like Go use UTF-8 directly
UTF-8 has remarkable properties (e.g. sorting and searching work byte-by-byte) and UTF-16 is demonstrably the worst choice you can make