COMS10008 Aside: Characters

History

Warning: historical summaries are always inaccurate, misleading and biased

These slides are no exception - they provide a summary of the history of characters, inaccurate because they are simplified and biased towards explaining the overall current state of affairs

Paper tape

At first, there were 128 characters, which we now know as the ASCII characters or plain text characters

They are the ones you can type on a USA keyboard

They began as encodings on 8-hole paper tape

One of the 8 holes was used as a parity bit (check bit), chosen to make the number of holes for each character even, so 7 bits were available

There were other conventions such as EBCDIC (still used on some IBM computers) but ASCII became the standard

Nul and Del

For convenience, e.g. for skipping spare blank paper tape at the start of a roll, the character with code zero, i.e. no holes (called NUL, written '\0' in C) is a character meaning 'ignore me', and is now used in C to mean 'end of string'

When tape was occasionally patched by hand, unwanted characters were deleted by punching all the holes, so the character with code 127 (DEL) also meant 'ignore me', but is now used in editing to mean 'forward delete'

Control characters

Most ASCII characters (95 out of 128) can be typed and printed

The other 33 (codes 0 to 31 and 127) were used to control devices such as printers, or to act as markers

They are called control characters, and the first 26 can be generated on a keyboard using the CTRL key

The most important is newline (code 10, or hex 0a, or CTRL+J, or '\n') which marks the end of a line

Newlines

Suppose you create a file eg.txt on Linux containing:

One
Two

The result of using od is:

> od -c -t x1 eg.txt
0000000   O   n   e  \n   T   w   o  \n
         4f  6e  65  0a  54  77  6f  0a
0000010

You can see the newlines (hex 0a = 10)

Platform newlines

On Linux, lines end with 0a = \n = linefeed, and this is still the standard way that newlines are stored in programs running in memory

On MacOs, 0d = \r = return = enter used to be used, which is still the standard way that newlines are typed on a keyboard (but since MacOs X, MacOs has become more Linux-compatible, and 0a is now used)

On Windows, the pair 0d 0a is used, which is still the standard for network transmission of text on the Web

Why?

Why is there this disagreement?

Because on original printers, \r meant 'carriage return', i.e. return the print head to the left edge of the paper, and \n meant 'line feed', move the paper up by one line

After a while, there were many situations where it was unnecessary to use both characters, but the three resulting choices were randomly chosen in different situations, by historical accident

End of file

Are newlines separators or terminators?
So, should the last line of a text file end in a newline?
If not:

you can't tell the difference between an empty file and a file with one blank line
lot's of tools that process text files, especially old ones, go wrong

So the answer is terminators and you should configure your editor to add a final newline, if necessary

Facts about ASCII

Each character has a code from 0 to 127

The codes for 'a' to 'z' are contiguous

The codes for 'A' to 'Z' are contiguous, so to convert a letter to upper case, subtract the code for 'a' and add the code for 'A'

The codes for '0' to '9' are contiguous, but they are not
0 to 9, so to convert a digit character to a number, subtract the code for '0'

Otherwise, you almost never need to know what the codes are

Not enough

The ASCII characters support the English language only

In fact, they have a USA bias, because they include the $ sign, but not any other currency sign

So what was done for other human languages?

For English variants, and a few other languages, ASCII was OK

Extensions

Many languages with small numbers of characters used 256 characters

The first 128 were ASCII (for programs) and the remainder were for the local language alphabet

This allowed a character to fit in one byte (parity bits no longer being needed)

Ideograph languages (Chinese, Japanese, Korean) used their own multi-byte schemes

Incompatibility

The problem with these encoding schemes was that they were incompatible with each other

No convention was ever established inside the text, or in file names, to say which language was being used

In any case, support was needed for multi-language documents, because lots of documents need to contain occasional quotes from other languages

Unicode

To deal with this, Unicode was created

Unicode is a collection of all the characters in all human languages

This collection excludes font, size, italic, bold, colour... issues, which are regarded as graphical

Each character has a unique code, and the first 128 are the ASCII characters

It makes all previous character sets obsolete, but it also evolves (e.g it now includes Klingon!)

Original limit

At first, the number of characters in Unicode was limited to 65536 (i.e. each had a 16-bit = 2-byte code)

So, at this point in Computer Science history, many people thought characters would be two bytes in size for ever more

However, Unicode claimed to be a character set (allocating codes) not an encoding (specifying how codes are represented in bytes)

Running out

After a while, it became clear that 65536 characters would not be enough

Unicode continued to expand beyond that limit, but there was an understanding that no more than 4 bytes would ever be needed for a character

People expected, eventually, 4 bytes per character with a maximum of 4294967296 characters

An extension mechanism

When Unicode was extended beyond 65536 characters, some extension mechanism was needed in 2-byte systems to deal with the new extra characters

Two blocks of characters within the 65536 were reserved and used as "first half of extended four-byte character" and "second half of extended four-byte character"

This allowed some of the codes beyond 65536 to be represented

UTF-16

The extension mechanism was incorporated within Unicode

The two blocks of codes used for the extension mechanism were permanently reserved within the Unicode collection

The particular encoding with 2-bytes per character, using the reserved blocks for extended characters, was standardised as the UTF-16 encoding

UTF-8

Another encoding was created and has become very popular, and should eventually make all others obsolete

It is called UTF-8

Instead of using a fixed number of bytes per character, it uses a variable number of bytes per character

It uses 1 byte for the first few characters, then 2 bytes, then 3 or 4 bytes e.g. for ideograph languages

Unfairness of UTF-8

It has been argued that UTF-8 is unfair, because it favours letter-based languages over ideograph-based languages, making ideographs take more space than necessary

The most common situation where this might be a problem is with web pages written in one of the CJK languages

But such web pages have so many ASCII characters in their HTML skeletons that they are shorter in UTF-8 than in any 'fairer' alternative

One byte characters

In UTF-8, the first 128 characters, i.e. the original ASCII characters, are stored in one byte

The zero in the most significant bit is reserved to mean "this is a one byte character"

A big advantage of this is that all 'legacy' ASCII files, including almost all program source files, are valid UTF-8 files already, with no conversion needed

Multi-byte characters

The prefix 1 in the most significant bit of a byte is reserved to mean "part of a multi-byte character"

The prefix 10 in the most significant two bits means "continuation byte in a multi-byte character", so 6 bits are left for data

The prefixes 110, 1110, 11110 mean "first byte of 2-, 3-, or 4-byte character"

Starting at a random point in some UTF-8 text, you can sort out where the character boundaries are

A new limit

At this point, a new convention on the limit of the number of characters was established

It was decided that there should never be more than 4 bytes in a UTF-8 character, and there should never be codes beyond what UTF-16 can cope with

The new limit is 17 * 65536 = 1114112

The answer

So the answer to the original question is that, at various times in history, the number of characters (or more accurately character slots) has been regarded as 128, 256, 65536, 4294967296

But now it is 1114112

Programming languages

The effects of Unicode are seen in the evolution of programming languages

The C language is based on the 128 ASCII characters

Some messy 'wide character' library facilities were added later for handling 65536 characters

But it is now recommended to use arrays of the ordinary char type to hold text encoded using UTF-8

Other languages

Languages such as Haskell, Java, JavaScript are typically designed with two bytes per character, allowing 65536, with UTF-16 usually available as an extension mechanism

Newer languages like Go use UTF-8 directly

UTF-8 has remarkable properties (e.g. sorting and searching work byte-by-byte) and UTF-16 is demonstrably the worst choice you can make

Characters

How many?