ASCII, the history
The most used computer standard is without doubt ASCII, the American Standard Code for Information Interchange. When people started to develop computers, they had to define a way to represent certain types of information in a digital format. For numbers this was relatively easy, but text representation was far more difficult. Morse code was developed in the 19th century, but could not be easily adapted to the binary system in computers because the codes used for characters have different lengths and there is no obvious sorting method.
IBM came in the sixties of the previous century with it’s own solution EBCDIC, Extended Binary Coded Decimal Interchange Code used on their mainframes and AS/400 systems. But this system had some drawbacks. The letters of the alphabet were placed in blocks which is not very useful for sorting. At the same time that IBM was developing her EBCDIC solution, others computer developers were creating their own.
It became evident that exchanging data between various computer systems would be a huge problem if this diversity would not stop. It was therefore that Bob Bemer—now often called the Father of ASCII—compiled all different coding methods in a huge list. It was this list that made computer manufacturers realize that something had to be done about this situation quiclky. Bob Bemer started standardization committees, and the first implementation of ASCII was introduced in 1963. Extensions for foreign languages were adopted to ASCII in 1967, and in 1968 it finally became an official government standard.
Nowadays 100% of all computers use the ASCII coding as their primary coding system. Extensions for foreign languages are all coded as a superset of ASCII. Therefore we can say without doubt that ASCII is the most used computer standard in the world.
ASCII character set table
The ASCII character set has been adopted as the standard in information exchange. The first 32 characters and the last one are control codes, the others are printable characters. The control codes DC1 (XON) and DC3 (XOFF) are used in software flow control applications. The following table shows the ASCII character set.
ASCII control codes in detail
- 0 – NUL – Null character
- The NUL character in the ASCII character set was originally ment to be treated as a NOP, a character to be ignored. This would be useful on paper tapes where additional information had to be added in between existing information. However, some printing devices had the NUL implemented as a wite space instead. Later on, the importance of the null character increased significantly when it was defined as the string terminator in the C programming language. It made it possible to define strings of infinite length in programming languages. Until then most languages like Pascal defined a string as a length indicator, followed by an array that contained the characters.
- 1 – SOH – Start of heading
- If the communication primarily exists of commands and messages, the SOH can be used to mark the beginning of each message header. In the original 1963 definition of the ASCII standard the name start of message was used, which has been renamed to start of heading in the final release. Nowadays we often see the SOH used in serial RS232 communications where there is a master-slave configuration. Each command from the master starts with the SOH. This makes it possible for the slave or slaves to resynchronize on the next command when data errors occured. Without a clear marking of the start of each command a resync might be problematic to implement.
- 2 – STX – Start of text
3 – ETX – End of text
- A message based communication protocol will probably use messages with a header containing addressing information, followed by the actual content. The ASCII STX indicates the start of the content part in such a message. This control code automatically ends a previous header, i.e. there is no control code to close a header started by SOH. The end of the message content is signalled with control character ETX. The actual contents of a message are not defined by the ASCII standard and are protocol dependent. Interesting to note is, that in the 1963 draft of the standard, naming conventions differed. STX was in this draft called EOA, end of address and ETX started its life as EOM, end of message. This is because in the original draft a message always contained a start and stop control character. The new definition allowed to use only the SOH to send a fixed length command, without the need to end the command with a trailing control code. In fact, in current serial protocols we see this commonly used where fixed length messages are sent without a distinction between the header and content.
- 4 – EOT – End of transmission
- 5 – ENQ – Enquiry
- 6 – ACK – Acknowledgment
- 7 – BEL – Audible bell
- The BEL code is an interesting one in the ASCII set as it is not primarily used for data coding or device control. Instead it is used to attract human attention with an audible sound. It was intended to be used on both computers and devices like printers. In the programming language C the control code a is used the bell signal.
- 8 – BS – Backspace
- The functionality of the backspace has changed over time. In the beginning it was primarily ment to move the cursor one character backwards on printers and teletypes to make accents on characters possible. For example to generate the character ? one could send the sequence aBS^ to the printer. This method was a practical copy of the way how characters with accents were handled on mechanical typewriters, but when CRT’s were introduced it was no longer supported in that way. Therefore now the backspace is most often used to not only reposition the cursor, but also delete the actual contents on that position. You can use this control character as b in the C programming language.
- 9 – HT – Horizontal tab
- The HT control character in the ASCII character set is defined for layout purposes. It instructs the output device to proceed to the next table column. Table column width is flexible, but on many devices the distance between table columns defaults to 8. The use of the horizontal tab not only reduced the work for data typists, but also introduced a method to reduce the amount of storage space necessary for formatted texts. We will now laugh about it, but keep in mind that the ASCII standard was developed 40 years ago when every byte of storage was valuable, and compression methods like ZIP, didn’t exist. The control character HT is available as t in the C programming language.
- 10 – LF – Line feed
- The line feed character is one of the characters in the ASCII character set that has been misused. Originaly, the LF character was ment to move the head of a printer one line down. A second control character CR would then be used to move the printing head to the left margin. This is the way it was implemented in many serial protocols and in operating systems like MS-DOS and Windows. On the other hand the C programming language and Unix operating system redefined this character as newline which ment a combination of line feed and carriage return. You can argue about which use is wrong. The way C and Unix handle it is certainly more natural from a programming point of view. On the other hand is the MS-DOS implementation closer to the original definition. It would have been better if both line feed and newline were part of the original ASCII definition because the first defines a typical device control functionality where the latter is a logical text separator. But this separation is not the case. Nowadays people tend to use the LF character mainly as newline function and most software that handles plain ASCII text files is capable of handling both single LF and CR/LF combinations. The control character is in the programming language C available as n.
- 11 – VT – Vertical tab
- The vertical tab is like the horizontal tab defined to reduce the amount of work for creating layouts, and also reduce the amount of storage space for formatted text pages. The VT control code is used to jump to the next marked line. To be honest, I have never seen a situation or application where this functionality was implemented. In most situations a sequence of LF codes is used instead.
- 12 – FF – Form feed
- The form feed code FF was designed to control the behaviour of printers. When receiving this code the printer moves to the next sheet of paper. The behaviour of the control code on terminals depends on the implementation. Some clear the screen, whereas others only display the ^L characters or perform a line feed instead. The shell environments Bash and Tcsh have implemented the ASCII form feed as a clear screen command. The form feed is implemented as f in the C programming language.
- 13 – CR – Carriage return
- The carriage return in the ASCII character set in its original form is ment to move the printing head back to the left margin without moving to the next line. Over time this code has also been assigned to the enter key on keyboards to signal that the input of text is finished. With screen oriented representation of data, people wanted that entering data would also imply that the cursor positioned to the next line. Therefore, in the C programming language and the unix operating system, a redefinition of the LF control code has taken place to newline. Often software now silently translates an entered CR to the LF ASCII code when the data is stored.
- 14 – SO – Shift out
15 – SI – Shift in
- Even as early as in the sixties, the people who defined the ASCII character set understood that it would be valuable to make the character set not only available for the English alphabet, but also for foreign ones. The shift in and shift out were defined for this purpose. Originaly it was ment to switch between the cyrillic alphabet and latin. The cyrillic ASCII definition which uses the shift characters is KOI-7. Later on these control codes were also used to change the typeface on printers. In this use SO produced double wide characters where condensed printing was selected with SI.
- 16 – DLE – Data link escape
- It is sometimes necessary in an ongoing data communication to send control characters. There are situations where those control characters might be understood as part of the normal data stream. The DLE has been defined in the ASCII standard for these situations. If this character is detected in a datastream, the receiving party knows, that one or more of the following characters must be interpreted in a different way than the other characters in the stream. The exact interpretation of the following characters is not part of the ASCII definition, just the availability to break out of a communication stream with the data link escape. In the Hayes communication protocol for modems, the data link escape has been defined as silence+++silence. In my opinion it would have been a better idea if the Hayes protocol had used the DLE instead, as it does not need to embedded by communication silence, and it would fit within an existing standard. However, the developers of Hayes decided otherwise and now the +++ sequence is used far more often then the original DLE.
- 17 – DC1 – Device control 1 / XON – Transmission on
- Although originally defined as DC1, this ASCII control code is now better known as the XON code used for software flow control in serial communications. The main use is restarting the transmission after the communication has been stopped by the XOFF control code. People who used to work with serial terminals probably remember that sometimes when data errors occured, it helped to hit the Ctrl-Q key. This is because this key-sequence in fact generates the XON control code, which unlocks a blocked communication when terminal or host computer accidentaly interpreted an errornous character as XOFF.
- 18 – DC2 – Device control 2
- 19 – DC3 – Device control 3 / XOFF – Transmission off
- 20 – DC4 – Device control 4
- 21 – NAK – Negative acknowledgment
- 22 – SYN – Synchronous idle
- 23 – ETB – End of transmission block
- 24 – CAN – Cancel
- 25 – EM – End of medium
- The EM is used at the end of a serial storage medium like paper tape or magnetic reels. It indicates the logical end of the data. It is not necessary that this is also the physical end of the data carrier.
- 26 – SUB – Substitute character
- 27 – ESC – Escape
- The escape character is one of the inventions in the ASCII standard that was proposed by Bob Bemer. It is used to start an extended sequence of control codes. In this way it was not necessary to put all thinkable control codes in the ASCII standard. As new technologies would need new control commands, the ESC would be present to be the starting character of these multi-character commands. Escape codes are widely used in printers and terminals to control device settings like fonts, text positioning and colors. If ESC had been absent in the original ASCII definition, the standard would likely have been superseeded by some other standard in the past. The escape possibility allowed developers to literaly escape from the standard where necessary, but use it whenever possible.
- 28 – FS – File separator
- The file separator FS is an interesting control code, as it gives us insight in the way that computer technology was organized in the sixties. We are now used to random access media like RAM and magnetic disks, but when the ASCII standard was defined, most data was serial. I am not only talking about serial communications, but also about serial storage like punch cards, paper tape and magnetic tapes. In such a situation it is clearly efficient to have a single control code to signal the separation of two files. The FS was defined for this purpose.
- 29 – GS – Group separator
- Data storage was one of the main reasons for some control codes to get in the ASCII definition. Databases are most of the time setup with tables, containing records. All records in one table have the same type, but records of different tables can be different. The group separator GS is defined to separate tables in a serial data storage system. Note that the word table wasn’t used at that moment and the ASCII people called it a group.
- 30 – RS – Record separator
- Within a group (or table) the records are separated with RS or record separator.
- 31 – US – Unit separator
- The smallest data items to be stored in a database are called units in the ASCII definition. We would call them field now. The unit separator separates these fields in a serial data storage environment. Most current database implementations require that fields of most types have a fixed length. Enough space in the record is allocated to store the largest possible member of each field, even if this is not necessary in most cases. This costs a large amount of space in many situations. The US control code allows all fields to have a variable length. If data storage space is limited—as in the sixties—this is a good way to preserve valuable space. On the other hand is serial storage far less efficient than the table driven RAM and disk implementations of modern times. I can’t imagine a situation where modern SQL databases are run with the data stored on paper tape or magnetic reels…
- 32 – SP – White space
- You can argue if the space character is a real control character as it is so widely used in normal texts. But, as the horizontal tab and backspace are also called control characters in the ASCII set, I think it is most natural to call the white space or forward space also a control character. Afterall it doesn’t represent a character by itself, but merely a command to the output device to proceed one position forward, clearing the information in the current field. In many applications like wordprocessors the white space is also a character that can cause lines to wrap, and web browsers combine multiple spaces to just one output character. This stengthens my belief that it is not just representing a unique character, but an information carrier for devices and applications.
- 127 – DEL – Delete
- One might question why all control codes in the ASCII character set have low values, but the DEL control code has value 127. This is, because this specific character was defined for deleting data on paper tapes. Most paper tapes in that time used 7 holes to code the data. The value 127 represents a binary pattern were all seven bits are high, so when using the DEL character on an existing paper tape, all holes are punched and existing data is erased.