UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit. As of July 2025, almost every webpage is transmitted as UTF-8.
Some character encoding systems represent each character using a fixed number of bits whereas other systems use varying sizes. Various fixed-length sizes were used for now obsolete systems such as the six-bit character code, the five-bit Baudot code and even 4-bit systems (with only 16 possible values). The more modern ASCII system uses the 8-bit byte for each character. Today, the Unicode-based UTF-8 encoding uses a varying number of byte-sized code units to define a code point which combine to encode a character.
Character encoding is a convention of using a numeric value to represent each character of a writing script. Not only can a character set include natural languagesymbols, but it can also include codes that have meanings or functions outside of language, such as control characters and whitespace. Character encodings have also been defined for some constructed languages. When encoded, character data can be stored, transmitted, and transformed by a computer. The numerical values that make up a character encoding are known as code points and collectively comprise a code space or a code page.
Early character encodings that originated with optical or electrical telegraphy and in early computers could only represent a subset of the characters used in languages, sometimes restricted to upper case letters, numerals and limited punctuation. Over time, encodings capable of representing more characters were created, such as ASCII, ISO/IEC 8859, and Unicode encodings such as UTF-8 and UTF-16.
The term "code page" originated from IBM's EBCDIC-based mainframe systems, but Microsoft, SAP, and Oracle Corporation are among the vendors that use this term. The majority of vendors identify their own character sets by a name. In the case when there is a plethora of character sets (like in IBM), identifying character sets through a number is a convenient way to distinguish them. Originally, the code page numbers referred to the page numbers in the IBM standard character set manual, a condition which has not held for a long time. Vendors that use a code page system allocate their own code page number to a character encoding, even if it is better known by another name; for example, UTF-8 has been assigned page numbers 1208 at IBM, 65001 at Microsoft, and 4110 at SAP.
Transcoding is the direct digital-to-digital conversion of one encoding to another, such as for video data files, audio files (e.g., MP3, WAV), or character encoding (e.g., UTF-8, ISO/IEC 8859). This is usually done in cases where a target device (or workflow) does not support the format or has limited storage capacity that mandates a reduced file size, or to convert incompatible or obsolete data to a better-supported or modern format.
In the analog video world, transcoding can be performed just while files are being searched, as well as for presentation. For example, Cineon and DPX files have been widely used as a common format for digital cinema, but the data size of a two-hour movie is about 8 terabytes (TB). That large size can increase the cost and difficulty of handling movie files. However, transcoding into a JPEG2000 lossless format has better data compression performance than other lossless coding technologies; in many cases, JPEG2000 can compress images to half their original size.
Code page 866 (CCSID 866) (CP 866, "DOS Cyrillic Russian") is a code page used under DOS and OS/2 in Russia to write Cyrillic script. It is based on the "alternative code page" (Russian: Альтернативная кодировка) developed in 1984 in IHNA AS USSR and published in 1986 by a research group at the Academy of Science of the USSR. The code page was widely used during the DOS era because it preserves all of the pseudographic symbols of code page 437 (unlike the "Main code page" or Code page 855) and maintains alphabetic order (although non-contiguously) of Cyrillic letters (unlike KOI8-R). Initially this encoding was only available in the Russian version of MS-DOS 4.01 (1990), but with MS-DOS 6.22 it became available in any language version.
The WHATWG Encoding Standard, which specifies the character encodings permitted in HTML5 which compliant browsers must support, includes Code page 866. It is the only single-byte encoding listed which is not named as an ISO 8859 part, Mac OS specific encoding, Microsoft Windows specific encoding (Windows-874 or Windows-125x) or KOI-8 variant. Authors of new pages and the designers of new protocols are instructed to use UTF-8 instead.
Kenneth Lane Thompson (born February 4, 1943) is an American pioneer of computer science. Thompson worked at Bell Labs for most of his career where he designed and implemented the original Unix operating system. He also invented the B programming language, the direct predecessor to the C language, and was one of the creators and early developers of the Plan 9 operating system. Other notable contributions included his work on regular expressions and early computer text editors QED and ed, the definition of the UTF-8 encoding, and his work on computer chess that included the creation of endgame tablebases and the chess machine Belle.
Since 2006, Thompson has worked at Google, where he co-developed the Go language. In 1983, he won the Turing Award with his long-term colleague Dennis Ritchie. He is considered one of the greatest computer programmers of all time.
A telegraph code is one of the character encodings used to transmit information by telegraphy. Morse code is the best-known such code. Telegraphy usually refers to the electrical telegraph, but telegraph systems using the optical telegraph were in use before that. A code consists of a number of code points, each corresponding to a letter of the alphabet, a numeral, or some other character. In codes intended for machines rather than humans, code points for control characters, such as carriage return, are required to control the operation of the mechanism. Each code point is made up of a number of elements arranged in a unique way for that character. There are usually two types of element (a binary code), but more element types were employed in some codes not intended for machines. For instance, American Morse code had about five elements, rather than the two (dot and dash) of International Morse Code.
Codes meant for human interpretation were designed so that the characters that occurred most often had the fewest elements in the corresponding code point. For instance, Morse code for E, the most common letter in English, is a single dot ( ▄ ), whereas Q is ▄▄▄ ▄▄▄ ▄ ▄▄▄ . These arrangements meant the message could be sent more quickly and it would take longer for the operator to become fatigued. Telegraphs were always operated by humans until late in the 19th century. When automated telegraph messages came in, codes with variable-length code points were inconvenient for machine design of the period. Instead, codes with a fixed length were used. The first of these was the Baudot code, a five-bit code. Baudot has only enough code points to print in upper case. Later codes had more bits (ASCII has seven) so that both upper and lower case could be printed. Beyond the telegraph age, modern computers require a very large number of code points (Unicode has 21 bits) so that multiple languages and alphabets (character sets) can be handled without having to change the character encoding. Modern computers can easily handle variable-length codes such as UTF-8 and UTF-16 which have now become ubiquitous.