UTF-8 in the context of Character (computer)

⭐ Core Definition: UTF-8

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit. As of July 2025, almost every webpage is transmitted as UTF-8.

UTF-8 supports all 1,112,064 valid Unicode code points using a variable-width encoding of one to four one-byte (8-bit) code units.

↓ Menu

HINT:

In this Dossier

⭐ Core Definition: UTF-8
UTF-8 in the context of Character (computing)
UTF-8 in the context of Character encoding
UTF-8 in the context of Code page
UTF-8 in the context of Transcoding
UTF-8 in the context of Code page 866
UTF-8 in the context of Ken Thompson
UTF-8 in the context of Telegraph code

UTF-8 in the context of Character (computing)

In computing and telecommunications, a character is the encoded representation of a natural language character (including letter, numeral and punctuation), whitespace (space or tab), or a control character (controls computer hardware that consumes character-based data). A sequence of characters is called a string.

Some character encoding systems represent each character using a fixed number of bits whereas other systems use varying sizes. Various fixed-length sizes were used for now obsolete systems such as the six-bit character code, the five-bit Baudot code and even 4-bit systems (with only 16 possible values). The more modern ASCII system uses the 8-bit byte for each character. Today, the Unicode-based UTF-8 encoding uses a varying number of byte-sized code units to define a code point which combine to encode a character.

View the full Wikipedia page for Character (computing)

↑ Return to Menu

UTF-8 in the context of Character encoding

Character encoding is a convention of using a numeric value to represent each character of a writing script. Not only can a character set include natural language symbols, but it can also include codes that have meanings or functions outside of language, such as control characters and whitespace. Character encodings have also been defined for some constructed languages. When encoded, character data can be stored, transmitted, and transformed by a computer. The numerical values that make up a character encoding are known as code points and collectively comprise a code space or a code page.

Early character encodings that originated with optical or electrical telegraphy and in early computers could only represent a subset of the characters used in languages, sometimes restricted to upper case letters, numerals and limited punctuation. Over time, encodings capable of representing more characters were created, such as ASCII, ISO/IEC 8859, and Unicode encodings such as UTF-8 and UTF-16.

View the full Wikipedia page for Character encoding

↑ Return to Menu

UTF-8 in the context of Code page

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers. Typically each number represents the binary value in a single byte. (In some contexts these terms are used more precisely; see Character encoding § Terminology.)

The term "code page" originated from IBM's EBCDIC-based mainframe systems, but Microsoft, SAP, and Oracle Corporation are among the vendors that use this term. The majority of vendors identify their own character sets by a name. In the case when there is a plethora of character sets (like in IBM), identifying character sets through a number is a convenient way to distinguish them. Originally, the code page numbers referred to the page numbers in the IBM standard character set manual, a condition which has not held for a long time. Vendors that use a code page system allocate their own code page number to a character encoding, even if it is better known by another name; for example, UTF-8 has been assigned page numbers 1208 at IBM, 65001 at Microsoft, and 4110 at SAP.

View the full Wikipedia page for Code page

↑ Return to Menu

UTF-8 in the context of Transcoding

Transcoding is the direct digital-to-digital conversion of one encoding to another, such as for video data files, audio files (e.g., MP3, WAV), or character encoding (e.g., UTF-8, ISO/IEC 8859). This is usually done in cases where a target device (or workflow) does not support the format or has limited storage capacity that mandates a reduced file size, or to convert incompatible or obsolete data to a better-supported or modern format.

In the analog video world, transcoding can be performed just while files are being searched, as well as for presentation. For example, Cineon and DPX files have been widely used as a common format for digital cinema, but the data size of a two-hour movie is about 8 terabytes (TB). That large size can increase the cost and difficulty of handling movie files. However, transcoding into a JPEG2000 lossless format has better data compression performance than other lossless coding technologies; in many cases, JPEG2000 can compress images to half their original size.

View the full Wikipedia page for Transcoding

↑ Return to Menu

UTF-8 in the context of Code page 866

Code page 866 (CCSID 866) (CP 866, "DOS Cyrillic Russian") is a code page used under DOS and OS/2 in Russia to write Cyrillic script. It is based on the "alternative code page" (Russian: Альтернативная кодировка) developed in 1984 in IHNA AS USSR and published in 1986 by a research group at the Academy of Science of the USSR. The code page was widely used during the DOS era because it preserves all of the pseudographic symbols of code page 437 (unlike the "Main code page" or Code page 855) and maintains alphabetic order (although non-contiguously) of Cyrillic letters (unlike KOI8-R). Initially this encoding was only available in the Russian version of MS-DOS 4.01 (1990), but with MS-DOS 6.22 it became available in any language version.

The WHATWG Encoding Standard, which specifies the character encodings permitted in HTML5 which compliant browsers must support, includes Code page 866. It is the only single-byte encoding listed which is not named as an ISO 8859 part, Mac OS specific encoding, Microsoft Windows specific encoding (Windows-874 or Windows-125x) or KOI-8 variant. Authors of new pages and the designers of new protocols are instructed to use UTF-8 instead.

View the full Wikipedia page for Code page 866

↑ Return to Menu

UTF-8 in the context of Ken Thompson

Kenneth Lane Thompson (born February 4, 1943) is an American pioneer of computer science. Thompson worked at Bell Labs for most of his career where he designed and implemented the original Unix operating system. He also invented the B programming language, the direct predecessor to the C language, and was one of the creators and early developers of the Plan 9 operating system. Other notable contributions included his work on regular expressions and early computer text editors QED and ed, the definition of the UTF-8 encoding, and his work on computer chess that included the creation of endgame tablebases and the chess machine Belle.

Since 2006, Thompson has worked at Google, where he co-developed the Go language. In 1983, he won the Turing Award with his long-term colleague Dennis Ritchie. He is considered one of the greatest computer programmers of all time.

View the full Wikipedia page for Ken Thompson

↑ Return to Menu

UTF-8 in the context of Telegraph code

A telegraph code is one of the character encodings used to transmit information by telegraphy. Morse code is the best-known such code. Telegraphy usually refers to the electrical telegraph, but telegraph systems using the optical telegraph were in use before that. A code consists of a number of code points, each corresponding to a letter of the alphabet, a numeral, or some other character. In codes intended for machines rather than humans, code points for control characters, such as carriage return, are required to control the operation of the mechanism. Each code point is made up of a number of elements arranged in a unique way for that character. There are usually two types of element (a binary code), but more element types were employed in some codes not intended for machines. For instance, American Morse code had about five elements, rather than the two (dot and dash) of International Morse Code.

Codes meant for human interpretation were designed so that the characters that occurred most often had the fewest elements in the corresponding code point. For instance, Morse code for E, the most common letter in English, is a single dot ( ▄ ), whereas Q is ▄▄▄ ▄▄▄ ▄ ▄▄▄ . These arrangements meant the message could be sent more quickly and it would take longer for the operator to become fatigued. Telegraphs were always operated by humans until late in the 19th century. When automated telegraph messages came in, codes with variable-length code points were inconvenient for machine design of the period. Instead, codes with a fixed length were used. The first of these was the Baudot code, a five-bit code. Baudot has only enough code points to print in upper case. Later codes had more bits (ASCII has seven) so that both upper and lower case could be printed. Beyond the telegraph age, modern computers require a very large number of code points (Unicode has 21 bits) so that multiple languages and alphabets (character sets) can be handled without having to change the character encoding. Modern computers can easily handle variable-length codes such as UTF-8 and UTF-16 which have now become ubiquitous.

View the full Wikipedia page for Telegraph code

↑ Return to Menu

UTF-8 in the context of Character (computer)

UTF-8 Study page number 1 of 1

Play TriviaQuestions Online!

Skip to study material about UTF-8 in the context of "Character (computer)"

⭐ Core Definition: UTF-8

UTF-8 in the context of Character (computing)

UTF-8 in the context of Character encoding

UTF-8 in the context of Code page

UTF-8 in the context of Transcoding

UTF-8 in the context of Code page 866

UTF-8 in the context of Ken Thompson

UTF-8 in the context of Telegraph code