Variable-width encoding in the context of Information entropy


Variable-width encoding in the context of Information entropy

Variable-width encoding Study page number 1 of 1

Play TriviaQuestions Online!

or

Skip to study material about Variable-width encoding in the context of "Information entropy"


⭐ Core Definition: Variable-width encoding

In coding theory, variable-length encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation in a computer. The equivalent concept in computer science is bit string.

Variable-length codes can allow sources to be compressed and decompressed with zero error (lossless data compression) and still be read back symbol by symbol. An independent and identically-distributed source may be compressed almost arbitrarily close to its entropy. This is in contrast to fixed-length coding methods, for which data compression is only possible for large blocks of data, and any compression beyond the logarithm of the total number of possibilities comes with a finite (though perhaps arbitrarily small) probability of failure.

↓ Menu
HINT:

In this Dossier

Variable-width encoding in the context of UTF-8

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format – 8-bit. As of July 2025, almost every webpage is transmitted as UTF-8.

UTF-8 supports all 1,112,064 valid Unicode code points using a variable-width encoding of one to four one-byte (8-bit) code units.

View the full Wikipedia page for UTF-8
↑ Return to Menu

Variable-width encoding in the context of UTF-16

UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 2 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

UTF-16 is used by the Windows API, and by many programming environments such as Java and Qt. The variable-length character of UTF-16, combined with the fact that most characters are not variable-length (so variable length is rarely tested), has led to many bugs in software, including in Windows itself.

View the full Wikipedia page for UTF-16
↑ Return to Menu

Variable-width encoding in the context of Telegraph code

A telegraph code is one of the character encodings used to transmit information by telegraphy. Morse code is the best-known such code. Telegraphy usually refers to the electrical telegraph, but telegraph systems using the optical telegraph were in use before that. A code consists of a number of code points, each corresponding to a letter of the alphabet, a numeral, or some other character. In codes intended for machines rather than humans, code points for control characters, such as carriage return, are required to control the operation of the mechanism. Each code point is made up of a number of elements arranged in a unique way for that character. There are usually two types of element (a binary code), but more element types were employed in some codes not intended for machines. For instance, American Morse code had about five elements, rather than the two (dot and dash) of International Morse Code.

Codes meant for human interpretation were designed so that the characters that occurred most often had the fewest elements in the corresponding code point. For instance, Morse code for E, the most common letter in English, is a single dot ( ▄ ), whereas Q is  ▄▄▄ ▄▄▄ ▄ ▄▄▄ . These arrangements meant the message could be sent more quickly and it would take longer for the operator to become fatigued. Telegraphs were always operated by humans until late in the 19th century. When automated telegraph messages came in, codes with variable-length code points were inconvenient for machine design of the period. Instead, codes with a fixed length were used. The first of these was the Baudot code, a five-bit code. Baudot has only enough code points to print in upper case. Later codes had more bits (ASCII has seven) so that both upper and lower case could be printed. Beyond the telegraph age, modern computers require a very large number of code points (Unicode has 21 bits) so that multiple languages and alphabets (character sets) can be handled without having to change the character encoding. Modern computers can easily handle variable-length codes such as UTF-8 and UTF-16 which have now become ubiquitous.

View the full Wikipedia page for Telegraph code
↑ Return to Menu