Text corpus in the context of Frequency list


Text corpus in the context of Frequency list

Text corpus Study page number 1 of 3

Play TriviaQuestions Online!

or

Skip to study material about Text corpus in the context of "Frequency list"


⭐ Core Definition: Text corpus

In linguistics and natural language processing, a corpus (pl.: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Annotated, they have been used in corpus linguistics for statistical hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

↓ Menu
HINT:

In this Dossier

Text corpus in the context of Cuneiform

Cuneiform is a logo-syllabic writing system that was used to write several languages of the ancient Near East. The script was in active use from the early Bronze Age until the beginning of the Common Era. Cuneiform scripts are marked by and named for the characteristic wedge-shaped impressions (Latin: cuneus) which form their signs. Cuneiform is the earliest known writing system and was originally developed to write the Sumerian language of southern Mesopotamia (modern Iraq).

Over the course of its history, cuneiform was adapted to write a number of languages in addition to Sumerian. Akkadian texts are attested from the 24th century BC onward and make up the bulk of the cuneiform record. Akkadian cuneiform was itself adapted to write the Hittite language in the early 2nd millennium BC. The other languages with significant cuneiform corpora are Eblaite, Elamite, Hurrian, Luwian, and Urartian. The Old Persian and Ugaritic alphabets feature cuneiform-style signs; however, they are unrelated to the cuneiform logo-syllabary proper. The latest known cuneiform tablet, an astronomical almanac from Uruk, dates to AD 79/80.

View the full Wikipedia page for Cuneiform
↑ Return to Menu

Text corpus in the context of Linguistic prescription

Linguistic prescription is the establishment of rules defining publicly preferred usage of language, including rules of spelling, pronunciation, vocabulary, grammar, etc. Linguistic prescriptivism may aim to establish a standard language, teach what a particular society or sector of a society perceives as a correct or proper form, or advise on effective and stylistically apt communication. If usage preferences are conservative, prescription might appear resistant to language change; if radical, it may produce neologisms. Such prescriptions may be motivated by consistency (making a language simpler or more logical); rhetorical effectiveness; tradition; aesthetics or personal preferences; linguistic purism or nationalism (i.e. removing foreign influences); or to avoid causing offense (etiquette or political correctness).

Prescriptive approaches to language are often contrasted with the descriptive approach of academic linguistics, which observes and records how language is actually used (while avoiding passing judgment). The basis of linguistic research is text (corpus) analysis and field study, both of which are descriptive activities. Description may also include researchers' observations of their own language usage. In the Eastern European linguistic tradition, the discipline dealing with standard language cultivation and prescription is known as "language culture" or "speech culture".

View the full Wikipedia page for Linguistic prescription
↑ Return to Menu

Text corpus in the context of Name of Italy

The etymology of the name of Italy has been the subject of reconstructions by linguists and historians. Considerations extraneous to the specifically linguistic reconstruction of the name have formed a rich corpus of solutions that are either associated with legend (the existence of a king named Italus) or in any case strongly problematic (such as the connection of the name with the grape vine, vitis in Latin).

One theory is that the name derives from the word Italói, a term with which the ancient Greeks designated a tribe of Sicels who had crossed the Strait of Messina and who inhabited the extreme tip of the Italic Peninsula, near today's Catanzaro. This is attested by the fact that the ancient Greek peoples who colonized present-day Calabria, referred to themselves as Italiotes, that is, inhabitants of Italy. This group of Italian people had worshiped the simulacrum of a calf (vitulus, in Latin), and the name would therefore mean "inhabitants of the land of calves (young bulls)". In any case, it is known that in archaic times the name indicated the part located in the extreme south of the Italian Peninsula.

View the full Wikipedia page for Name of Italy
↑ Return to Menu

Text corpus in the context of Egyptian language

The Egyptian language, or Ancient Egyptian (r n kmt; 'speech of Egypt'), is an extinct branch of the Afro-Asiatic language family that was spoken in ancient Egypt. It is known today from a large corpus of surviving texts, which were made accessible to the modern world following the decipherment of the ancient Egyptian scripts in the early 19th century.

Egyptian is one of the earliest known written languages, first recorded in the hieroglyphic script in the late 4th millennium BC. It is also the longest-attested human language, with a written record spanning over 4,000 years. Its classical form, known as "Middle Egyptian," served as the vernacular of the Middle Kingdom of Egypt and remained the literary language of Egypt until the Roman period.

View the full Wikipedia page for Egyptian language
↑ Return to Menu

Text corpus in the context of Hippocratic Corpus

The Hippocratic Corpus (Latin: Corpus Hippocraticum), or Hippocratic Collection, is a collection of around 60 early Ancient Greek medical works closely associated with the physician Hippocrates and his teachings. The Hippocratic Corpus covers many diverse aspects of medicine, from Hippocrates' medical theories to what he devised to be ethical means of medical practice, to addressing various illnesses. Even though it is considered a singular corpus that represents Hippocratic medicine, they vary (sometimes significantly) in content, age, style, methods, and views practiced; therefore, authorship is largely unknown. The ancient commentaries on this corpus, from writers such as Attalion and Oribasius, are myriad. Hippocrates began Western society's development of medicine, through a delicate blending of the art of healing and scientific observations. What Hippocrates was sharing from within his collection of works was not only how to identify symptoms of disease and proper diagnostic practices, but more essentially, he was alluding to his personable form of art, "The art of true living and the art of fine medicine combined." The Hippocratic Corpus became the foundation upon which Western medical practice was built.

View the full Wikipedia page for Hippocratic Corpus
↑ Return to Menu

Text corpus in the context of Indus script

The Indus script, also known as the Harappan script and the Indus Valley script, is a corpus of symbols produced by the Indus Valley Civilisation. Most inscriptions containing these symbols are extremely short, making it difficult to judge whether or not they constituted a writing system used to record a Harappan language, any of which are yet to be identified. Despite many attempts, the "script" has not yet been deciphered. There is no known bilingual inscription to help decipher the script, which shows no significant changes over time. However, some of the syntax (if that is what it may be termed) varies depending upon location.

The first publication of a seal with Harappan symbols dates to 1875, in a drawing by Alexander Cunningham. By 1992, an estimated 4,000 inscribed objects had been discovered, some as far afield as Mesopotamia due to existing Indus–Mesopotamia relations, with over 400 distinct signs represented across known inscriptions.

View the full Wikipedia page for Indus script
↑ Return to Menu

Text corpus in the context of Ancient Egyptian literature

Ancient Egyptian literature was written with the Egyptian language from ancient Egypt's pharaonic period until the end of Roman domination. It represents the oldest corpus of Egyptian literature. Along with Sumerian literature, it is considered the world's earliest literature.

Writing in ancient Egypt—both hieroglyphic and hieratic—first appeared in the late 4th millennium BC during the late phase of predynastic Egypt. By the Old Kingdom (26th century BC to 22nd century BC), literary works included funerary texts, epistles and letters, hymns and poems, and commemorative autobiographical texts recounting the careers of prominent administrative officials. It was not until the early Middle Kingdom (21st century BC to 17th century BC) that a narrative Egyptian literature was created. This was a "media revolution" which, according to Richard B. Parkinson, was the result of the rise of an intellectual class of scribes, new cultural sensibilities about individuality, unprecedented levels of literacy, and mainstream access to written materials. The creation of literature was thus an elite exercise, monopolized by a scribal class attached to government offices and the royal court of the ruling pharaoh. However, there is no full consensus among modern scholars concerning the dependence of ancient Egyptian literature on the sociopolitical order of the royal courts.

View the full Wikipedia page for Ancient Egyptian literature
↑ Return to Menu

Text corpus in the context of Gothic language

Gothic is an extinct East Germanic language that was spoken by the Goths. It is known primarily from the Codex Argenteus, a 6th-century copy of a 4th-century Bible translation, and is the only East Germanic language with a sizeable text corpus. All others, including Burgundian and Vandalic, are known, if at all, only from proper names that survived in historical accounts, and from loanwords in other, mainly Romance languages.

As a Germanic language, Gothic is a part of the Indo-European language family. It is the earliest Germanic language that is attested in any sizable texts, but it lacks any modern descendants. The oldest documents in Gothic date back to the fourth century. The language was in decline by the mid-sixth century, partly because of the military defeat of the Goths at the hands of the Franks, the elimination of the Goths in Italy, and geographic isolation (in Spain, the Gothic language lost its last and probably already declining function as a church language when the Visigoths converted from Arianism to Nicene Christianity in 589). The language survived as a domestic language in the Iberian Peninsula (modern-day Spain and Portugal) as late as the eighth century. Gothic-seeming terms are found in manuscripts subsequent to this date, but these may or may not belong to the same language.

View the full Wikipedia page for Gothic language
↑ Return to Menu

Text corpus in the context of The Complete Works

The complete works of an artist, writer, musician, group, etc., is a collection of all of their cultural works. For example, Complete Works of Shakespeare is an edition containing all the plays and poems of William Shakespeare. A Complete Works published edition of a text corpus is normally accompanied with additional information and critical apparatus. It may include notes, introduction, a biographical sketch, and may pay attention to textual variants.

Similarly, the term body of work may be used to describe the entirety of the creative or academic output produced by a particular individual or unit.

View the full Wikipedia page for The Complete Works
↑ Return to Menu

Text corpus in the context of Bilingual inscription

In epigraphy, a multilingual inscription is an inscription that includes the same text in two or more languages. A bilingual is an inscription that includes the same text in two languages (or trilingual in the case of three languages, etc.). Multilingual inscriptions are important for the decipherment of ancient writing systems, and for the study of ancient languages with small or repetitive corpora.

View the full Wikipedia page for Bilingual inscription
↑ Return to Menu

Text corpus in the context of Corpus linguistics

Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural corpora). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. Large collections of text, though corpora may also be small in terms of running words, allow linguists to run quantitative analyses on linguistic concepts that may be difficult to test in a qualitative manner.

View the full Wikipedia page for Corpus linguistics
↑ Return to Menu

Text corpus in the context of Parallel text

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla (Greek for "sixfold") placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.

Large collections of parallel texts are called parallel corpora (see text corpus). Alignments of parallel corpora at sentence level are prerequisite for many areas of linguistic research. During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task.

View the full Wikipedia page for Parallel text
↑ Return to Menu