Speech recognition in the context of Natural-language processing


Speech recognition in the context of Natural-language processing

Speech recognition Study page number 1 of 2

Play TriviaQuestions Online!

or

Skip to study material about Speech recognition in the context of "Natural-language processing"


⭐ Core Definition: Speech recognition

Speech recognition (automatic speech recognition (ASR), computer speech recognition, or speech-to-text (STT)) is a sub-field of computational linguistics concerned with methods and technologies that translate spoken language into text or other interpretable forms.

Speech recognition applications include voice user interfaces, where the user speaks to a device, which "listens" and processes the audio. Common voice applications include interpreting commands for calling, call routing, home automation, and aircraft control. These applications are called direct voice input. Productivity applications include searching audio recordings, creating transcripts, and dictation.

↓ Menu
HINT:

In this Dossier

Speech recognition in the context of Natural language processing

Natural language processing (NLP) is the processing of natural language information by a computer. The study of NLP, a subfield of computer science, is generally associated with artificial intelligence. NLP is related to information retrieval, knowledge representation, computational linguistics, and more broadly with linguistics.

Major processing tasks in an NLP system include: speech recognition, text classification, natural language understanding, and natural language generation.

View the full Wikipedia page for Natural language processing
↑ Return to Menu

Speech recognition in the context of Machine learning

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.

ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics.

View the full Wikipedia page for Machine learning
↑ Return to Menu

Speech recognition in the context of Whispering

Whispering is an unvoiced mode of phonation in which the vocal cords are abducted so that they do not vibrate; air passes between the arytenoid cartilages to create audible turbulence during speech. Supralaryngeal articulation remains the same as in normal speech.

In normal speech, the vocal cords alternate between states of voice and voicelessness. In whispering, only the voicing segments change, so that the vocal cords alternate between whisper and voicelessness (though the acoustic difference between the two states is minimal). Because of this, implementing speech recognition for whispered speech is more difficult, as the characteristic spectral range needed to detect syllables and words is not given through the total absence of tone. More advanced techniques such as neural networks may be used, however, as is done by Amazon Alexa.

View the full Wikipedia page for Whispering
↑ Return to Menu

Speech recognition in the context of Typing

Typing is the process of entering or inputting text by pressing keys on a typewriter, computer keyboard, mobile phone, or calculator. It can be distinguished from other means of text input, such as handwriting and speech recognition; text can be in the form of letters, numbers and other symbols. The world's first typist was Lillian Sholes from Wisconsin in the United States, the daughter of Christopher Latham Sholes, who invented the first practical typewriter.

User interface features such as spell checker and autocomplete serve to facilitate and speed up typing and to prevent or correct errors the typist may make.

View the full Wikipedia page for Typing
↑ Return to Menu

Speech recognition in the context of Speech synthesizer

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.

View the full Wikipedia page for Speech synthesizer
↑ Return to Menu

Speech recognition in the context of Deep learning

In machine learning, deep learning focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and revolves around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers (ranging from three to several hundred or thousands) in the network. Methods used can be supervised, semi-supervised or unsupervised.

Some common deep learning network architectures include fully connected networks, deep belief networks, recurrent neural networks, convolutional neural networks, generative adversarial networks, transformers, and neural radiance fields. These architectures have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.

View the full Wikipedia page for Deep learning
↑ Return to Menu

Speech recognition in the context of Apple Vision Pro

The Apple Vision Pro is a mixed-reality headset developed by Apple. It was announced on June 5, 2023, at Apple's Worldwide Developers Conference (WWDC) and was released first in the US, then in global territories throughout 2024. Apple Vision Pro is Apple's first new major product category since the release of the Apple Watch in 2015.

Apple markets Apple Vision Pro as a spatial computer where digital media is integrated with the real world. Physical inputs—such as motion gestures, eye tracking, and speech recognition—can be used to interact with the system. Apple has avoided marketing the device as a virtual reality headset when discussing the product in presentations and marketing.

View the full Wikipedia page for Apple Vision Pro
↑ Return to Menu

Speech recognition in the context of Conversational commerce

Conversational commerce is e-commerce done via various means of conversation (live support on e-commerce Web sites, online chat using messaging apps, chatbots on messaging apps or websites, voice assistants) and using technology such as: speech recognition, speaker recognition (voice biometrics), natural language processing and artificial intelligence.

View the full Wikipedia page for Conversational commerce
↑ Return to Menu

Speech recognition in the context of Digital signal processor

A digital signal processor (DSP) is a specialized microprocessor chip, with its architecture optimized for the operational needs of digital signal processing. DSPs are fabricated on metal–oxide–semiconductor (MOS) integrated circuit chips. They are widely used in audio signal processing, telecommunications, digital image processing, radar, sonar and speech recognition systems, and in common consumer electronic devices such as mobile phones, disk drives and high-definition television (HDTV) products.

The goal of a DSP is usually to measure, filter or compress continuous real-world analog signals. Most general-purpose microprocessors can also execute digital signal processing algorithms successfully, but may not be able to keep up with such processing continuously in real-time. Also, dedicated DSPs usually have better power efficiency, thus they are more suitable in portable devices such as mobile phones because of power consumption constraints. DSPs often use special memory architectures that are able to fetch multiple data or instructions at the same time.

View the full Wikipedia page for Digital signal processor
↑ Return to Menu

Speech recognition in the context of Speech processing

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.

View the full Wikipedia page for Speech processing
↑ Return to Menu

Speech recognition in the context of Subtitle

Subtitles are texts representing the contents of the audio in a film, television show, opera or other audiovisual media. Subtitles might provide a transcription or translation of spoken dialogue. Although naming conventions can vary, captions are subtitles that include written descriptions of other elements of the audio, like music or sound effects. Captions are thus especially helpful to deaf or hard-of-hearing people. Subtitles may also add information that is not present in the audio. Localizing subtitles provide cultural context to viewers. For example, a subtitle could be used to explain to an audience unfamiliar with sake that it is a type of Japanese wine. Lastly, subtitles are sometimes used for humor, as in Annie Hall, where subtitles show the characters' inner thoughts, which contradict what they were saying in the audio.

Creating, delivering, and displaying subtitles is a complicated and multi-step endeavor. First, the text of the subtitles needs to be written. When there is plenty of time to prepare, this process can be done by hand. However, for media produced in real-time, like live television, it may be done by stenographers or using automated speech recognition. Subtitles written by fans, rather than more official sources, are referred to as fansubs. Regardless of who does the writing, they must include information on when each line of text should be displayed.

View the full Wikipedia page for Subtitle
↑ Return to Menu

Speech recognition in the context of Siri

Siri (/ˈsɪri/ SEER-ee) is a virtual assistant and chatbot purchased, developed, and popularized by Apple Inc., which is included in the iOS, iPadOS, watchOS, macOS, Apple TV, audioOS, and visionOS operating systems. It uses voice queries, gesture based control, focus-tracking and a natural-language user interface to answer questions, make recommendations, and perform actions by delegating requests to a set of Internet services. With continued use, it adapts to users' individual language usages, searches, and preferences, returning individualized results.

Siri is a spin-off from a project developed by the SRI International Artificial Intelligence Center. Its speech recognition engine was provided by Nuance Communications, and it uses advanced machine learning technologies to function. Its original American, British, and Australian voice actors recorded their respective voices around 2005, unaware of the recordings' eventual usage. Siri was released as an app for iOS in February 2010. Two months later, Apple acquired it and integrated it into the iPhone 4s at its release on 4 October 2011, removing the separate app from the iOS App Store. Siri has since been an integral part of Apple's products, having been adapted into other hardware devices including newer iPhone models, iPad, iPod Touch, Mac, AirPods, Apple TV, HomePod, and Apple Vision Pro.

View the full Wikipedia page for Siri
↑ Return to Menu

Speech recognition in the context of Language model

A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation, natural language generation (generating more human-like text), optical character recognition, route optimization, handwriting recognition, grammar induction, and information retrieval.

Large language models (LLMs), currently their most advanced form as of 2019, are predominantly based on transformers trained on larger datasets (frequently using texts scraped from the public internet). They have superseded recurrent neural network-based models, which had previously superseded the purely statistical models, such as the word n-gram language model.

View the full Wikipedia page for Language model
↑ Return to Menu

Speech recognition in the context of Voice user interface

A voice-user interface (VUI) enables spoken human interaction with computers, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device is a device controlled with a voice user interface.

Voice user interfaces have been added to automobiles, home automation systems, computer operating systems, home appliances like washing machines and microwave ovens, and television remote controls. They are the primary way of interacting with virtual assistants on smartphones and smart speakers. Older automated attendants (which route phone calls to the correct extension) and interactive voice response systems (which conduct more complicated transactions over the phone) can respond to the pressing of keypad buttons via DTMF tones, but those with a full voice user interface allow callers to speak requests and responses without having to press any buttons.

View the full Wikipedia page for Voice user interface
↑ Return to Menu

Speech recognition in the context of Direct voice input

Direct voice input (DVI), sometimes called voice input control (VIC), is a style of human–machine interaction "HMI" in which the user makes voice commands to issue instructions to the machine through speech recognition.

In the field of military aviation, DVI has been introduced into the cockpits of several modern military aircraft, such as the Eurofighter Typhoon, the Lockheed Martin F-35 Lightning II, the Dassault Rafale, the KF-21 Boramae and the Saab JAS 39 Gripen. Such systems have also been used for various other purposes, including industry control systems and speech recognition assistance for impaired individuals.

View the full Wikipedia page for Direct voice input
↑ Return to Menu

Speech recognition in the context of Facial recognition system

A facial recognition system is a technology potentially capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and works by pinpointing and measuring facial features from a given image.

Development on similar systems began in the 1960s as a form of computer application. Since their inception, facial recognition systems have seen wider uses in recent times on smartphones and in other forms of technology, such as robotics. Because computerized facial recognition involves the measurement of a human's physiological characteristics, facial recognition systems are categorized as biometrics. Although the accuracy of facial recognition systems as a biometric technology is lower than iris recognition, fingerprint image acquisition, palm recognition or voice recognition, it is widely adopted due to its contactless process. Facial recognition systems have been deployed in advanced human–computer interaction, video surveillance, law enforcement, passenger screening, decisions on employment and housing, and automatic indexing of images.

View the full Wikipedia page for Facial recognition system
↑ Return to Menu

Speech recognition in the context of Speaker recognition

Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to speaker recognition or speech recognition. Speaker verification (also called speaker authentication) contrasts with identification, and speaker recognition differs from speaker diarisation (recognizing when the same speaker is speaking).

Recognizing the speaker can simplify the task of translating speech in systems that have been trained on specific voices or it can be used to authenticate or verify the identity of a speaker as part of a security process. Speaker recognition has a history dating back some four decades as of 2019 and uses the acoustic features of speech that have been found to differ between individuals. These acoustic patterns reflect both anatomy and learned behavioral patterns.

View the full Wikipedia page for Speaker recognition
↑ Return to Menu

Speech recognition in the context of Journal of the International Phonetic Association

The Journal of the International Phonetic Association (JIPA; /ˈpə/) is a peer-reviewed academic journal that appears three times a year. It is published by Cambridge University Press on behalf of the International Phonetic Association. It was established as Dhi Fonètik Tîtcer ("The Phonetic Teacher") in 1886. In 1889, it was renamed Le Maître Phonétique and French was designated as the Association's official language. It was written entirely in the IPA, with its name being written accordingly as " mɛːtrə fɔnetik" and hence abbreviated "mf", until 1971, when it obtained its current name and began to be written in the Latin script. It covers topics in phonetics and applied phonetics such as speech therapy and voice recognition, as well as "Illustrations of the IPA" that describe individual languages using the IPA. The journal is abstracted and indexed in the MLA Bibliography.

View the full Wikipedia page for Journal of the International Phonetic Association
↑ Return to Menu