Speech is a versatile mean of communication. It conveys linguistic speaker and environmental information. Even though such information is encoded in a complex form, humans can relatively decode most of it. Among all speech tasks, automatic speech recognition (ASR) has been the focus of many researchers for several decades. In this task, the linguistic message is the information of interest. Speech recognition applications range from dictating a text to generating subtitles in real-time for a television broadcast. Despite the human ability, researchers learned that extracting information from speech is not a straightforward process.
Definition of Automatic Speech Recognition
Speech recognition has in years has become a practical concept, which is now being implemented in different languages around the world. Speech recognition has been used in real-world human language applications, such as information recovery. Speech in human can be said as the most common means of the communication because the information maintains the basic role in conversation. The conversation or speech that is captured by a microphone or a telephone is converted from acoustic signal to a set of words in speech recognition. It can be defined:
“Automatic speech recognition (ASR) can be defined as the independent, computer‐driven transcription of spoken language into readable text in real time.”
Automatic speech recognition is primarily used to convert spoken words into computer text. Additionally, automatic speech recognition is used for authenticating users via their voice (biometric authentication) and performing an action based on the instructions defined by the human. Typically, automatic speech recognition requires preconfigured or saved voices of the primary user(s). The human needs to train the automatic speech recognition system by storing speech patterns and vocabulary of their into the system.
Automatic speech recognition (ASR) systems convert speech from a recorded audio signal to text. Humans convert words to speech with their speech production mechanism. An ASR system aims to infer those original words given the observable signal.
Figure 1: Speech Recognition System
Speech is naturally dynamic in nature. Acoustic model is used to model the statistics of speech features for each speech unit of the language such as a phone or a word. Figure 1 shows the basic block diagram of a speech recognition system. As can be seen from Figure 1, acoustic models are required to analyze the speech feature vectors for their acoustic content
Speech Recognition Challenges
Speech analytics market is also expected to grow owing to the growth in automatic speech recognition market. Speech analytics also known as audio mining are widely used to formulate meaning from the captured words. Better decisions for operational and strategic issues are expected to be solved by the study of voice.
Inaccuracy in ASR systems is one of the biggest challenges faced by speech-based biometrics industry. Reduced accuracy level due to surrounding noise serves as a significant disadvantage to highly sensitive voice recognition applications. The hassle of ASR systems being highly sensitive poses a key challenge to the acceptance of such sensitive applications.
Lack of efficient I.T. infrastructure is expected to hinder the overall market growth. Further, lack of knowledge and ability to adopt new technology by some organizations is anticipated to restrain industry growth.
Voice recognition broadly utilizes front end and back end techniques. Front end techniques are plagued by the challenge of time and accuracy. However, owing to high speed and precision, back-end recognition techniques are widely used. Back end techniques are expected to handle noise generated errors and disturbances. This system also needs to detect low pitch sound and thus is highly sensitive
How Does Automatic Speech Recognition (ASR) Work?
The goal of an ASR system is to accurately and efficiently convert a speech signal into a text message transcription of the spoken words independent of the speaker, environment or the device used to record the speech (i.e. the microphone). This process begins when a speaker decides what to say and actually speaks a sentence. The software then produces a speech wave form, which embodies the words of the sentence as well as the extraneous sounds and pauses in the spoken input. Next, the software attempts to decode the speech into the best estimate of the sentence. First it converts the speech signal into a sequence of vectors which are measured throughout the duration of the speech signal. Then, using a syntactic decoder it generates a valid sequence of representations.
Applications of Automatic Speech Recognition
Automatic Speech Recognition (ASR) is an application that consistently exploits advances in computation capabilities. With the availability of a new generation of highly parallel single-chip computation platforms, ASR researchers are faced with the question of unlimited computing to make speech recognition better
There are fundamentally three major reasons why so much research and effort has gone into the problem of trying to teach machines to recognize and understand speech:
- Accessibility for the deaf and hard of hearing
- Cost reduction through automation
- Searchable text capability
 What is Automatic Speech Recognition? Available online at: http://support.docsoft.com/help/whitepaper-asr.pdf
 “Chapter 2: Automatic Speech Recognition”, Statistical Pronunciation Modeling for Non-Native Speech Processing, Signals and Communication Technology, DOI: 10.1007/978-3-642-19586-0_2, Springer-Verlag Berlin Heidelberg 2011
 Adami, André Gustavo. “Automatic speech recognition: From the beginning to the portuguese language”, In The International Conference on Computational Processing of Portuguese (PROPOR), 2010.