Logo - springer
Slogan - springer

Engineering - Signals & Communication | International Journal of Speech Technology – incl. option to publish open access (Societies)

International Journal of Speech Technology

International Journal of Speech Technology

Editor-in-Chief: Amy Neustein

ISSN: 1381-2416 (print version)
ISSN: 1572-8110 (electronic version)

Journal no. 10772

Large Vocabulary Continuous Speech Recognition

This virtual issue contains the leading edge work of speech scientists at private companies and research labs who have developed voice applications that use some of the most advanced speech recognition technologies in real-world settings. The virtual issue opens with a demonstration of IBM Watson Research Center’s novel multimodal application that uses natural language understanding with a WAP browser to access email messages on a cell phone, followed by a fascinating discussion of SONY’s new techniques for reducing the size of acoustic models while maintaining or improving accuracy of the recognition engine, which serves the function of reducing memory and improving accuracy of an embedded Large Vocabulary Continuous Speech Recognition (LVCSR) application. Included in this issue is an inside look provided by an AT&T researcher into the benchmark tests used to evaluate the performance of automated directory assistance when compared with the performance of human operators. Rounding out this issue is University of Ulm’s Institute for Information Technology’s comparative analysis of three principal system architectures (embedded speech recognition systems, network speech recognition, and distributed speech recognition for delivering ASR (automatic speech recognition) to mobile users; the University of Patras, Wire Communication Laboratory’s exciting findings on the use of various regression algorithms to estimate an unknown speaker’s height from his speech serving as an added biometric tool for performing speaker surveillance, speaker profiling or providing access authorization; and the University of Geneva and the Swiss International Institute of Management and Technology’s illuminating study on the actual adoption rate of Interactive Voice Response systems at large companies. These are just a few highlights of the articles in this virtual issue that show the advanced applications of speech technology in everyday life for both consumers and commercial enterprises.

Examining modality usage in a conversational multimodal application for mobile e-mail access  

Jennifer Lai, Stella Mitchell and Christopher Pavlovski (IBM)

...This paper describes the architecture and implementation of a multimodal application (voice and text) that uses natural language understanding combined with a WAP browser to access email messages on a cell phone. We present results from the use of the system by users as part of a laboratory trial that evaluated usage...

Development of the compact English LVCSR acoustic model for embedded entertainment robot applications  

Xavier Menéndez-Pidal, Ajay Patrikar, Lex Olorenshaw and Hitoshi Honda (SONY)

In this paper we discuss two techniques to reduce the size of the acoustic model while maintaining or improving the accuracy of the recognition engine. The first technique, demiphone modeling, tries to reduce the redundancy existing in a context dependent state-clustered Hidden Markov Model (HMM). Three-state demiphones optimally designed from the triphone decision tree are introduced to drastically reduce the phone space of the acoustic model and to improve system accuracy. The second redundancy elimination technique is a more classical approach based on parameter tying. Similar vectors of variances in each HMM cluster are tied together to reduce the number of parameters. The closeness between the vectors of variances is measured using a Vector Quantizer (VQ) to maintain the information provided by the variances parameters. The paper also reports speech recognition improvements using assignment of variable number Gaussians per cluster and gender-based HMMs. The main motivation behind these techniques is to improve the acoustic model and at the same time lower its memory usage. These techniques may help in reducing memory and improving accuracy of an embedded Large Vocabulary Continuous Speech Recognition (LVCSR) application.

Comparing machine and human performance for caller’s directory assistance requests  

Harry M. Chang (AT and T)

To understand how to systematically construct the machine models for automating Directory Assistance (DA) that are capable of reaching the performance level of human DA operators, we conducted a number of studies over the years. This paper describes the methods used for such studies and the results of laboratory experiments. These include a series of benchmark tests configured specifically for DA related tasks to evaluate the performance of state-of-the-art and commercially-available Hidden Markov Model (HMM) based Automatic Speech Recognition (ASR) technologies. The results show that the best system achieves a 57.9% task completion rate on the city-state-recognition benchmark test. For the most frequently-requested-listing benchmark test, the best system achieves a 40% task completion rate.

An efficient singular value decomposition algorithm for digital audio watermarking  

Fathi E. Abd El-Samie

The singular value decomposition (SVD) mathematical technique is utilized, in this paper, for audio watermarking in time and transform domains. Firstly, the audio signal in time or an appropriate transform domain is transformed to a 2-D format. The SVD algorithm is applied on this 2-D matrix, and an image watermark is added to the matrix of singular values (SVs) with a small weight, to guarantee the possible extraction of the watermark without introducing harmful distortions to the audio signal. The transformation of the audio signal between the 1-D and 2-D formats is performed in the well-known lexicographic ordering method used in image processing. A comparison study is presented in the paper between the time and transform domains as possible hosting media for watermark embedding. Experimental results are in favor of watermark embedding in the time domain if the distortion level in the audio signal is to be kept as low as possible with a high detection probability. The proposed algorithm is utilized also for embedding chaotic encrypted watermarks to increase the level of security. Experimental results show that watermarks embedded with the proposed algorithm can survive several attacks. A segment-by-segment implementation of the proposed SVD audio watermarking algorithm is also presented to enhance the detectability of the watermark in the presence of severe attacks.

Estimation of unknown speaker’s height from speech  

Iosif Mporas and Todor Ganchev

In the present study, we propose a regression-based scheme for the direct estimation of the height of unknown speakers from their speech. In this scheme every speech input is decomposed via the openSMILE audio parameterization to a single feature vector that is fed to a regression model, which provides a direct estimation of the persons’ height. The focus in this study is on the evaluation of the appropriateness of several linear and non-linear regression algorithms on the task of automatic height estimation from speech. The performance of the proposed scheme is evaluated on the TIMIT database, and the experimental results show an accuracy of 0.053 meters, in terms of mean absolute error, for the best performing Bagging regression algorithm. This accuracy corresponds to an averaged relative error of approximately 3%. We deem that the direct estimation of the height of unknown people from speech provides an important additional feature for improving the performance of various surveillance, profiling and access authorization applications.

Explaining the (non) adoption and use of interactive voice response (IVR) among small and medium-sized enterprises 

Caroline Kähr and Martin Steinert

Typically, the penetration of interactive voice response systems (IVRs) is described as being very high especially among large companies. The paper at hand discusses the use and adoption rate of such systems among companies, especially among small and medium-sized enterprises (SME). The study conducted shows that the penetration of IVRs is far lower (about 12%) than initially thought. The main reason stated for this low penetration level seems to be the incompatibility of the company’s business model with an automated telephone answering system. However, the evaluation of results gave evidence that this reason serves as a pretext only and that the real reason(s) for not adopting an interactive voice response system might be far more complicated and profound. It is supposed that the negative historic perception of automated speech system still prevails and that IVR providers and sellers have failed to communicate the system’s progress as well as its benefits and its numerous areas of application.

Robust features for multilingual acoustic modeling  

C. Santhosh Kumar and V. P. Mohandas

In this paper, we propose a technique to derive robust features for multilingual acoustic modeling using hidden Markov model–Gaussian mixture models (HMM-GMM). We achieve this by discriminatively combining the phonetic contexts of the target languages (languages in the multilingual system). Phonetic context is captured using wide temporal context of the features, and the dimensionality of the resulting feature set is reduced to suit the HMM-GMM implementation using a neural network with a bottle-neck in one of the hidden layers. The output before the non-linearity at the bottle-neck layer of the neural network is the new feature. Since the features are optimized for the target languages in the multilingual recognizer, they are referred to as Target Languages Oriented Features (TLOF)...

Speaker verification under degraded conditions: a perceptual study  

Gayadhar Pradhan and S. R. Mahadeva Prasanna

This study analyzes the effect of degradation on human and automatic speaker verification (SV) tasks. The perceptual test is conducted by the subjects having knowledge about speaker verification. An automatic SV system is developed using the Mel-frequency cepstral coefficients (MFCC) and Gaussian mixture model (GMM). The human and automatic speaker verification performances are compared for clean train and different degraded test conditions. Speech signals are reconstructed in clean and degraded conditions by highlighting different speaker specific information and compared through perceptual test. The perceptual cues that the human subjects used as speaker specific information are investigated and their importance in degraded condition is highlighted. The difference in the nature of human and automatic SV tasks is investigated in terms of falsely accepted and falsely rejected speech pairs. Speech signals are reconstructed in clean and degraded conditions by highlighting different speaker specific information and compared through perceptual test. A discussion on human vs automatic speaker verification is carried out and the possibility of performance improvement of automatic speaker verification under degraded condition is suggested.

Phone duration modeling: overview of techniques and performance optimization via feature selection in the context of emotional speech 

Alexandros Lazaridis, Todor Ganchev, Theodoros Kostoulas, Iosif Mporas and Nikos Fakotakis

Don't have access? 

Subscription access requried to read the full text. Need the paper but don't have access? Contact me at scott.epstein@springer.com or +1-212-460-1728.



For authors and editors

  • Aims and Scope

    Aims and Scope


    The International Journal of Speech Technology is a research journal that focuses on speech technology and its applications. It promotes research and description on all aspects of speech input and output, including theory, experiment, testing, base technology, applications.

    The journal is an international forum for the dissemination of research related to the applications of speech technology as well as to the technology itself as it relates to real-world applications. Articles describing original work in all aspects of speech technology are included. Sample topics include but are not limited to the following:

    applications employing digitized speech, synthesized speech or automatic speech recognition
    technological issues of speech input or output
    human factors, intelligent interfaces, robust applications
    integration of aspects of artificial intelligence and natural language processing
    international and local language implementations of speech synthesis and recognition
    development of new algorithms
    interface description techniques, tools and languages
    testing of intelligibility, naturalness and accuracy
    computational issues in speech technology
    software development tools
    speech-enabled robotics
    speech technology as a diagnostic tool for treating language disorders
    voice technology for managing serious laryngeal disabilities
    the use of speech in multimedia

    This is the only journal which presents papers on both the base technology and theory as well as all varieties of applications. It encompasses all aspects of the three major technologies: text-to-speech synthesis, automatic speech recognition and stored (digitized) speech.
  • Submit Online
  • Open Choice - Your Way to Open Access
  • Instructions for Authors

    Instructions for Authors


  • Call for Papers (docx, 26 kB)
  • Author Academy: Training for Authors
  • Copyright information

    Copyright information


    Copyright Information

    For Authors

    Submission of a manuscript implies: that the work described has not been published before (except in form of an abstract or as part of a published lecture, review or thesis); that it is not under consideration for publication elsewhere; that its publication has been approved by all co-authors, if any, as well as – tacitly or explicitly – by the responsible authorities at the institution where the work was carried out.

    Author warrants (i) that he/she is the sole owner or has been authorized by any additional copyright owner to assign the right, (ii) that the article does not infringe any third party rights and no license from or payments to a third party is required to publish the article and (iii) that the article has not been previously published or licensed. The author signs for and accepts responsibility for releasing this material on behalf of any and all co-authors. Transfer of copyright to Springer (respective to owner if other than Springer) becomes effective if and when a Copyright Transfer Statement is signed or transferred electronically by the corresponding author. After submission of the Copyright Transfer Statement signed by the corresponding author, changes of authorship or in the order of the authors listed will not be accepted by Springer.

    The copyright to this article, including any graphic elements therein (e.g. illustrations, charts, moving images), is assigned for good and valuable consideration to Springer effective if and when the article is accepted for publication and to the extent assignable if assignability is restricted for by applicable law or regulations (e.g. for U.S. government or crown employees).

    The copyright assignment includes without limitation the exclusive, assignable and sublicensable right, unlimited in time and territory, to reproduce, publish, distribute, transmit, make available and store the article, including abstracts thereof, in all forms of media of expression now known or developed in the future, including pre- and reprints, translations, photographic reproductions and microform. Springer may use the article in whole or in part in electronic form, such as use in databases or data networks for display, print or download to stationary or portable devices. This includes interactive and multimedia use and the right to alter the article to the extent necessary for such use.

    Authors may self-archive the Author's accepted manuscript of their articles on their own websites. Authors may also deposit this version of the article in any repository, provided it is only made publicly available 12 months after official publication or later. He/she may not use the publisher's version (the final article), which is posted on SpringerLink and other Springer websites, for the purpose of self-archiving or deposit. Furthermore, the Author may only post his/her version provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com".

    Prior versions of the article published on non-commercial pre-print servers like arXiv.org can remain on these servers and/or can be updated with Author's accepted version. The final published version (in pdf or html/xml format) cannot be used for this purpose. Acknowledgement needs to be given to the final publication and a link must be inserted to the published article on Springer's website, accompanied by the text "The final publication is available at link.springer.com". Author retains the right to use his/her article for his/her further scientific career by including the final published journal article in other publications such as dissertations and postdoctoral qualifications provided acknowledgement is given to the original source of publication.

    Author is requested to use the appropriate DOI for the article. Articles disseminated via link.springer.com are indexed, abstracted and referenced by many abstracting and information services, bibliographic networks, subscription agencies, library networks, and consortia.

    For Readers

    While the advice and information in this journal is believed to be true and accurate at the date of its publication, neither the authors, the editors, nor the publisher can accept any legal responsibility for any errors or omissions that may have been made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

    All articles published in this journal are protected by copyright, which covers the exclusive rights to reproduce and distribute the article (e.g., as offprints), as well as all translation rights. No material published in this journal may be reproduced photographically or stored on microfilm, in electronic data bases, video disks, etc., without first obtaining written permission from the publisher (respective the copyright owner if other than Springer). The use of general descriptive names, trade names, trademarks, etc., in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.

    Springer has partnered with Copyright Clearance Center's RightsLink service to offer a variety of options for reusing Springer content. For permission to reuse our content please locate the material that you wish to use on link.springer.com or on springerimages.com and click on the permissions link or go to copyright.com, then enter the title of the publication that you wish to use. For assistance in placing a permission request, Copyright Clearance Center can be connected directly via phone: +1-855-239-3415, fax: +1-978-646-8600, or e-mail: info@copyright.com.

    © Springer Science+Business Media New York

Alerts for this journal


Get the table of contents of every new issue published in International Journal of Speech Technology.