B.E (ECE), VTU, Belgaum, Karnataka, 2005

Srinivas Desai

): +91 99491 58857 / *:

http://research.iiit.ac.in/~srinivasdesai

Objective:

· To harness my virtuosity and knowledge by carrying out innovative research and development in the domain of Speech Technology.

Education:

· MS by Research (Expected September 2009), CGPA: 8.3/10, IIIT-Hyderabad.

· B.E (ECE), VTU, Belgaum, Karnataka, 2005.

Areas of Competence:

· Voice Conversion, Speech Synthesis, Speaker Identification, Speech Recognition, Natural Language Processing.

Work Experience:

· Research Assistant, Speech Lab, LTRC, IIIT-H (August 2006 – Present).

· Project Assistant, Speech Lab, LTRC, IIIT-H (February 2006 – August 2006).

Skill set:

· Operating Systems : Linux, Windows.

· Programming Languages : C, C++, C#, MATLAB.

· Scripting Languages : Perl.

· Other Tools : Festival/Festvox, Sphinx, CART, SPTK.

Publications:

· Srinivas Desai, B. Yagnanarayana, Alan W Black, Kishore Prahallad, "Spectral Mapping Using Artificial Neural Networks for Voice Conversion", submitted for review at IEEE Transactions on Audio, Speech and Language Processing.

· Srinivas Desai, E. Veera Raghavendra, B. Yagnanarayana, Alan W Black, Kishore Prahallad, “Voice Conversion using Artificial Neural Networks”, ICASSP-09, Taipei, Taiwan.

· E. Veera Raghavendra, Srinivas Desai, B Yegnanarayana, Alan W Black, Kishore Prahallad, "Global Syllable Set for Speech Synthesis in Indian Languages", in Proceedings of IEEE 2008 workshop on Spoken Language Technologies, Goa, India, December 2008.

· E. Veera Raghavendra, Srinivas Desai, B. Yegnanarayana, Alan W Black, Kishore Prahallad, "Blizzard 2008: Experiments on Unit Size for Unit Selection Speech Synthesis", in Blizzard Challenge 2008 workshop, Brisbane, Australia, September 2008.

MS Thesis:

Topic: Mono-lingual and Cross-lingual Voice Conversion using Artificial Neural Networks.

Adviser: Mr. S. Kishore Prahallad

Description: Speech is the most important and natural way of communication among human beings. It transmits not only information that is meant to be conveyed but also carries information about the attitude, emotion and individuality of the speaker. The sound of a person’s voice i.e, speaker identity allows us to differentiate between speakers. Voice Conversion (VC) has emerged as a way to control this speaker identity of a spoken utterance.

The goal of a VC system is to morph the utterance of a source speaker so that it is perceived as if spoken by a specified target speaker. The main goal to build a VC was to create new speakers for Text-To-Speech (TTS) systems. However, it has a number of other interesting applications such as voice quality analysis, emotional speech synthesis and speech recognition. Fields such as speech-to-speech translation, education health and entertainment have also developed applications using techniques involved in VC.

In this thesis, we propose ANN based spectral mapping for voice conversion. We also compare the performance of spectral mapping performed by both ANN and GMM to conclude that the ANN based mapping is better and easier to implement. Typically, to build voice conversion models, a set of utterances have to be recorded by both the source and target speakers called as parallel data. Availability of such parallel data may not always be feasible. Hence, there have been efforts to use nonparallel data and speaker adaptation techniques which build models out of data recorded by these speakers but are not necessarily the same utterances. It is thus important to investigate techniques which perform voice conversion by capturing speaker-specific characteristics of a speaker, without any need for source speaker's data either for training or for adaptation. Such conversion models not only allow an arbitrary speaker to transform his/her voice to a pre-defined target speaker but also extend its applications in the area of cross-lingual voice conversion, i.e, the source speaker and the target speaker language is different.

Funded Projects Involved:

· Text-to-Speech System for Telugu (Developed for Bhrigus Inc, Hyderabad): The Project is aimed to develop a natural sounding TTS engine for Telugu language. The project involved developing various modules like, Font Converters, Text Normalization, Prosody Pause Prediction, Optimal Text Selection, etc.

· Text-to-Speech System for Hindi (Developed for NOKIA, Beijing): The Project is aimed to develop a natural sounding TTS engine for Hindi language. The project involved developing various modules like, Font Converters, Text Normalization, Prosody Pause Prediction, Optimal Text Selection, and Letter to Sound Rules (LTS) etc.

In-House (Research/Tools) Projects:

· Language Independent Speaker Identification: Identifying the speaker using his/her voice is called as speaker identification. As speech is almost unique across the speakers, it can be used to authenticate the speaker. The present speaker identification is trained and tested on some unknown text spoken spontaneously.

· ConQuest: A Real-time Spoken Dialogue System: This was a system building project done to build a real-time Spoken Dialogue System to act as a help-desk at conferences. The project comprised of several technologies including Speech Recognition, Language Understanding, Dialogue management, Language Generation, and Text-to-Speech Synthesis.

· Prosodic Pause Prediction using reduced tag set: This was implemented as a course project where the aim was to predict pauses to build an effective Text-to-Speech system. There is a lot of importance to identify the locations in a sentence where a pause has to be placed. Utterances with pauses in wrong places could lead to a different meaning. Our work included use of parts-of-speech to predict pauses using CART.

· Pitch prediction from MFCC for speech synthesis and voice conversion: With an aim to build an ultra-low bit-rate speech coding system, I worked on this project where pitch was predicted from MFCCs. MLSA is an algorithm which takes in MFCCs and F0 to synthesize speech. Hence, our aim was to quantize MFCCs using vector quantization and predict F0 from MFCCs so that only bits related to MFCCs will need to be transmitted and hence causing a reduction in bit-rate when compared to transmission of bits belonging to MFCCs and F0. However, we were also able to use this methodology to predict pitch from text which is needed in a text-to-speech system.

Miscellaneous:

· Presented a paper on “Voice Conversion Using Artificial Neural Networks” in IEEE ICASSP held at Taipei, Taiwan, April 2009.

· Participated in the 3rd workshop on Winter School in Speech and Audio Processing (WiSSAP-08) held at IIT-Madras, from 2-5 January 2008.

· One of the participants in Blizzard 2008 synthesis challenge from IIIT-H.

· Demonstrated a demo on CONQUEST – a conference question answering dialogue system at IJCAI, 2006, Hyderabad and IJCNLP, 2007, Hyderabad.

· Language independent speaker identification, IIIT-H, November 2007..

· Prosodic Pause Presentation using Reduced Tag Set, IIIT-H, November 2006.

· Guided students for summer internships and course projects.

· TA ship for LT specialization at MSIT, IIIT-H, 2008.

Languages Known:

English, Hindi, Telugu, Kannada, Marathi.

Reference:

Mr. Kishore Prahallad, Senior Research Scientist, IIIT-Hyderabad.

Email: