Current Researches

Last Update Sep. 2. 2016

{ Japanese / English }

Current Researches

Topic 1: Articulatory feature-based speech synthesis/voice conversion system

Articulatory feature (AF) is a data that represents the shape of mouth and the place of tongue when a person pronounces a word. Using AF for speech synthesis enables to simulate more close process in producing the voice in the speech synthesis system. For example, this system can produce the nasal voice when a person catches a cold by adjusting the AF parameters in the process of speech synthesis.

Our system uses a neural network to convert the AF parameters into the vocal tract parameters that models the difference of voices among individuals. Since only this neural network depends on an individual in the system, we can model a person's voice from small amount of voice data. Also, voice conversion from a person to another person can be conducted by changing the output from the neural network. This process is implemented using a deep neural network.

In future, we will start the research of speech recognition using the speech synthesis model.

Topic 2: Facial image recognition using active appearance model

AAM (Active Appearance Models) is a technique to detect the facial features (such as position of mouth or eyes) based on the facial image generation model. It iterates drawing of facial images so as to minimize the difference between the captured facial image and the generated facial image. The parameters to produce the approximate facial image to the captured image are used in various facial image processing applications.

We use this AAM for the facial image cloning system that duplicates a person's facial expression on a facial image. Also, we are developing a lip reading system which recognizes a person's utterance from the lip movement.

Topic 3: Fast keyword detection from a very large speech database

Keyword detection is an essential issue in effectively utilizing a speech database. Recently, some studies focused on the search speed because a quick search is important when executed on very large speech/video databases such as the digital archives of TV/radio programs or video sites on the Internet. However, the existing methods cannot be used for a 10,000-h speech database.

We provide fast keyword detection for a large speech database by using the following techniques.

Suffix array for fast search
Phoneme-based ambiguous matching (DP-matching)
Distinctive phonetic feature-based distance for correct matching
Split of long keyword for quick search
Iterative lengthening search to rapidly output accurate results

These techniques enables to search a keyword in several tens of seconds from a 10,000-h virtual speech database.

Topic 4: Web-based multimodal interaction system

The interaction between human beings is composed of not only speech but also some other channels such as gesture, gaze, facial expression, and so on. Such interaction is called the Multi-Modal Interaction (MMI).

Many MMI systems have been proposed to realize multi-modal human-computer interaction. Although these systems resulted in significant outcomes regarding such things as system architecture and authoring, not many are widely used as human-computer interfaces. One reason for this is complexity of installation, compilation, and so on, to use the system.

To avoid this, we designed a web browser-based MMI system. The system enables users to interact with an anthropomorphic agent simply by accessing a web site via a common web browser. An advantage of the system is that it can be executed on any type of web browser that can handle JavaScript, Java applets, and Flash. This means that the system can be executed not only on a PC but also on a PDA, smart phone, etc.

Past Researches

Return to top page