{ Japanese / English }
Articulatory feature (AF) is a data that represents the shape of mouth and the place of tongue when a person pronounces a word. Using AF for speech synthesis enables to simulate more close process in producing the voice in the speech synthesis system. For example, this system can produce the nasal voice when a person catches a cold by adjusting the AF parameters in the process of speech synthesis.
Our system uses a neural network to convert the AF parameters into the vocal tract parameters that models the difference of voices among individuals. Since only this neural network depends on an individual in the system, we can model a person's voice from small amount of voice data. Also, voice conversion from a person to another person can be conducted by changing the output from the neural network. This process is implemented using a deep neural network.
In future, we will start the research of speech recognition using the speech synthesis model.
AAM (Active Appearance Models) is a technique to detect the facial features (such as position of mouth or eyes) based on the facial image generation model. It iterates drawing of facial images so as to minimize the difference between the captured facial image and the generated facial image. The parameters to produce the approximate facial image to the captured image are used in various facial image processing applications.
We use this AAM for the facial image cloning system that duplicates a person's facial expression on a facial image. Also, we are developing a lip reading system which recognizes a person's utterance from the lip movement.
Keyword detection is an essential issue in effectively utilizing a speech database. Recently, some studies focused on the search speed because a quick search is important when executed on very large speech/video databases such as the digital archives of TV/radio programs or video sites on the Internet. However, the existing methods cannot be used for a 10,000-h speech database.
We provide fast keyword detection for a large speech database by using the following techniques.
These techniques enables to search a keyword in several tens of seconds from a 10,000-h virtual speech database.
The interaction between human beings is composed of not only speech but also some other channels such as gesture, gaze, facial expression, and so on. Such interaction is called the Multi-Modal Interaction (MMI).
Many MMI systems have been proposed to realize multi-modal human-computer interaction. Although these systems resulted in significant outcomes regarding such things as system architecture and authoring, not many are widely used as human-computer interfaces. One reason for this is complexity of installation, compilation, and so on, to use the system.
To avoid this, we designed a web browser-based MMI system. The system enables users to interact with an anthropomorphic agent simply by accessing a web site via a common web browser. An advantage of the system is that it can be executed on any type of web browser that can handle JavaScript, Java applets, and Flash. This means that the system can be executed not only on a PC but also on a PDA, smart phone, etc.
Return to top page