Building a talking baby robot

2005 ◽  
Vol 6 (2) ◽  
pp. 253-286 ◽  
Author(s):  
Jihène Serkhane ◽  
Jean-Luc Schwartz ◽  
Pierre Bessière

Speech is a perceptuo-motor system. A natural computational modeling framework is provided by cognitive robotics, or more precisely speech robotics, which is also based on embodiment, multimodality, development, and interaction. This paper describes the bases of a virtual baby robot which consists in an articulatory model that integrates the non-uniform growth of the vocal tract, a set of sensors, and a learning model. The articulatory model delivers sagittal contour, lip shape and acoustic formants from seven input parameters that characterize the configurations of the jaw, the tongue, the lips and the larynx. To simulate the growth of the vocal tract from birth to adulthood, a process modifies the longitudinal dimension of the vocal tract shape as a function of age. The auditory system of the robot comprises a “phasic” system for event detection over time, and a “tonic” system to track formants. The model of visual perception specifies the basic lips characteristics: height, width, area and protrusion. The orosensorial channel, which provides the tactile sensation on the lips, the tongue and the palate, is elaborated as a model for the prediction of tongue-palatal contacts from articulatory commands. Learning involves Bayesian programming, in which there are two phases: (i) specification of the variables, decomposition of the joint distribution and identification of the free parameters through exploration of a learning set, and (ii) utilization which relies on questions about the joint distribution. Two studies were performed with this system. Each of them focused on one of the two basic mechanisms, which ought to be at work in the initial periods of speech acquisition, namely vocal exploration and vocal imitation. The first study attempted to assess infants’ motor skills before and at the beginning of canonical babbling. It used the model to infer the acoustic regions, the articulatory degrees of freedom and the vocal tract shapes that are the likeliest explored by actual infants according to their vocalizations. Subsequently, the aim was to simulate data reported in the literature on early vocal imitation, in order to test whether and how the robot was able to reproduce them and to gain some insights into the actual cognitive representations that might be involved in this behavior. Speech modeling in a robotics framework should contribute to a computational approach of sensori-motor interactions in speech communication, which seems crucial for future progress in the study of speech and language ontogeny and phylogeny.

2019 ◽  
Author(s):  
Alan Taitz ◽  
Diego E Shalom ◽  
Marcos A Trevisan

Silent reading is a cognitive operation that produces verbal content with no vocal output. One relevant question is the extent to which this verbal content is processed as overt speech in the brain. To address this, we investigated the signatures of articulatory processing during reading. We acquired sound, eye trajectories and vocal gestures during the reading of consonant-consonant-vowel (CCV) pseudowords. We found that the duration of the first fixations on the CCVs during silent reading are correlated to the duration of the transitions between consonants when the CCVs are actually uttered. An articulatory model of the vocal system was implemented to show that consonantal transitions measure the articulatory effort required to produce the CCVs. These results demonstrate that silent reading is modulated by slight articulatory features such as the laryngeal abduction needed to devoice a single consonant or the reshaping of the vocal tract between successive consonants.


2005 ◽  
Vol 40 ◽  
pp. 63-78
Author(s):  
Ian S. Howard ◽  
Mark A. Huckvale

The goal of our current project is to build a system that can learn to imitate a version of a spoken utterance using an articulatory speech synthesiser. The approach is informed and inspired by knowledge of early infant speech development. Thus we expect our system to reproduce and exploit the utility of infant behaviours such as listening, vocal play, babbling and word imitation. We expect our system to develop a relationship between the sound-making capabilities of its vocal tract and the phonetic/phonological structure of imitated utterances. At the heart of our approach is the learning of an inverse model that relates acoustic and motor representations of speech. The acoustic to auditory mappings uses an auditory filter bank and a self-organizing phase of learning. The inverse model from auditory to vocal tract control parameters is estimated using a babbling phase, in which the vocal tract is essentially driven in a random manner, much like the babbling phase of speech acquisition in infants. The complete system can be used to imitate simple utterances through a direct mapping from sound to control parameters. Our initial results show that this procedure works well for sounds generated by its own voice. Further work is needed to build a phonological control level and achieve better performance with real speech.  


2011 ◽  
Vol 23 (3) ◽  
pp. 683-698 ◽  
Author(s):  
Ran Liu ◽  
Lori L. Holt

Native language experience plays a critical role in shaping speech categorization, but the exact mechanisms by which it does so are not well understood. Investigating category learning of nonspeech sounds with which listeners have no prior experience allows their experience to be systematically controlled in a way that is impossible to achieve by studying natural speech acquisition, and it provides a means of probing the boundaries and constraints that general auditory perception and cognition bring to the task of speech category learning. In this study, we used a multimodal, video-game-based implicit learning paradigm to train participants to categorize acoustically complex, nonlinguistic sounds. Mismatch negativity (MMN) responses to the nonspeech stimuli were collected before and after training, and changes in MMN resulting from the nonspeech category learning closely resemble patterns of change typically observed during speech category learning. This suggests that the often-observed “specialized” neural responses to speech sounds may result, at least in part, from the expertise we develop with speech categories through experience rather than from properties unique to speech (e.g., linguistic or vocal tract gestural information). Furthermore, particular characteristics of the training paradigm may inform our understanding of mechanisms that support natural speech acquisition.


Author(s):  
Quentin Derbanne ◽  
Guillaume de Hauteclocque

Abstract When the long term behaviour of a floating unit is assessed, the environmental contour concept is often applied together with IFORM (Inverse First Order Reliability Method). This approach avoids direct computation on all sea-states, which is computationally very demanding, and most often simply not feasible. Instead, only a few conditions (the contour) are assessed and results in an accurate estimate of the long term extreme. However, most of available methods to derive the contour require the knowledge of the joint distribution of the different random variables (waves, wind, current...), which is often difficult to derive accurately. In fact, some complex dependences exist and are attempted to be simplified in too few coefficients. Another limitation of current environmental contour is its difficulty to deal with the dependence issue. Indeed, extreme sea-states arise by groups (storms, hurricanes...) and are not independent. While de-clustering techniques exist and are quite straightforward in univariate problems, this becomes difficult when the number of dimension increases. In an attempt to tackle those challenges, this paper presents a novel approach to derive IFORM contours. The method does not require any joint distribution and makes use of much more degrees of freedom to capture the dependence between variables. It also allows for an easy de-clustering. The approach is illustrated on two locations, using actual hindcast data of significant wave height and period; the resulting contours are compared to the ones obtained with more traditional methods.


Author(s):  
Marianne Pouplier

One of the most fundamental problems in research on spoken language is to understand how the categorical, systemic knowledge that speakers have in the form of a phonological grammar maps onto the continuous, high-dimensional physical speech act that transmits the linguistic message. The invariant units of phonological analysis have no invariant analogue in the signal—any given phoneme can manifest itself in many possible variants, depending on context, speech rate, utterance position and the like, and the acoustic cues for a given phoneme are spread out over time across multiple linguistic units. Speakers and listeners are highly knowledgeable about the lawfully structured variation in the signal and they skillfully exploit articulatory and acoustic trading relations when speaking and perceiving. For the scientific description of spoken language understanding this association between abstract, discrete categories and continuous speech dynamics remains a formidable challenge. Articulatory Phonology and the associated Task Dynamic model present one particular proposal on how to step up to this challenge using the mathematics of dynamical systems with the central insight being that spoken language is fundamentally based on the production and perception of linguistically defined patterns of motion. In Articulatory Phonology, primitive units of phonological representation are called gestures. Gestures are defined based on linear second order differential equations, giving them inherent spatial and temporal specifications. Gestures control the vocal tract at a macroscopic level, harnessing the many degrees of freedom in the vocal tract into low-dimensional control units. Phonology, in this model, thus directly governs the spatial and temporal orchestration of vocal tract actions.


2021 ◽  
Author(s):  
Sheena Waters ◽  
Elise Kanber ◽  
Nadine Lavan ◽  
Michel Belyk ◽  
Daniel Carey ◽  
...  

Humans have a remarkable capacity to finely control the muscles of the larynx, via distinct patterns of cortical topography and innervation that may underpin our sophisticated vocal capabilities compared with non-human primates. Here, we investigated the behavioural and neural correlates of laryngeal control, and their relationship to vocal expertise, using an imitation task that required adjustments of larynx musculature during speech. Highly-trained human singers and non-singer control participants modulated voice pitch and vocal tract length (VTL) to mimic auditory speech targets, while undergoing real-time anatomical scans of the vocal tract and functional scans of brain activity. Multivariate analyses of speech acoustics, larynx movements and brain activation data were used to quantify vocal modulation behaviour, and to search for neural representations of the two modulated vocal parameters during the preparation and execution of speech. We found that singers showed more accurate task-relevant modulations of speech pitch and VTL (i.e. larynx height, as measured with vocal tract MRI) during speech imitation; this was accompanied by stronger representation of VTL within a region of right dorsal somatosensory cortex. Our findings suggest a common neural basis for enhanced vocal control in speech and song.


Author(s):  
Hisashi Kanda ◽  
Tetsuya Ogata ◽  
Kazunori Komatani ◽  
Hiroshi G. Okuno
Keyword(s):  

1999 ◽  
Vol 3 (1) ◽  
pp. 49-77 ◽  
Author(s):  
Louis-Jean Boë ◽  
Shinji Maeda ◽  
Jean-Louis Heim

Since Lieberman and Crelin (1971) postulated a theory that Neandertals were speechless species, the speech capability of Neandertals has been a subject of hot debate for over 30 years and remains as a controversial question. These authors claimed that the acquisition of a low laryngeal position during evolution is a necessary condition for having a vowel space large enough to realize the necessary vocalic contrasts for speech. Moreover, Neandertals didn't posses this anatomical base and therefore could not speak, presumably causing their extinction. In this study, we refute Lieberman and Crelin's theory by showing, first with the analysis of biometric data, that the estimated laryngeal position for two Neandertals is relatively high, but not as high as claimed by the two authors. In fact, the length ratio of the pharyngeal cavity to the oral cavity, i.e., an acoustically important parameter, of the Neandertals corresponds to that of a modern female adult or of a child. Second, using an anthropomorphic articulatory model, the potentially maximum vowel space estimated by varying the model morphology from a newborn, a child, a female adult and to a male adult didn't show any relevant variation. We infer then that a Neandertal could have a vowel space no smaller than that of a modern human. Our study is strictly limited to the morphological aspects of the vocal tract. We, therefore, cannot offer any definitive answer to the question whether Neandertals actually spoke or not. But we feel safe saying that Neandertals were not morphologically handicapped for speech.


Sign in / Sign up

Export Citation Format

Share Document