Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A MECHANICAL VOCAL APPARATUS
Document Type and Number:
WIPO Patent Application WO/2017/134182
Kind Code:
A1
Abstract:
A mechanical vocal apparatus is provided and comprises a mouth roof and a tongue. The contour of the mouth roof is fixed and the tongue is movable relative to the mouth roof to vary the cross sectional area therebetween so as to produce different sounds. The contour of the mouth roof is defined by mathematical optimisation.

Inventors:
HOWARD IAN (GB)
Application Number:
PCT/EP2017/052296
Publication Date:
August 10, 2017
Filing Date:
February 02, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV PLYMOUTH (GB)
International Classes:
G09B23/32
Foreign References:
JP2002311974A2002-10-25
JP2006317878A2006-11-24
CN203325431U2013-12-04
Other References:
UNKNOWN: "Anthropomorphic Talking RobotWaseda Talker Series", 31 December 2009 (2009-12-31), XP002769569, Retrieved from the Internet [retrieved on 20170425]
Attorney, Agent or Firm:
BRYERS LLP et al. (GB)
Download PDF:
Claims:
CLAIMS

1. A mechanical vocal apparatus comprising a mouth roof and a tongue, the contour of the mouth roof is fixed and the tongue is movable relative to the mouth roof to vary the cross sectional area therebetween so as to produce different sounds, the contour of the mouth roof being defined by mathematical optimisation.

2. Apparatus as claimed in claim I , in which the tongue has a simple geometry. 3. Apparatus as claimed in claim I or claim 2, in which the geometry of the tongue is defined my mathematical optimisation.

4. Apparatus as claimed in any preceding claim, in which the tongue is manually movable. 5. Apparatus as claimed in any preceding claim, in which the movement of the tongue is automated.

6. Apparatus as claimed in any preceding claim, in which the apparatus is configured to produce vowel sounds.

7. A mechanical vocal apparatus comprising a mouth roof and a tongue, the contour of the mouth being fixed and the tongue having a simple geometry, the tongue is movable with respect to the mouth so as to vary the longitudinal cross-sectional area along the path therebetween, in which the mouth roof and the tongue mathematically matched to a desired cross-sectional area.

8. Apparatus substantially as hereinbefore described with reference to, and as shown in, the accompanying drawings. 9. An artificial mouth cavity comprising apparatus as claimed in any preceding claim.

1 0. A method of designing a mechanical vocal apparatus, comprising the step of defining the shape of a roof of a mouth and the radii that specifies an elliptical tongue section using non-linear optimisation. I I . A robotic vocal apparatus incorporating a tongue body that is moved by stepper motors under computer control.

1 2. The vocal apparatus of claim I I , comprising a central mouth region with a 2- dimentional tongue, a lip section and a nasal cavity.

1 3. The apparatus of claim I I or claim 1 2, in which movement of articulators changes the vocal tract cross-sectional area, and thereby its acoustic properties, thus providing a means to generate different vocalic sounds. 14. The apparatus of any of claims I I to 1 3, in which numerical optimization is used to design the mechanism.

1 5. The apparatus of claim 1 5, achieved by fitting its various dimensions to published articulatory data.

1 6. The apparatus of any of claims I I to 1 5, built using 3D printing technology.

Description:
A MECHANICAL VOCAL APPARATUS

The present invention relates generally to a mechanical vocal apparatus and to the construction of a mechanical vocal apparatus.

Introduction Motivation

There are several motivations for building mechanical models of speech production. An ideal mechanical model would have the advantages over computer simulations because it would be a real-world physical system, which has several important implications. Kinematic and dynamic properties of the mechanism, as well as its intrinsic aerodynamic properties, would not need to be explicitly modelled since they would arise naturally. Real airflow within a mechanical model would naturally lead to phonation at the glottis and turbulence flow at constrictions within the vocal tract, with the former realizing for voiced excitation and the latter the production of fricative and plosive excitations. The human vocal apparatus exhibits non-linear source-filter interactions [ I ] which are either difficult to model or overlooked in many software simulations. Such complex effects would be implicitly captured in an ideal mechanical model, since, for example, closure of vocal tract would affect other behaviours, such as reducing phonation at the larynx and frication at constrictions. The incorporation of the pulmonary system in a mechanical model of speech production would also permit the largely neglected issues of speech breathing in phonetics to be addressed and investigated.

We believe the act of building models is informative because it focuses attention on aspects and basic principles of the vocal apparatus that affect speech production. These can then also be objectively evaluated, since their influence of subsequent speech can be observed and analysed. Also, insight gained from building mechanical models may also lead to further questions that can then be investigated in real speakers. Invasive measurement techniques can easily be built into a mechanical model to record pressure and palatal contact. Indeed, even simple mechanical models have been valuable tools for scientific research in speech science. For example, frication has been investigated in this way [2]. Such invasive procedures are certainly difficult or uncomfortable to perform in real human subjects.

Basic design principles

To realize the acoustic filtering properties of the human vocal tract needed to simulate vowel production with a mechanical model relies on modelling its area function with a tube with a cross sectional areas that can be varied by the movement of the articulators (i.e. the jaw, tongue and lips). In this way an acoustic excitation from a glottal source applied at one end of the tube will be appropriately filtered, giving rise to acoustic output at the lips with a

I format-like structure that can simulate the production of the different vowel qualities. Fricative and other excitations are filtered in a similar fashion. For such a vocal tract model to operate effectively it is important that it is resonant and airtight, that is does not excessively leak sound though its walls and through tongue, and avoids unwanted resonances. This necessitates use of a sealed airtight construction and acoustically damping material in the construction, particularly for the tongue [3].

Previous mechanical models

Mechanical models of the speech apparatus have a long history, with one of the first models capable of reproducing a few steady state vowels [4]. This was shortly followed by a model which was capable of producing CV-like speech sounds [5]. Models that consisted of natural vocal-tract shapes were built much later [6], [7] and only more recently those having the intricate shapes of the vocal tract been constructed using rapid prototyping technology from MRI images of vowel production [8].

Robotic models enable configurations to change in real-time, thereby realizing the production of dynamic speech sounds. One early such design was Motormouth, which was driven by 6 motors and used a single rotating mechanism to implement the tongue [9]. Other models did not emulate the physical structure of the vocal apparatus, but instead was based on a deformable silicone rubber tubular resonator with nasal cavity [ I 0]-[ I 3]. These also incorporated an air pump and two-layer artificial silicone rubber vocal folds. A similar design was also built by the Asada group [ 14], [ 15]. Another mechanical vocal tract model was based on a Plexiglas and resin vocal cavity, a silicone tongue that is moved with a mechanism with 5 DOF, and a velo-pharyngeal port, lips and vocal folds [ 16]. A computer control mechanism has been recently added to an early sliding plastic strip design, enabling is to produce dynamic sequences of vowels [ 17].

Anton, an animatronic model of the vocal tract, has been developed based on details of human anatomy [ 18]. The design takes a biomimetics approach [ 19] and attempts to use components that mimic biology, only resorting to functional approximations where this is not technically feasible. The tongue is central to the design and is cast from silicone rubber, which captures the hydrostatic nature of a real tongue. To simulate the behaviour of the main extrinsic tongue muscles, Dyneema filaments innervate it. To simulate the effect of muscle contraction, they move and deform the tongue in a fashion that mimics real tongue behaviour. The filaments are located in channels in the tongue body and only actually attach to the tongue at their ends using a plastic mesh. The palette is constructed from silicone on the basis of MRI images. The parts are built into a plastic skull with a movable jaw. All articulators are driven using servomotors connected to the actuating filaments and a loudspeaker was used to simulate voiced excitation. Although Anton had limited performance as a speech production device, it demonstrates the feasibility of this innovative approach.

Currently, the most sophisticated robotic model of speech production is the Waseda Talker [20]. One of the latest versions (WT-7R) models all the main functional aspects of speech production. The lungs have I DOF and generate a controlled airflow using motor driven pistons that move and down in clear cylinders. The human larynx is a complex structure and the vocal folds design adopts a 5 DOF biomechanical implementation constructed using the synthetic rubber material Septon. Electric motor actuation is used to simulate the main muscle functions involved in phonation, which are to stretch and lengthen, thicken and shorten, and to abduct and adduct the vocal folds. The tongue was especially designed to capture the complex behaviour of the human tongue. This includes the intrinsic muscle functionality of tongue narrowing, lengthening and flattening, and tip movement, as well as the extrinsic tongue muscle functionality of raising the tongue front and rear, and pulling the tongue body down. The tongue was designed in three parts; the tip, the blade and the body. The tip of the tongue is driven by a 3 DOF parallel link, controlling front and back length, and rotation. Its vertical deformation is reproduced mainly by the jaw mechanism. Tongue blade and body are driven by a set of 2 DOF slider-crank links, to control length and rotation. To improve the resonance characteristics of the vocal tract, the tongue is filled with ethylene glycol. Lips play a key role in the acoustic properties of the vocal tract and were designed to mimic human lips. The lips are made of soft material and are connected by a vice mechanism to five direction links and one fixed point. They can reproduce the shapes needed for the generation of Japanese vowels using this 5 DOF rigid link actuation mechanism. They can be protruded, raised or lowered to achieve opening and closing, which is important in consonant articulation, and spread and rounded.

Recently the Asada group has developed a mechanical model of an infant vocal apparatus

[2 1 ]. This is an exciting development and particularly relevant for the study of infant speech acquisition.

These current mechanical models all have some limitations in terms of speech production performance and therefore there is still useful work to do in the development of a high performance robotic vocal apparatus.

According to an aspect of the present invention there is provided mechanical vocal apparatus comprising a mouth roof and a tongue, the contour of the mouth roof is fixed and the tongue is movable relative to the mouth roof to vary the cross sectional area therebetween so as to produce different sounds, the contour of the mouth roof being defined by mathematical optimisation. The tongue may have a fixed geometry, for example formed from a rigid material; in other embodiments the tongue is formed from a compliant material and the geometry of the tongue may be variable.

In some embodiments the tongue has a simple geometry.

The geometry of the tongue may be defined by mathematical optimisation.

The tongue may be manually movable. Alternatively or additionally movement of the tongue may be automated. For example a robotic mechanism may be provided.

Aspects and embodiments of the present invention may further comprise one or more of: lips, front of mouth, teeth, vocal folds, nose, and lungs. These parts may be optimised. The parts may be designed so as to match data using mechanically realisable parts.

A driver may be used to emulate a larynx.

The apparatus may further comprise two (for example parallel) side sections.

In some embodiments the apparatus is configured to produce vowel sounds. The present invention also provides a mechanical vocal apparatus comprising a mouth roof and a tongue, the contour of the mouth being fixed and the tongue having a simple geometry, the tongue is movable with respect to the mouth so as to vary the longitudinal cross-sectional area along the path therebetween, in which the mouth roof and the tongue mathematically matched to a desired cross-sectional area.

The present invention also provides an artificial mouth cavity comprising apparatus as described herein.

In some embodiment the present invention may employ robotic activation of a tongue that moves within a mouth cavity. The mouth may consist of two parallel side sections sealed at the top by a curved roof and at the bottom by the tongue. The mouth geometry may provide a U-section in which the tongue can move freely. The tongue itself may consist of a 2- dimensional ellipse with thickness such that it exactly fits into the U-section of the mouth, providing an airtight seal with its sides. Changes in the cross sectional areas of the vocal tract may be achieved by moving the longue around within the mouth channel. The dimensions of the vocal tract articulators may be estimated from published data. This may consist of cross- sectional area measurements of a set of vowel sounds performed by a single male participant. These measurements represent the area of the vocal tract along a midline from the glottis to the lips. Non-linear optimization may be used to fit the mouth roof and tongue geometries using an objective function that represents the error between the target cross section area data and that arising from the space between the tongue and roof of the mouth. This process may be constrained to yield a fixed mouth roof contour and a movable, but fixed geometry, elliptical tongue. The tongue itself may be moved around the mouth using linkages driven by small servomotors, which operate under computer control. A compression driver may be attached to the lower end of the assembly and may be driven by a Rosenberg voice source model to simulate acoustic glottal excitation.

In some embodiments the design employs an elliptical tongue of fixed radii and a linear jaw and lip section that move within a mouth cavity. The dimension and geometries of the articulators and mouth were found by fitting them to published human vocal tract data. The dataset used may consist of the vocal tract area functions for a single male speaker for the production of nine American English vowels. The fit was achieved using an optimization running in Matlab, which minimized the mean-square error arising from the target vocal tract area functions with the area arising from the gap between the articulators and the roof of the mouth. The articulators and mouth roof dimension may be optimised over the entire set of vowels, and the location of the articulators was found for each individual vowel. The optimized vocal tract parts were imported into a CAD package and incorporated into the design of the vocal apparatus. This was then built using 3D printing technology. A horn driver unit may be attached to the lower end of the assembly and driven by a voice source model to simulate voiced glottal excitation. We present results from the mechanical vocal tract for several vowel sounds which were achieved by moving the articulators by hand, and make a comparison to the acoustic output achieved from a male speaker and to a control set of 3D printed tubes based directly on the published vowel cross-sectional area data.

In some embodiments the present invention provides a robotic mechanical vocal tract based on articulators and dimensions that are numerically optimized to fit real human vocal tract data.

The intricate designs may be manufactured using 3D printing technology.

The present invention may comprise nonlinear optimization of articulators and their subsequent realization using 3D printing for robotic speech synthesis.

The present invention also provides a reference set of 3D printed tubes that exactly represent published vowel cross-sectional areas.

Different aspects and embodiments of the invention may be used separately or together. Further particular and preferred aspects of the present invention are set out in the accompanying independent and dependent claims. Features of the dependent claims may be combined with the features of the independent claims as appropriate, and in combination other than those explicitly set out in the claims.

The present invention will now be more particularly described, by way of example, with reference to the accompanying drawings. Figure I : Rectangular 3D printed vocal tract sections for the vowels IN and /U/ generated around the vocal tract midline. Control tubes shown here with 5mm wall thickness for clarity.

Figure 2: Vocal tract area functions fitted by optimization with fixed geometry roof, an elliptical tongue and arbitrary curved jaw-lip section. A-C Configurations for vowels /i/, IN and /U/. Mouth roof plotted in black, vocal tract articulator target in red, fitted elliptical tongue shown in blue and fitted jaw-lip section shown in green. D Variations in the overall tongue locations across the nine vowel dataset. The centre and movement directions of the ellipse are also shown.

Figure 3: LHS: 3D CAD assembly consisting of the fixed mouth roof and the movable tongue and jaw-lip sections, shown here without the side plates that seal the mouth cavity. RHS: Assembled mechanical vocal tract attached to horn driver to provide voiced excitation.

Figure 4: Spectrographic analysis for the three vowel sounds IN, l\l and /U/. Human vowels from a male subject are shown in the left column, vowel simulations from control tubes set to the exact published area functions are shown in the middle column and the mechanical vocal tract simulations are shown in the right column.

The example embodiments are described in sufficient detail to enable those of ordinary skill in the art to embody and implement the systems and processes herein described. It is important to understand that embodiments can be provided in many alternate forms and should not be construed as limited to the examples set forth herein.

Accordingly, while embodiment can be modified in various ways and take on various alternative forms, specific embodiments thereof are shown in the drawings and described in detail below as examples. There is no intent to limit to the particular forms disclosed. On the contrary, all modifications, equivalents, and alternatives falling within the scope of the appended claims should be included. Elements of the example embodiments are consistently denoted by the same reference numerals throughout the drawings and detailed description where appropriate.

Unless otherwise defined, all terms (including technical and scientific terms) used herein are to be interpreted as is customary in the art. It will be further understood that terms in common usage should also be interpreted as is customary in the relevant art and not in an idealized or overly formal sense unless expressly so defined herein.

In the following description, all orientational terms, such as upper, lower, radially and axially, are used in relation to the drawings and should not be interpreted as limiting on the invention. Methods

Design philosophy

The overall goal is to develop a realistic physical 3D model of the speech apparatus, complete with robotic actuation and control mechanisms for the speech breathing apparatus, vocal folds and vocal tract. Although modelling the jaw, lips and a vocal cavity, as well as the vocal folds and air source, are needed and may be present in some embodiments for a full mechanical vocal tract simulation, in other embodiments we concentrate on the tongue and the fixed structures in the mouth.

Story et al. dataset

In this work we make use of area functions for vowel productions for a single male speaker from a published MRI study carried out by Story et al. [22] This study provides midline and one-dimensional area measurements of area along the midline of the vocal track. The area data is presented in tabular form. In this work we used the first nine entries corresponding to the American English vowels (represented here in SAMPA [23]): HI, /I./, IEI /}/, Nl, /A/, Id, /o/, /U/. We did not use the vowel lul since this involved considerable extension of the lips, which was not the focus of our current project

Story et al. also plot the vocal tract the midline location for the single male speaker. We digitized this plot to recover these distance measurements.

Rectangular vocal tract control sections

A set of fixed configuration rectangular vocal tracts were built to act as measurement controls for three selected vowels, l\l, IN and /U/. To achieve this, the area functions for the respective vowels were used to calculate the height of the vocal tract aligned to the midline given a 2cm vocal tract width. The upper and lower extents of the roof and base of the rectangular section were then set from the calculation of the normal vectors of length ±height 2 at each midline sample point. The vocal tract surfaces were then plotted out in Matlab as sealed tubes with wall thickness 10 mm and saved in STL file format. The printer files were prepared with 100% infill to minimize acoustic transmission though their walls using Simplyfy3D, and the mechanical parts were manufactured in PLA using a Flashforge Creator Pro 3D printer. The resulting tube shapes are shown in Fig. I . They are shown here for 5mm wall thickness for clarity (since wider wall thickness obscures the tube shapes).

Mouth, tongue and jaw-lip design

Here we adopt a simple design that employs articulators that moves within a fixed mouth cavity. Unlike in software models which can use sophisticated geometries (e.g. [24]), we were careful to ensure the geometries are easy to realize using physical mechanical components. The mouth thus consisted of two parallel side sections separated by 2cm, sealed at the top by a curved roof and at the bottom by the tongue. This mouth geometry provides a U-section in which the tongue can move freely. Such a mouth construction with an arbitrary shaped mouth roof is easy to build using 3D printing techniques. The tongue itself consists of a 2-dimensional elliptical structure with thickness such that it exactly fits into the U-section of the mouth, providing an airtight seal with its sides - which can be enhanced further by the application of a small quantity of lubricant between the articulator and the U-section (such as water or thin oil). Changes in the cross sectional areas of the vocal tract are achieved by moving the articulators around within the mouth channel by means of mechanical actuation. In this initial work, only a simple tongue was investigated since this ensured that it could be easily realized mechanically and fabricated using 3D printing. It consisted of an ellipse specified by its two radii, and these dimensions remained fixed for all articulations. The tongue had three degrees of freedom (two translational and one rotational), so its location in the mouth and orientation could change to articulate different vowels. Thus although the tongue could not change shape to articulate different vowel sounds, it could move around in the mouth cavity to change its effective cross sectional area. A second articulator was used to model the area function that arises from the front jaw and lips. This was modelled as an arbitrary curve that could rotate around its stating position and also translate in two dimensions. Fitting articulator geometries to the Story et al. dataset

The shape of the roof of the mouth, the radii that specified the elliptical tongue and shape of jaw-lip section were found using non-linear optimisation. This was achieved using the Matlab function fmincon to minimize the mean-square error arising from the target vocal tract area functions with the area arise from the gap between the articulators to the mouth roof. The fixed articulator and mouth roof dimension were optimised over the entire set of vowels, and the translation and rotation of the articulators (i.e. tongue and jaw-lip sections) was found for each individual vowel configuration. This process was constrained to yield a fixed mouth roof contour and articulators that could be moved within the mouth to achieve the cross sectional areas needed to realize each of the individual target vowel cross sectional areas. The discovered articulator geometries were then imported into AutoCAD Fusion 360, which was subsequently used to finalize the design of the mechanical vocal tract apparatus. Examples of the fitted mouth and articulators for three of the nine fitted vowels are shown in Fig. 2A-C. In addition the different individual tongue and jaw-lip locations for all nine vowels in the dataset is shown on 2D.

The optimized mouth and articulator geometries were used to generate STL format files, which were again manufactured in PLA following the 3D printing procedure used for the control tubes. The CAD models of the mouth roof and articulators are shown on the LHS of Fig. 3.

Voice source

A Monacor die cast horn driver (model U-5 16) was attached to the lower end of the 3D printed mechanical vocal apparatus assembly (see Fig. 3 RHS). This horn driver has a respectable frequency range of 160-6500 Hz, although to simulate male speech a unit with a lower cut-off would have been more desirable. It was driven from the sound output from a Mac computer via a Lepai LP-2020A Audio Mini Amplifier using a signal generated in software by a Rosenberg voice source model to simulate acoustic glottal excitation. To test the mechanical vocal apparatus, simple single second long linearly falling ( Ι 40Ηζ- Ι 20Ηζ) intonation pitch contours were used.

Analysing acoustic output

Speech production was recorded at the lips of the apparatus using a Podcaster USB microphone and spectrographic analysis was carried out using Matlab. Results

Preliminary results for the static vowel sounds /i/, IN and /U/, and shown in Fig. 4. This illustrates spectrographic comparison of the acoustic outputs generated by a single adult male speaker that achieved with the control 3D printed tubes and also the mechanical vocal apparatus. Good agreement for the lower formats can be seen across the conditions. The mechanical vocal apparatus generated respectable vowel qualities, which could easily be recognized. Subjectively it performed almost as well as the control tubes.

Discussion

Further embodiments

Here we used the data from Story et al. [22] to demonstrate the principles of our design approach. Currently this is limited to a one-dimensional model of area although in the future using more sophisticated datasets this could easily be used to build a 3D vocal tract.

The design employs robotic activation of a tongue that moves within a mouth cavity. The mouth consists of two parallel side sections sealed at the top by a curved roof and at the bottom by the tongue. The mouth geometry provides a U-section in which the tongue can move freely. The tongue itself consists of a 2-dimensional ellipse with thickness such that it exactly fits into the U-section of the mouth, providing an airtight seal with its sides. Changes in the cross sectional areas of the vocal tract are achieved by moving the longue around within the mouth channel. The dimensions of the vocal tract articulators were estimated from published data. This consisted of cross-sectional area measurements of a set of vowel sounds performed by a single male participant. These measurements represent the area of the vocal tract along a midline from the glottis to the lips. Non-linear optimization was used to fit the mouth roof and tongue geometries using an objective function that represents the error between the target cross section area data and that arising from the space between the tongue and roof of the mouth. This process was constrained to yield a fixed mouth roof contour and a movable, but fixed geometry, elliptical tongue. The tongue itself is moved around the mouth using linkages driven by small servomotors, which operate under computer control. A compression driver is attached to the lower end of the assembly and is driven by a Rosenberg voice source model to simulate acoustic glottal excitation.

Articulator actuation of the tongue may be achieved by moving it around the mouth by hand. In other embodiments (not shown) a mechanism using linkages driven by small servomotors, which operate under computer control, may be used.

In further embodiments a more complex tongue design would give a better fit to the area functions. In addition, it would also potentially be much more effective in generating the constructions needed for frication and for plosive sounds. Mathematically it is possible to increase tongue complexity. For example tongue elliptical radii can also be fitted on a vowel- by-vowel basis for give a better fit to the dataset. However is it still important to limit tongue designs to those that can be mechanically realized in practise.

In some embodiments a compression driver is used to generate the glottal waveform, which generates an acoustic excitation with no net airflow. In other embodiments a speech breathing apparatus and/or vocal folds may be implemented, thereby enabling the inclusion of these important aspects of speech production.

Conclusions

This embodiment provides a mechanical vocal tract design process as an optimization problem. This involved fitting articulators in a mechanical model to the real measured vocal area functions. We present results from the mechanical vocal tract for several vowel sounds, and show that even using a simple tongue geometry it is still possible to generate acoustic output that compares reasonably well with more accurate vocal tract geometries. References

[ I ] I. TlTZE, T. RlEDE, AND P. POPOLO, "Nonlinear source-filter coupling in phonation: Vocal exercises." The Journal of the Acoustical Society of America 123.4 (2008): 1902- 19 15. [2] A. BARNEY, C. H. SHADLE, AND P. O. A. L. DAVES, "Fluid flow in a dynamic mechanical model of the vocal folds and tract. I. Measurements and theory." The Journal of the Acoustical Society of America 105. 1 ( 1999): 444-455.

[3] . NlSHIKAWA, H. TAKANOBU, T. MOCHIDA, M. HONDA, AND A. TAKANISHI, "Speech production of an advanced talking robot based on human acoustic theory." Robotics and Automation, 2004. Proceedings. ICRA'04. 2004 IEEE International Conference on. Vol. 4. IEEE, 2004.

[4] C. RATZENSTEIN, "Sur la raissance de la formation des voyelles." J. phys 2 1 ( 1782): 358- 380.

[5] W. Von empelen, "Mechanismus der menschlichen Sprache." Degen, 179 1 .

[6] N. UMEDA AND R. TERANISHI, "Phonemic feature and vocal feature: Synthesis of speech sounds, using an acoustic model of vocal tract." J. Acoust. Soc. Jpn 22.4 ( 1966): 195-203.

[7] R. R. RlESZ, "Description and demonstration of an artificial larynx." The Journal of the Acoustical Society of America I .2A ( 1930): 273-279.

[8] S. FUJITA AND . HONDA, "An experimental study of acoustic characteristics of hypopharyngeal cavities using vocal tract solid models."Acoustical science and technology 26.4 (2005): 353-357.

[9] . LAEDEFABRIK, "Martin Riches-Maskinerne/The Machines." (2005): 10- 1 3.

[ 10] H. SAWADA AND S. HASHIMOTO, "Mechanical Model of Human Vocal System and Its Control with Auditory Feedback." JSME International Journal Series C 43.3 (2000): 645-652.

[ I I ] H. SAWADA AND S. HASHIMOTO, "Mechanical construction of a human vocal system for singing voice production," vol. 1 3, no. 7, pp. 647-66 1 , Jan. 1998.

[ 12] T. HlGASHIMOTO, "A mechanical voice system: construction of vocal cords and its pitch control." International Conference on Intelligent Technologies. Vol. 7624768. 2003.

[ 1 3] H. SAWADA, M. K.ITANI, AND Y. HAYASHI, "A robotic voice simulator and the interactive training for hearing-impaired people." BioMed Research International 2008 (2008).

[ 14] . MlURA, Y. YOSHIKAWA, AND M. ASADA, "Unconscious anchoring in maternal imitation that helps find the correspondence of a caregiver's vowel categories." Advanced Robotics 2 1 . 1 3 (2007): 1583- 1600.

[ 15] Y. YOSHIKAWA, M. ASADA, . HOSODA, AND J. K.OGA, "A constructivist approach to infants' vowel acquisition through mother-infant interaction." Connection Science 15.4 (2003): 245-258.

[ 16] M. C. BRADY, "Prosodic timing analysis for articulatory re-synthesis using a bank of resonators with an adaptive oscillator." INTERSPEECH. 2010.

I I [ 17] T. ARAI, "Mechanical vocal-tract models for speech dynamics."INTERSPEECH. 2010.

[ 18] R. HOFE AND R. MOORE, "Towards an investigation of speech energetics using ΆηΤοη': an animatronic model of a human tongue and vocal tract." Connection Science 20.4 (2008): 3 19-336.

[ 19] Y. BAR-COHEN, "Biomimetics— using nature to inspire human innovation." Bioinspiration & Biomimetics I . I (2006): P I .

[20] . FUKUI, . NlSHIKAWA, AND S. IKEO, "Development of a talking robot with vocal cords and lips having human-like biological structures." Intelligent Robots and Systems, 2005.(IROS 2005). 2005 IEEE/RSJ International Conference on. IEEE, 2005.

[2 1 ] N. ENDO, T. OJIMA, Y. SASAMOTO, H. ISHIHARA, T. HORN, AND M. ASADA, "Design of an Articulation Mechanism for an Infant-like Vocal Robot "Lingua"." Biomimetic and Biohybrid Systems. Springer International Publishing, 2014. 389-391 .

[22] B. H. STORY, I. R. TITZE, AND E. A. HOFFMAN, "Vocal tract area functions from magnetic resonance imaging." The Journal of the Acoustical Society of America 100. 1 ( 1996): 537-554.

[23] J. C. WELLS, "SAMPA computer readable phonetic alphabet." Handbook of standards and resources for spoken language systems 4 ( 1997).

[24] P. BlRKHOLZ, D. JACKEL, AND B. J. KROGER, "Construction and control of a three- dimensional vocal tract model." Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. Vol. I . IEEE, 2006.

In a further embodiment we present a robotic vocal apparatus incorporating a tongue body that is moved by stepper motors under computer control. The vocal tract consists of a central mouth region with a 2-dimentional tongue, a lip section and a nasal cavity. Movement of the articulators changes the vocal tract cross-sectional area, and thereby its acoustic properties, thus providing a means to generate different vocalic sounds. Numerical optimization in Matlab was used to design the mechanism, which was achieved by fitting its various dimensions to published articulatory data. The vocal tract was then built using 3D printing technology. We present some examples of vowel and nasal sound production using simulated voiced sound excitation. Finally, we demonstrate that by blowing air into the lower larynx end of the vocal tract while making appropriate constrictions, the apparatus can also generate different fricative sounds.

Introduction

Motivation for robotic models of speech production The reasons for building a robotic vocal apparatus are many-fold and this is an active area of research [ l ]-[ l I ]. From the inventor's perspective, a robotic speech apparatus will provide a good alternative to using software implementations of articulatory speech production and should therefore be a valuable tool to assist the building models of speech acquisition and speech motor control. It addition, it will provides a means to investigate the importance of the vocal tract anatomy and mechanisms that are involved in speech production.

Extending our previous embodiment

One embodiment of the system could generate purely vocalic sounds, but in order to modulate vocal tract cross sectional area required hand actuation of the tongue to move it around the mouth cavity [ 12]. In this embodiment, we extend the design in several ways. We now incorporate robotic activation of the tongue body using linkages driven by stepper motors operated under computer control. In addition, we now include a nasal cavity. Finally, we show that such a system can generate fricative sounds by blowing air in at its base, which can lead to turbulent airflow and therefore noise generation at constrictions in the vocal tract. Methods

Approach

We adopt a computational approach to mechanical design by first generating drawings in Matlab and importing them into AutoCAD Fusion 360 using Fython scripts. This provides a means to incorporate sophisticated optimization procedures in the mechanical design process. The design employs an elliptical tongue of fixed radii and a linear jaw and lip sections that move within a mouth cavity. The dimension and geometries of the articulators and mouth were calculated by fitting them to published human vocal tract data. Similarly the nasal cavity was matched to published nasal area data. The vocal apparatus was built using 3D printing technology. A horn driver unit was attached to the lower end of the assembly and driven by a voice source model to simulate voiced glottal excitation.

Area Data

The acoustic properties of the mouth cavity that play an important role on speech production are determined by its cross sectional area, and changes in the cross sectional area is achieved by means of movement of the speech articulators. To operate effectively as an acoustic filter, (as in a real human vocal tract), a robotic vocal tract needs to have the appropriate cross sectional area, be airtight and keep the level of sound transmission through its walls down to low levels.

To obtain the appropriate cross sectional area, we again make use of area functions for vowel productions for a single male speaker from a published MRI study carried out by Story et al. [ 1 3]. This provides a midline path and one-dimensional area measurements of area along the midline of the vocal track. In total we used the first nine entries corresponding to the American English vowels. Their SAMPA representation [ 14] is as follows: /i/, /I./, ΙΈ.Ι /}/, Nl, IN, Id, lol, /U/.

To add a nose to the model, we make use of nasal tract area data from Dang and Honda [ 15]. The degree of nasality is controlled by the opening of the velopharyngeal port, which we realize using a rotary flap. The nasal tract has a total length of about I I cm and a volume of about 25 cm3. The narrowest part is at the nostrils where the area reduces to about I cm3 - 2 cm3. Its velum opening can be up to I cm3, but is normally 0.2cm3 - 0.8 cm3. In a human nasal tract, the nasal cavity divides into two parts. However, for simplicity, here it is only modelled as a single cavity.

Mouth, tongue and tongue-tip jaw-lip design

In this vocal apparatus we employ three articulators that move within a fixed mouth cavity; the tongue body, tongue tip and velum. The mouth consisted of two parallel side sections separated by 2cm, sealed at the top by a curved roof and at the bottom by the tongue and the base of the mouth. This mouth geometry provides a U-section in which the articulators can move freely. To ensure the mouth is airtight, its side plates are screwed down by several bolt and grease is used to hermetically seal the cavity. In addition, grease is applied to the sides of the walls to lubricate the moving parts and also to provide an airtight seal with the articulators.

We make use of more complex tongue design than that used previously. The body of the tongue again consists of an ellipse specified by its two radii, and these dimensions remained fixed for all articulations. The tongue now has an adjustable hinged tip, which is used to model the jaw-lip area function that occurs at the front of the mouth (See Figs. 5 & 6.). The lower part of the tongue now has a sliding section tensioned by a spring to close off the vocal tract at the tongue base. This ensures there is a continuous airtight passage through the vocal tract from the larynx section to the tip of the tongue. The tongue body has three degrees of freedom (two translational and one rotational), so its location in the mouth and orientation can change to articulate different vowels. The tongue tip has a single rotational degree of freedom. Similarly the velum has a single rotational degree of freedom. The tongue is moved by the links, which move it from the rear thereof (se Fig. 7).

Fitting articulator geometries

The shape of the roof of the mouth, the radii that specified the elliptical tongue and shape of jaw-lip tip section were found using non-linear optimisation. This was achieved using the Matlab function fmincon to minimize the mean-square error arising from the target vocal tract area functions with the area arising from the gap between the articulators to the mouth roof. The fixed articulator and mouth roof geometries were optimised over the entire set of vowels, and the translation and rotation of the articulators (i.e. tongue and tip jaw-lip sections) were found for each individual vowel configuration. This process was thereby constrained to yield a fixed mouth roof and palate contour with articulators that could be moved within the mouth to achieve the cross sectional areas needed to realize each of the individual target vowel cross sectional areas. Fig. 5 A-C shows examples of three fitted configurations. Fig. 5 D shows the location of the centre of the tongue body. Note these locations are similar to those seem on the classical vowel quadrilateral, except the mouth is on the right side and not on the left.

Figure - 5 A-C: Examples of 3 vocal tract and nasal area functions generated in Matlab. The tongue section is shown in red with the single line indicating the upper tongue tip contour. The roof of the mouth and nasal cavity are shown in black. The ->palate section is shown in blue. The velopharyngeal port is shown open and plotted in green. D: Tongue center location for vowel production. The short line indicates orientation. It can be seen the range of movement across vowel qualities is quite small.

The computed vocal tract geometries were imported into AutoCAD Fusion 360, which was subsequently used to finalize the design of the mechanical vocal tract apparatus. The optimized mouth and articulator geometries were used to generate STL format files and the mechanical parts were built in PLA using a Flashforge Creator Pro 3D printer. The AutoCAD model of the mouth roof and articulators are shown in Fig. 6.

Figure - 6 LHS: AutoCAD Fusion design of the central section of vocal apparatus. Here the sides are not shown so that the tongue mechanism vocal and nasal cavities and the velar flap can be seen. RHS: Oblique front view of the 3D printed vocal apparatus. The black movable tongue body and white tongue tip can be seen, as can the white side plates of the apparatus.

Robotic actuation

Control of articulator position plays a critical role in speech production. As can be seen from the central location of the tongue (Fig. 5D) the actual movement range for the tongue across vowel qualities is quite small, changing only by about 5mm over the production of the range of vowels modelled in this study. We adopted a simple three-link (lower link, mid link, upper link) revolute arm construction to move the main tongue section in two dimensions and also to control its orientation (Fig. 7). It was operated using GT2 timing belt drive. Miniature ball bearings were used at its joints to ensure all moving parts could rotate with high precision and with little frictional resistance despite high belt tension. All parts were designed in AutoCAD Fusion 360 and subsequently 3D printed in PLA.

To drive the tongue body we use three high-resolution 400 PPR Nema l 7 stepper motors connected to them by means of GT2 timing belts. Currently the tip of the tongue and the velum are moved by hand but could be similarly actuated. The vocal assembly and actuation are mounted on 20mm aluminium profile and attached by means of bolts and T-nuts. This greatly facilitates adjustment of the apparatus. It also provides an elegant means to tension the timing belts, since the motors can simply be slid backwards and forwards until appropriate tension is achieved. The motors are driven from a uman 3D Printer Controller kit for Arduino RAMPS 1 .4 Controller Board, which can operate up to 5 stepper motors. It was operated from an Arduino Mega 2560 R3 Microcontroller programmed in C++.

Adding airflow

Speech breathing and the role of the respiratory system play an important role in speech production, and we previously briefly touched this theme in simple speech sound learning simulations [ 16]. In short, we believe that incorporating breathing and airflow would make a robotic vocal apparatus more realistic and useful for modelling the learning of speech production, and therefore consider some of its characteristic below.

During normal adult speech, inhalation is performed under muscular control. Exhalation also takes advantage of the elastic properties of the chest walls, although in order to hold sub-glottal pressure constant during speech production, muscular control is also involved. In contrast, in young infants this chest wall elasticity is almost absent [ 17]. Although the normal adult male lungs have a vital capacity between 3000-5000 cm3, speech production tidal volume is typically only 500 cm3. An air pressure of about 20 cm H20 is required for normal phonation, although the lungs can generate up to 10 times this pressure during coughing. Air pressure during speech is typically around 5- 10cm H20. Airflow during phonation is normally at around 500 cm3- I OOO cm3 per second displacement. Phonation threshold pressure is generally between I and 3 cm H20, and it increases with fundamental frequency [ 18]. For conversational speech pressure is typically about 5 cm H20 for loud speech it can be 10 to 15 cm H20 [ 19]. We note that atmospheric pressure is typically 1030 cm H20. Thus compared to most commercial air pumps that typically develop many times atmospheric pressure, the human respiratory system operates at very low pressures with only moderate airflow volumes.

Airflow within the human vocal tract naturally leads to phonation at the glottis and turbulence flow at constrictions, with the latter leading to the production of fricatives and plosives. The location of the frication noise source within the vocal apparatus leads to acoustic filtering, thereby giving rise to a fricative spectral balance that is dependent on the place of frication. Ideally to achieve the right airflow conditions, it would be necessity to model the vocal tract cross section for fricative production, which can again achieved on the basis of published data. In addition, obstacles to airflow, such as the teeth and alveolar ridge, affect the generation of turbulence and need to be taken into account. Sound sources for vocal excitation

As before to simulate voiced excitation, a Monacor die cast horn driver (model U- 5 16) was attached to the lower end of vocal apparatus and driven from a Mac computer via a Lepai LP-2020A Audio Mini Amplifier using a simulated glottal excitation [ 12].

Alternative airflow-based means of excitations were also investigated. Although to generate a suitable airflow, a compressor or pump could be used (and this approach will be investigated in the future), a simple approach was adopted here. Since the human respiratory system is the perfect source for such airflow, this air source was adopted. Thus in the current study, activation involved a single participant (the author) manually blowing down the air tube.

To examine airflow generated voiced excitation, a makeshift artificial larynx consisting of a duck call was attached at the lower end of the vocal tract (see supplementary material for results).

To examine fricative generation, an air tube was directly attached at the same location. Currently we only used the vocal tract optimized for vowels, but still were able to show that frication can be achieved if the articulators are first positioned to generate a point of partial closure of the vocal tract. Then when air is then blown into the vocal tract at its base, it results in air turbulence at the constriction, leading to the production of a fricative.

Results

The vocal apparatus was used to first generate a set of static vowel sounds and then dynamic sounds. Speech production was recorded at the lips of the apparatus using a Podcaster USB microphone and spectrographic analysis was carried out using Matlab. Preliminary results presented as spectrographic comparisons between those generated by a single adult male speaker and those from the robotic vocal apparatus are shown in Fig. 8 for vowels and a CVCV. These indicate some similarity in their formant structure. To generate fricatives the vocal tract was suitably constricted and air blown into its base. Spectral comparisons between fricatives for the adult male speaker and robot are shown in Fig. 9, which show similar spectral characteristics.

Figure 7 - Robotic vocal apparatus with stepper motor actuation mechanism. On the left is the PLA vocal tract. A horn driver is attached below to provide simulated voiced excitation for vowel production. The stepper motor actuated revolute arm mechanism can be seen on the right.

Figure 8 - Spectrographic analysis for the three vowel sounds /A/, l\l /U/ and the CVCV /mama/. The latter were made with the velopharyngeal port open to nasalize the sounds. Data from a male participant and the vocal apparatus are shown on the left and right respectively. Figure 9 - Spectrographic analyses for fricative from a male participant are shown in the upper row and those from the robotic vocal apparatus are shown in the lower row.

Discussion

Summary

Here we present a design and results from a computer controlled robotic vocal tract that is able to generate vocalic, nasal and fricative sounds. Motor actuation of the tongue tip and the velum in particular could be implemented in further embodiments. Overall the focus here has been to adopt an agile approach to development and concentrate first on the main design principle, and then improve the design in future iterations.

Further embodiments

The above embodiments make no attempt to model dynamics of speech production and instead use open loop feed-forward stepper motor position control to simplify the task. In engineering systems, feedback control mechanisms are used to achieve operating goals in the presence of noise and to compensate for changes in plant characteristics. In the in future we will be investigate using force and compliant control to better take advantages of the dynamics of the system and made use of state space and optimal feedback control strategies using somatosensory feedback. This could include state feedback control methods (SFC), as outlined by Houde et al. [20]. In further embodiments of the apparatus, we will increase acoustic performance by adopting better area functions built on the basis of better datasets. The area functions that were used here represent early research in the fields and more sophisticated and accurate datasets now exist that could easily be used to generate an improved version of the vocal tract in 3D. Although we took initial steps to utilize real airflow in the model generated by a participant blowing down a tube, further embodiments may involve implementing a computer controllable speech breathing apparatus. In further embodiments the incorporations of a model of the folds will also be tackled. In some embodiments we use PLA to print the vocal tract. Utilising 3D printing of soft materials (e.g. NinjaFlex) also has great potential to improve many aspects of the design, including the construction of airtight flexible joints and a deformable tongue and lips. In some embodiments the cavity and tongue may be coated with metal (e.g. copper tape) or other conductive material so that contact between these parts could be detected.

Although illustrative embodiments of the invention have been disclosed in detail herein, with reference to the accompanying drawings, it is understood that the invention is not limited to the precise embodiments shown and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention. References [ I ] . LAEDEFABRI , Martin Riches - Maskinerne / the Machines. ehrer Verlag: laedefabrik, KB, 2005.

[2] H. SAWADA AND S. HASHIMOTO, "Mechanical Model of Human Vocal System and Its Control with Auditory Feedback," JSME International Journal Series C, vol. 43, no. 3, pp. 645-652, 2000.

[3] H. SAWADA AND S. HASHIMOTO, "Mechanical construction of a human vocal system for singing voice production," vol. 1 3, no. 7, pp. 647-66 1 , Jan. 1998.

[4] T. HIGASHIMOTO, "A mechanical voice system: construction of vocal cords and its pitch control," Robotics and Automation, 2002. Proceedings. ICRA '02. IEEE, 2003.

[5] H. SAWADA, M. KITANI, AND Y. HAYASHI, "A Robotic Voice Simulator and the Interactive Training for Hearing-Impaired People," J. Biomed. Biotech., vol. 2008, pp. 1 -8, 2008.

[6] K. MIURA, Y. YOSHIKAWA, AND M. ASADA, "Unconscious anchoring in maternal imitation that helps find the correspondence of a caregiver's vowel categories," 2007.

[7] Y. YOSHIKAWA, M. ASADA, K. HOSODA, AND J. KOGA, "A constructivist approach to infants' vowel acquisition through mother-infant interaction," Connection Science, vol. 15, no. 4, pp. 245-258, Dec. 2003.

[8] M. C. BRADY, "Prosodic Timing Analysis for Articulatory Re-Synthesis Using a Bank of Resonators with an Adaptive Oscillator," presented at the Eleventh Annual Conference of the International Speech Communication Association, 2010, pp. 1029- 1032.

[9] T. ARAI, "Mechanical Vocal-Tract Models for Speech Dynamics," Interspeech, 2010.

[ 10] K. FUKUI, K. NISHIKAWA, AND S. IKEO, "Development of a Talking Robot with Vocal Cords and Lips Having Human-like Biological Structures," Intelligent Robots and Systems, 2005. (IROS 2005)., 2005.

[ I I ] N. ENDO, T. KOJIMA, Y. SASAMOTO, H. ISHIHARA, T. HORN, AND M. ASADA, "Design of an Articulation Mechanism for an Infant-like Vocal Robot 'Lingua'," in Biomimetic and Biohybrid Systems, vol. 8608, no. 39, Cham: Springer International Publishing, 2014, pp. 389-39 1 .

[ 12] I. S. HOWARD, "Towards a mechanical vocal apparatus for vowel production," presented at the ESSV Leipzig Germany, 2016.

[ 1 3] B. H. STORY, I. R. TITZE, AND E. A. HOFFMAN, "Vocal tract area functions from magnetic resonance imaging," The Journal of the Acoustical Society of America, vol. 100, no. I , pp. 537-554, Jul. 1996. [ 14] J. C. WELLS, "SAMPA computer readable phonetic alphabet." Handbook of standards and resources for spoken language systems 4, 1997.

[ 15] J. DANG, K. HONDA, AND H. SUZUKI, "Morphological and acoustical analysis of the nasal and the paranasal cavities.," The Journal of the Acoustical Society of America, vol. 96, no. 4, pp. 2088-2 100, Oct. 1994.

[ 16] I. HOWARD AND P. MESSUM, "Modeling motor pattern generation in the development of infant speech production," International Seminar on Speech Production, 2008.

[ 17] R. Netsell, W. K. Lotz, J. E. Peters, and L. Schulte, "Developmental patterns of laryngeal and respiratory function for speech production.," J Voice, vol. 8, no. 2, pp. 123- 1 3 1 , Jun. 1994.

[ 18] I. R. Titze, "Principles of Voice Production," National Center for Voice and Speech, 2000.

[ 19] T. J. Hixon, G. Weismer, and J. D. Hoit, Preclinical speech science. Plural Pub Inc, 2008.

[20] J. F. HOUDE AND S. S. NAGARAJAN, "Speech production as state feedback control.," Front H um Neurosci, vol. 5, p. 82, 201 I .