Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CAPTION DELIVERY SYSTEM
Document Type and Number:
WIPO Patent Application WO/2019/063751
Kind Code:
A1
Abstract:
A system and method of delivering an information output to a viewer of a live performance. The information output can be displayed text or an audio description at predefined times in the live performance relative to stage events. A follower script with entries organised along a timeline, and metadata at timepoints between at least some of the entries is generated. The metadata is associated with stage events in the live performance. The system uses speech recognition to track spoken dialogue against the entries in the follower script, and the stage events, to aid in following the live performance.

Inventors:
LAMBOURNE ANDREW (GB)
RANSON PAUL (GB)
Application Number:
PCT/EP2018/076384
Publication Date:
April 04, 2019
Filing Date:
September 28, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
THE ROYAL NAT THEATRE (GB)
International Classes:
G10L15/26
Domestic Patent References:
WO2013144605A22013-10-03
WO2002089114A12002-11-07
Foreign References:
US20130120654A12013-05-16
US20150208139A12015-07-23
US20160007054A12016-01-07
AU2007100441A42007-06-28
US5648789A1997-07-15
Other References:
None
Attorney, Agent or Firm:
VIRGINIA ROZANNE DRIVER (GB)
Download PDF:
Claims:
Claims:

1. A caption delivery system for displaying captions during a live performance, the system comprising:

a memory component storing a follower script and a caption script, the follower script including performance-related metadata

a speech follower component operable to recognise spoken dialogue of the performance and to compare the spoken dialogue with the follower script to track the location in the follower script of the spoken dialogue to identify when a caption is to be displayed;

the caption output module configured to access from the caption script the caption for display at each location in the follower script associated with a caption;

wherein the caption delivery system is configured to detect events in the live performance and to assist the speech follower component to determine the location based on the timing of the detected events.

2. A caption delivery system according to claim 1 comprising a metadata file storing the performance related metadata.

3. A caption delivery system according to claim 2, wherein the metadata file comprises performance-related metadata corresponding to events in the live performance and usable to assist in tracking the location of the live performance relative to the follower script.

4. A caption delivery system according to claim 2, wherein the metadata file comprises cue-related metadata in at least one of: a first category which assists the system to determine its location in the follower script based on cues detected in the live performance; a second category of cue metadata which causes a control signal to be generated to a speech detection module to cause it to turn off when speech is not expected at that time in the performance, and to turn it on again when speech is expected; and a third category of cue metadata which causes a control signal to be provided to the caption output module to trigger the display of a non-dialogue caption which is associated with that cue.

5. A caption delivery system according to any preceding claim comprising a caption distribution module configured to receive captions from the caption output module and to generate caption messages for display.

6. A caption delivery system according to claim 5, wherein the memory component stores a first follower script for a first performance and an associated first caption script for the first performance, and a second follower script for a second performance and an associated second caption script for the second performance, wherein the caption distribution module generates caption messages for the first and second performances which are occurring at the same time, each in a respective channel with a performance identifier associated with the caption messages for that performance.

7. A caption delivery system according to claim 5 or 6, wherein the caption distribution module generates the caption output messages in a format which defines colour and style of font for display.

8. A caption delivery system according to claim 5, 6 or 7, wherein the caption distribution module is configured to generate caption messages in which the text of a caption for display is defined in plain text form.

9. A caption delivery system according to any of claims 5 to 8, wherein the caption distribution module is configured to create caption messages which define at least one of a background colour and foreground colour for display of the text.

10. A caption delivery system according to any preceding claim, wherein the memory component stores a translation script which stores for each entry in the follower script a caption in a language other than the language in which the performance is being delivered.

11. A caption delivery system according to any preceding claim, wherein the memory component stores an audio script which identifies at each of a plurality of timepoints in the follower script an audio recording to be played at those points.

12. A caption delivery system which comprises a plurality of display devices each equipped with a receiver for receiving caption messages with captions for display, and the display for showing the captions to a user of the receiving device.

13. A caption delivery system according to claim 12, wherein the display devices are glasses.

14. A caption delivery system according to claim 1, comprising a caption editing module arranged to receive a performance production script and to generate from the production script the follower script and the caption script.

15. A caption delivery system according to claim 14, wherein the caption editing module is configured to receive manual input during a rehearsal of a live performance to modify production of the follower script and caption script.

16. A method of delivering an information output to a viewer of a live performance, the information output being at least one of displayed text and an audio description at predefined times in the live performance relative to stage events by providing a follower script with entries organised in sequence, and metadata associated with at least some of the entries , wherein the metadata is associated with stage events in the live performance, and using speech recognition to track spoken dialogue against the entries in the follower script, and the stage events to aid in following the live performance.

17. A method according to claim 16 wherein the metadata comprises performance cues, and wherein the follower script comprises different variants of the live performance, the method comprising interpreting the performance cues as instructions to select one of the variants appropriate to that live performance.

18. A computer program product comprising computer readable instructions recorded on a transitory or non transitory medium which when executed by a computer perform the method of Claim 16 or 17.

Description:
i

CAPTION DELIVERY SYSTEM

Technical field

This disclosure relates to controlling the delivery of captions and other performance synchronised services in live theatre.

Background

There has been a continuing requirement for captioning and subtitling services to display a text version of something which is spoken. Such captioning improves accessibility, allowing those suffering from hearing disabilities to follow a broadcast or performance by being able to read a transcript of a speech being delivered. Subtitles are also known, for example for conveying a foreign language text of speech in a film or broadcast.

Captioning and subtitling is provided in a number of different contexts. In one context, films may be provided with so-called Open captions' which are synchronised to the speech during the film. Such synchronisation is relatively easy to perform, because the script and timing are known in advance and synchronised captions can be prepared ahead of delivery using the moving image of the file or its audio track. Captions on television have long been available as a service which can be accessed by people with a hearing disability. So-called 'closed captions' are viewed only when the service is selected by a viewer. More recently heads-up displays on so-called smart glasses have become available by means of which captions can be delivered directly to smart glasses worn by members of a cinema audience.

In theatre, live performances represent particular challenges for captioning. At present, captions are triggered manually and only provided for certain accessible theatre

performances. Predefined captions are created based on the script which is to be spoken and the sound effects. The captions are designed to allow a deaf or hard of hearing person to follow the performance. The captions are manually triggered for display by a person (caption cuer) whose task it is to follow the performance and manually trigger the display of each caption synchronised with the oral delivery of the lines of the performance. In live theatre, the timing of captions needs to accommodate for variations in the rate of speech and the timing of pauses or noises other than speech, which may be intentional in the performance or not. The caption cuer is trained to accommodate such variations by carefully watching and listening to the performance so that they provide the captions at the correct times.

Subtitles are also available in the case of live television broadcasts, again to assist deaf or hard of hearing viewers so that they may follow a broadcast even if they cannot clearly hear the audio. The subtitles may be created by a human intermediary who follows the broadcast and carefully and precisely re-speaks the words and punctuation to a highly accurate speech recognition system which provides a real-time transcript for display as subtitles.

Alternatively, a real-time machine shorthand transcription system can be used to generate the subtitle text. In such contexts, there are inevitably delays between the broadcast being delivered and the captions being displayed on a screen. Attempts have been made to play the broadcast soundtrack directly into a speech recognition system, the speech recognition system configured to provide a transcript of the speech as subtitles.

The term speech recognition system is used herein to denote a system capable of transcribing human speech into text to be displayed, the text corresponding to the speech which has been delivered. A speech recognition system might be used as part of a 'speech follower'. Speech followers are used to synchronise between an audio signal carrying speech and a corresponding script. This process may be performed by computer processing of a media file in order to time the script to the corresponding audio signal. The process may also be performed in real time to control the rate at which a script is displayed to a speaker, for example as a teleprompt. That is, a speaker is reading from a script in a live context, and a speech follower assists in making sure that the rate at which the script is displayed to the speaker matches the speaking rate of the speaker. These systems thus display part of a script corresponding to the location that they have detected the speaker has reached in the script.

Attempts have been made to provide a speech follower (using speech recognition) in the context of live theatre, to display a script in time with its delivery. The idea is to use a speech follower to follow the speech which is being spoken during the live performance, and to display captions created from the script at the correct time corresponding to the speech. However, it is very difficult to implement such speech followers in the context of live theatre, due to the many variables that can occur in live theatre. Previously, live speech follower systems have successfully been used in studio contexts with good quality audio. They are not suited generally to live theatre, which poses a number of different challenges. Because of the theatre surroundings, rather than a studio context, there may be poor audio quality. The system has to cope with a number of different styles and speeds of speech and therefore cannot be trained to a standard style and speed of delivery. It is known that speech following systems behave more accurately when they can be trained to a standard style and speed of delivery. Theatres are subject to general background noise, which may be part of the performance itself, or may be unexpected. There may be long pauses between utterances on stage, while the action proceeds without dialogue. Utterances in theatres may consist not only of words but also other utterances such as exclamations or cries or whimpers, and may be unusually loud or quiet. Actors may speak at the same time as each other. Real-time speech follower systems generally work well only in a context where the speech (more or less) is clear and consists of words and is spoken at an even pace and at a reasonably even volume. Actors may have different accents, or deliberately be speaking in an affected way or unusually quickly or slowly. The performance may consist of non-verbal sounds, such as music or effects, which are part of the performance and which a person who is hard of hearing would like to know something about.

For all of these reasons, it has not been possible to date to successfully implement a speech following system to automatically and reliably cue captions for display in the generality of performances. So far, theatre captioning services which have been provided, and which are increasingly in demand, have been manually cued.

Summary

The aim of the present invention is to provide automatically an accurately timed delivery of captions to supplement a live performance in a synchronised manner, and which overcomes the above challenges. This aim is met by tracking the performance against a pre-defined metadata script, which incorporates information about performance events as well as speaker dialogue. According to one aspect of the invention there is provided a caption delivery system for displaying captions during a live performance, the system comprising:

a memory component storing a follower script and a caption script, the follower script including waypoints associated with performance cues;

a speech follower component operable to recognise spoken dialogue of the performance and compare the spoken dialogue with the follower script to track the location in the follower script of the spoken dialogue to identify when a caption is to be displayed; a caption output module configured to access from the caption script the caption for display at each location in the follower script associated with a caption;

a cue handler component which stores performance cue identifiers with associated cue metadata and which is configured to receive performance cues detected in the live performance and output cue signals to the speech follower component to thereby assist the speech follower component to determine the location based on the waypoints at the detected cues.

According to another aspect of the invention there is provided a caption delivery system for displaying captions during a live performance, the system comprising:

a memory component storing a follower script and a caption script, the follower script including performance-related metadata;

a speech follower component operable to recognise spoken dialogue of the performance and to compare the spoken dialogue with the follower script to track the location in the follower script of the spoken dialogue to identify when a caption is to be displayed;

the caption output module configured to access from the caption script the caption for display at each location in the follower script associated with a caption; wherein the caption delivery system is configured to detect events in the live performance and to assist the speech follower component to determine the location based on the timing of the detected events.

The caption delivery system may alternatively be configured to use events in the live performance and information from the speech follower component to determine the location based on the timing of the detected events.

It will be appreciated that the memory component may be any suitable computer or electronic storage, which could be in one server or distributed in different computer entities. For example, the caption script may be stored separately from the follower script and separately from the metadata. The follower script may include waypoints or other timing points referencing a separately stored metadata file, rather than the metadata being part of the follower script itself.

According to another aspect of the invention there is provided a caption delivery system for displaying captions during a live performance, the system comprising:

a memory component storing a follower script and a caption script, the follower script including metadata associated with performance cues;

a speech follower component capable of recognising sufficient spoken dialogue of the performance and comparing that partial transcript with the follower script to track the location in the follower script of the spoken dialogue to identify when a caption is to be displayed;

a caption output module configured to access from the caption script the caption for display at each location in the follower script associated with a caption;

a cue handler component which stores performance cue identifiers with associated cue metadata and which is configured to receive performance cues detected in the live performance and output cue signals to the speech follower component to thereby assist the speech follower component to determine the location based on the timing of the detected cues.

In addition to cues identified in the follower script, it can comprise non-cue metadata corresponding to events and pauses in the live performance and usable to assist in tracking the location of the live performance relative to the follower script.

An aspect of the invention provides a system and method for improving the reliability of a speech follower by using inline metadata and/or cues representing fixed waypoints. One category of metadata causes a control signal to be generated to the speech follower module to cause it to pause when speech is not expected at that time in the performance, and to resume again when speech is expected. This prevents the speech follower from attempting to follow audio when it is known from the performance timeline that a pause or other non- dialogue effect is expected.

Another category of metadata causes a control signal to be provided to the caption output module to trigger the display of a non-dialogue caption from the caption script which is associated with that location. For example, such a text could describe a sound effect such as 'DOG BARKING' or 'GUN SHOT'. The caption delivery system may comprise a caption distribution module configured to receive captions from the caption output module and to generate caption messages for display.

When configured to follow speech for multiple live performances the first speech follower receives an audio feed from the first performance and a second speech follower receives an audio feed from a second performance and so forth.

When configured to deliver captions for multiple live performances the memory component stores a first follower script for a first performance and an associated first caption script for the first performance, and a second follower script for a second performance and an associated second caption script for the second performance, wherein the caption distribution module generates caption messages for the first and second performances which are occurring at the same time, each in a respective channel with a performance identifier associated with the caption messages for that performance. The memory component may be a single computer memory or distributed memory.

The caption distribution module may be configured to generate the caption output messages in a format which defines colour and style of font and text size and position for display. The caption distribution module may be configured to generate caption messages in which the text of a caption for display is defined in plain text form. The caption distribution module may be configured to create caption messages which define at least one of a background colour and foreground colour for display of the text. In one augmented version, the memory component stores a translation script which stores for each entry in the follower script a caption in a language other than the language in which the performance is being delivered.

In another augmented version, the memory component stores an audio script which identifies at each of a plurality of points in the follower script an audio recording to be played at a predefined interval after each of those points. Note that these audio

descriptions are thereby triggered at pauses in the dialogue to describe scenes for a blind person (who can hear the dialogue).

The caption delivery system may comprise a plurality of display devices each equipped with a receiver for receiving caption messages with captions for display, and a display for showing the captions to a user of the receiving device. The display devices can be glasses, e.g. so-called 'smart glasses', or tablets or other mobile devices.

The caption delivery system can comprise a caption and metadata editing module arranged to receive a production script, and to generate from the production script an initial follower script and an initial caption script. During a rehearsal of a live performance the caption and metadata editing module can be configured to receive the timeline of the spoken delivery of the script, along with the performance cues. The caption and metadata editing module may then be used manually to review all the information received and to prepare the follower script and caption script along with the associated metadata required to automate the caption delivery in subsequent performances.

A further aspect of the invention provides a method of delivering an information output to a viewer of a live performance, the information output being at least one of displayed text and an audio description at predefined times in the live performance relative to stage events by providing the follower script with entries organised along a time line, and metadata at time points between at least some of the entries, wherein the metadata corresponds to stage events or expected pauses or timing in the live performance. During a live performance this metadata is used along with the information from the speech recognition module to guide the speech follower in tracking spoken dialogue against the entries in the follower script and against the stage events.

A further aspect of the invention provides a method of delivering an information output to a viewer of a live performance, the information output being at least one of displayed text and an audio description at predefined times in the live performance relative to stage events by providing a follower script with entries organised in sequence, and metadata associated with at least some of the entries, wherein the metadata is associated with stage events in the live performance, and using speech recognition to track spoken dialogue against the entries in the follower script, and the stage events to aid in following the live performance.

In one embodiment the metadata comprises performance cues, and the follower script comprises different variants of the live performance, the method comprising interpreting the performance cues as instructions to select one of the variants appropriate to that live performance.

According to another aspect of the invention, there is provided a caption delivery system for displaying captions during a live performance, the system comprising:

a memory component storing a follower script and a caption script, the follower script including waypoints associated with event-related metadata.

a speech follower component operable to recognise spoken dialogue of the performance and to compare the spoken dialogue with the follower script to track the location in the follower script of the spoken dialogue to identify when a caption is to be displayed;

a caption output module configured to access from the caption script the caption for display at each location in the follower script associated with a caption; wherein, the system is configured to detect events in the performance and to use the associated waypoints to assist the speech follower component to determine its location in the follower script.

According to another aspect of the invention, there is provided a caption delivery system for displaying captions during a live performance, the system comprising: a memory component storing a follower script and a caption script, the follower script including waypoints associated with performance-related metadata (e.g. associated with cues or other events such as pauses).

a speech follower component operable to recognise spoken dialogue of the performance and to compare the spoken dialogue with the follower script to track the location in the follower script of the spoken dialogue to identify when a caption is to be displayed;

a caption output module configured to access from the caption script the caption for display at each location in the follower script associated with a caption; wherein, the system is configured to detect events in the live performance and to assist the speech follower component to determine its location based on the waypoints at the detected events.

It is noted that the production of such a follower script is a novel step. The invention provides in a further aspect a computer program comprising computer readable instructions which when loaded into a computer create a follower script from a caption script by removing at least one of speaker labels and sound effects from the caption script.

Preferably, spoken text in the caption script is automatically phoneticised by the computer program to make the text easier for a speech follower to follow.

Preferably the computer program generates a user interface for a human to interact with to amend the follower script manually.

The user interface is also configured to allow a human to generate or amend the metadata.

It will be appreciated that the techniques described herein for time-assisted and event- assisted speech following could be used in other contexts where there is poor recognition accuracy, not just live performances. For example, the system could be used at live conferences or events, such as sport or musical events. For a better understanding of the present invention and to show how the same may be carried into effect, reference will now be made by way of example to the accompanying drawings in which :

Figure 1 is a schematic block diagram showing server-side functionality of a caption delivery system operating in a performance;

Figure 2 is a schematic diagram of a cue handler;

Figure 3 is a schematic block diagram of a caption editing system for generating scripts; Figure 4 shows an example time line; and

Figure 5 is a schematic block diagram of glasses for delivering captions; and

Figure 6 is a schematic block diagram showing delivery of audio information.

Aspects of the present invention enable accessibility information such as captions for deaf or hard-of-hearing people, translation texts, and audio description recordings, to be delivered automatically during a stage play or similar live performance or delivery. A system described herein has two components:

a data capture and editing system which enables accessibility information and associated metadata to be produced; and

a real-time replay system which monitors cues, timing and production audio and controls the delivery of accessibility information.

Data capture and editing system

The data capture and editing system enables information needed necessary for automated access to live performances to be assembled, refined and tested, and conveniently delivered for use.

In one embodiment it comprises a software application which runs on a PC or similar and is used at various stages throughout the process of preparing for a live production. One example of the stages of production by trained operators using the data capture and editing system is given below:

Stage 1: pre-process a production script to remove all information except speaker labels and dialogue, and import this "cleaned script" into the software application, which creates a "caption script" for display in a native language, and a "follower script" which is a dialogue script which a real-time replay system will follow against audio of an actual live production,

during rehearsals for the live production, operators use the data capture and editing system to refine and correct the dialogue script and speaker labels, and to add any necessary "effects captions" describing music or sounds which may not be obvious to deaf people, as well as comments on style of content or delivery to guide operators during the later production stages if required, once the dialogue has finalised, produce parallel translation text streams in as many languages as desired for translation captions

once the production is finalised (normally on press night) use the data capture and editing system manually to cue the captions in time with the production, during which the system will capture timing information for each caption and all of the stage and lighting cues

use the software application , guided by the comments and knowledge of the production, manually to select which of the stage and lighting cues are useful for automated synchronisation and command the application to convert these into "triggers" which inherit data identifying the respective cues;

manually fill in additional information such as the trigger action required (eg "start speech following", "timed section", "waypoint", "pause"); manually add the timing information for timed sections and pauses; and save all this metadata in a metadata master file

use information displayed by the system manually to identify gaps in the timing of displayed captions, and create an additional script in the metadata specifying the content of suitable audio descriptions, each with a unique identifier locating it in the master metadata file, which are then recorded using a separate tool, each recording being keyed to the master file by its unique identifier

load the appropriate master metadata file into the real-time replay system and feed it with a recording of the production audio and cues or simulated cues to rehearse and test the accuracy of the speech following and the timing and content of all the accessibility data, and use the software application (and possibly other tools) to make adjustments as required to the metadata in the master file (including where useful converting some of the dialogue text in the "follower script" into a phonetic form more representative of the mode of speech delivery; and

adding specific vocabulary from the production into the lexicon of the speech follower in the automated replay system) in order to optimise quality and accuracy

Real-time replay system

The real-time replay system comprises a software application executed on a PC or similar which reads the master metadata file from the data capture and editing system, and is fed with live audio from the stage and with all of the sound and lighting cues (performance cues) as well as manual cues from a "nudge" application running on a portable tablet device. The replay system is connected to a speech follower application which can be run on the same PC or on a separate machine, and which is fed with the real-time audio signal and the metadata script, and continually feeds back the derived location in the follower script. The replay system may have network connections to other applications which are responsible for distributing textual captions to heads-up glasses, or audio description audio data to headsets.

Just before the start of the live production the master metadata file is loaded into the real time replay system and a signal given to "go live". The replay system then monitors the live audio feed and the stage and lighting cues and acts in accordance with the sequence of information in the master file and the information from the speech follower system.

For example, the first trigger in the file may instruct the system to replay a timed sequence of announcement captions at the start of the performance, to match a pre-recorded announcement. When the performance cue corresponding to that trigger is received, the announcement captions are then played out according to times which are pre-stored with the captions, relative to the receipt of the cue.

When the sequence completes, the metadata is examined to see whether the replay system should automatically continue, for example, by following the dialogue audio against the follower script using the speech follower to provide timing information for each caption; or to wait for a performance cue before doing so; or to proceed to replay some timed effects captions which explain an introductory song or music.

In this manner, moving through the sequence of metadata items in response to their associated performance cues or by automated following-on, and performing each action as specified in the metadata at, using timings either based on timings captured during the rehearsal or dynamic timings from the speech follower in accordance to the metadata instructions, the system progresses through the performance delivering the access information as it does so.

If the system were to drift, or the performance to deviate from the script, the option exists for a manual cue to be issued from a nudge application which causes the replay system to resynchronise either one or more captions further ahead or further behind.

Figure 1 is a schematic block diagram which schematically illustrates elements of both systems in an apparatus for delivering captions in the context of live theatre. A stage play can be considered to comprise a sequence of stage events. A stage event may be an utterance by an actor, a change in lighting, a triggered sound effect or a movement of scenery, for example. The stage events comprising a play are expected to, and normally do, occur in a prearranged sequence. However, in contrast to films, this prearranged sequence may not always have the same timeline. The length of each event might vary from

performance to performance, and the time between events might vary, and occasionally the order of events might vary. This is a factor of live performance. Access services for a stage play can be provided by delivering captions to assist people who are deaf or hard of hearing. In the systems described herein, these captions are presented as a sequence of caption events. For example, a given caption event will correspond either to an utterance in the sequence of utterance events, or another stage event such as an effect event, for example the playing of particular music. As explained in the background portion above, currently a human captioner will follow the performance by listening to the utterance events and effects events, matching them against a caption script and pressing a button at the correct instance to deliver for each caption for display. The reason a human captioner has been used to date is due to the need for human skill to deal with factors such as: o variability in the duration of and intervals between utterances and stage events,

o content of utterances varying depending on the lead actor for that performance,

o the need to discriminate between utterances and other sounds for example audience noise or sound effects or exclamations; and

o the need to follow different modes and volumes of speech uttered in different ways or spoken at the same time as other dialogue

The challenge which the arrangement described herein seeks to meet is how to

automatically trigger each caption event at the time at which its corresponding stage event occurs. As described more fully in the following, data gathered from the script itself and during rehearsals aids the automated replay system described above and enables it to be loaded with the sequence of expected utterances and metadata including a sequence of expected stage cues, timing information and instructions. The stage cues give it fixed waypoints whereby its assumption about current location derived from audio input can be updated. Metadata is added to enable other useful information to be classified and parameterised in order to assist an automated follower to stay On track'. This is described in the following.

So-called 'performance cues' are commonly used in live performances. They are managed by a stage manager whose task it is to follow the performance against a performance script and to trigger performance cues to be issued at the appropriate points in the performance. Cues are triggered by activating an interface (e.g. pressing a button) which causes an electronic control signal to be generated which triggers the desired action at the cue, e.g. light control, audio effects, stage prop movement etc. Such cues are often delivered according to a standard referred to as OSC. In the present replay system these cues are used to aid following of the script to deliver captions. To summarise, in some embodiments what is stored is:

1) a standard "caption script" (speaker labels, sound effects, spoken text) which when timed against the rehearsal also acquires timing information (part of the set of

metadata).

2) an automatically derived and then possibly manually adjusted "follower script" (no

speaker labels, no sound effects, optionally phoneticisation to increase accuracy. The phoneticisation can be regarded as another part of the set of metadata).

3) a list of waypoints (which are either associated with selected 'hard' performance cues or with manually-defined 'soft' inline cues; each of which can be regarded as part of the metadata and contains further metadata which describes how to interpret or act on the cue when received (hard) or encountered inline (soft)

Taken together, all of this is stored in the master metadata file which is loaded in at runtime.

Waypoints are described herein separately from "metadata" . However it will be

understood that some waypoints may contain metadata and the waypoints themselves form part of the overall master metadata - i.e. the complete set of data accompanying the basic script which all goes to enabling operation of the system .

As described in outline above, and in more detail later, in the described embodiments the real time replay system feeds the audio to the speech follower and monitors what the speech follower perceives is the current position in the follower script (which may be right or may be wrong), as well as monitoring the clock and the cues and the metadata, and decides overall where the current position actually is. During a timed section, the speech follower is inactive in any case, but the real time replay system just keeps working using the clock and the (timing) metadata. At a waypoint, the real time replay system checks that the speech follower is in sync and adjusts things if not. So the real time replay system acts as the main control: the speech follower provides key input between waypoints, the cues provide key input at waypoints, and a clock provides key input for timed sections.

Figure 1 shows the real-time replay system in use during a live performance. As will be explained in more detail in the following, there are a number of different steps prior to the point at which the system is fully operative to deliver captions in a live performance. Thus, Figure 1 shows components uploaded to a server 100 ready to deliver captions to a live performance. These components may be implemented in hardware or software or any combination. Active modules may be implemented by a suitably programmed computer. In most cases software is uploaded to a server with modules pre-programmed to use the software. The dotted line in Figure 1 represents a differentiation between components on the server side and the live theatre portions. This is entirely diagrammatic and is just to illustrate use of the system. When the replay system is in use, captions may be delivered to a screen visible to members of the theatre audience, to individual devices such as tablets or smart phones, or, more preferably, 'smart glasses' used by members of the audience as required. The system comprises a speech detection component 1 which detects speech delivered by the actors on stage and converts it into a form which can be used by a speech follower 2 to compare the position of the speech with a predefined script that the actors are delivering. This is referred to herein as a follower script 3 and is shown stored in a memory component 4. This form could be phonemes, text or any other suitable form which enables a speech follower to track the speech against the predefined script. Speech followers are known per se in the field of teleprompting to enable a predefined script to be delivered at a rate matching an input speech, so that the speaker can follow it. Speech following is augmented in the present system by using the stage cues, and/or metadata in the follower script.

In the caption delivery system described herein, the memory component 4 holds at least two different scripts 3, 7. Note that these scripts are shown separately in Figure 1, but could be implemented as a single document framework as discussed later. The first script, the follower script 3, is for the purpose of assisting the speech follower to track the most likely point in the performance that the current speaker has reached. That is, the output of the speech detection component 1 (be it text or phonemes et cetera) is compared with text (or phonemes et cetera) in the follower script 3 at and around the currently determined position in order to continue to update the location of the currently determined position. The speech follower maintains a clock 20 which can be adjusted in line with the determined position to indicate location in the expected timeline of the production. The determined position is used to control the triggering of captions representing the dialogue. To achieve this the second script which is held in the memory component 4 is the caption script 7. This is a script of the lines of text which are to be displayed as captions corresponding to the location which has been determined by the speech follower. The text in the follower script is related to equivalent text in the caption script during the preparation of the scripts. Each line in the follower script has a corresponding line in the caption script. The caption script also includes unspoken dialogue, for example speaker labels, song lyrics, descriptions of audio effects. As the line position determined in the follower script is updated by the speech follower the particular corresponding caption from the caption script is sent to a caption distribution module 8 under the control of a caption output module 9. The follower script 3 incorporates additional metadata 3a about the performance which improves speech following and thus the accuracy of caption delivery. Metadata in the follower script can be used to aid following of the performance to give the speech follower the best possible chance of following a spoken performance in a live theatre environment. This metadata will be described in more detail later. So-called 'hard' metadata may be associated with cues which are expected to be delivered during the live performance, so-called 'soft' metadata may be associated with particular but non-cued events. Cues delivered during the live performance are detected by a cue detection component 5. The cue detection component supplies detected cues to a cue handler 6. Operation of the cue handler will be described later. The detection of cues and use of metadata allows the speech follower 2 to be periodically synchronised with the timeline of the live performance, as explained later. Note that while the metadata 3a is shown in the follower script 3, it could be stored in any suitable place accessible to the follower 2. For example, it may be in the master metadata file output from the data capture and editing system. By using metadata associated with the timeline in the follower script, speech following can be much improved. However, it may occasionally be necessary to 'reset' the precise location. For this a nudge component 10 can be provided to advance or retard the clock 20 to more closely track the timeline of a production based on input from a human operator watching the performance. The nudge component 10 has pushbutton inputs which feed specially designated cues to the cue handler 6, e.g.: "advance 1 second", "back 2 seconds".

Operation of the cue handler will now be described with reference to Figure 2. The cue handler 6 comprises a store 60 in which each performance cue received from theatre systems such as sound and lighting control systems is matched against a predefined list to retrieve metadata which controls how cue-related captions are displayed. Each cue has a cue identifier which is labelled Cue ID in the store 60 and a number of examples are shown by way of example only and not by limitation. Cue identifiers could be numbers, letters or any appropriate digital identification string. For ease of reference they are labelled ID 1, ID 2, ID 3 et cetera. Each cue is associated with metadata. This metadata may also be available in the master metadata file. The metadata can define a number of different cue responses and examples are shown in figure 3. For example, Cue ID 1 could be associated with a particular waypoint in the following script, but have no effect on caption display (e.g. close curtains). Cue ID 2 could be associated with an instruction to clear all text from the caption display. Cue ID 3 could be associated with a control signal which interrupts the listening function of the speech detection component 1, for example when the performance time line reaches a pause or song etc. Cue ID 4 could denote the beginning of a certain predefined display insert, such as lyrics of a song. These examples of metadata are non-limiting, because any particular performance will be associated with its own unique set of cues and each cue could have metadata relevant to that performance. It would be understood however that it is likely that many of the cues will be associated with waypoints in the follower script timeline which can be linked to specific captions. The cue handler 6 has cue handling logic 62 which detects an incoming cue, identifies it and accesses the store 60 to retrieve the metadata for that cue. A control signal (or cue signal) is output to the speech follower 2 to aid following by comparing the cue with the determined position in the follower script, and adjusting the position to bring them into line if necessary. Depending on the metadata, a control signal may be sent to the speech detection component 1 to control 'listening' (start/stop).

A first category of cues can affect operation of the speech detection component 1, for example to cause it to 'stop listening' for a defined period at points in the play when the cue has indicated that no reasonable dialogue is likely to be picked up due to the point that has been reached in the performance. For example, it may be known that at that point the actors are not speaking (there is a predefined pause), there are audio effects, or there is another interlude such as a song. This avoids background noise interfering with the speech follower.

A second category of cues can guide operation of the speech follower 2. For example, if the cue denotes a certain waypoint, this can be matched with the position that the speech follower has determined the dialogue is at in the follower script, and if the same way point is not aligned in time, the speech follower can adjust the clock 20 to bring the determined position gracefully into line and so to improve following of the follower script subsequently. Such cues can also cause captions to be displayed.

A third category of cue addresses the problems that can arise when different actors may utter different dialogue (in different performances). These cues have metadata identified in the store as 'branch'. This is shown in Figure 2 as Cue ID 5. In fact, a script may include two different cues, each associated with a different branch (ID5X/ ID5Y). If actor X is delivering the lines, cue SAX can be triggered. The metadata associated with that cue can cause a speech follower to branch to a certain portion of the follower script and output a sequence of utterances at that branch location associated with actor X. However, if actor Y is speaking the dialogue, the cue 5Y can be triggered and the follower script moves to a different branch location in the script which corresponds to a different sequence of utterances (to reflect the different dialogue spoken by each actor). One reason for this is that different actors performing the same play can nevertheless use different variations of the script at the same point in the dialogue. Such branching is referred to herein as branching from railway points, and allows the speech follower 2 to take a different route for lead actor X as compared with lead actor Y, and then come back to a common thread in the follower script after the particular sequence which pertains to those actors has been delivered. Note that such railway points may be provided both in the follower script 3 and also in the caption script 7. In the follower script they can be used to assess what point in time the play has reached. In the caption script 7, they are used to control which set of captions is displayed at that point. A fourth category of cue controls the display without displaying dialogue. For example, there may be an instruction to clear all text, and this instruction will cause captions to be cleared off the display screen. It may be that there is a particular song which is going to be played at that point in time, and the caption output module 9 can access the lyrics of the song and display those, at a predetermined rate while the song is being played in live performance. The predetermined rate can be set during rehearsal as described later.

Note that the follower script 3 includes metadata 3a which enables the determined location of the speech follower in the follower script to be frequently checked against the timeline of the live performance. The caption output module 9 is capable of outputting captions in the caption script matching the follower script as followed by the speech follower 2. As already mentioned, this might be the exact dialogue which is spoken formatted for display or some slightly modified version of it. In particular it is important to understand that displayed captions could be spoken dialogue or other items based on the cue metadata. This could be such actions as displaying the lyrics of the song or text descriptions of audio effects. Each new determined line in the caption script will be displayed; metadata may then cause the text to be cleared after x seconds. Other lines may be replayed simply to time (e. g. a song) and not determined explicitly by the speech follower.

Generation of the scripts will now be described with reference to Figure 3 which illustrates the caption editing system 30. The caption editing system receives a production script 32. The script includes speaker labels denoting who is speaking at what time, and information about special effects which will be generated at certain times, and other key events in the performance. Such scripts are available for live performances. The production script completely describes all events in the performance as well as the spoken dialogue. An extract of the dialogue only forms the follower script 3. This may be accessed as a subset of the caption script or created as a separate parallel script. This is the information against which the speech follower follows. It may be edited to phoneticise the text to make it easier to follow, and the vocabulary of the follower script will be used as the vocabulary of the speech recognition component in the speech follower. The caption script is also derived from the production script and includes all text to be displayed: speaker labels, events description etc. A timeline 34 for the performance is created. The caption editing system includes a speech detection module 32 which can capture a recording of the spoken dialogue of the script to allow the timeline of the performance to be obtained. Further metadata is added to the follower script to indicate the timeline of the final rehearsals. Time points are stored with each caption in the caption script to match the timeline of the performance. During this phase, the caption edit system 30 also receives and stores relevant performance cues. It pre-filters the cue data to extract only 'action' cues rather than

'preparation' cues or status information. A cue list 37 is built, each associated with its time point in the timeline. By this means a human editor can then be shown a display using the editing system to select cues which represent useful waypoints, cues which may precede silences, cues which signify the start of a song etc. If there may be a branch, the stage management team can be asked to insert a branch cue (ID5X or ID5Y) and cause it to be triggered at that point in the dialogue. Finally, if no explicit cue is available from stage management, the captioner may insert 'soft' metadata into the script to cause it to self-cue on reaching that point, for example to accommodate a pause of more than a few seconds, or some other event which requires control of the speech follower or caption delivery. Figure 4 shows an example of some of the events which may be captured in a timeline for a performance. The top line denotes lighting cues, and the second line includes speech delivered by live actors (speech A/ speech B). Lighting cues are shown as examples of cue points which might be associated with waypoints in the following script. These are just some examples of cues which may be captured - there may be many different kinds of cues. A stage cue, such as a gunshot, is also shown. Key moments may be included in the timeline, such as dramatic pauses. Note that the first dramatic pause may be associated with a 'hard' cue/waypoint, while the second dramatic pause is associated with 'soft' metadata at the end of speech B. Other events could be scene end or interval. Just by way of example, these may be associated with performance cues which would cause text to be cleared from the display and/or an interval to be announced. The 'soft' non-cue metadata 3a can is added manually to the metadata script. This metadata is not associated with an explicit performance cue. An example would be to tell the system that once caption 27 has been displayed it can stop listening for a defined time period, e.g. 10 seconds.

The production script and the timeline (associated with the utterances and events) are supplied to a script production component 36 of the caption editing system 36 which carries out the above steps. The speech production component 36 outputs the follower script 3 and the caption script 7 (and optionally a translation script to be described later). Script production might also involve some manual input, as described above. What is important here is that the follower script entries and the caption script correspond so that when the speech follower determines a location in the follower script the corresponding caption is displayed. Each entry in the follower script has a corresponding entry in the caption script. Where there are two separate scripts this can be done by line number, but it could also be done by a single electronic document with caption entries marked up, for example. When the speech follower component 2 (Figure 1) is following the live performance and detects new line in the script, the caption associated with that line is displayed. Note that the cues and the metadata are defined at this stage, as part of the combined metadata script which also contains the follower script and the caption script(s) in the desired language(s). At this point, the cue metadata store 60 is established to associate appropriate metadata code with each cue. Before the script generated by the script production component 36 and the cue metadata can be uploaded into the server for live performance, there is a rehearsal stage. In the rehearsal stage, the speech follower component 2 is active, and an assessment can be made as to how well the speech following component is following a recording of the rehearsal. This allows modifications to be made, for example to put into the speech follower script what the speech recogniser thought the actor was saying (maybe rather than what the dialogue would have indicated). This allows a speech recogniser to have guide utterances which might be more appropriate to enable a speech follower to track the script. The final production component (the script and the cue metadata) are uploaded into the server ready for the performance.

The caption delivery system can be augmented to provide captions in alternative languages to the script which is being delivered on stage. A translation script 7 provides a foreign language version of the script. The foreign language captions are pre-stored in the translation script 7, corresponding by line number (or mark-up) to the follower script 3.

Another augmentation is to provide an audio description script (not shown). This could be used to define and trigger audio descriptions, for example based on a pre-recorded sound file or using a real-time speech synthesiser. For example, it could point to different sound files in a memory which could be supplied on request to earpieces for blind or partially sighted people to allow the activity on stage to be described. It will be appreciated that such audio files are likely to be generated at points where there is no dialogue, the assumption being made that the blind or partially sighted people are able to hear the speech, but are not able to follow the play because they cannot see what is happening in between the speech. The audio files may be triggered by trigger times from the clock which tracks the production timeline, delayed by a specified time from the preceding caption.

As mentioned, one preferred display system is so-called 'smart glasses' or 'smart eyewear. Such glasses are known and so will not be described in depth here. In brief, smart glasses can be worn by a user in the manner of normal glasses, and have a display and have a lens in front of each eye of a user, the lens being transparent to allow a user to view what is beyond the lens, but also including a display portion which can display items to the viewer. Such glasses are used for example in augmented reality. Figure 5 shows a pair of glasses, with the lens designated by reference numeral 50 and the display area in each lens designated by reference 52. The glasses include a controller 54. This controller is connected to the display to control what is displayed there, and can receive as input user control inputs and server control inputs. The server control inputs come from the server side (from the caption distribution module 8), while the user control inputs can come from a user wearing the glasses to give them some control over how items are displayed to them. The glasses should preferably fulfil the following physical requirements. They are designed to be as lightweight as possible, preferably below 100 g. They are suitable for people who already wear glasses, as well as those who do not. They may be provided with an adjustable frame. Ideally, they have a minimal frame around the lenses (to maximise the display area).

Preferably they look discreet when in use. It is important in a theatre context that there is a low light output to nearby people such that the performance is not interrupted by local light. A reasonable battery life is needed (for example, a minimum of 4 hours as

appropriate). They should be quick to issue, adjust to fit and have an easy user control system to provide the user control inputs. For example, the user control input could select a language for display. Preferably they could be provided with a rechargeable battery with multiple headsets docking in a charging station. They could additionally be provided with headphones/ built-in speakers for delivery of the audio description. The technical requirements for the glasses are ideally adjustable to locate the text at the same apparent depth as the actors on the live stage. Ideally the text display has a high resolution, and it is advantageous if multiple lines of text may be displayed at one time. It is possible to provide an option for scrolling text and/ or an option to pop-on pop-off block display. A user control input could remotely switch between these options. The display brightness could be adjusted (either remotely from the server side or by the user), and similarly text size and text colour could be adjusted remotely and/ or by the user. According to one option, there is the ability to select up to 10 display channels (for example, for different language translations). Contact with the server side could be wireless, for example Wi-Fi or Bluetooth (although it is expected that Wi-Fi is advantageous due to the larger range).

Communication between the server-side system, in particular the caption distribution module 8 and the server control interface on the controller 54 of the glasses may be implemented as a TCP (Transmission Control Protocol) connection which would normally remain open for the duration of the performance. However, given that each message is independent closing and opening a connection would not affect the output. The glasses display captions in real time as each individual message is received. The caption description is compatible with the EBU-TT part 3, defined in https://tech.EBU.ch/docs/text/tech 3370.pdf et al. Each combination of language and performance is a distinct stream with its own stream identifier. The stream identifier is configured to be sufficient to route the message to the appropriate output devices. It is possible to have a separate connection from the caption output module 9 to the caption distribution module for each performance / language if that is more convenient. When utilised with the above system, there is no need for the presentation system at the glasses to manage any timing, for example the duration of display of the caption. As described above, the caption output module in association with the speech follower can explicitly issue a 'clear instruction' or appropriate "scroll rate" parameter is sent so that the glasses know how fast to scroll the text. Each message therefore defines precisely what should be displayed at the moment after it is processed, and then after that the display will be updated by further messages. There follows some caption examples.

Example 1 denotes a message which causes the text 'ladies and gentlemen: please switch off mobile phones' to be displayed in bold font in white text on a black background. A style identifier in the message indicates the background and foreground colours, and the font style (font weight). The span style parameter in the body of the message defines the text.

According to Example 2, the span style parameter in the body of the text provides the phrase 'dogs barking' as a caption. This will follow an identified cue which has been detected by the cue handler and which is associated with the metadata to display this text. In the performance itself, there will be an actual sound of dogs barking at this point in the timeline.

Example 3 is an example of a message which is conveying actual dialogue in the play: 9 Ί am a dealer not a doctor!'. Example 4 also follows dialogue from the play but shows 2 different pieces of dialogue on the same display, the second piece aligned to the right:

'are you my friend?' /'I don't think so!'

Example 5 - shows an empty message.

Example 1

<tt xmlns:ebuttp="urn:ebu:tt:parameters" xmlns="http://www. w3.org/ns/ttml" xml:lang="en" ebuttp:sequenceldentifier="cleansed_English_Theatrel"

ebuttp:sequenceNumber="100001">

<head>

<styling>

<style xml:id="sl" tts:backgroundColor="black" tts:color="white" tts:fontWeight="bold" />

</styling>

</head>

<body>

<div>

<p>

<span style="sl"> Ladies and gentlemen:<br/>

Please switch off mobile phones. </span>

</p> </div>

</body>

</tt>

Example 2

<tt xmlns:ebuttp= urn:ebu:tt:parameters" xmlns="http://www.w3.org/ns/ttm xml:lang="en" ebuttp:sequenceldentifier="cleansed_English_Theatrel" ebuttp:sequenceNumber="100002">

<head>

<styling>

<style xml:id="s2" tts:backgroundColor="black" tts:color="white" tts:fontStyle

</styling>

</head>

<body>

<div>

<p>

<span style="s2">DOGS BARKING</span>

</p>

</div>

</body>

</tt>

Example 3

<tt xmlns:ebuttp="urn:ebu:tt:parameters" xmlns="http://www.w3.org/ns/ttml" xml:lang="en" ebuttp:sequenceldentifier="cleansed_English_Theatrel" ebuttp:sequenceNumber="100003">

<head>

<styling>

<style xml:id="s3" tts:backgroundColor="black" tts:color="white" />

</styling>

</head>

<body>

<div>

<p>

<span style="s3">l'm a dealer not a doctor!</span>

</p>

</div>

</body>

</tt>

Example 4 <tt xmlns:ebuttp="urn:ebu:tt:parameters" xmlns="http://www. w3.org/ns/ttml" xml:lang="en" ebuttp:sequenceldentifier="cleansed_English_Theatrel"

ebuttp:sequenceNumber="100004">

<head>

<styling>

<style xml:id="s3" tts:backgroundColor="black" tts:color="white" />

<style xml:id="s4" tts:backgroundColor="black" tts:color="white" textAlign="right

</styling>

</head>

<body>

<div>

<P>

<span style="s3">Are you my friend?</span>

<br/>

<span style="s4">l don't think so!</span>

</p>

</div>

</body>

</tt>

Example 5 tt xmlns:ebuttp="urn:ebu:tt:parameters" xmlns="http://www. w3.org/ns

xml:lang="en" ebuttp:sequenceldentif ' ier="cleansed_English_Theatrel"

ebuttp:sequenceNumber="100005">

<head>

</head>

<body/>

</tt>

The EBU-TT specification is useful in this context because the format permits 'decoration' with many parameters, and is a tried and tested standard way to convey live text. It permits display of multiple languages and with a potentially extensible range of mark-up to code style, as well as enabling a stream context and font accepted to be specified. Any other suitable protocol could be used.

The controller 54 provides a receiver for receiving the messages. A socket is opened on the server side (caption distribution module) with a socket ID for each language for each production show. For example, if a particular play is being captioned in three languages, three distinct channels will be opened. If simultaneously another production is being run at a different theatre in eight languages, an additional eight distinct socket connections are opened. In opening a connection, the production ID and the language are specified. Caption messages are pushed to that connection as required until the end of performance. A commencement message may be provided, for example 'ready to roll' to the controller 54 to enable the controller to provide instructions to a wearer as to how to set up the system if needed. Alternatively, such a message can just trigger the controller 54 to be ready for incoming caption messages.

In a context where there may be restricted bandwidth and to avoid channel delay, broadcast/ multicast methods may be utilised to send each caption update message once over an Internet connection to all headsets, which could then select only the ones that they are interested in based on their configuration by production ID and language. That is, all channels could be broadcast at all time, but individual glasses could be tuned in to receive particular ones at particular times.

By using the user control, users may adjust text size and/ or font. However, there may be advantages to disallow the users from doing this, since in the past theatre captioners have wished for editorial reasons to be an explicit control of line breaks, for example. As shown in example 4, the protocol allows for explicit control of line breaks. Nevertheless, there may be circumstances where a local option could be settled to ignore this and so cause any subsequent text to append to that already on display. A possible disadvantage is that the text on the display will be scrolling up and therefore that a user selected option may cause incoming text to be appended to a half-displayed line. In one example, therefore, it is possible to define a fixed adequate text size in the glasses from which it can be derived how many characters will fit on a row.

The controller on the glasses is configured to receive all messages and control signals from the single distribution port, and to filter them to different glasses depending on the show that the user is watching at the time. In another example, the server offers a single port but permits multiple connections. Such an arrangement also allows for an alternate sender to take over if a particular caption distribution module fails. Moreover, it also allows for an emergency message such as 'please leave the building by the nearest emergency exit' to be sent to all glasses in all languages from a central safety system independently of the captioning system.

In a caption delivery system described herein use is made of two features in particular. There are predetermined waypoints which are the points at which cues can be guaranteed to occur and they can be guaranteed to occur in a fixed relationship to fixed items of dialogue because they are effectively following cue triggers. What actually is happening in the theatre is someone is manually calling up stage cues, lighting cues and so forth. Use is made of the pre-existing human following of the play, by a stage manager whose job it is to cause staff to prepare for cue 153, trigger cue 153 etc. At that point someone presses a button, a cue signal goes out, the light comes on, or the music starts to play, and the system captures that signal to know where it is quite precisely in the timeline.

Some of the cues may not be useful so part of the preparation of the metadata is to predetermine which cues are useful, where they occur in the follower script and what their nature is, whether they signal (because the person doing this would have watched the rehearsals) that there is going to be music or a song or a sound effect or silence or whatever it may be. The cues associated with metadata act as hard waypoints, something that can be relied on pretty definitely. What the follower then needs to do is to travel between those waypoints as accurately as possible using the speech following technology. Preferably there may be a cue every 3 or 4 minutes, and the dialogue is followed in between by not just using the speech follower alone but also using a knowledge of the timeline of the performance by previously having captured that as additional metadata associated with each entry in the follower script. So the captions will have times associated with them that indicate that, during the rehearsal at least, this was the approximate expected rate of delivery. The speech follower then takes into account the data from the audio, (as it matches it against the following script), plus the metadata which defines the broadly expected timeline plus information as to how to interpret the 'hard' or 'soft' cues which tally with what is expected to occur at a particular waypoint. When it is all taken into account, it regulates the determined position so as to proceed as accurately as possible but also as smoothly as possible through the stepping out of the captions for display. In other words, for example, if it was drifting a little bit (e.g. a few seconds late), and a 'hard' cue is received, it does not skip the captions that would have been expected to be delivered between where it is currently and where the cue indicates it should be, the system advances the rate of delivery so as to catch up and then continues following. The system is designed to enable speech following in the context of imprecise information from the speech recognition component, because it is known to be working in a challenging environment, with audible noise and other sounds which could throw the speech follower off track, and also periods of silence. The system therefore uses that imprecise information along with the metadata and the synchronising cues and the flow of the timeline to determine as precisely as possible the correct position in the script in the context of theatre performances which vary slightly from each other.

The caption script can be considered the primary script. It is divided into entries or lines which correspond to the captions that are going to be displayed, so each line corresponds to a new caption scrolling out of the bottom of the LED display. For example, there may be 2,000 or 3,000 lines of dialogue and effects captions in an entire production. The caption script has speaker labels which are displayed with the lines. The system creates an initial follower script by stripping off the speaker labels copying only the dialogue lines into equivalent lines for the follower to follow. So there is one to one correspondence between each line in the follower script and the corresponding line in the caption script. Similarly, for translations, there is a one to one correspondence in each of the lines in each of the translated scripts. If the determined position of the follower (aided by cues and metadata) is line 13 of the follower script, the system puts out line 13 of the caption script and line 13 of each translation script (if any).

In the caption delivery system described herein, the core speech following engine may balance the input from the speech follower, the input from the cues, the manually-defined metadata and the pre-captured timeline to achieve following by:

• Responding to synchronising (rather than instructing) cues as firm (hard) waypoints.

• Proceeding between the waypoints based on information including that from realtime speech recognition which is known to be imprecise due to the poor audio environment but using techniques to give this the best possible chance for example, (a) by audio compression and (b) by seeding the speech recogniser with only the vocabulary of the script and (c) by modifying the follower script to phoneticize or adjust it to match more closely with what the recogniser thinks it is hearing.

• Guiding interpretation of speech recogniser output with knowledge of expected rate of delivery (i.e. stops it leaping too far ahead or too far behind; knows it should be progressing forwards with some built-in elasticity to allow a degree of uncertainty due to the less than 100% perfect recognition in these circumstances.

• Supplementing speech following using metadata to assist in interpreting

performance cues (hard cues) and inline cues (soft cues).

• Able to be relocalised if necessary using nudges.